In some cases you will not be able to expose your Databricks instance for Atlan to extract and ingest lineage. For example, this may happen when security requirements restrict access to sensitive, mission-critical data.
In such cases you may want to decouple the extraction of lineage from its ingestion in Atlan. This approach gives you full control over your resources and metadata transfer to Atlan.
Prerequisites
To extract lineage from your on-premises Databricks instance, you will need to use Atlan's databricks-extractor tool.
Install Docker Compose
Docker Compose is a tool for defining and running applications composed of many Docker containers. (Any guesses where the name came from? π)
To install Docker Compose:
Get the databricks-extractor tool
To get the databricks-extractor tool:
- Raise a support ticket to get the link to the latest version.
- Download the image using the link provided by support.
- Load the image to the server you'll use to extract lineage from Databricks:
sudo docker load -i /path/to/databricks-extractor-master.tar
Get the compose file
Atlan provides you with a Docker compose file for the databricks-extractor tool.
To get the compose file:
- Download the latest compose file.
- Save the file to an empty directory on the server you'll use to access your on-premises Databricks instance.
- The file is
docker-compose.yaml
.
Define Databricks connections
The structure of the compose file includes three main sections:
x-templates
contains configuration fragments. You should ignore this section β do not make any changes to it.services
is where you will define your Databricks connections.volumes
contains mount information. You should ignore this section as well β do not make any changes to it.
Define services
For each on-premises Databricks instance, define an entry under services
in the compose file.
Each entry will have the following structure:
services:
connection-name:
<<: *extract-lineage
environment:
<<: *databricks-defaults
EXTRACT_QUERY_HISTORY: true
QUERY_HISTORY_START_TIME_MS: 0
volumes:
- ./output/connection-name:/output
- Replace
connection-name
with the name of your connection. <<: *extract-lineage
tells the databricks-extractor tool to run.environment
contains all parameters for the tool.EXTRACT_QUERY_HISTORY
β specifies whether to extract query history for the Databricks connection, in addition to lineage. The query history output can then be used to calculate usage and popularity metrics.QUERY_HISTORY_START_TIME_MS
β specifies the time in epoch milliseconds from when to extract query history. If unspecified, the extractor will extract queries for the past 30 days by default. In Databricks, the query history retains query data for the past 30 days.
volumes
specifies where to store results. In this example, the extractor will store results in the./output/connection-name
folder on the local file system.
You can add as many Databricks connections as you want.
services
format in more detail.Provide credentials
To define the credentials for your Databricks connections, you will need to provide a Databricks configuration file.
The Databricks configuration is a .ini
file with the following format:
[DatabricksConfig]
host = <host>
port = <port>
# seconds to wait for a response from the server
timeout = 300
# Databricks authentication type. Options: personal_access_token, aws_service_principal
auth_type = personal_access_token
# Required only if auth_type is personal_access_token.
[PersonalAccessTokenAuth]
personal_access_token = <personal_access_token>
# Required only if auth_type is aws_service_principal.
[AWSServicePrincipalAuth]
client_id = <client_id>
client_secret = <client_secret>
Secure credentials
Using local files
To specify the local files in your compose file:
secrets:
databricks_config:
file: ./databricks.ini
secrets
section is at the same top-level as the services
section described earlier. It is not a subsection of the services
section.Using Docker secrets
To create and use Docker secrets:
- Store the Databricks configuration file:
sudo docker secret create databricks_config path/to/databricks.ini
- At the top of your compose file, add a secrets element to access your secret:
secrets: databricks_config: external: true name: databricks_config
- The
name
should be the same one you used in thedocker secret create
command above. - Once stored as a Docker secret, you can remove the local Databricks configuration file.
- The
-
Within the
service
section of the compose file, add a new secrets element and specify the name of the secret within your service to use it.
Example
Let's explain in detail with an example:
secrets:
databricks_config:
external: true
name: databricks_config
x-templates:
# ...
services:
databricks-lineage-example:
<<: *extract-lineage
environment:
<<: *databricks-defaults
EXTRACT_QUERY_HISTORY: true
QUERY_HISTORY_START_TIME_MS: 0
volumes:
- ./output/databricks-lineage-example:/output
secrets:
- databricks_config
- In this example, we've defined the secrets at the top of the file (you could also define them at the bottom). The
databricks_config
refers to an external Docker secret created using thedocker secret create
command. - The name of this service is
databricks-lineage-example
. You can use any meaningful name you want. - The
<<: *databricks-defaults
sets the connection type to Databricks. - The
./output/databricks-lineage-example:/output
line tells the extractor where to store results. In this example, the extractor will store results in the./output/databricks-lineage-example
directory on the local file system. We recommend you output the extracted lineage for different connections in separate directories. - The
secrets
section withinservices
tells the extractor which secrets to use for this service. Each of these refers to the name of a secret listed at the beginning of the compose file.