Once you have set up the databricks-extractor tool, you can extract lineage from your on-premises Databricks instances by completing the following steps.
Run databricks-extractor
To extract lineage for a specific Databricks connection using the databricks-extractor tool:
- Log into the server with Docker Compose installed.
- Change to the directory containing the compose file.
- Run Docker Compose:
sudo docker-compose up <connection-name>
(Replace <connection-name>
with the name of the connection from the services
section of the compose file.)
(Optional) Review generated files
The databricks-extractor tool will generate many folders with JSON files for each service
. For example:
extracted-lineage
extracted-query-history
(ifEXTRACT_QUERY_HISTORY
is set to true)
You can inspect the lineage and usage metadata and make sure it is acceptable for providing metadata to Atlan.
Upload generated files to S3
To provide Atlan access to the extracted lineage and usage metadata, you will need to upload the metadata to an S3 bucket.
To upload the metadata to S3:
- Ensure that all files for a particular connection have the same prefix. For example,
output/databricks-lineage-example/extracted-lineage/result-0.json
,output/databricks-lineage-example/extracted-query-history/result-0.json
, and so on. - Upload the files to the S3 bucket using your preferred method.
For example, to upload all files using the AWS CLI:
aws s3 cp output/databricks-lineage-example s3://my-bucket/metadata/databricks-lineage-example --recursive
Extract lineage in Atlan
Once you have extracted lineage on-premises and uploaded the results to S3, you can extract lineage in Atlan:
Be sure to select Offline for the Extraction method.