Once you have crawled assets from Databricks, you can retrieve lineage from Unity Catalog and usage and popularity metrics from query history. This is supported for all three authentication methods: personal access token, AWS service principal, and Azure service principal.
To retrieve lineage and usage from Databricks, review the order of operations and then complete the following steps.
Select the extractor
To select the Databricks lineage and usage extractor:
- In the top right of any screen, navigate to New and then click New Workflow.
- From the filters along the top, click Miner.
- From the list of packages, select Databricks Miner and click on Setup Workflow.
Configure the lineage extractor
Choose your lineage extraction method:
- In REST API, Atlan connects to your database and extracts lineage directly.
- In Offline, you will need to first extract lineage yourself and make it available in S3.
- In System Table, Atlan connects to your database and queries system tables to extract lineage directly.
REST API
To configure the Databricks lineage extractor:
- For Connection, select the connection to extract. (To select a connection, the crawler must have already run.)
- Click Next to proceed.
Offline extraction method
Atlan supports the offline extraction method for extracting lineage from Databricks This method uses Atlan's databricks-extractor tool to extract lineage. You will need to first extract lineage yourself and make it available in S3.
To enter your S3 details:
- For Connection, select the connection to extract. (To select a connection, the crawler must have already run.)
- For Bucket name, enter the name of your S3 bucket.
- For Bucket prefix, enter the S3 prefix under which all the metadata files exist. These include
extracted-lineage/result-0.json
,extracted-query-history/result-0.json
, and so on. - For Bucket region, enter the name of the S3 region.
- When complete, at the bottom of the screen, click Next.
System table
To configure the Databricks lineage extractor:
- For Connection, select the connection to extract. (To select a connection, the crawler must have already run.)
- For SQL Warehouse ID, enter the ID you copied from your SQL warehouse.
- Click Next to proceed.
Configure the usage extractor
Atlan extracts usage and popularity metrics from query history. This feature is currently limited to queries on SQL warehouses — queries on interactive clusters are not supported. Additionally, expensive queries and compute costs for Databricks assets are currently unavailable due to limitations of the Databricks APIs.
Even if you have configured lineage extraction using system tables, Atlan will calculate usage and popularity from query history.
To configure the Databricks usage and popularity extractor:
- (Optional) For Fetch Query History and Calculate Popularity, click Yes to retrieve usage and popularity metrics for your Databricks assets from query history.
- For Popularity Window (days), 30 days is the maximum limit. You can set a shorter popularity window of less than 30 days.
- For Start time, choose the earliest date from which to mine query history. If you're using the offline extraction method to extract query history from Databricks, you can skip to the next step.
- For Excluded Users, type the names of users to be excluded while calculating usage metrics for Databricks assets. Press
enter
after each name to add more names.Â
Run the extractor
To run the Databricks lineage and popularity extractor, after completing the steps above:
- To check for any permissions or other configuration issues before running the crawler, click Preflight checks. This is currently only supported when using REST API and offline extraction methods. If you're using system tables, skip to step 2.
- You can either:
- To run the crawler once immediately, at the bottom of the screen, click the Run button.
- To schedule the crawler to run hourly, daily, weekly, or monthly, at the bottom of the screen, click the Schedule Run button.
Once the extractor has completed running, you will see lineage for Databricks assets! 🎉