How to extract lineage and usage from Databricks

Once you have crawled assets from Databricks, you can retrieve lineage from Unity Catalog and usage and popularity metrics from query history.

🚨 Careful! Usage and popularity metrics can be retrieved for all Databricks users. However, your Databricks workspace must be Unity Catalog-enabled for lineage retrieval to succeed. You may also need to upgrade existing tables and views to Unity Catalog, as well as reach out to your Databricks account executive to enable lineage in Unity Catalog. (As of publishing, the feature is still in preview from Databricks on AWS and Azure.) Lineage extraction is currently not supported for AWS and Azure service principal authentication.

To retrieve lineage and usage from Databricks, review the order of operations and then complete the following steps.

Select the extractor

To select the Databricks lineage and usage extractor:

  1. In the top right of any screen, navigate to New and then click New Workflow.
  2. From the filters along the top, click Miner.
  3. From the list of packages, select Databricks Miner and click on Setup Workflow.

Configure the lineage extractor

Choose your lineage extraction method:

REST API

To configure the Databricks lineage extractor:

  1. For Connection, select the connection to extract. (To select a connection, the crawler must have already run.)
  2. Click Next to proceed.

Offline extraction method

Atlan supports the offline extraction method for extracting lineage from Databricks This method uses Atlan's databricks-extractor tool to extract lineage. You will need to first extract lineage yourself and make it available in S3.

To enter your S3 details:

  1. For Connection, select the connection to extract. (To select a connection, the crawler must have already run.)
  2. For Bucket name, enter the name of your S3 bucket.
  3. For Bucket prefix, enter the S3 prefix under which all the metadata files exist. These include extracted-lineage/result-0.json, extracted-query-history/result-0.json, and so on.
  4. For Bucket region, enter the name of the S3 region.
  5. When complete, at the bottom of the screen, click Next.

Configure the usage extractor

Atlan extracts usage and popularity metrics from query history. This feature is currently limited to queries on SQL warehouses β€” queries on interactive clusters are not supported. Additionally, expensive queries and compute costs for Databricks assets are currently unavailable due to limitations of the Databricks APIs.

To configure the Databricks usage and popularity extractor:

  • (Optional) For Fetch Query History and Calculate Popularity, click Yes to retrieve usage and popularity metrics for your Databricks assets from query history.
    • For Popularity Window (days), 30 days is the maximum limit. You can set a shorter popularity window of less than 30 days.
    • For Start time, choose the earliest date from which to mine query history. If you're using the offline extraction method to extract query history from Databricks, you can skip to the next step.
    • For Excluded Users, type the names of users to be excluded while calculating usage metrics for Databricks assets. Press enter after each name to add more names. 
🚨 Careful! If running the miner for the first time, Atlan recommends setting a start date around three days prior to the current date and then scheduling it daily to build up to two weeks of query history. Mining two weeks of query history on the first miner run may cause delays. For all subsequent runs, Atlan requires a minimum lag of 24 to 48 hours to capture all the relevant transformations that were part of a session. Learn more about the miner logic here.

Run the extractor

To run the Databricks lineage and popularity extractor, after completing the steps above:

  1. To check for any permissions or other configuration issues before running the crawler, click Preflight checks.
  2. You can either:
    • To run the crawler once immediately, at the bottom of the screen, click the Run button.
    • To schedule the crawler to run hourly, daily, weekly, or monthly, at the bottom of the screen, click the Schedule Run button.

Once the extractor has completed running, you will see lineage for Databricks assets! πŸŽ‰

Related articles

Was this article helpful?
1 out of 1 found this helpful