How to extract lineage and usage from Databricks

Have more questions? Submit a request

Once you have crawled assets from Databricks, you can retrieve lineage from Unity Catalog and usage and popularity metrics from query history.

🚨 Careful! Usage and popularity metrics can be retrieved for all Databricks users. However, your Databricks workspace must be Unity Catalog-enabled for lineage retrieval to succeed. You may also need to upgrade existing tables and views to Unity Catalog, as well as reach out to your Databricks account executive to enable lineage in Unity Catalog. (As of publishing, the feature is still in preview from Databricks on AWS and Azure.)

To retrieve lineage and usage from Databricks, review the order of operations and then complete the following steps.

Select the extractor

To select the Databricks lineage and usage extractor:

  1. In the top right of any screen, navigate to New and then click New Workflow.
  2. From the filters along the top, click Miner.
  3. From the list of packages, select Databricks Lineage and click on Setup Workflow.

Configure the lineage extractor

To configure the Databricks lineage extractor:

  1. For Connection, select the connection to extract. (To select a connection, the crawler must have already run.)
  2. Click Next to proceed.

Configure the usage extractor

Atlan extracts usage and popularity metrics from query history. This feature is currently limited to queries on SQL warehouses β€” queries on interactive clusters are not supported. Additionally, expensive queries and compute costs for Databricks assets are currently unavailable due to limitations of the Databricks APIs.

To configure the Databricks usage and popularity extractor:

  • (Optional) For Fetch Query History and Calculate Popularity, click Yes to retrieve usage and popularity metrics for your Databricks assets from query history.
    • For Popularity Window (days), 30 days is the maximum limit. You can set a shorter popularity window of less than 30 days.
    • For Start time, choose the earliest date from which to mine query history.
    • For Excluded Users, type the names of users to be excluded while calculating usage metrics for Databricks assets. Press enter after each name to add more names. 
🚨 Careful! If running the miner for the first time, Atlan recommends setting a start date around three days prior to the current date and then scheduling it daily to build up to two weeks of query history. Mining two weeks of query history on the first miner run may cause delays. For all subsequent runs, Atlan requires a minimum lag of 24 to 48 hours to capture all the relevant transformations that were part of a session. Learn more about the miner logic here.

Run the extractor

To run the Databricks lineage and popularity extractor, after completing the steps above:

  1. To check for any permissions or other configuration issues before running the crawler, click Preflight checks.
  2. You can either:
    • To run the crawler once immediately, at the bottom of the screen, click the Run button.
    • To schedule the crawler to run hourly, daily, weekly, or monthly, at the bottom of the screen, click the Schedule Run button.

Once the extractor has completed running, you will see lineage for Databricks assets! πŸŽ‰

Related articles

Was this article helpful?
1 out of 1 found this helpful