How to crawl on-premises Databricks

Updated January 09, 2024 11:51

Once you have set up the databricks-extractor tool, you can extract metadata from your on-premises Databricks instances by completing the following steps.

Run databricks-extractor

Crawl all Databricks connections

To crawl all Databricks connections using the databricks-extractor tool:

Log into the server with Docker Compose installed.
Change to the directory containing the compose file.
Run Docker Compose: sudo docker-compose up

Crawl a specific connection

To crawl a specific Databricks connection using the databricks-extractor tool:

Log into the server with Docker Compose installed.
Change to the directory containing the compose file.
Run Docker Compose: sudo docker-compose up <connection-name>

(Replace <connection-name> with the name of the connection from the services section of the compose file.)

(Optional) Review generated files

The databricks-extractor tool will generate many folders with JSON files for each service. For example:

catalogs
schemas
tables

You can inspect the metadata and make sure it is acceptable for providing metadata to Atlan.

Upload generated files to S3

To provide Atlan access to the extracted metadata, you will need to upload the metadata to an S3 bucket.

💪 Did you know? We recommend uploading to the same S3 bucket as Atlan uses to avoid access issues. Reach out to your Data Success Manager to get the details of your Atlan bucket. To create your own bucket, refer to the Create your own S3 bucket section of the dbt documentation. (The steps will be exactly the same.)

To upload the metadata to S3:

Ensure that all files for a particular connection have the same prefix. For example, output/databricks-example/catalogs/success/result-0.json, output/databricks-example/schemas/{{catalog_name}}/success/result-0.json, output/databricks-example/tables/{{catalog_name}}/success/result-0.json, and so on.
Upload the files to the S3 bucket using your preferred method.

For example, to upload all files using the AWS CLI:

aws s3 cp output/databricks-example s3://my-bucket/metadata/databricks-example --recursive

Crawl metadata in Atlan

Once you have extracted metadata on-premises and uploaded the results to S3, you can crawl the metadata into Atlan:

How to crawl Databricks

Be sure to select Offline for the Extraction method.

How to crawl on-premises Databricks

Run databricks-extractor

Crawl all Databricks connections

Crawl a specific connection

(Optional) Review generated files

Upload generated files to S3

Crawl metadata in Atlan

Related articles

Get in touch

Related articles

How to set up on-premises database access

How to crawl on-premises databases

Supported connections for on-premises databases

Troubleshooting on-premises database connectivity

How to set up on-premises Databricks access

How to crawl on-premises Databricks

How to connect on-premises databases to Kubernetes