Atlan extracts job-level operational metadata from Apache Spark and generates job lineage through OpenLineage.
To integrate Apache Spark/OpenLineage with Atlan, review the order of operations and then complete the following steps.
Create an API token in Atlan
Before running the workflow, you will need to create an API token in Atlan.
Select the source in Atlan
To select Apache Spark/OpenLineage as your source, from within Atlan:
- In the top right of any screen, click New and then click New workflow.
- From the filters along the top, click Data processing.
- From the list of packages, select Spark Assets and then click Setup Workflow.
Configure the integration in Atlan
To configure the Apache Spark/OpenLineage connection, from within Atlan:
- For Connection Name, provide a connection name that represents your source environment. For example, you might use values like
- (Optional) To change the users who are able to manage this connection, change the users or groups listed under Connection Admins.
🚨 Careful! If you do not specify any user or group, no one will be able to manage the connection — not even admins.
- To run the workflow, at the bottom of the screen, click the Run button.
Configure the integration in Apache Spark
Spark has a default SparkListener interface that OpenLineage leverages to collect information about Spark jobs.
To configure Apache Spark to send OpenLineage events to Atlan, you can either:
- To activate the listener, add the following properties to your Spark configuration:
# Initialize Spark session spark = (SparkSession.builder.master('local') .appName("SparkJobs") .config('spark.jars.packages', "io.openlineage:openlineage-spark:1.5.0") .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') .config('spark.openlineage.transport.type', 'http') .config('spark.openlineage.transport.url', 'https://<instance>.atlan.com') .config('spark.openlineage.transport.endpoint', '/events/openlineage/spark/api/v1/lineage') .config('spark.openlineage.namespace', '<connection-name>') .config('spark.openlineage.transport.auth.type', 'api_key') .config('spark.openlineage.transport.auth.apiKey', '<Atlan_api_key>') .config('spark.openlineage.facets.disabled', '[spark.logicalPlan;]') .config('spark.openlineage.debugFacet', 'enabled') .getOrCreate())
- Atlan recommends using the latest available version of the OpenLineage package for the Apache Spark integration.
url: set the URL of your Atlan instance — for example,
endpoint: points to the service that will consume OpenLineage events — for example,
namespace: set the connection name as exactly configured in Atlan.
apiKey: set the API token generated in Atlan.
logicalPlan: disable this facet — this is only used to debug OpenLineage issues and has a performance cost while processing events.
debugFacet: enable this facet to retrieve debug logs for Spark jobs.
- Add the above configuration to your cluster’s
spark-defaults.conffile or specific jobs on submission via the
Once the data processing tool has completed running, you will see Spark jobs along with lineage from OpenLineage events in Atlan! 🎉