Atlan extracts job-level operational metadata from Apache Spark and generates job lineage through OpenLineage. To learn more about OpenLineage, refer to OpenLineage configuration and facets.
To integrate Apache Spark/OpenLineage with Atlan, review the order of operations and then complete the following steps.
Create an API token in Atlan
Before running the workflow, you will need to create an API token in Atlan.
Select the source in Atlan
To select Apache Spark/OpenLineage as your source, from within Atlan:
- In the top right of any screen, click New and then click New workflow.
- From the list of packages, select Spark Assets and then click Setup Workflow.
Configure the integration in Atlan
You will only need to create a connection once to enable Atlan to receive incoming OpenLineage events. Once you have set up the connection, you neither have to rerun the workflow nor schedule it. Atlan will process the OpenLineage events as and when your jobs run to catalog your assets.
To configure the Apache Spark/OpenLineage connection, from within Atlan:
- For Connection Name, provide a connection name that represents your source environment. For example, you might use values like
production
,development
,gold
, oranalytics
. - (Optional) To change the users who are able to manage this connection, change the users or groups listed under Connection Admins.
🚨 Careful! If you do not specify any user or group, no one will be able to manage the connection — not even admins.
- To run the workflow, at the bottom of the screen, click the Run button.
Configure the integration in Apache Spark
Spark has a default SparkListener interface that OpenLineage leverages to collect information about Spark jobs.
To configure Apache Spark to send OpenLineage events to Atlan, you can either:
- To activate the listener, add the following properties to your Spark configuration:
# Initialize Spark session spark = (SparkSession.builder.master('local') .appName("SparkJobs") .config('spark.jars.packages', "io.openlineage:openlineage-spark:<latest OpenLineage version>") .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener') .config('spark.openlineage.transport.type', 'http') .config('spark.openlineage.transport.url', 'https://<instance>.atlan.com') .config('spark.openlineage.transport.endpoint', '/events/openlineage/spark/api/v1/lineage') .config('spark.openlineage.namespace', '<connection-name>') .config('spark.openlineage.transport.auth.type', 'api_key') .config('spark.openlineage.transport.auth.apiKey', '<Atlan_api_key>') .getOrCreate())
- Atlan recommends using the latest available version of the OpenLineage package for the Apache Spark integration. Replace
<latest OpenLineage version>
with the latest version of OpenLineage. -
url
: set the URL of your Atlan instance — for example,https://<instance>.atlan.com
. -
endpoint
: points to the service that will consume OpenLineage events — for example,/events/openlineage/spark/api/v1/lineage
. -
namespace
: set the connection name as exactly configured in Atlan. -
apiKey
: set the API token generated in Atlan.
- Atlan recommends using the latest available version of the OpenLineage package for the Apache Spark integration. Replace
- Add the above configuration to your cluster’s
spark-defaults.conf
file or specific jobs on submission via thespark-submit
command.
Once the data processing tool has completed running, you will see Spark jobs along with lineage from OpenLineage events in Atlan! 🎉
You can also view event logs in Atlan to track and debug events received from OpenLineage.