How to integrate Apache Spark/OpenLineage

Atlan extracts job-level operational metadata from Apache Spark and generates job lineage through OpenLineage. To learn more about OpenLineage, refer to OpenLineage configuration and facets.

To integrate Apache Spark/OpenLineage with Atlan, review the order of operations and then complete the following steps.

Create an API token in Atlan

Before running the workflow, you will need to create an API token in Atlan.

Select the source in Atlan

To select Apache Spark/OpenLineage as your source, from within Atlan:

  1. In the top right of any screen, click New and then click New workflow.
  2. From the list of packages, select Spark Assets and then click Setup Workflow.

Configure the integration in Atlan

You will only need to create a connection once to enable Atlan to receive incoming OpenLineage events. Once you have set up the connection, you neither have to rerun the workflow nor schedule it. Atlan will process the OpenLineage events as and when your jobs run to catalog your assets.

To configure the Apache Spark/OpenLineage connection, from within Atlan:

  1. For Connection Name, provide a connection name that represents your source environment. For example, you might use values like production,development,gold, or analytics.
  2. (Optional) To change the users who are able to manage this connection, change the users or groups listed under Connection Admins.
    🚨 Careful! If you do not specify any user or group, no one will be able to manage the connection — not even admins.
  3. To run the workflow, at the bottom of the screen, click the Run button.

Configure the integration in Apache Spark

💪 Did you know? You will need the Atlan API token and connection name to configure the integration in Apache Spark/OpenLineage. This will allow Apache Spark to connect with the OpenLineage API and send events to Atlan.

Spark has a default SparkListener interface that OpenLineage leverages to collect information about Spark jobs.

To configure Apache Spark to send OpenLineage events to Atlan, you can either:

  • To activate the listener, add the following properties to your Spark configuration:
    # Initialize Spark session
    spark = (SparkSession.builder.master('local')
             .appName("SparkJobs")
             .config('spark.jars.packages', "io.openlineage:openlineage-spark:<latest OpenLineage version>")
             .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
             .config('spark.openlineage.transport.type', 'http')
             .config('spark.openlineage.transport.url', 'https://<instance>.atlan.com')
             .config('spark.openlineage.transport.endpoint', '/events/openlineage/spark/api/v1/lineage')
             .config('spark.openlineage.namespace', '<connection-name>')
             .config('spark.openlineage.transport.auth.type', 'api_key')
             .config('spark.openlineage.transport.auth.apiKey', '<Atlan_api_key>')
             .getOrCreate())
    • Atlan recommends using the latest available version of the OpenLineage package for the Apache Spark integration. Replace <latest OpenLineage version> with the latest version of OpenLineage.
    • url: set the URL of your Atlan instance — for example, https://<instance>.atlan.com.
    • endpoint: points to the service that will consume OpenLineage events — for example, /events/openlineage/spark/api/v1/lineage.
    • namespace: set the connection name as exactly configured in Atlan.
    • apiKey: set the API token generated in Atlan.
  • Add the above configuration to your cluster’s spark-defaults.conf file or specific jobs on submission via the spark-submit command.

Once the data processing tool has completed running, you will see Spark jobs along with lineage from OpenLineage events in Atlan! 🎉

You can also view event logs in Atlan to track and debug events received from OpenLineage.

Related articles

Was this article helpful?
1 out of 1 found this helpful