How to integrate Apache Spark/OpenLineage

🧪 Preview feature! This feature is available for your experimentation, and we'd love your feedback. It may change before its final generally-available form. If you'd like to participate in the preview, reach out to your customer success manager for more information.

Atlan extracts job-level operational metadata from Apache Spark and generates job lineage through OpenLineage.

To integrate Apache Spark/OpenLineage with Atlan, review the order of operations and then complete the following steps.

Create an API token in Atlan

Before running the workflow, you will need to create an API token in Atlan.

Select the source in Atlan

To select Apache Spark/OpenLineage as your source, from within Atlan:

  1. In the top right of any screen, click New and then click New workflow.
  2. From the filters along the top, click Data processing.
  3. From the list of packages, select Spark Assets and then click Setup Workflow.

Configure the integration in Atlan

To configure the Apache Spark/OpenLineage connection, from within Atlan:

  1. For Connection Name, provide a connection name that represents your source environment. For example, you might use values like production,development,gold, or analytics.
  2. (Optional) To change the users who are able to manage this connection, change the users or groups listed under Connection Admins.
    🚨 Careful! If you do not specify any user or group, no one will be able to manage the connection — not even admins.
  3. To run the workflow, at the bottom of the screen, click the Run button.

Configure the integration in Apache Spark

💪 Did you know? You will need the Atlan API token and connection name to configure the integration in Apache Spark/OpenLineage. This will allow Apache Spark to connect with the OpenLineage API and send events to Atlan.

Spark has a default SparkListener interface that OpenLineage leverages to collect information about Spark jobs.

To configure Apache Spark to send OpenLineage events to Atlan, you can either:

  • To activate the listener, add the following properties to your Spark configuration:
    # Initialize Spark session
    spark = (SparkSession.builder.master('local')
             .config('spark.jars.packages', "io.openlineage:openlineage-spark:1.5.0")
             .config('spark.extraListeners', 'io.openlineage.spark.agent.OpenLineageSparkListener')
             .config('spark.openlineage.transport.type', 'http')
             .config('spark.openlineage.transport.url', 'https://<instance>')
             .config('spark.openlineage.transport.endpoint', '/events/openlineage/spark/api/v1/lineage')
             .config('spark.openlineage.namespace', '<connection-name>')
             .config('spark.openlineage.transport.auth.type', 'api_key')
             .config('spark.openlineage.transport.auth.apiKey', '<Atlan_api_key>')
             .config('spark.openlineage.facets.disabled', '[spark.logicalPlan;]')
             .config('spark.openlineage.debugFacet', 'enabled')	
    • Atlan recommends using the latest available version of the OpenLineage package for the Apache Spark integration.
    • url: set the URL of your Atlan instance — for example, https://<instance>
    • endpoint: points to the service that will consume OpenLineage events — for example, /events/openlineage/spark/api/v1/lineage.
    • namespace: set the connection name as exactly configured in Atlan.
    • apiKey: set the API token generated in Atlan.
    • logicalPlan: disable this facet — this is only used to debug OpenLineage issues and has a performance cost while processing events.
    • debugFacet: enable this facet to retrieve debug logs for Spark jobs.
  • Add the above configuration to your cluster’s spark-defaults.conf file or specific jobs on submission via the spark-submit command.

Once the data processing tool has completed running, you will see Spark jobs along with lineage from OpenLineage events in Atlan! 🎉

Related articles

Was this article helpful?
0 out of 0 found this helpful