How to mine queries through S3

Once you have crawled assets from a supported connector, you can mine query history.

Supported connectors include the following:

For each of the supported connectors, Atlan supports mining query history via S3. This is useful when you have files that hold query history beyond what the source itself retains.

To mine lineage from these sources from S3, complete the following steps.

Structure the query files

To make the query history files available for Atlan, ensure the files:

  • Use a .json extension.
  • Are present in a single S3 bucket and prefix (directory).

To structure the contents of the files for Atlan, ensure:

  • Each line is a single JSON value. (The JSON object cannot be pretty-formatted or span multiple lines.)
  • Each SQL query is on its own line.
  • Commas are not used to separate the lines.
💪 Did you know? You can also provide a default database and schema, and session IDs in the JSON.
  • If a SQL query has only the name of the table or view it queries, Atlan will use the default database and schema to generate lineage for the query.
  • Including the session ID speeds up lineage processing. If provided, ensure that all queries belonging to the same session are next to each other in the file.

Here is an example of what your JSON should look like. (Here it is split across multiple lines to assist reading, but remember it must all be on a single line in the file!)

{
  "QUERY_TEXT": "insert into NETFLIX_DB.PUBLIC.MOVIES_FILTERED as select m.* from MOVIES m where m.RATING > 5;",
  "DATABASE_NAME": "NETFLIX_DB",
  "SCHEMA_NAME": "PUBLIC",
  "SESSION_ID": "5c2f0a41-5d02-46f1-b9bd-ef80ad571013"
}

The name of the keys or properties in the JSON can be configured while setting up the miner package. In the example above, the default database (DATABASE_NAME) and schema (SCHEMA_NAME) will be used to qualify the query against the table MOVIES as NETFLIX_DB.PUBLIC.MOVIES.

Set up the S3 bucket

The query files must be available in an S3 bucket. You can either upload these files to the Atlan deployment bucket or use your own S3 bucket.

Option 1: Use the Atlan S3 bucket

To avoid access issues, we recommend uploading the required files to the same S3 bucket as Atlan. Raise a support request to get the details of your Atlan bucket and include the ARN value of the IAM user or IAM role we can provision access to.

To configure access, add the following IAM policy to the default EC2 instance role used by the Atlan EKS cluster.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::<bucket-name>",
        "arn:aws:s3:::<bucket-name>/<prefix>/*"
      ]
    }
  ]
}
  • Replace <bucket-name> with the bucket where the data is uploaded.
  • Replace <prefix> with the prefix (directory) where all the files have been uploaded.

If you instead opt to use your own S3 bucket, you will need to complete the following steps:

Option 2: Use your own S3 bucket

🚨 Careful! S3 buckets with VPC endpoints currently do not support cross-region requests. This may result in workflows not picking up objects from your bucket.

You'll first need to create a cross-account bucket policy giving Atlan's IAM role access to your bucket. A cross-account bucket policy is required since your Atlan tenant and S3 bucket may not always be deployed in the same AWS account. The permissions required for the S3 bucket include — GetBucketLocation, ListBucket, and GetObject.

To create a cross-account bucket policy:

  1. Raise a support ticket to get the ARN of the Node Instance Role for your Atlan EKS cluster.
  2. Create a new policy to allow access by this ARN and update your bucket policy with the following:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "VisualEditor0",
          "Effect": "Allow",
          "Principal": {
            "AWS": "<role-arn>"
          },
          "Action": [
            "s3:GetBucketLocation",
            "s3:ListBucket",
            "s3:GetObject"
          ],
          "Resource": [
            "arn:aws:s3:::<bucket-name>",
            "arn:aws:s3:::<bucket-name>/<prefix>/*"
          ]
        }
      ]
    }
    • Replace <role-arn> with the role ARN of Atlan's node instance role.
    • Replace <bucket-name> with the name of the bucket you are creating.
    • Replace <prefix> with the name of the prefix (directory) within that bucket where you will upload the files.
  3. Once the new policy has been set up, please notify the support team. Your request should include the S3 bucket name and prefix. This should be done prior to setting up the workflow so that we can create and attach an IAM policy for your bucket to Atlan's IAM role.

(Optional) Update KMS policy

If your S3 bucket is encrypted, you will need to update your KMS policy. This will allow Atlan to decrypt the objects in your S3 bucket.

  1. Provide the KMS key ARN and KMS key alias ARN to the Atlan support team. The KMS key that you provide must be a customer managed KMS key. (This is because you can only change the key policy for a customer managed KMS key, and not for an AWS managed KMS key. Refer to AWS documentation to learn more.)
  2. To whitelist the ARN of Atlan's node instance, update the KMS policy with the following:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Decrypt Cross Account",
          "Effect": "Allow",
          "Principal": {
            "AWS": "<role-arn>"
          },
          "Action": [
            "kms:Decrypt",
            "kms:DescribeKey"
          ],
          "Resource": "*"
        }
      ]
    }
    • Replace <role-arn> with the role ARN of Atlan's node instance role.

Select the miner

To select the S3 miner:

  1. In the top right of any screen, navigate to New and then click New Workflow.
  2. From the filters along the top, click Miner.
  3. From the list of packages, select the miner for your source and click on Setup Workflow.

Configure the miner

To configure the S3 miner:

  1. For Connection, select the connection to mine. (To select a connection, a crawler must have already run against that source.)
  2. For Miner extraction method, select S3.
  3. Enter the details for your files:
    • For Bucket Name, enter the name of your S3 bucket or Atlan's bucket, including s3://.
    • For Bucket Prefix, enter the S3 prefix (directory) within the bucket where the files are located.
    • (Optional) For Bucket Region, enter the name of the S3 region in which the bucket exists.
    • For SQL Json key, enter the JSON key containing the SQL query value. (In the example above, this was QUERY_TEXT.)
    • For Default Database Json Key, enter the JSON key containing the name of the default database. (In the example above, this was DATABASE_NAME.)
    • For Default Schema Json Key, enter the JSON key containing the name of the default schema. (In the example above, this was SCHEMA_NAME.)
    • For Session ID Json Key, enter the JSON key containing the session ID under which the query ran. (In the example above, this was SESSION_ID.)
  4. (Optional) For Control Config, if Atlan support has provided you a custom control configuration, select Custom and enter the configuration into the Custom Config box. You can also:
    • Enter {“ignore-all-case”: true} to enable crawling assets with case-sensitive identifiers.

Run the miner

To run the S3 miner, after completing the steps above:

  • To run the miner once, immediately, at the bottom of the screen click the Run button.
  • To schedule the miner to run hourly, daily, weekly or monthly, at the bottom of the screen click the Schedule & Run button.

Once the miner has completed running, you will see lineage for your source's assets created by the queries in S3! 🎉

Frequently asked questions

If I remove queries from S3 and run the miner, does it remove the lineage generated from those queries?

No, we do not remove lineage from older queries that are no longer in the bucket.

Does the miner reprocess files in the S3 prefix?

Yes, we process all files present in the S3 prefix and publish any new lineage generated. We recommend removing older files when updating the files in the S3 prefix.

I used this approach for initial mining. Can I convert the miner I already set up to do its future mining direct from the source (not S3)?

Yes, just edit the workflow configuration. Alternatively, you can also set up another miner for the same connection.

Are the database and schema name parameters always required in the JSON file?

The DATABASE_NAME and SCHEMA_NAME fields can be set to null if that data is already available in the query. These properties are used as a fallback option for when queries are run in the context of a certain schema or database.

What SQL statements should be added to the S3 miner JSON file for lineage?

You will need to add DDL and DML statements to the S3 miner JSON file for mining lineage. SELECT is not required since it is a DQL statement. Both UPDATE and DELETE can be based on values from another table, so these statements will be required for generating lineage.

Related articles

Was this article helpful?
1 out of 1 found this helpful