How to mine queries through S3

Have more questions? Submit a request

Once you have crawled assets from a supported connector (Snowflake, BigQuery, and Redshift) you can mine query history.

For each of the supported connectors Atlan supports mining query history via S3. This is useful when you have files that hold query history beyond what the source itself retains.

To mine lineage from these sources from S3, complete the following steps.

Structure the query files

To make the query history files available for Atlan, ensure the files:

  • Use a .json extension.
  • Are present in a single S3 bucket and prefix (directory).

To structure the contents of the files for Atlan, ensure:

  • Each line is a single JSON value. (The JSON object cannot be pretty-formatted or span multiple lines.)
  • Each SQL query is on its own line.
  • Commas are not used to separate the lines.
πŸ’ͺ Did you know? You can also provide a default database and schema, and session IDs in the JSON.
  • If a SQL query has only the name of the table or view it queries, Atlan will use the default database and schema to generate lineage for the query.
  • Including the session ID speeds up lineage processing. If provided, ensure that all queries belonging to the same session are next to each other in the file.

Here is an example of what your JSON should look like. (Here it is split across multiple lines to assist reading, but remember it must all be on a single line in the file!)

{
  "QUERY_TEXT": "insert into NETFLIX_DB.PUBLIC.MOVIES_FILTERED as select m.* from MOVIES m where m.RATING > 5;",
  "DATABASE_NAME": "NETFLIX_DB",
  "SCHEMA_NAME": "PUBLIC",
  "SESSION_ID": "5c2f0a41-5d02-46f1-b9bd-ef80ad571013"
}

The name of the keys or properties in the JSON can be configured while setting up the miner package. In the example above, the default database (DATABASE_NAME) and schema (SCHEMA_NAME) will be used to qualify the query against the table MOVIES as NETFLIX_DB.PUBLIC.MOVIES.

Set up the S3 bucket

The query files must be available in an S3 bucket.

If the S3 bucket is different from the Atlan deployment bucket, you must configure access to it. To configure access, add the following IAM policy to the default EC2 instance role used by the Atlan EKS cluster.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket",
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::<bucket-name>",
        "arn:aws:s3:::<bucket-name>/<prefix>/*"
      ]
    }
  ]
}
  • Replace <bucket-name> with the bucket where the data is uploaded.
  • Replace <prefix> with the prefix (directory) where all the files have been uploaded.

Select the miner

To select the S3 miner:

  1. In the top right of any screen, navigate to New and then click New Workflow.
  2. From the filters along the top, click Miner.
  3. From the list of packages, select the miner for your source and click on Setup Workflow.

Configure the miner

To configure the S3 miner:

  1. For Connection, select the connection to mine. (To select a connection, a crawler must have already run against that source.)
  2. For Miner extraction method select S3.
  3. Enter the details for your files:
    • For S3 bucket enter the bucket name including s3://.
    • For S3 prefix enter the S3 prefix (directory) within the bucket where the files are located.
    • For SQL Json key enter the JSON key containing the SQL query value. (In the example above, this was QUERY_TEXT.)
    • For Default Database Json Key enter the JSON key containing the name of the default database. (In the example above, this was DATABASE_NAME.)
    • For Default Schema Json Key enter the JSON key containing the name of the default schema. (In the example above, this was SCHEMA_NAME.)
    • For Session ID Json Key enter the JSON key containing the session ID under which the query ran. (In the example above, this was SESSION_ID.)
  4. (Optional) For Control Config, if Atlan support has provided you a custom control configuration, select Custom and enter the configuration into the Custom Config box.

Run the miner

To run the S3 miner, after completing the steps above:

  • To run the miner once, immediately, at the bottom of the screen click the Run button.
  • To schedule the miner to run hourly, daily, weekly or monthly, at the bottom of the screen click the Schedule & Run button.

Once the miner has completed running, you will see lineage for your source's assets created by the queries in S3! πŸŽ‰

Frequently asked questions

If I remove queries from S3 and run the miner, does it remove the lineage generated from those queries?

No, we do not remove lineage from older queries that are no longer in the bucket.

Does the miner reprocess files in the S3 prefix?

Yes, we process all files present in the S3 prefix and publish any new lineage generated. We recommend removing older files when updating the files in the S3 prefix.

I used this approach for initial mining. Can I convert the miner I already set up to do its future mining direct from the source (not S3)?

Yes, just edit the workflow configuration. Alternatively, you can also set up another miner for the same connection.

Are the database and schema name parameters always required in the JSON file?

The DATABASE_NAME and SCHEMA_NAME fields can be set to null if that data is already available in the query. These properties are used as a fallback option for when queries are run in the context of a certain schema or database.

What SQL statements should be added to the S3 miner JSON file for lineage?

You will need to add DDL and DML statements to the S3 miner JSON file for mining lineage. SELECT is not required since it is a DQL statement. Both UPDATE and DELETE can be based on values from another table, so these statements will be required for generating lineage.

Related articles

Was this article helpful?
1 out of 1 found this helpful