How to automate data profiling

βž• Premium feature! This feature will be a paid addition. Reach out to your customer success manager for more information.
πŸ€“ Who can do this? You need to be an admin user in Atlan to create profiling playbooks.

Monitoring and improving data quality is critical to building trust in your data assets. Atlan solves for this with profiling playbooks!

Profiling playbooks help power data observability for your assets in Atlan. You can create profiling playbooks to scan your assets at scale, identify any issues or inconsistencies, and improve the data quality of your assets.

Supported sources

Atlan currently supports column profiling for the following connectors:

Create a profiling playbook

To create a profiling playbook:

  1. In the left menu in Atlan, click Governance.
  2. Under the Governance heading of the Governance center, click Playbooks.
  3. To the right of the Create New button, click the downward arrow and then select Profiling Playbook.
  4. In the Create new profiling playbook dialog, enter the following details:
    1. For Name, enter a name for the task to be accomplished β€” for example, Tables scan. (Atlan recommends that the length of a playbook name must be no longer than 46 characters.)
    2. For Connection, select a supported connection from the dropdown menu β€” in this example, we'll select a Google BigQuery connection development.
    3. (Optional) For Description, enter a description for your playbook.
    4. (Optional) Select an icon for your playbook.
  5. Click Create to save your playbook.

Set up rules as filters

πŸ’ͺ Did you know? The assets to be scanned are pre-filled based on your selected connection.

To set up rules as filters for your profiling playbook:

  1. In the Build Rules page of your profiling playbook, click Filters.
  2. For the name field, add a name to your filter β€” for example, Profiling action.
  3. To set a matching condition for the filters, select Match all or Match any. Match all will logically AND the criteria, while Match any will logically OR the criteria.
  4. For Attributes, select the relevant option. For this example, we'll select Name listed under Properties. (Optional) To further refine your asset selection:
    • Click Connection to select a specific connection. 
      1. Click All databases to filter by databases in a selected connection.
      2. Click All schemas to filter by schemas in a selected connection.
    • Click Connector to filter assets by supported connectors.
    • Click Asset type to filter by specific asset types β€” for example, tables, columns, queries, glossaries, and more.
    • Click Certificate to filter assets by certification status.
    • Click Owners to filter assets by asset owners.
    • Click Tags to filter assets by your tags in Atlan, including imported Snowflake and dbt tags.
    • Click Glossary, terms, & categories to filter by a specific glossary or category to bulk update all the nested terms or by multiple glossaries and categories.
    • Click Linked terms to filter assets by linked terms.
    • Click Schema qualified Name to filter assets by the qualified name of a given schema.
    • Click Database qualified Name to filter assets by the qualified name of a given database.
    • Click dbt to filter assets by dbt-specific filters and then select a dbt Cloud or dbt Core filter.
    • Click Properties to filter assets by common asset properties.
    • Click Usage to filter assets by usage metrics.
    • Click Monte Carlo to filter assets by Monte Carlo-specific filters.
    • Click Soda to filter assets by Soda-specific filters.
    • Click Table/View to filter tables or views by row count, column count, or size.
    • Click Column to filter columns by column-specific filters, including parent asset type or name, data type, or column keys.
    • Click Process to filter lineage processes by the SQL query.
    • Click Query to filter assets by associated visual queries.
    • Click Measure to filter Microsoft Power BI measures using the external measures filter.
  5. For Operator, select Is one of for values to include or Is not for values to exclude. Depending on the selected attribute(s), you can also choose from additional operators:
    • Select Equals (=) or Not Equals (!=) to include or exclude assets through exact match search.
    • Select Starts With or Ends With to filter assets using the starting or ending sequence of values.
    • Select Contains or Does not contain to find assets with or without specified values contained within the attribute.
    • Select Pattern to filter assets using supported Elastic DSL regular expressions.
    • Select Is empty to filter assets with null values.
  6. For Values, select the relevant values. The values will vary depending on the selected attributes.
  7. (Optional) To add more filters, click Add filter and select Filter to add individual filters or Filter Group to nest more filters in a group.
  8. (Optional) To view all the assets that match your rules, in the Filters card, click View all for a preview.

Confirm profiling actions

🚨 Careful! Column profiling is currently only supported for number and text data types. The profiled column assets will be populated with preconfigured metrics.

To select the actions to be performed based on your rules:

  1. The default profiling actions to be performed include:
    • Base metrics:
      • Distinct count β€” number of rows that contain distinct values, relative to the column.
      • Missing count β€” number of rows that do not contain specific values.
    • Numeric metrics:
      • Minimum and maximum values β€” smallest and greatest values in a numeric column.
      • Average β€” calculated average of values in a numeric column.
      • Standard deviation β€” calculated standard deviation of values in a numeric column.
      • Variance β€” calculated variance of values in a numeric column.
    • String metrics:
      • Average length β€” average length of string values in a column.
      • Minimum and maximum length β€” minimum and maximum length of string values in a column.
  2. Click Next to proceed to the next step.
  3. In the Optimize your Profiling query popup, the following message will appear: This Profiling playbook will query x rows across y assets. To avoid significant computing costs, review your applied filters before proceeding. Click Review filters to review your existing filters or click Continue anyway to proceed.

Note that Atlan is working to support sampling functionality in the future.

Run the playbook

If you'd like to continue working on your playbook, you can save it as a draft. If your playbook is ready, you can proceed to running it.

To run the playbook:

  1. You can either:
    • To run the playbook once immediately, click Run once.
    • To schedule the playbook to run hourly, daily, weekly, or monthly, click Schedule and choose the preferred frequency, timezone, and time.
  2. Click Complete to confirm your selections.
  3. In the resulting screen, click Go to profile to view your playbook profile.

Once your playbook run is completed, you will see the data profile updated for your assets! πŸŽ‰

View profiled assets

To view the profiled assets for your playbook:

  1. In the Overview page of your playbook, to the right of Profiling action, click the total count of profiled assets.
  2. In the sidebar to the right, profiled assets will be indicated with a bar graph icon. Click any profiled asset to proceed to viewing profiling data.
  3. From the table sidebar, click the Column tab to view column assets and then select any of the profiled columns.
  4. In the column sidebar to the right, click Profile to view profiling data for the selected column asset.
πŸ’ͺ Did you know? Once you've created a profiling playbook, you can monitor, modify, or delete it at any time.

Related articles

Was this article helpful?
0 out of 0 found this helpful