High availability and disaster recovery (HA/DR)

High availability (HA)

Atlan uses Amazon Elastic Kubernetes Service (EKS) for high availability (HA).

Amazon EKS runs and scales the Kubernetes control plane across multiple AWS Availability Zones to ensure high availability. It automatically scales control plane instances based on load, detects and replaces unhealthy control plane instances, and patches the control plane.

The benefits of using this concept are:

  • Scalability and reliability help the system remain stable
  • Promotes self-healing to ensure that containers are running in a healthy state
  • Handles node failures gracefully
  • Auto-scaling enables automated cluster creation

Application HA

Atlan ensures application HA through the following:

  • Multiple replicas for both stateless and stateful applications
  • Load balancing with services
  • Rolling updates to maintain the availability threshold
  • Pod-to-node distribution to ensure that critical application pods are running on dedicated nodes
  • Usage of inter-Availability Zone (AZ) data transfers to avoid single-zone failure impact

Disaster recovery (DR)

Atlan follows industry best practices for disaster recovery. Atlan uses Argo Workflows for orchestration to successfully implement a disaster recovery strategy and reduce production downtime, so that business impact is minimized in the event of an outage.

Here are a few parameters that help reduce downtime and expedite the process of disaster recovery:


Single-tenant SaaS is the default deployment option for most Atlan users. In this model, Atlan manages the infrastructure needs and ensures that all instances are spread across multiple Availability Zones (AZ) in each AWS Region where the user instance is deployed.

Availability Zones are multiple, isolated locations within a single AWS Region. Multi-AZ deployments provide enhanced availability for instances within a single AWS Region. With multi-AZ, your data is synchronously replicated to standby in a different Availability Zone.

🚨 Careful! Atlan currently does not support multi-region deployment.

Atlan service overview

The diagram below illustrates the relationships and communication flows between each service. The bottom-most layer shows the services that are entirely independent, such as Cassandra, Postgres, and more. Most of the other services depend on these to function. 

Backups and restore

Atlan runs backups on a daily basis through automated workflows and retains them for 15 days. The backups are encrypted and stored as object files in an S3 bucket.

Stateful data sources that are backed up:

  • Elasticsearch (metastore and logging)
  • Cassandra
  • Redis
  • Postgres
  • S3 (Argo)

Atlan can restore a single component in case of data corruption for any single point of failure, such as a metastore and its components like Elasticsearch and Cassandra. It is also possible to do a full-cluster restore in case of an unintended operation or a data loss or corruption event.

💪 Did you know? Argo Workflows powers all the backup and restore packages in Atlan. It includes a retry mechanism in case of any errors while completing the steps in the workflow. It also sends alerts in case of entire package failure as part of observability.


Atlan has an easy process to migrate the application to other AWS Regions. In case of total region outage and the need for migrating an instance to another region or account, this migration activity will be performed via Atlan’s backup and restore packages.

RTO, RPO, and retention

Greater RTOs and RPOs as well as system recovery are crucial for ensuring that multiple mission-critical applications are quickly restored. It is now possible to minimize the impact of a disruption and perform a recovery within a few hours of an outage.

  • Atlan carries out a daily backup of all critical services once every 24 hours, so in a worst case scenario provides an RPO of 24 hours.
  • Atlan retains daily backups for 15 days.

Related articles

Was this article helpful?
0 out of 0 found this helpful