Documentation Index
Fetch the complete documentation index at: https://docs.cognite.com/llms.txt
Use this file to discover all available pages before exploring further.
Corporate infrastructure security, compliance, and reliability practices are documented on the Cognite Trust Center. This page describes reliability and disaster recovery for CDF.
High availability
We have designed each CDF component with high availability to avoid single points of failure and reduce the effects of infrastructure maintenance. CDF runs on elastic cloud infrastructure that autoscales compute and storage with system load, and we use Kubernetes rolling updates for deployments with no downtime. For backup schedules, retention, and restore procedures, see Availability and business continuity.On-premises components
If the data is critical for production solutions or there is a risk of data loss, on-premises components, like extractors, should run in a high-availability hosting environment. To configure the extractors for high availability, we recommend installing extractors in mutually redundant environments and setting up the necessary failover. For example, if you have extractors in three different environments, you can:- Configure each extractor with an active instance in its primary environment and a passive instance in one of the other environments. Allow failover if one of the environments becomes unavailable.
- Configure each extractor with an active instance in its primary environment and a passive instance in the other environments. Allow failover if two of the environments become unavailable.
Disaster recovery
Reliability also focuses on recovery from data loss and disaster scenarios. These types of incidents might result in downtime or permanent loss of data. Recovery often involves active intervention and is much about careful planning. Disaster recovery is a subset of Business Continuity Planning and builds on an impact analysis that defines the recovery time objective (RTO) and recovery point objective (RPO).Recovery point objective (RPO)
The maximum duration of acceptable data loss. We measure RPO in units of time, not volume.
Recovery time objective (RTO)
The maximum duration of acceptable downtime.
Full cluster restore
For situations with loss of infrastructure, data stores, and services from the cloud provider and cases of data integrity loss caused by malicious users or data corrupting bugs.
CDF project restore
Tailored to situations where the damage to data or data integrity is limited to one or a few CDF projects.
Disaster recovery testing
Cognite performs extensive multi-service disaster recovery tests twice per year. The tests engage all Cognite teams that own services in production. In addition, new services and services that have undergone significant changes also need to pass single-service disaster recovery tests before they’re made available. We select scenarios for the disaster recovery tests using risk analysis, experiences from earlier disaster recovery tests, and needs for validating redundancy in resources, skills, or infrastructure. Examples of earlier DR tests include- simulating a complete outage of a cloud service provider region.
- data corruption in several data stores.
- user errors that corrupt a customer’s data model.