Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.cognite.com/llms.txt

Use this file to discover all available pages before exploring further.

Corporate infrastructure security, compliance, and reliability practices are documented on the Cognite Trust Center. This page describes reliability and disaster recovery for CDF.
We’ve designed Cognite Data Fusion (CDF) to ensure that it’s available to end-users during incidents and to ensure quick recovery from failures.

High availability

We have designed each CDF component with high availability to avoid single points of failure and reduce the effects of infrastructure maintenance. CDF runs on elastic cloud infrastructure that autoscales compute and storage with system load, and we use Kubernetes rolling updates for deployments with no downtime. For backup schedules, retention, and restore procedures, see Availability and business continuity.

On-premises components

If the data is critical for production solutions or there is a risk of data loss, on-premises components, like extractors, should run in a high-availability hosting environment. To configure the extractors for high availability, we recommend installing extractors in mutually redundant environments and setting up the necessary failover. For example, if you have extractors in three different environments, you can:
  • Configure each extractor with an active instance in its primary environment and a passive instance in one of the other environments. Allow failover if one of the environments becomes unavailable.
  • Configure each extractor with an active instance in its primary environment and a passive instance in the other environments. Allow failover if two of the environments become unavailable.
CDF automatically resolves conflicts if it receives duplicate data from multiple streaming extractors. Batch extractors automatically backfill any missing data when they’re brought online after a failure or migration.

Disaster recovery

Reliability also focuses on recovery from data loss and disaster scenarios. These types of incidents might result in downtime or permanent loss of data. Recovery often involves active intervention and is much about careful planning. Disaster recovery is a subset of Business Continuity Planning and builds on an impact analysis that defines the recovery time objective (RTO) and recovery point objective (RPO).

Recovery point objective (RPO)

The maximum duration of acceptable data loss. We measure RPO in units of time, not volume.

Recovery time objective (RTO)

The maximum duration of acceptable downtime.
For CDF, we configure security in the same way in the disaster recovery environment as in the production environment. We verify the security through testing, monitoring of policies, and infrastructure as code. We host the continuous deployment (CD) environment and artifacts in a location that ensures they’re available and operational in the event of a disaster. Cognite offers two approaches to restoring data:

Full cluster restore

For situations with loss of infrastructure, data stores, and services from the cloud provider and cases of data integrity loss caused by malicious users or data corrupting bugs.

CDF project restore

Tailored to situations where the damage to data or data integrity is limited to one or a few CDF projects.
There is a significant difference in RTO between the two approaches. CDF project restores benefits from versioning history in the databases for time series and sequences. In contrast, full cluster restore requires restore backups for all resource types. For the most critical resource types, Cognite can restore all backups to the same point in time, even if the backups for the different resource types run at different times. When Cognite has completed the disaster recovery, the data in the customer’s CDF project returns to the state it had at the restore time. The data model will be consistent, but you must update CDF with any changes in your source systems after the restore point. Make sure your business continuity plan includes the steps to resume feeding from this point in time.

Disaster recovery testing

Cognite performs extensive multi-service disaster recovery tests twice per year. The tests engage all Cognite teams that own services in production. In addition, new services and services that have undergone significant changes also need to pass single-service disaster recovery tests before they’re made available. We select scenarios for the disaster recovery tests using risk analysis, experiences from earlier disaster recovery tests, and needs for validating redundancy in resources, skills, or infrastructure. Examples of earlier DR tests include
  • simulating a complete outage of a cloud service provider region.
  • data corruption in several data stores.
  • user errors that corrupt a customer’s data model.

More information

Last modified on May 19, 2026