High availability
We have designed each CDF component with high availability to avoid single points of failure and reduce the effects of infrastructure maintenance. The goal is to eliminate the impact of incidents quickly and automatically and ensure that CDF continues to process requests, even during incidents. We use native cloud platform features as much as possible to ensure high availability and resilience. Cognite Data Fusion (CDF) runs on an elastic cloud infrastructure that allows it to scale upwards and downwards with the system load—it autoscales. CDF autoscales both the compute capacity (throughput) and the storage capacity (volume). For deployment, we use native Kubernetes self-healing and scalable functionality to do rolling updates with no downtime.On-premises components
If the data is critical for production solutions or there is a risk of data loss, on-premises components, like extractors, should run in a high-availability hosting environment. To configure the extractors for high availability, we recommend installing extractors in mutually redundant environments and setting up the necessary failover. For example, if you have extractors in three different environments, you can:- Configure each extractor with an active instance in its primary environment and a passive instance in one of the other environments. Allow failover if one of the environments becomes unavailable.
- Configure each extractor with an active instance in its primary environment and a passive instance in the other environments. Allow failover if two of the environments become unavailable.
Backing up the system
The services that support the resource types defined in the API have dedicated storage for each type, and the backups for the resource types run at different times. Backups of the data stores for time series and sequences may take days, and these data types have a different backup schedule than other data types. For assets, events, time series, sequences, files, and relationships, Cognite can restore to points in time between backups using point-in-time restore (PTR).Disaster recovery
Reliability also focuses on recovery from data loss and disaster scenarios. These types of incidents might result in downtime or permanent loss of data. Recovery often involves active intervention and is much about careful planning. Disaster recovery is a subset of Business Continuity Planning and builds on an impact analysis that defines the recovery time objective (RTO) and recovery point objective (RPO).Recovery point objective (RPO)
The maximum duration of acceptable data loss. We measure RPO in units of time, not volume.
Recovery time objective (RTO)
The maximum duration of acceptable downtime.
Full cluster restore
For situations with loss of infrastructure, data stores, and services from the cloud provider and cases of data integrity loss caused by malicious users or data corrupting bugs.
CDF project restore
Tailored to situations where the damage to data or data integrity is limited to one or a few CDF projects.
Disaster recovery testing
Cognite performs extensive multi-service disaster recovery tests twice per year. The tests engage all Cognite teams that own services in general availability. In addition, new services and services that have undergone significant changes also need to pass single-service disaster recovery tests before they’re made available. We select scenarios for the disaster recovery tests using risk analysis, experiences from earlier disaster recovery tests, and needs for validating redundancy in resources, skills, or infrastructure. Examples of earlier DR tests include- simulating a complete outage of a cloud service provider region.
- data corruption in several data stores.
- user errors that corrupt a customer’s data model.
Redundancy, cloud provider features, and incident handling
CDF has multiple redundancy levels and relies on standard cloud provider features for disaster recovery:- The cloud backbone network uses advanced software-defined networking and edge-caching services to deliver fast, consistent, and scalable performance.
- Multiple points of presence across the globe mean robust redundancy. Your data is mirrored automatically across storage devices in multiple locations.
- Cloud services are designed to autoscale, even when you experience a huge traffic spike.
- Maintaining data security is critical for cloud providers. Reliability engineering teams help assure high availability and prevent abuse of platform resources.
- Cloud providers regularly undergo independent third-party audits to verify that their services align with security, privacy, compliance regulations, and best practices.