Architecting for reliability ensures that Cognite Data Fusion (CDF) is available to end-users during incidents and quickly recovers from failures.
We have designed each CDF component with high availability to avoid single points of failure and reduce the effects of infrastructure maintenance. The goal is to eliminate the impact of incidents quickly and automatically and ensure that CDF continues to process requests, even during incidents.
We use native cloud platform features as much as possible to ensure high availability and resilience. Cognite Data Fusion (CDF) runs on an elastic cloud infrastructure that allows it to scale upwards and downwards with the system load—it autoscales. CDF autoscales both the compute capacity (throughput) and the storage capacity (volume).
For deployment, we use native Kubernetes self-healing and scalable functionality to do rolling updates with no downtime.
If the data is critical for production solutions or there is a risk of data loss, on-premises components, like extractors, should run in a high-availability hosting environment. To configure the extractors for high availability, we recommend installing extractors in mutually redundant environments and setting up the necessary failover. For example, if you have extractors in three different environments, you can:
Configure each extractor with an active instance in its primary environment and a passive instance in one of the other environments. Allow failover if one of the environments becomes unavailable.
Configure each extractor with an active instance in its primary environment and a passive instance in the other environments. Allow failover if two of the environments become unavailable.
CDF automatically resolves conflicts if it receives duplicate data from multiple streaming extractors. Batch extractors automatically backfill any missing data when they are brought online after a failure or migration.
Backing up CDF
The services that support the resource types defined in the API have dedicated storage for each type, and the backups for the resource types run at different times.
Backups of the data stores for time series and sequences may take days, and these data types have a different backup schedule than other datatypes. For assets, events, time series, sequences, files, and relationships, Cognite can restore to points in time between backups using point-in-time restore (PTR).
Reliability also focuses on recovery from data loss and disaster scenarios. These types of incidents might result in downtime or permanent loss of data. Recovery often involves active intervention and is much about careful planning. Disaster recovery is a subset of Business Continuity Planning and builds on an impact analysis that defines the recovery time objective (RTO) and recovery point objective (RPO).
Recovery point objective (RPO) is the maximum duration of acceptable data loss. We measure RPO in units of time, not volume.
Recovery time objective (RTO) is the maximum duration of acceptable downtime.
For CDF, we configure security in the same way in the disaster recovery environment as in the production environment. We verify the security through testing, monitoring of policies, and infrastructure as code. We host the continuous deployment (CD) environment and artifacts in a location that ensures they are available and operational in the event of a disaster.
Cognite offers two approaches to restoring data:
Full cluster restore is for situations with loss of infrastructure, data stores, and services from the cloud provider and cases of data integrity loss caused by malicious users or data corrupting bugs.
CDF project restore is tailored to situations where the damage to data or data integrity is limited to one or a few CDF projects.
There is a significant difference in RTO between the two approaches. CDF project restore benefits from versioning history in the databases for time series and sequences. In contrast, full cluster restore requires us to restore backups for all resource types.
For the most critical resource types, Cognite can restore all backups to the same point in time, even if the backups for the different resource types run at different times.
When Cognite has completed the disaster recovery, the data in the customer's CDF project is returned to the state it had at the restore time. The data model will be consistent, but you must update CDF with any changes in your source systems after the restore point. Ensure that your business continuity plan includes the steps to resume feeding from this point in time.
Disaster recovery testing
Cognite performs extensive multi-service disaster recovery tests twice per year. The tests engage all Cognite teams that own services in general availability. In addition, new services and services that have undergone significant changes also need to pass single-service disaster recovery tests before they are made generally available.
We select scenarios for the disaster recovery tests using risk analysis, experiences from previous disaster recovery tests, and needs for validating redundancy in resources, skills, or infrastructure. Examples of previous DR tests include simulating a complete outage of a cloud service provider region, data corruption in several data stores, and user errors accidentally corrupting a customer's data model.
Redundancy, cloud provider features and incident handling
CDF has multiple redundancy levels and relies on standard cloud provider features for disaster recovery:
- The cloud backbone network uses advanced software-defined networking and edge-caching services to deliver fast, consistent, and scalable performance.
- Multiple points of presence across the globe mean robust redundancy. Your data is mirrored automatically across storage devices in multiple locations.
- Cloud services are designed to autoscale even when you experience a huge traffic spike.
- Maintaining data security is critical for cloud providers. Reliability engineering teams help ensure high availability and prevent abuse of platform resources.
- Cloud providers regularly undergo independent third-party audits to verify that their services align with security, privacy, compliance regulations, and best practices.
In addition to the standard cloud provider features, Cognite has a well-defined incident handling process, and we offer 24x7 support and engineering on-call rotation to resolve incidents.