# Data governance

Before you start integrating and enhancing data in CDF, you should define and implement your data governance policies. Data governance is a set of principles and practices that ensure high quality through your data's lifecycle. It is a key part of data operations to continuously optimize your data management practices.

We recommend that you appoint a CDF admin who can work with the IT department to ensure that CDF follows your organization’s security practices. Also, connect CDF to your IdP (Identity Provider), and use the existing IdP user identities to manage access to CDF and the data stored in CDF.

This unit takes a look at the CDF tools and features you can use to make sure that your data conforms to your organization and users' expectations.

# Secure access management

To control access to data in CDF, you define what capabilities users or applications have to work with different resource types in CDF, for example, if they can read a time series or delete an asset.

Group

Instead of assigning capabilities to individual users and service accounts (for applications and services) directly, you define capabilities for a group and then make the service accounts or users members of the relevant groups. This flexibility allows you to manage and update your data governance policies quickly.

Both service accounts and users can be members of a group, but they differ in the way they get access to CDF:

  • Service accounts enable apps and services, for example, extractors and machine learning models, to interact with CDF resource types through the Cognite API or one of our SDKs.

    Each service account has its own API key that connects one application or service to one CDF project. Never share the same API key between multiple applications or services.

  • Users can use their existing organizational identity to sign in to CDF and related applications such as Asset Data Insight and InField. You can continue to manage the organizational identities for users in your organization's Identity Provider (IdP) service outside of CDF.

    We support Microsoft's Azure Active Directory (Azure AD), Google Cloud, and other OpenID Connect compliant providers.

# Data lineage and integrity

When you rely on data to make operational decisions, it is critical that you know when the data is reliable and that end-users know when they can depend on the data to make decisions. CDF has tools and features to ensure that your data conforms to organizational and user expectations.

# Data sets

Data sets let you document and track data lineage, ensure data integrity, and allow 3rd parties to write their insights securely back to your CDF project. We recommend that you organize all data in CDF in data sets to always know where data comes from and who is responsible for it.

Data sets group and track data by its source. For example, a data set can contain all work orders originating from SAP. Typically, an organization will have one data set for each of its data ingestion pipelines in CDF. Each data object in CDF can belong to only one data set.

A data set is a container for data objects and has metadata with information about the data it contains. For example, you can use the data set metadata to document who is responsible for the data, upload documentation files, describe the data lineage, and so on. In CDF, data sets are a separate resource type.

Group

Typically, you define programmatically in the data ingestion pipelines which data objects, for example, events, files, and time series, belong to a data set. Data objects can belong to only one data set so that you can unambiguously trace the data lineage for each data object.

# Data quality monitoring

Data quality monitoring lets you track time series data quality for apps and models running on data from CDF. For example, a monitor can contain all the time series data that is being used in a data science model to monitor the health of an oil well.

Group

Use rule sets to specify data quality requirements for the monitor's data and continuously ensure that the data meet the requirements. Each data object in CDF can belong to multiple monitors.

You can set up alerts via email or a webhook URL to notify you whenever a data quality rule is broken and when the data quality is restored.

# More information

Last Updated: 12/1/2020, 10:23:35 AM