Get started with data sets
Data sets let you document and track data lineage, ensure data integrity, and allow 3rd parties to write their insights to your Cognite Data Fusion (CDF) project securely.
Learn more about data sets:
Create and edit data sets:
What is a data set?
Data sets group and track data by its source. For example, a data set can contain all work orders originating from SAP or the output data from a 3rd party partner's machine-learning model. Typically, organizations have one data set for each data ingestion pipeline in CDF.
A data set is a container for data objects with metadata about the data it contains. For example, you can use the data set metadata to document who is responsible for the data, upload documentation files, and describe the data lineage. In CDF, data sets are a separate resource type with a /datasets
API endpoint.
To define which data objects, such as events, files, and time series, belong to a data set, you specify the relevant dataSetId
field for each data object. This is typically done programmatically in the data ingestion pipelines. Data objects can belong to only one data set so you can unambiguously trace the data lineage for each data object.
You can organize the following resource types into data sets:
- Assets
- Events
- Files
- Time series
- Sequences
Learn more about resource types in the CDF data model.
Why use data sets?
For proper data governance, you need to trace the data lineage to understand where data originates from and be confident that the data is reliable. Data managers need to ensure data integrity and let 3rd parties write data to CDF.