Create data sets and add data
A data set is a container for data objects with metadata about the data. For instance, you can use the data set metadata to document who is responsible for the data, upload documentation files, and describe the data lineage. In Cognite Data Fusion (CDF), you'll see data sets as a separate resource type with a /datasets
API endpoint.
To define which data objects, such as events, files, and time series, belong to a data set, you specify the relevant dataSetId
for each data object, typically in a data ingestion pipeline. Data objects can belong to only one data set, so you can unambiguously trace the data lineage for each data object.
Step 1: Create a data set
-
Navigate to Data management > Data catalog.
-
Select Create data set.
-
Fill in the basic information and select Create to create the data set.
-
Follow the steps in the wizard to fill in basic information about the data set, document the data extraction and transformation processes, and add extra documentation for your data set.
The documentation provides data consumers and data managers with the lineage documentation they need for the data set. We recommend that you upload documentation about how the data is ingested. For example, the documentation could include instructions about how to sign in to a computer where an extractor is running and describe the type of data processing that has been done.
You don't have to add all documentation at once. We recommend that you update the information for the data set as you proceed with your data ingestion work.
You can also:
-
Mark the data set as write protected to ensure the integrity of the data it contains.
-
Set labels for the data set, for instance, to group similar group data sets and make them more discoverable.
-
Set the governance status for the data set to indicate whether it has a defined owner and follows the data governance processes in your organization.
-