Data ingested into the CDF staging area, RAW, must fit the CDF data model before you can move the data into the CDF resource types. With CDF Transformations, you shape and move the data using Spark SQL queries to create, update and delete. Run one-time transformations or set up a schedule to run transformations regularly, for example, once a day to pick up new data from the extraction process. Set up email notifications to quickly catch any transformation failures.
Before you start
To access and run transformations, you need the capabilities listed here.
Navigate to Cognite Data Fusion and select Integrate > Transform data.
Select Create transformation, enter a unique name and a unique external ID.
Optionally, associate the transformation to an existing data set.
To make a copy of an existing transformation, select Duplicate under More options (...).
Query and preview transformations
On the Recipe tab:
Select the destination resource type. If you're ingesting data into the Asset hierarchy, make sure a parent asset already exists in CDF.
Select an action. Notice that the Destination field lists the required and optional columns in the schema. Optional columns are marked as
nullablewhen you hover over the name.
In the SQL editor, specify the Spark SQL query to select the RAW data you want to transform and how it should be transformed. A maximum of 10,000 rows will be displayed in the results view.
Select Preview to preview the query results.
Verify that the transformation produces the expected output in the query result. You'll see the columns that make up the schema for the chosen destination resource type above the Preview table. Hover over a column to see the type. You'll also see any source columns that don't exist in the destination schema and columns that have the wrong type in the destination schema.
Columns that are
nullable may or may not be required by the destination schema. Check the API reference documentation for the relevant resource type.
- Optionally, preview RAW tables directly in the recipe editor by selecting the RAW button. The tables open in a new tab to make sure you don't lose the query results.
When you have verified the transformation, switch to the Schedule and Run tab to complete the preparations for the transformation job.
Handle null values in transformation updates
When you run SQL transformations to update CDF resources, you must decide how the transformation should set null values (i.e., infer null in the expression):
- If fields in a CDF resource need to be set to null (clearing the existing values of the fields), select NULL means clear. This typically occurs when a piece of equipment is removed for maintenance. Use this option to disassociate the asset from its parent in the asset hierarchy.
- If fields in a CDF resource should not be updated (kept as is), select NULL means ignore. Use this option to keep the existing value as it is in CDF. This is the default value when you run transformations.
Select authentication method for reading/writing data
You need to specify either the OIDC credentials or the API keys that the transformation should use to authenticate when reading and writing data from CDF. Having separate OIDC credentials or API keys for reading and writing allows you to transform data between different projects.
For CDF projects running in Microsoft Azure, only OIDC credentials are supported. Using API keys is deprecated for CDF projects running in Google Cloud Platform. Learn more.
To authenticate using OIDC credentials, you need to specify the Client Id and Client secret, which are values from the CDF Transformations access configuration with Azure Active Directory. The remaining fields are prefilled, but you can edit these.
In the Client ID field, enter the object Id for the AAD group exactly as it exists in AAD.
In the Client secret field, enter the client secret for the AAD group exactly as it exists in AAD.
The Scopes field is the base URL + .default from your CDF instanace. To edit, use this format:\<BASE URL/.default>
The Token URL field is a valid token from your Idp. To edit, use this format: \https://login.microsoftonline.com/<YOUR_AAD_TENANT_ID/oauth2/v2.0/token>.
The CDF project name field is the CDF project you have signed in to. To edit, use this format: <your_CDF_project_name>
If you don't know what values to enter in these fields, contact your internal help desk or the CDF admin for help.
- To authenticate using API keys, enter the API key generated for the CDF project under Manage and Configure > Manage access > API keys.
If your destination resource type is RAW, specify the RAW database and table you want to write to.
You must create the RAW database and table before you can run the transformation. For instance, using the RAW Explorer.
Subscribe to email notifications to monitor the transformation process and solve any issues before they reach the data consumer. You can add up to 5 email addresses, either to groups or to specific people.
Select the Transform button to manually start a new transformation job, or follow the steps in Schedule transformations below to schedule your transformation to run at regular intervals.
On the Schedule and Run tab, you can specify a schedule that determines when and how often the transformation should run.
Schedules are specified as cron expressions. For example,
45 23 * * *will run the transformation at 23:45 (11:45 PM) every day.
Select Add schedule to activate the schedule. When a transformation is scheduled, it becomes read-only to prevent unintentional changes to future scheduled jobs.
Pause a scheduled transformation that is not running by clicking the Pause icon on the transformation overview page.