# CDF Transformations

With CDF Transformations, you can use Spark SQL queries to transform data from the CDF staging area, RAW, into the CDF data model. CDF Transformations is an integrated part of CDF, and you can run it in your browser.

If you cannot implement your transformation logic in SQL, use a tool other than CDF Transformations that enables programmatic data transformations, for example Databricks.

In this article:

# Getting started

NOTE

You need to be a member of the transformations group to access and use CDF Transformations. Contact your project administrator if you need access.

To open CDF Transformations, navigate to fusion.cognite.com (opens new window) and select Integrate > Transform data.

In the overview window you can see all transformations that you have access to.

  • To edit a transformation, click the name of the transformation.

  • To create a new transformation, click New transformation.

  • To make a copy of an existing transformation, click duplicate.

Overview

# Create and run a transformation

To create and run a transformation with CDF Transformations:

  1. Open Integrate > Transform data.

  2. In the Transformations window, select New transformation.

    Overview

    The new transformation opens in the transformation editor.

  3. On the Recipe tab, enter a transformation name using the pencil icon and select the desired destination resource type.

    Recipe

  4. In the SQL editor, specify the Spark SQL query to select the RAW data you want to transform and to specify how it should be transformed.

    See Writing SQL queries for tips on how to read data from RAW tables and other CDF resource types.

    You can configure the source limit to change the maximum number of rows to read from data sources when previewing the results of your query. Note that even if you set this to All, there is still a final limit of 10,000 rows that will be displayed in the results view.

    Preview limit

  5. Click Preview to preview the query results.

  6. Use the query results to verify that the transformation produces the expected output, and adjust the SQL query if necessary.

    Results view

    Above the preview table, you can see the columns that make up the schema for the chosen destination resource type. To see the type of a column, hover over it.

    NOTE: Columns that are nullable may or may not be required by the destination schema. If you're in doubt, check the API reference documentation for the relevant resource type.

    You will also see any source columns that don't exist in the destination schema, and columns that have the wrong type in the destination schema.

    Open RAW explorer

    You can also preview RAW tables directly in the recipe editor by clicking the RAW button. Tables will be opened in a new tab so you don't lose your query results.

  7. When you have verified that the transformation works the way you want, switch to the Schedule and Run tab to complete the configuration of the transformation.

  8. You need to specify either the OIDC credentials or the API keys that the transformation should use to authenticate when reading and writing data from CDF. Having separate OIDC credentials or API keys for reading and writing allows you to transform data between different projects.

Ignore Nulls

9a. When performing updates to CDF resources using SQL transformations, you have to decide how the transformation should infer null in the expression:

  • if fields in a CDF resource need to be updated to null: Select NULL means clear.

  • if fields in a CDF resource should not be updated: Select NULL means ignore.

NOTE: By default, null is ignored in SQL transformations.

OIDC configuration

9b. To authenticate using OIDC credentials, you need to specify the Client Id and Client secret, which are values from the CDF Transformations access configuration with Azure Active Directory. The remaining fields are prefilled, but you can edit these.

  • In the Client ID field, enter the object Id for the AAD group exactly as it exists in AAD.

  • In the Client secret field, enter the client secret for the AAD group exactly as it exists in AAD.

  • The Scopes field is the base URL + .default from your CDF instanace. To edit, use this format:<BASE URL/.default>

  • The Token URL field is a valid token from your Idp. To edit, use this format: <https://login.microsoftonline.com/<YOUR_AAD_TENANT_ID>/oauth2/v2.0/token>.

  • The CDF project name field is the CDF project you have signed in to. To edit, use this format: <YOUR_CDF_PROJECT_NAME>

If you don't know what values to enter in these fields, contact your internal help desk or the CDF admin for help.

API configuration

9c. To authenticate using API keys, enter the API key generated for the CDF project under Manage & Configure > Manage access > API keys.

NOTE

NOTE: For CDF projects running in Microsoft Azure, only OIDC credentials are supported. Using API keys is deprecated for CDF projects running in Google Cloud Platform. Learn more.

  1. If your destination resource type is RAW, specify the RAW database and table you want to write to.

    NOTE: You must create the RAW database and table before you can run the transformation. This can be done for example with the RAW Explorer.

  2. Click the Transform button to manually start a new transformation job, or follow the steps in Schedule transformations below to schedule your transformation to run at regular intervals.

# Schedule transformations

Edit schedule

  1. On the Schedule and Run tab, you can specify a schedule that determines when and how often the transformation should run.

    Schedules are specified as cron expressions. For example, 45 23 * * * will run the transformation at 23:45 (11:45 PM) every day.

    Learn more about cron expressions (opens new window).

  2. Click Add schedule to activate the schedule. When a transformation is scheduled, it becomes read-only to prevent unintentional changes to future scheduled jobs.

# Monitor scheduled transformations

Email notifications

Subscribe to email notifications to monitor the transformation process and solve any issues before they reach the data consumer. You can add up to 5 email addresses, either to groups or to specific people.

# Share transformations

On the Transformation page, you can share transformations with other users in your project to allow them to edit, run, schedule, and delete the transformation.

Share transformation from overview

Share transformation from editor

IMPORTANT

When you share a transformation, you implicitly grant the permissions defined for the service accounts that are configured for the transformation to other users in your project. Make sure you use service accounts with only the minimum permissions they need to complete transformations.

# Writing SQL queries

The information in this section helps you efficiently query data from RAW tables and CDF resource types, and explains how you can load data incrementally.

# Read data from CDF

# From a RAW table

To select data from a RAW table, use the syntax mydb.mytable.

select * from mydb.mytable
1

If your database or table name contains special characters, enclose the name in backticks, for example `my-db`.`my table`.

# From other CDF resource types

To select other CDF resource types, use the syntax _cdf.resource_type.

select * from _cdf.events
1

The supported resource types are:

  • _cdf.events
  • _cdf.assets
  • _cdf.timeseries
  • _cdf.datapoints
  • _cdf.stringdatapoints
  • _cdf.labels
  • _cdf.relationships

# Load data incrementally

When reading from RAW tables, you probably want to transform only the data that has changed since the last transformation job ran.

To achieve this, you can filter on the lastUpdatedTime column to query for the rows that have changed after a specific timestamp. When filtering on lastUpdatedTime, the filter is pushed down to the RAW service itself, so this query can be performed efficiently. For example: select * from mydb.mytable where lastUpdatedTime > to_timestamp(123456).

Instead of encoding the timestamp directly in the query and manually keep it up to date every time new data has been processed, you can use the is_new function. This function returns true when a row has changed since the last time the transformation was run, and false otherwise.

The first time you run a transformation using the query below, all the rows of mytable will be processed:

select * from mydb.mytable where is_new("mydb_mytable", lastUpdatedTime)
1

If the transformation completes successfully, the second run will only process rows that have changed since the first run.

If the transformation fails, is_new filters the same rows the next time the transformation is run. This ensures that there is no data loss in the transformation from source to destination.

NOTE

Incremental load is disabled when previewing query results. That is, is_new will always return true for all rows.

Each is_new filter is identified by a name (for example"mydb_mytable"), and can be set to any constant string. This allows you to differentiate between multiple calls to is_new in the same query, and use is_new to filter on multiple tables. To easily identify the different filters, we recommend that you use the name of the table as the name of the is_new filter.

# Backfill

To process all the data even if it hasn't changed since the last transformation, change the name of the is_new filter, for example by adding a postfix with an incrementing number (e.g. "mydb_mytable_1").

This is especially useful when the logic of the query changes, and data that has already been imported needs to be updated accordingly.

# Custom SQL functions

In addition to the built-in Spark SQL functions (opens new window), we also provide a set of custom SQL functions to help you write efficient transformations.

NOTE

When a function expects var_args, it allows a variable number of arguments of any type, including star *.

# get_names

get_names(var_args): Array[String]

Returns an array of the field names of a struct or row.

Example

select get_names(*) from mydb.mytable
-- Returns the column names of 'mydb.mytable'
1
2
select get_names(some_struct.*) from mydb.mytable
-- Returns the field names of 'some_struct'
1
2

# cast_to_strings

cast_to_strings(var_args): Array[String]

Casts the arguments to an array of strings. It handles array, struct and map types by casting it to JSON strings.

Example

select cast_to_strings(*) from mydb.mytable
-- Returns the values of all columns in 'mydb.mytable' as strings
1
2

# to_metadata

to_metadata(var_args): Map[String, String]

Creates metadata compatible type from the arguments. In practice it does map_from_arrays(get_names(var_args), cast_to_strings(var_args)). Use this function when you want to transform your columns or structures into a format that fits the metadata field in CDF.

Example

select to_metadata(*) from mydb.mytable
-- Creates a metadata structure from all the columns found in 'mydb.mytable'
1
2

# to_metadata_except

to_metadata_except(excludeFilter: Array[String], var_args)

Returns a metadata structure (Map[String, String]) where strings found in excludeFilter will exclude keys from var_args.

Use this function when you want to put most, but not all, columns into metadata, for example to_metadata_except(array("someColumnToExclude"), *)

Example

select to_metadata_except(array("myCol"), myCol, testCol) from mydb.mytable
-- Creates a map where myCol is filtered out.
-- The result in this case will be Map("testCol" -> testCol.value.toString)
1
2
3

# asset_ids

asset_ids(assetNames: Array[String], rootAssetName: String): Array[BigInt]

Attempts to find Asset names in the asset hierarchy which have rootAssetName as their root Asset. The function returns the IDs of the assets matched.

See Assets for more information about assets in CDF.

Warning

The entire job will be aborted if asset_ids() did not find any matching assets.

Example

select asset_ids(array("PV10"), "MyBoat")
1

# is_new

is_new(name: String, version: Timestamp)

Returns true if the version provided is higher than the version found with the specified name, based on the last time the transformation was run. See Load data incrementally.

If you're using more than one occurrence of is_new() in one transformation, we recommend that you use different variable names. This guarantees that subqueries within one transformation don't override the lastUpdatedTime record before the transformation is completed.

Example

select * from mydb.mytable where is_new("mydb_mytable_version", lastUpdatedTime)
-- Returns only rows that have changed since the last successful run
1
2

# dataset_id

dataset_id(externalId: String): BigInt

Attempts to find the id of the given data set by externalId and returns the id if the externalId exists.

Example

select dataset_id("EXAMPLE_DATASET") as dataSetId
1
Last Updated: 7/14/2021, 3:52:38 PM