> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cognite.com/llms.txt
> Use this file to discover all available pages before exploring further.

# SQL patterns and best practices

> Patterns for writing efficient transformation SQL with predictable performance.

Use these patterns to improve performance, reduce unnecessary reads, and keep transformation SQL reliable in production.

## Start with incremental filters

When reading data, your transformation jobs run much faster and more efficiently if they only process data that has changed since the last run. This reduces transformation time and allows for more frequent runs.

### Use is\_new() for automatic incremental filtering

Instead of manually managing timestamps in your queries, use the `is_new()` function to automatically filter for changed data. The function returns `true` when a row has changed since the last successful transformation run and `false` otherwise.

```sql theme={"languages":{"custom":["/_languages/kuiper.json","../_languages/kuiper.json"]}}
select *
from mydb.mytable
where is_new("mydb_mytable", lastUpdatedTime)
```

**How it works:**

* **First run**: All rows are processed.
* **Successful subsequent runs**: Only rows that changed since the last successful run are processed.
* **Failed runs**: The same rows are reprocessed to prevent data loss.

<Info>
  Incremental load is disabled when previewing query results. That is, `is_new` always returns `true` for all rows during preview.
</Info>

<Warning>
  Avoid auto-updated columns, such as write timestamps, in RAW sources when using `is_new()`. These fields can cause all rows to appear new on every run.
</Warning>

### Naming is\_new filters

Each `is_new` filter is identified by a unique name (for example, `"mydb_mytable"` in the example above). The name is stored with the transformation and must be unique within that transformation. Use descriptive names based on the table or source to easily identify different filters.

If you use the same name in different transformations, they're tracked separately and don't interfere with each other.

<Info>
  It's not common to use multiple `is_new` filters in the same query. Instead, use `is_new` on the main resource you're accessing, then join in other resources to enrich the new entries. If you do use multiple `is_new` filters, they apply to each source separately before any join operations, meaning both sources must be updated at the same time for joins to work correctly.
</Info>

### Resource types supporting incremental loading

Incremental data loading with `lastUpdatedTime` filtering is supported for:

* RAW tables (`mydb.mytable`)
* `_cdf.assets`
* `_cdf.events`
* `_cdf.files`
* `_cdf.timeseries`
* `_cdf.relationships`
* `_cdf.datasets`

### Incremental loading with data modeling

For data modeling sources, use the `is_new()` variant with `cdf_nodes()`, `cdf_edges()`, or `cdf_data_models()` to rely on sync cursors instead of timestamp filters. This approach is more efficient than filtering on timestamp or int64 columns.

```sql theme={"languages":{"custom":["/_languages/kuiper.json","../_languages/kuiper.json"]}}
select * from cdf_nodes() where is_new('my_nodes')
```

With multiple sources in the same query, specify which source the `is_new` references using an alias:

```sql theme={"languages":{"custom":["/_languages/kuiper.json","../_languages/kuiper.json"]}}
select * from cdf_nodes() a, cdf_edges() b where is_new('a_nodes', a)
```

The cursor updates automatically every time the transformation runs successfully. You can combine this with other filters and take advantage of indexes in data modeling for optimal performance.

<Warning>
  When using `is_new` with data modeling, the transformation must run at least once per three days. If it doesn't run for three days or more, the transformation falls back to reading all input data.
</Warning>

### Backfilling data

To reprocess all data even if it hasn't changed, change the name of the `is_new` filter (for example, add an incrementing suffix like `"mydb_mytable_2"`). This is useful when query logic changes and previously imported data needs updating.

## Avoid double reads from the same table

If multiple sub-queries read from the same source table, combine them into a single query instead of using `UNION ALL`. This avoids scanning the same data twice in Spark and reduces executor load.

```sql theme={"languages":{"custom":["/_languages/kuiper.json","../_languages/kuiper.json"]}}
with base as (
  select * from mydb.mytable
)
select * from base where status = 'active'
union all
select * from base where status = 'inactive'
```

Prefer a single pass:

```sql theme={"languages":{"custom":["/_languages/kuiper.json","../_languages/kuiper.json"]}}
select *
from mydb.mytable
where status in ('active', 'inactive')
```

## Prefer LEFT JOIN for small lookup tables

If you join a RAW table with a very small table or a single-row table and there is no `ON` clause, use `LEFT JOIN` or inline the constant values. A join without a condition becomes a cartesian join, which can lead to inefficient plans. Switching to `LEFT JOIN` can significantly improve the query plan and reduce runtime.

```sql theme={"languages":{"custom":["/_languages/kuiper.json","../_languages/kuiper.json"]}}
select a.*, b.lookup_value
from mydb.main a
left join mydb.lookup b
```

## Filter early at the source

Reduce data volume before Spark reads it. RAW is a key-value store, and filtering happens on the client side in transformations. If you only need a subset of columns from a very wide RAW table, consider:

* Splitting the source upstream into smaller, purpose-built tables.
* Creating a staging table that includes only the columns used by transformations.

This reduces memory pressure and avoids reading unused data into Spark.

## Use CTEs for clarity and reuse

Use CTEs (common table expressions) for temporary tables and shared sub-logic. CTEs make complex queries easier to read and maintain, and they can simplify legacy transformations where logic has grown over time. For data modeling sources, keep the logic focused on graph queries and avoid deep CTE chains when they add little value.

```sql theme={"languages":{"custom":["/_languages/kuiper.json","../_languages/kuiper.json"]}}
with base as (
  select * from mydb.mytable where is_new("mydb_mytable", lastUpdatedTime)
),
enriched as (
  select b.*, r.site
  from base b
  left join mydb.site_map r
)
select * from enriched
```

## Keep joins inside the data modeling framework

When you are writing to data modeling, keep complex joins inside the data modeling framework. Filters and indexes are optimized for graph queries, and you can combine joins with `is_new()` on the primary source to enrich new entries efficiently.

## Manage schema inference explicitly

CDF infers schema from the first 10,000 rows it reads. If those rows are sparse or inconsistent, the inferred schema can be wrong.

Workarounds:

* Insert one or more “schema rows” with all expected columns and sort them to the top of the input.
* Use `get_json_object` to extract fields from JSON payloads when schema is uneven or evolving.

```sql theme={"languages":{"custom":["/_languages/kuiper.json","../_languages/kuiper.json"]}}
select *
from mydb.mytable
order by _schema_row desc, row_key
```

```sql theme={"languages":{"custom":["/_languages/kuiper.json","../_languages/kuiper.json"]}}
select
  get_json_object(payload, "$.externalId") as external_id,
  get_json_object(payload, "$.name") as name
from mydb.json_table
```

## Use clear staging boundaries

Avoid writing updates back to RAW from transformations. RAW is optimized for staging, not repeated read-modify-write cycles. Use transformations to write to structured targets such as data models, sequences, or files.

## Validate with small samples, then scale

Use targeted filters and limited datasets while iterating on query logic. When results are correct, remove the limiting filters and enable incremental processing.

## Related guidance

* [SQL syntax and functions](/cdf/integration/guides/transformation/write_sql_queries)
* [Operations and performance](/cdf/integration/guides/transformation/operations_and_performance)
