Skip to main content
Use these practices to keep transformations efficient and reliable in production.

Use incremental processing with is_new()

Use is_new() to process only changed data. This reduces scan volume and transformation runtime. Apply change detection as close to the source as possible. For patterns and syntax, see Start with incremental filters and the is_new function reference.

SQL query performance

For SQL optimization patterns including avoiding double reads, using efficient joins, handling wide RAW tables, and managing schema inference, see SQL patterns and best practices.

Observability and load management

Use workflow orchestration to distribute load and avoid peak concurrency. If you see repeated 503 errors, rebalance schedules or dependencies instead of adding retries.
Transformations have a concurrency limit of 10 parallel jobs per project. Plan schedules and workflow dependencies with this limit in mind.

Avoid RAW anti-patterns

RAW is optimized for staging, not repeated read-modify-write cycles. Avoid these patterns:
  • Writing updates back to RAW from transformations.
  • Creating large multipurpose RAW tables used by many transformations.
  • Designing RAW tables wider than what downstream transformations need.
Structure RAW tables to match consumption patterns. If multiple transformations use different subsets of a large table, split the data upstream so each transformation reads only what it needs.
Avoid full delete-and-recreate cycles for RAW tables. This resets is_new() state and forces full reprocessing. Prefer row-level deletes and updates.

Keep consumers off RAW

Data consumers should read from curated targets, not RAW. RAW has no schema guarantees, so direct consumption risks breaking downstream clients. Use transformations to write to data models or resource types that match consumer needs.

Use Files and Functions for heavy data

RAW has limits on row and column size and is not suited for very large payloads or highly sparse tables. Instead:
  • Store large payloads as Files and reference them from metadata.
  • Process files with CDF Functions for scalable, parallel processing.
  • Orchestrate Functions and Transformations together using Data workflows.
For extremely large datasets, consider file-based staging instead of RAW. Process the files with CDF Functions before writing structured outputs. This scales better than wide RAW tables and simplifies schema handling.

Manage environments to avoid redundant loads

Avoid running the same heavy transformation logic across DEV, TEST, and PROD. Instead, use data replication to keep non-production environments aligned without reprocessing large datasets.
Last modified on March 18, 2026