Operations and monitoring
Error 429 or 503: Service overload
Error 429 or 503: Service overload
Symptoms
- Runs fail with HTTP 429 or 503 errors.
- Multiple transformations fail in the same time window.
- Peak concurrency or heavy workloads overload the service.
- Spread transformation start times to reduce concurrency peaks.
- Orchestrate with Data workflows and add dependencies so jobs run only after prerequisites succeed.
- Reduce per‑run data volume with incremental filters such as
is_new().
- Use workflow orchestration to balance load and avoid synchronized schedules.
- Rebalance schedules instead of relying on retries when overload persists.
Error 408: Timeout on long-running queries
Error 408: Timeout on long-running queries
Symptoms
- Runs fail with HTTP 408.
- Data model queries take a long time to complete.
- Slow queries or large joins increase runtime.
- Apply incremental filters early to reduce scan volume.
- Review query patterns in SQL patterns and best practices.
- For data modeling sources, use the
is_new()variant oncdf_nodes()orcdf_edges(). - Simplify complex joins in data modeling and test with smaller scopes first.
- Keep transformations scoped to one resource type and avoid wide RAW scans.
Driver restarts or repeated failures on long jobs
Driver restarts or repeated failures on long jobs
Symptoms
- Long runs fail and restart.
- The same job repeatedly fails after long execution time.
- Driver restarts can occur during rollouts or for transformation hygiene. The service starts a new driver to handle new requests and gives the old driver time to finish. Long‑running jobs may fail during this transition.
- Split large transformations into smaller, focused jobs.
- Orchestrate retries with Data workflows for automatic recovery.
- Use incremental processing to shorten run times.
- Favor incremental processing and smaller transformation scope.
Tooling limitations
Preview results differ from full runs
Preview results differ from full runs
Symptoms
- Preview succeeds but full run fails.
- Preview returns unexpected results with multi‑source joins.
- Preview times out on long‑running transformations.
- Preview uses sampled data from the first rows and may not exercise full joins or filters. Preview is not a full execution and does not evaluate all input data.
- Validate logic with small but representative datasets.
- Run a full execution on a limited scope (for example, a single source or time window).
- Treat preview as a sanity check, not as a performance or correctness benchmark.
Columns not found in RAW tables
Columns not found in RAW tables
Symptoms
- Runs fail with column‑not‑found errors.
- Queries worked previously but fail after new RAW writes.
- RAW is schema‑less. Transformations infer schema from the first 10,000 rows. If a column is not present in that sample, it is not available to SQL.
- Use
get_json_objectfor fields in semi‑structured payloads. - Insert a schema row with all expected columns and sort it to the top.
- Keep RAW tables stable and avoid frequent schema drift in the first rows.
Logging and diagnostics
Limited error detail in run history
Limited error detail in run history
Symptoms
- Only request IDs are visible in run history.
- Error messages do not show root cause details.
- Run history surfaces high‑level errors without expanded context.
- Capture the full error response payload from the API or SDK.
- Correlate request IDs with backend logs when available.
- Use the expand control in run history to view detailed error messages when available.
- Use structured logging and store request IDs alongside transformation metadata.
Internal error or unknown failure
Internal error or unknown failure
Symptoms
- Run fails with a generic internal error.
- Transient service issues or unhandled edge cases.
- Retry after reducing concurrency or data volume.
- If the error persists, contact Cognite Support with the transformation ID and request ID.