Skip to main content

Configure the DB extractor

To configure the DB extractor, you must create a configuration file. The file must be in YAML format.

The configuration file allows substitutions with environment variables. For example, connection strings:

connection-string: DRIVER={Oracle 19.3};DBQ=localhost:1521/XE;UID=SYSTEM;PWD=${DATABASE_PASSWORD}
Note

Implicit substitutions only work for unquoted value strings. For quoted strings, use the !env tag to activate environment substitution:

connection-string: !env "DRIVER={Oracle 19.3};DBQ=localhost:1521/XE;UID=SYSTEM;PWD=${DATABASE_PASSWORD}"

The configuration file also contains the global parameter version, which holds the version of the configuration schema used in the configuration file. This document describes version 2 of the configuration schema.

Logger

The optional logger section sets up logging to a console and files.

ParameterDescription
consoleSets up console logger configuration. See the Console section.
fileSets ut file logger configuration. See the File section.

Console

Include the console section to enable logging to a standard output, such as a terminal window.

ParameterDescription
levelSelect the verbosity level for console logging. Valid options, in decreasing level of verbosity, are DEBUG, INFO, WARNING, ERROR, and CRITICAL.

File

Include the file section to enable logging to a file. The files are rotated daily.

ParameterDescription
levelSelect the verbosity level for file logging. Valid options, in decreasing level of verbosity, are DEBUG, INFO, WARNING, ERROR, and CRITICAL.
pathInsert the path to the log file.
retentionSpecify the number of days to keep logs for. The default value is 7.

Cognite

The cognite section describes which CDF project the extractor will load data into and how to connect to the project.

ParameterDescription
projectInsert the CDF project name. This is a required value.
hostInsert the base URL for the CDF project. The default value is https://api.cognitedata.com.
api-keyEnter the API key for the CDF project. You must enter either an API key or use IdP authentication.
idp-authenticationInsert the credentials for authenticating to CDF using an external identity provider. You must enter either an API key or use IdP authentication.
data-setInsert an optional data set ID that will be used if you've set the extractor to create missing time series. Contains either id or external-id.

Identity provider (IdP) authentication

The idp-authentication section enables the extractor to authenticate to CDF using an external identity provider, such as Azure AD..

ParameterDescription
client-idEnter the client ID from the IdP. This is a required value.
secretEnter the client secret from the IdP. This is a required value.
scopesList the scopes. This is a required value.
resourceInsert token requests. This is an optional field.
token-urlInsert the URL to fetch tokens from. You must enter either a token URL or an Azure tenant.
tenantEnter the Azure tenant. You must enter either a token URL or an Azure tenant
min-ttlInsert the minimum time in seconds a token will be valid. If the cached token expires in less than min_ttl seconds, it will be refreshed. The default value is 30.

Extractor

The optional extractor section contains tuning parameters.

ParameterDescription
upload-queue-sizeEnter the size of the upload queue. The default value is 100 000 rows.
parallelismInsert the number of parallel queries to run. The default value is 4.
state-storeThe state store configuration. See the State store section. The default value is no state store, and incremental load is deactivated.

State store

A state store is used to save extraction states between runs. This is used if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.

ParameterDescription
localLocal state store configuration. See the Local section.
rawRAW state store configuration. See the RAW section.

Local

A local state store stores the extraction state in a JSON file on the local machine.

ParameterDescription
pathInsert the file path to a JSON file.
save-intervalEnter the interval in seconds between each save. The default value is 30.

RAW

A RAW state store stores the extraction state in a table in Cognite RAW.

ParameterDescription
databaseEnter the database name in Cognite RAW.
tableEnter the table name in Cognite RAW.
upload-intervalEnter the interval in seconds between each save. The default value is 30.

Metrics

The metrics section describes where to send metrics on extractor performance for remote monitoring of the extractor. We recommend sending metrics to a Prometheus pushgateway, but you can also send metrics as time series in the CDF project.

ParameterDescription
push-gatewaysList of pushgateway configurations. See the Pushgateways section.
cogniteCognite metrics configurations. See the Cognite section.

Pushgateways

The push-gateways sections contain a list of metric destinations, each on the following schema:

ParameterDescription
hostEnter the address of the host to push metrics to. This is a required value.
job-nameEnter the value of the exported_job label to associate metrics with. This separates several deployments on a single pushgateway, and should be unique. This is a required value.
usernameEnter the credentials for the pushgateway. This is a required value.
passwordEnter the credentials for the pushgateway. This is a required value.
clear-afterEnter the number of seconds to wait before clearing the pushgateway. When this parameter is present, the extractor will stall after the run is complete before deleting all metrics from the pushgateway. The recommended value is at least twice that of the scrape interval on the pushgateway. This is to ensure that the last metrics are gathered before the deletion.
push-intervalEnter the interval in seconds between each push. The default value is 30.

Cognite

Metrics can be sent as time series to the CDF project configured in cognite. Only numeric metrics, such as Prometheus counters and gauges, are sent.

ParameterDescription
external-id-prefixInsert a prefix to all time series used to represent metrics for this deployment. This is used to create a scope for the set of time series created by these metrics exported and should be deployment-unique across the entire project. This is a required value.
asset-nameEnter the name of the asset to attach to time series. This will be created if it doesn't already exist.
asset-external-idEnter the external ID for the asset to create if the asset doesn't already exist.
push-intervalEnter the interval in seconds between each push. The default value is 30.

Databases

The databases section contains a list of database configurations, each on the following schema:

ParameterDescription
nameEnter a user-given name of this database. This is used throughout the queries section and for logging. The name must be unique for each database in the configuration file. This is a required value.
connection-stringEnter the ODBC connection string. See Connect to a database section for details. This is a required value.
database-typeEnter the database type to connect to. Options are GENERIC, MYSQL, and ORACLE. The default value is GENERIC.
timeoutEnter the timeout in seconds for the ODBC connection and queries. The default value is 60 seconds. Set this value to 0 if you want to disable timeouts.

Some ODBC drivers doesn't accept either the SQL_ATTR_CONNECTION_TIMEOUT or the SQL_ATTR_QUERY_TIMEOUT option. The extractor will log an exception with the message Could not set timeout on the ODBC driver - timeouts might not work properly. Extractions will continue regardless, but without timeouts. To avoid this log line, you can disable timeouts for the database causing these problems.

Queries

The queries section contains a list of queries to run. Each query is configured according to this schema:

ParameterDescription
nameEnter the name of this query. This is used for logging and tagging metrics. The name must be unique for each query in the configuration file. This is a required value.
databaseEnter the name of the database to connect to. This must be one of the database names configured in the Databases section. This is a required value.
primary-keyInsert the format of the row key in Cognite RAW. This parameter supports case-sensitive substitutions with values from the table columns. For example, if there's a column called index, setting primary-key: "row_{index}" will result in rows with keys row_0, row_1, etc. This is a required value.
queryEnter a SQL query to run against the database. The query can contain joins as long as the database supports joins.

The query should contain a WHERE clause using the substitutions {incremental-field} and {start-at} if data is loaded incrementally. The queries must also be sorted on the incremental field.
Example:SELECT * FROM table WHERE {incremental_field} > {start_at} ORDER BY {incremental_field} ASC. This is a required value.
destinationInsert the Cognite RAW database and table to upload to. See the RAW destination section. This is a required value.
incremental-fieldInsert the table column that holds the incremental field
initial-startEnter the {start-at} for the first run. Subsequent runs will use the value stored in the state store.
scheduleEnter a CRON expression used to schedule runs when the extractor runs as a service. When you set up the schedule, add additional time for the previous execution to finish.
destination-typeEnter either RAW or TIME_SERIES. For TIME_SERIES, the extractor will expect returned rows to contain only the three columns: externalId, timestamp, and value.

RAW destination

The raw destination section enables writing data to the CDF staging area (raw).

ParameterDescription
databaseEnter the Cognite RAW database to upload data into. This will be created if it doesn't exist. This is a required value.
tableEnter the name of Cognite RAW table to upload data into. This will be created if it doesn't exist. This is a required value.