Salt la conținutul principal

Configuration settings

To configure the DB extractor, you must create a configuration file. The file must be in YAML format.

Tip

You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud.

Using values from environment variables

The configuration file allows substitutions with environment variables. For example:

cognite:
secret: ${COGNITE_CLIENT_SECRET}

will load the value from the COGNITE_CLIENT_SECRET environment variable into the cognite/secret parameter. You can also do string interpolation with environment variables, for example:

url: http://my-host.com/api/endpoint?secret=${MY_SECRET_TOKEN}
Note

Implicit substitutions only work for unquoted value strings. For quoted strings, use the !env tag to activate environment substitution:

url: !env 'http://my-host.com/api/endpoint?secret=${MY_SECRET_TOKEN}'

Using values from Azure Key Vault

The DB extractor also supports loading values from Azure Key Vault. To load a configuration value from Azure Key Vault, use the !keyvault tag followed by the name of the secret you want to load. For example, to load the value of the my-secret-name secret in Key Vault into a password parameter, configure your extractor like this:

password: !keyvault my-secret-name

To use Key Vault, you also need to include the azure-keyvault section in your configuration, with the following parameters:

ParameterDescription
keyvault-nameName of Key Vault to load secrets from
authentication-methodHow to authenticate to Azure. Either default or client-secret. For default, the extractor will look at the user running the extractor, and look for pre-configured Azure logins from tools like the Azure CLI. For client-secret, the extractor will authenticate with a configured client ID/secret pair.
client-idRequired for using the client-secret authentication method. The client ID to use when authenticating to Azure.
secretRequired for using the client-secret authentication method. The client secret to use when authenticating to Azure.
tenant-idRequired for using the client-secret authentication method. The tenant ID of the Key Vault in Azure.

Example:

azure-keyvault:
keyvault-name: my-keyvault-name
authentication-method: client-secret
tenant-id: 6f3f324e-5bfc-4f12-9abe-22ac56e2e648
client-id: 6b4cc73e-ee58-4b61-ba43-83c4ba639be6
secret: 1234abcd

Base configuration object

ParameterTypeDescription
versioneither string or integerConfiguration file version
typeeither local or remoteConfiguration file type. Either local, meaning the full config is loaded from this file, or remote, which means that only the cognite section is loaded from this file, and the rest is loaded from extraction pipelines. Default value is local.
cogniteobjectThe cognite section describes which CDF project the extractor will load data into and how to connect to the project.
loggerobjectThe optional logger section sets up logging to a console and files.
metricsobjectThe metrics section describes where to send metrics on extractor performance for remote monitoring of the extractor. We recommend sending metrics to a Prometheus pushgateway, but you can also send metrics as time series in the CDF project.
querieslistList of queries to execute
databaseslistList of databases to connect to
extractorobjectGeneral extractor configuration

cognite

Global parameter.

The cognite section describes which CDF project the extractor will load data into and how to connect to the project.

ParameterTypeDescription
projectstringInsert the CDF project name.
idp-authenticationobjectThe idp-authentication section enables the extractor to authenticate to CDF using an external identity provider (IdP), such as Microsoft Entra ID (formerly Azure Active Directory).
data-setobjectEnter a data set the extractor should write data into
extraction-pipelineobjectEnter the extraction pipeline used for remote config and reporting statuses
hoststringInsert the base URL of the CDF project. Default value is https://api.cognitedata.com.
timeoutintegerEnter the timeout on requests to CDF, in seconds. Default value is 30.
external-id-prefixstringPrefix on external ID used when creating CDF resources
connectionobjectConfigure network connection details

idp-authentication

Part of cognite configuration.

The idp-authentication section enables the extractor to authenticate to CDF using an external identity provider (IdP), such as Microsoft Entra ID (formerly Azure Active Directory).

ParameterTypeDescription
authoritystringInsert the authority together with tenant to authenticate against Azure tenants. Default value is https://login.microsoftonline.com/.
client-idstringRequired. Enter the service principal client id from the IdP.
tenantstringEnter the Azure tenant.
token-urlstringInsert the URL to fetch tokens from.
secretstringEnter the service principal client secret from the IdP.
resourcestringResource parameter passed along with token requests.
audiencestringAudience parameter passed along with token requests.
scopeslistEnter a list of scopes requested for the token
min-ttlintegerInsert the minimum time in seconds a token will be valid. If the cached token expires in less than min-ttl seconds, it will be refreshed even if it is still valid. Default value is 30.
certificateobjectAuthenticate with a client certificate

scopes

Part of idp-authentication configuration.

Enter a list of scopes requested for the token

Each element of this list should be a string.

certificate

Part of idp-authentication configuration.

Authenticate with a client certificate

ParameterTypeDescription
authority-urlstringAuthentication authority URL
pathstringRequired. Enter the path to the .pem or .pfx certificate to be used for authentication
passwordstringEnter the password for the key file, if it is encrypted.

data-set

Part of cognite configuration.

Enter a data set the extractor should write data into

ParameterTypeDescription
idintegerResource internal id
external-idstringResource external id

extraction-pipeline

Part of cognite configuration.

Enter the extraction pipeline used for remote config and reporting statuses

ParameterTypeDescription
idintegerResource internal id
external-idstringResource external id

connection

Part of cognite configuration.

Configure network connection details

ParameterTypeDescription
disable-gzipbooleanWhether or not to disable gzipping of json bodies.
status-forceliststringHTTP status codes to retry. Defaults to 429, 502, 503 and 504
max-retriesintegerMax number of retries on a given http request. Default value is 10.
max-retries-connectintegerMax number of retries on connection errors. Default value is 3.
max-retry-backoffintegerRetry strategy employs exponential backoff. This parameter sets a max on the amount of backoff after any request failure. Default value is 30.
max-connection-pool-sizeintegerThe maximum number of connections which will be kept in the SDKs connection pool. Default value is 50.
disable-sslbooleanWhether or not to disable SSL verification.
proxiesobjectDictionary mapping from protocol to url.

proxies

Part of connection configuration.

Dictionary mapping from protocol to url.

logger

Global parameter.

The optional logger section sets up logging to a console and files.

ParameterTypeDescription
consoleobjectInclude the console section to enable logging to a standard output, such as a terminal window.
fileobjectInclude the file section to enable logging to a file. The files are rotated daily.
metricsbooleanEnables metrics on the number of log messages recorded per logger and level. This requires metrics to be configured as well

console

Part of logger configuration.

Include the console section to enable logging to a standard output, such as a terminal window.

ParameterTypeDescription
leveleither DEBUG, INFO, WARNING, ERROR or CRITICALSelect the verbosity level for console logging. Valid options, in decreasing verbosity levels, are DEBUG, INFO, WARNING, ERROR, and CRITICAL. Default value is INFO.

file

Part of logger configuration.

Include the file section to enable logging to a file. The files are rotated daily.

ParameterTypeDescription
leveleither DEBUG, INFO, WARNING, ERROR or CRITICALSelect the verbosity level for file logging. Valid options, in decreasing verbosity levels, are DEBUG, INFO, WARNING, ERROR, and CRITICAL. Default value is INFO.
pathstringRequired. Insert the path to the log file.
retentionintegerSpecify the number of days to keep logs for. Default value is 7.

metrics

Global parameter.

The metrics section describes where to send metrics on extractor performance for remote monitoring of the extractor. We recommend sending metrics to a Prometheus pushgateway, but you can also send metrics as time series in the CDF project.

ParameterTypeDescription
push-gatewayslistList of prometheus pushgateway configurations
cogniteobjectPush metrics to CDF timeseries. Requires CDF credentials to be configured
serverobjectThe extractor can also be configured to expose a HTTP server with prometheus metrics for scraping

push-gateways

Part of metrics configuration.

List of prometheus pushgateway configurations

Each element of this list should be a the push-gateways sections contain a list of metric destinations.

ParameterTypeDescription
hoststringEnter the address of the host to push metrics to.
job-namestringEnter the value of the exported_job label to associate metrics with. This separates several deployments on a single pushgateway, and should be unique.
usernamestringEnter the credentials for the pushgateway.
passwordstringEnter the credentials for the pushgateway.
clear-aftereither null or integerEnter the number of seconds to wait before clearing the pushgateway. When this parameter is present, the extractor will stall after the run is complete before deleting all metrics from the pushgateway. The recommended value is at least twice that of the scrape interval on the pushgateway. This is to ensure that the last metrics are gathered before the deletion. Default is disabled.
push-intervalintegerEnter the interval in seconds between each push. Default value is 30.

cognite

Part of metrics configuration.

Push metrics to CDF timeseries. Requires CDF credentials to be configured

ParameterTypeDescription
external-id-prefixstringRequired. Prefix on external ID used when creating CDF time series to store metrics.
asset-namestringEnter the name for a CDF asset that will have all the metrics time series attached to it.
asset-external-idstringEnter the external ID for a CDF asset that will have all the metrics time series attached to it.
push-intervalintegerEnter the interval in seconds between each push to CDF. Default value is 30.
data-setobjectData set the metrics will be created under

data-set

Part of cognite configuration.

Data set the metrics will be created under

ParameterTypeDescription
idintegerResource internal id
external-idstringResource external id

server

Part of metrics configuration.

The extractor can also be configured to expose a HTTP server with prometheus metrics for scraping

ParameterTypeDescription
hoststringHost to run the prometheus server on. Default value is 0.0.0.0.
portintegerLocal port to expose the prometheus server on. Default value is 9000.

queries

Global parameter.

List of queries to execute

Each element of this list should be a description of a SQL query against a database

ParameterTypeDescription
databasestringRequired. Enter the name of the database to connect to. This must be one of the database names configured in the databases section.
namestringRequired. Enter a name of this query that will be used for logging and tagging metrics. The name must be unique for each query in the configuration file.
querystringRequired. SQL query to execute. Supports interpolation with {incremental_field} and {start_at}
destinationconfiguration for either RAW, Events, Assets, Time series, Sequence or FilesRequired. The destination of the data in CDF.

Examples:
{'destination': {'type': 'raw', 'database': 'my-database', 'table': 'my-table'}}
{'destination': {'type': 'events'}}
primary-keystringInsert the format of the row key in CDF RAW. This parameter supports case-sensitive substitutions with values from the table columns. For example, if there's a column called index, setting primary-key: row_{index} will result in rows with keys row_0, row_1, etc. This is a required value if the destination is a raw type.

Example:
row_{index}
incremental-fieldstringInsert the table column that holds the incremental field. Include to enable incremental loading, otherwise the extractor will default to a full run every time. To use incremental load, a state store is required
freshness-fieldstringWhich column to use for freshness metric. Must be specified along with freshness-field-timezone
freshness-field-timezonestringTimezone to use for freshness metric
initial-starteither string, number or integerEnter the {start_at} for the first run. Later runs will use the value stored in the state store. Will only be used on the initial run, subsequent runs will use the stored state. Required when incremental-field is set.
scheduleconfiguration for either Fixed interval or CRON expressionEnter the schedule for when this query should run. Make sure not to schedule runs too often, but leave some room for the previous execution to be done. Required when running in continuous mode, ignored otherwise.

Examples:
{'schedule': {'type': 'interval', 'expression': '1h'}}
{'schedule': {'type': 'cron', 'expression': '0 7-17 * * 1-5'}}
collectionstringSpecify the collection on which the query will be executed. This parameter is mandatory when connecting to mongodb databases.
containerstringSpecify the container on which the query will be executed. This parameter is mandatory when connecting to cosmosdb databases.
sheetstringSpecify the sheet on which the query will be executed. This parameter is mandatory when connecting to spreadsheet files.
skip_rowsstringSpecify the number of rows to be skipped when reading a spreadsheet. This parameter is optional when connecting to spreadsheet files.
has_headerstringSpecify if the extractor should skip the file header while reading a spreadsheet. This parameter is optional when connecting to spreadsheet files.
parametersstringSpecify the parameters to be used when querying to AWS DynamoDB. This parameter is mandatory when connectong to dynamodb databases.

destination

Part of queries configuration.

The destination of the data in CDF.

Either one of the following options:

Examples:

destination:
type: raw
database: my-database
table: my-table
destination:
type: events

raw

Part of destination configuration.

The raw destination writes data to the CDF staging area (RAW). The raw destination requires the primary-key parameter in the query configuration.

ParameterTypeDescription
typealways rawType of CDF destination, set to raw to write data to RAW.
databasestringRequired. Enter the CDF RAW database to upload data into. This will be created if it doesn't exist.
tablestringRequired. Enter the CDF RAW table to upload data into. This will be created if it doesn't exist.

events

Part of destination configuration.

The events destination inserts the resulting data as CDF events. The events destination is configured by setting the type parameter to events. No other parameters are required.

To ingest data into a events, the query must produce columns named

  • externalId

In addition, columns named

  • startTime
  • endTime
  • description
  • source
  • type
  • subType

may be included and will be mapped to corresponding fields in CDF events. Any other columns returned by the query will be mapped to key/value pairs in the metadata field for events.

ParameterTypeDescription
typealways eventsType of CDF destination, set to events to write data to events.

assets

Part of destination configuration.

The assets destination inserts the resulting data as CDF assets. The assets destination is configured by setting the type parameter to assets. No other parameters are required.

To ingest data into a assets, the query must produce columns named

  • name

In addition, columns named

  • externalId
  • parentExternalId
  • description
  • source

may be included and will be mapped to corresponding fields in CDF assets. Any other columns returned by the query will be mapped to key/value pairs in the metadata field for assets.

ParameterTypeDescription
typealways assetsType of CDF destination, set to assets to write data to assets.

time_series

Part of destination configuration.

The time_series destination inserts the resulting data as data points in time series. The time series destination is configured by setting the type parameter to time_series. No other parameters are required.

To ingest data into a time series, the query must produce columns named

  • externalId
  • timestamp
  • value

In addition, include a column called status to give the datapoint a status code. Statuses include a category, and an optional comma-separated list of modifyer flags. You can read more about status codes here. Some examples for status codes include Good (which is assumed if status is omitted), UNCERTAIN, HIGH and bad.

The extractor will insert data points into time series identified by the externalId column. If a time series does not exist, the extractor will create a minimal time series with only an external ID and the isString property inferred from the type of first data point processed for that time series. All other time series attributes need to be added separately.

ParameterTypeDescription
typealways time_seriesType of CDF destination, set to time_series to write data to time series.

sequence

Part of destination configuration.

The sequence destination writes data to a CDF sequence.

The column set of the query result will determine the columns of the sequence.

The result must include a column named row_number, which must include an integer indicating which row number in the sequence to ingest the row into.

ParameterTypeDescription
typealways sequenceType of CDF destination, set to sequence to write data to a sequence.
external-idstringRequired. Configured sequence external ID
value-typeseither convert, drop or assertHow types are converted into the expected types in CDF. Convert attempts to make a conversion, which may fail. Drop drops the row if there is a mismatch. Assert fails the query if the types do not match. Default value is convert.

files

Part of destination configuration.

The files destination inserts the resulting data as CDF files. The files destination is configured by setting the type parameter to files. No other parameters are required.

To ingest data into a files, the query must produce columns named

  • name
  • externalId
  • content

content will be treated as binary data and uploaded to CDF files as the content of the file

In addition, columns named

  • source
  • mimeType
  • directory
  • sourceCreatedTime
  • sourceModifiedTime
  • asset_ids

may be included and will be mapped to corresponding fields in CDF files. Any other columns returned by the query will be mapped to key/value pairs in the metadata field for files.

ParameterTypeDescription
typealways filesType of CDF destination, set to files to write data to CDF files.
content-columnstringColumn used as file content. Default value is content.

schedule

Part of queries configuration.

Enter the schedule for when this query should run. Make sure not to schedule runs too often, but leave some room for the previous execution to be done. Required when running in continuous mode, ignored otherwise.

Either one of the following options:

Examples:

schedule:
type: interval
expression: 1h
schedule:
type: cron
expression: 0 7-17 * * 1-5

fixed_interval

Part of schedule configuration.

ParameterTypeDescription
typealways intervalRequired. Type of time interval configuration. Use interval to configure a fixed interval.
expressionstringRequired. Enter a time interval, with a unit. Avaiable units are s (seconds), m (minutes), h (hours) and d (days).

Examples:
45s
15m
2h

cron_expression

Part of schedule configuration.

ParameterTypeDescription
typealways cronRequired. Type of time interval configuration. Use cron to configure CRON schedule.
expressionstringRequired. Enter a CRON expression. See crontab.guru for a guide on writing CRON expressions.

Example:
*/15 8-16 * * *

databases

Global parameter.

List of databases to connect to

Each element of this list should be a configuration for a database the extractor will connect to

Either one of the following options:

Example:

databases:
- type: odbc
name: my-odbc-database
connection-string: DRIVER={Oracle 19.3};DBQ=localhost:1521/XE;UID=SYSTEM;PWD=oracle
- type: postgres
name: postgres-db
host: pg.company.com
user: postgres
password: secret123Pas$word

odbc

Part of databases configuration.

Open Database Connectivity (ODBC) is a generic protocol for querying databases. To connect to a database using ODBC, you must first download and install an ODBC driver for your database system on the machine running the extractor. Consult the documentation or contact the vendor of your database system to find its driver.

Example:

type: odbc
name: asset-database
connection-string: Driver={ODBC Driver 17 for SQL Server};Server=10.24.5.162;Database=assets;UID=extractorUser;PWD=myPassword;
ParameterTypeDescription
typealways odbcSelect the type of database connection. Set to odbc for ODBC databases.
connection-stringstringRequired. Enter the ODBC connection string. This will differ between database vendors.

Examples:
DRIVER={Oracle 19.3};DBQ=localhost:1521/XE;UID=SYSTEM;PWD=oracle
DSN={MyDatabaseDsn}
response-encodingstringOverride the encoding to expect on database responses if the driver does not adhere to the ODBC standard. Default is to follow the ODBC standard.

Examples:
utf8
iso-8859-1
query-encodingstringOverride the encoding to use on database queries if the driver does not adhere to the ODBC standard. Default is to follow the ODBC standard.

Examples:
utf8
iso-8859-1
timeoutintegerEnter the timeout in seconds for the ODBC connection and queries. The default value is no timeout.

Some ODBC drivers don't accept either the SQL_ATTR_CONNECTION_TIMEOUT or the SQL_ATTR_QUERY_TIMEOUT option. The extractor will log an exception with the message Could not set timeout on the ODBC driver - timeouts might not work properly. Extractions will continue regardless but without timeouts. To avoid this logline, you can disable timeouts for the database causing these problems.
batch-sizeintegerEnter the number of rows to fetch from the database at a time. You can decrease this number if the machine with the extractor runs out of memory. Note that this will increase the run time. Default value is 1000.
namestringEnter a name for the database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file.
timezoneconfiguration for either local time zone, universal coordinated time or offset from UTCSpecify how the extractor should handle timestamps from the source when timezone data is absent. Either local for the local timezone on the machine the extractor is running on, utc for UTC, or a number for a numerical offset from UTC. Default value is local.

Examples:
utc
-8
5.5

postgresql

Part of databases configuration.

Example:

type: postgres
name: my-database
host: 10.42.39.12
user: extractor-user
password: mySecretPassword
ParameterTypeDescription
typealways postgresRequired. Type of database connection, set to postgres for PostgreSQL databases.
hoststringRequired. Enter the hostname or address of postgres database

Examples:
123.234.123.234
postgres.my-domain.com
localhost
userstringRequired. Enter the username for postgres database
passwordstringRequired. Enter the password for postgres database
databasestringEnter the database name to use. The default is to use the user name.
portintegerEnter the port to connect to. Default value is 5432.
timeoutintegerEnter the timeout in seconds for the database connection and queries. The default value is no timeout.
batch-sizeintegerEnter the number of rows to fetch from the database at a time. You can decrease this number if the machine with the extractor runs out of memory. Note that this will increase the run time. Default value is 1000.
namestringEnter a name for the database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file.
timezoneconfiguration for either local time zone, universal coordinated time or offset from UTCSpecify how the extractor should handle timestamps from the source when timezone data is absent. Either local for the local timezone on the machine the extractor is running on, utc for UTC, or a number for a numerical offset from UTC. Default value is local.

Examples:
utc
-8
5.5

oracle_db

Part of databases configuration.

The Cognite DB Extractor can connect directly to an Oracle Database version 12.1 or later.

Example:

type: oracle
name: my-database
host: 10.42.39.12
user: extractor-user
password: mySecretPassword
ParameterTypeDescription
typealways oracleType of database connection, set to oracle for Oracle databases.
hoststringRequired. Enter the hostname or address of oracle database

Examples:
123.234.123.234
database.my-domain.com
localhost
userstringRequired. Enter the user name
passwordstringRequired. Enter the user password
portintegerEnter the port to connect to. Default value is 1521.
service-namestringOptionally specify the service name of the database to connect to
timeoutintegerTimeout for statements to the database
batch-sizeintegerEnter the number of rows to fetch from the database at a time. You can decrease this number if the machine with the extractor runs out of memory. Note that this will increase the run time. Default value is 1000.
namestringEnter a name for the database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file.
timezoneconfiguration for either local time zone, universal coordinated time or offset from UTCSpecify how the extractor should handle timestamps from the source when timezone data is absent. Either local for the local timezone on the machine the extractor is running on, utc for UTC, or a number for a numerical offset from UTC. Default value is local.

Examples:
utc
-8
5.5

snowflake

Part of databases configuration.

ParameterTypeDescription
typealways snowflakeType of database connection, set to snowflake for Snowflake data warehouses.
userstringRequired. User name for Snowflake
passwordstringRequired. Password for Snowflake
accountstringRequired. Snowflake account ID
organizationstringRequired. Snowflake organzation name
databasestringRequired. Snowflake database to use
schemastringRequired. Snowflake schema to use
namestringEnter a name for the database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file.
timezoneconfiguration for either local time zone, universal coordinated time or offset from UTCSpecify how the extractor should handle timestamps from the source when timezone data is absent. Either local for the local timezone on the machine the extractor is running on, utc for UTC, or a number for a numerical offset from UTC. Default value is local.

Examples:
utc
-8
5.5

mongodb

Part of databases configuration.

ParameterTypeDescription
typealways mongodbType of database connection, set to mongodb for MongoDB databases.
uristringRequired. Adress and authentication data for the database as a Uniform Resource Identifier (URI). You can read more about MongoDB URIs here.

Example:
mongodb://mymongo:port/?retryWrites=true&connectTimeoutMS=10000
databasestringRequired. Name of the related MongoDB database to use.
namestringEnter a name for the database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file.
timezoneconfiguration for either local time zone, universal coordinated time or offset from UTCSpecify how the extractor should handle timestamps from the source when timezone data is absent. Either local for the local timezone on the machine the extractor is running on, utc for UTC, or a number for a numerical offset from UTC. Default value is local.

Examples:
utc
-8
5.5

azure_cosmos_db

Part of databases configuration.

ParameterTypeDescription
typealways cosmosdbType of database connection, set to cosmosdb for Cosmos DB databases.
hoststringRequired. Host address for the database

Example:
https://my-cosmos-db.documents.azure.com
keystringRequired. Azure Key used to connect to the Cosms DB instance
databasestringRequired. Database name to use
namestringEnter a name for the database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file.
timezoneconfiguration for either local time zone, universal coordinated time or offset from UTCSpecify how the extractor should handle timestamps from the source when timezone data is absent. Either local for the local timezone on the machine the extractor is running on, utc for UTC, or a number for a numerical offset from UTC. Default value is local.

Examples:
utc
-8
5.5

local_spreadsheet_files

Part of databases configuration.

The Cognite DB extractor can run against excel spreadsheets and other files containting tabular data. The currently supported file types are

  • xlsx, xlsm and xlsb (modern Excel files)
  • xls (legacy excel files)
  • odf, ods and odt (OpenDocument Format, used by e.g. Libre Office and Open Office)
  • csv (Comma separated values)

When using Excel or OpenDocument Format spreadsheets, you need to provide an additional sheet parameter in the associated query configuration.

ParameterTypeDescription
typealways spreadsheetType of connection, set to spreadsheet for local spreadsheet files.
pathstringRequired. Path to a single spreadsheet file

Examples:
/path/to/my/excel/file.xlsx
./relative/path/file.csv
C:\\Users\\Robert\\Documents\\spreadsheet.xls
namestringEnter a name for the database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file.
timezoneconfiguration for either local time zone, universal coordinated time or offset from UTCSpecify how the extractor should handle timestamps from the source when timezone data is absent. Either local for the local timezone on the machine the extractor is running on, utc for UTC, or a number for a numerical offset from UTC. Default value is local.

Examples:
utc
-8
5.5

amazon_dynamo_db

Part of databases configuration.

ParameterTypeDescription
typealways dynamodbType of database connection, set to dynamodb for DynamoDB databases.
aws-access-key-idstringRequired. AWS authentication access key ID
aws-secret-access-keystringRequired. AWS authentication access key secret
region-namestringRequired. AWS region where your database is located.

Example:
us-east-1
namestringEnter a name for the database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file.
timezoneconfiguration for either local time zone, universal coordinated time or offset from UTCSpecify how the extractor should handle timestamps from the source when timezone data is absent. Either local for the local timezone on the machine the extractor is running on, utc for UTC, or a number for a numerical offset from UTC. Default value is local.

Examples:
utc
-8
5.5

amazon_redshift

Part of databases configuration.

ParameterTypeDescription
typealways redshiftType of database connection, set to redshift for Redshift databases.
aws-access-key-idstringRequired. AWS authentication access key ID
aws-secret-access-keystringRequired. AWS authentication access key secret
region-namestringRequired. AWS region where your database is located.

Example:
us-east-1
databasestringRequired. Redshift database
secret-arnstringAWS Secret ARN
cluster-identifierstringName of the Redshift cluster to connect. This parameter is required when connecting to a managed Redshift cluster.
workgroup-namestringName of the Redshift workgroup to connect. This parameter is mandatory when connecting to a Redshift Serverless database.
namestringEnter a name for the database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file.
timezoneconfiguration for either local time zone, universal coordinated time or offset from UTCSpecify how the extractor should handle timestamps from the source when timezone data is absent. Either local for the local timezone on the machine the extractor is running on, utc for UTC, or a number for a numerical offset from UTC. Default value is local.

Examples:
utc
-8
5.5

google_bigquery

Part of databases configuration.

The Cognite DB Extractor can run against Google BigQuery using Google SQL(like) query.

Because this extends the Google SDK, you also authenticate with the Google suggested authentication methods by setting the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of your authentication key

ParameterTypeDescription
typealways bigqueryType of database connection, set to bigquery for Google BigQuery
namestringEnter a name for the database that will be used throughout the queries section and for logging. The name must be unique for each database in the configuration file.
timezoneconfiguration for either local time zone, universal coordinated time or offset from UTCSpecify how the extractor should handle timestamps from the source when timezone data is absent. Either local for the local timezone on the machine the extractor is running on, utc for UTC, or a number for a numerical offset from UTC. Default value is local.

Examples:
utc
-8
5.5

extractor

Global parameter.

General extractor configuration

ParameterTypeDescription
state-storeobjectInclude the state store section to save extraction states between runs. Use this if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.
upload-queue-sizeintegerMaximum size of upload queue. Upload to CDF will be triggered once this limit is reached. Default value is 100000.
parallelismintegerMaximum number of parallel queries. Default value is 4.
modeeither continuous or singleExtractor mode. Continuous runs the configured queries using the schedules configured per query. Single runs the queries once each.

state-store

Part of extractor configuration.

Include the state store section to save extraction states between runs. Use this if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.

ParameterTypeDescription
rawobjectA RAW state store stores the extraction state in a table in CDF RAW.
localobjectA local state store stores the extraction state in a JSON file on the local machine.

raw

Part of state-store configuration.

A RAW state store stores the extraction state in a table in CDF RAW.

ParameterTypeDescription
databasestringRequired. Enter the database name in CDF RAW.
tablestringRequired. Enter the table name in CDF RAW.
upload-intervalintegerEnter the interval in seconds between each upload to CDF RAW. Default value is 30.

local

Part of state-store configuration.

A local state store stores the extraction state in a JSON file on the local machine.

ParameterTypeDescription
pathstringRequired. Insert the file path to a JSON file.
save-intervalintegerEnter the interval in seconds between each save. Default value is 30.
  • Using values from environment variables
  • Using values from Azure Key Vault
  • cognite
    • idp-authentication
    • data-set
    • extraction-pipeline
    • connection
  • logger
    • console
    • file
  • metrics
    • push-gateways
    • cognite
    • server
  • queries
    • destination
    • schedule
  • databases
    • odbc
    • postgresql
    • oracle_db
    • snowflake
    • mongodb
    • azure_cosmos_db
    • local_spreadsheet_files
    • amazon_dynamo_db
    • amazon_redshift
    • google_bigquery
  • extractor
    • state-store