Skip to main content
You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud.

Using values from environment variables

The configuration file allows substitutions with environment variables. For example:
cognite:
  secret: ${COGNITE_CLIENT_SECRET}
will load the value from the COGNITE_CLIENT_SECRET environment variable into the cognite/secret parameter. You can also do string interpolation with environment variables, for example:
url: http://my-host.com/api/endpoint?secret=${MY_SECRET_TOKEN}
Implicit substitutions only work for unquoted value strings. For quoted strings, use the !env tag to activate environment substitution:
url: !env 'http://my-host.com/api/endpoint?secret=${MY_SECRET_TOKEN}'

Using values from Azure Key Vault

The DB extractor also supports loading values from Azure Key Vault. To load a configuration value from Azure Key Vault, use the !keyvault tag followed by the name of the secret you want to load. For example, to load the value of the my-secret-name secret in Key Vault into a password parameter, configure your extractor like this:
password: !keyvault my-secret-name
To use Key Vault, you also need to include the azure-keyvault section in your configuration, with the following parameters:
ParameterDescription
keyvault-nameName of Key Vault to load secrets from
authentication-methodHow to authenticate to Azure. Either default or client-secret. For default, the extractor will look at the user running the extractor, and look for pre-configured Azure logins from tools like the Azure CLI. For client-secret, the extractor will authenticate with a configured client ID/secret pair.
client-idRequired for using the client-secret authentication method. The client ID to use when authenticating to Azure.
secretRequired for using the client-secret authentication method. The client secret to use when authenticating to Azure.
tenant-idRequired for using the client-secret authentication method. The tenant ID of the Key Vault in Azure.
Example:
azure-keyvault:
  keyvault-name: my-keyvault-name
  authentication-method: client-secret
  tenant-id: 6f3f324e-5bfc-4f12-9abe-22ac56e2e648
  client-id: 6b4cc73e-ee58-4b61-ba43-83c4ba639be6
  secret: 1234abcd

Base configuration object

ParameterTypeDescription
versioneither string or integerInput the configuration file version.
typeeither local or remoteInput the configuration file type. The local option loads the full config from this file, while the remote option loads only the cognite section and the rest from extraction pipelines. Default value is local.
cogniteobjectDescribes which CDF project the extractor will load data into and how to connect to the project.
loggerobjectSets up logging to a console and files. This is an optional value.
extractorobjectContains the common extractor configuration.
sourceobjectInsert the source configuration for data mirrored from Fabric.
destinationobjectInsert the destination configuration for time series data mirrored from Fabric.
subscriptionslistInsert the time series subscriptions configuration for time series mirrored to Fabric.
data-modelinglistInsert the data modeling configuration for syncing a data model to Fabric.
eventobjectEnter the event configuration for mirroring events to Fabric.
raw-tableslistEnter the raw tables configuration to mirror the raw tables to Fabric.

cognite

Global parameter. Describes which CDF project the extractor will load data into and how to connect to the project.
ParameterTypeDescription
projectstringInsert the CDF project name into which you want to ingest data.
idp-authenticationobjectInsert the credentials for authenticating to CDF using an external identity provider (IdP), such as Microsoft Entra ID (formerly Azure Active Directory).
data-setobjectEnter a data set into which the extractor should write data.
extraction-pipelineobjectEnter the extraction pipeline for remote config and reporting statuses.
hoststringInsert the base URL of the CDF project. Default value is https://api.cognitedata.com.
timeoutintegerEnter the timeout on requests to CDF in seconds. Default value is 30.
external-id-prefixstringEnter the external ID prefix to identify the documents in CDF. Leave empty for no prefix.
connectionobjectThis parameter configures the network connection details.

idp-authentication

Part of cognite configuration. Insert the credentials for authenticating to CDF using an external identity provider (IdP), such as Microsoft Entra ID (formerly Azure Active Directory).
ParameterTypeDescription
authoritystringInsert the authority together with tenant to authenticate against Azure tenants.s. Default value is https://login.microsoftonline.com/.
client-idstringRequired. Enter the service principal client ID from the IdP.
tenantstringEnter the EbtraID tenant ID. Do not use in combnation with the token-url parameter.
token-urlstringInsert the URL to fetch tokens. Do not use in combination with the tenant parameter.
secretstringEnter the service principal client secret from the IdP.
resourcestringInput the resource parameter and token requests.
audiencestringInput the audience parameter and token requests.
scopeslistEnter the list of scopes requested for the token.
min-ttlintegerInsert the minimum time in seconds for a token to be valid. If the cached token expires in less than min-ttl seconds, the system will refresh the token, even if it’s still valid. Default value is 30.
certificateobjectAuthenticate with a client certificate.

scopes

Part of idp-authentication configuration. Enter the list of scopes requested for the token. Each element of this list should be a string.

certificate

Part of idp-authentication configuration. Authenticate with a client certificate.
ParameterTypeDescription
authority-urlstringInput the authentication authority URL.
pathstringRequired. Enter the path to the .pem or .pfx certificate for authentication.
passwordstringEnter the password for the key file if it is encrypted.

data-set

Part of cognite configuration. Enter a data set into which the extractor should write data.
ParameterTypeDescription
idintegerInput the resource internal ID.
external-idstringInput the resource external ID.

extraction-pipeline

Part of cognite configuration. Enter the extraction pipeline for remote config and reporting statuses.
ParameterTypeDescription
idintegerInput the resource internal ID.
external-idstringInput the resource external ID.

connection

Part of cognite configuration. This parameter configures the network connection details.
ParameterTypeDescription
disable-gzipbooleanSet to true to turn off gzipping of JSON bodies.
status-forceliststringEnter the HTTP status codes to retry.
max-retriesintegerEnter the HTTP status codes to retry. Default value is 10.
max-retries-connectintegerEnter the maximum number of retries on connection errors. Default value is 3.
max-retry-backoffintegerSets a maximum backoff after any request failure. The retry strategy employs exponential backoff. Default value is 30.
max-connection-pool-sizeintegerSets the maximum number of connections in the SDK’s connection pool. Default value is 50.
disable-sslbooleanSet to true to turn off SSL verification.
proxiesobjectInput the dictionary mapping from protocol to URL.

proxies

Part of connection configuration. Input the dictionary mapping from protocol to URL.

logger

Global parameter. Sets up logging to a console and files. This is an optional value.
ParameterTypeDescription
consoleobjectInclude the console section to enable logging to standard output, such as a terminal window.
fileobjectInclude the file section to enable logging to a file. The files are rotated daily.
metricsbooleanEnables metrics on the number of log messages recorded per logger and level. Configure metrics to retrieve the logs.

console

Part of logger configuration. Include the console section to enable logging to standard output, such as a terminal window.
ParameterTypeDescription
leveleither DEBUG, INFO, WARNING, ERROR or CRITICALSelect the verbosity level for console logging. To reduce the verbosity levels, use DEBUG, INFO, WARNING, ERROR, or CRITICAL. Default value is INFO.

file

Part of logger configuration. Include the file section to enable logging to a file. The files are rotated daily.
ParameterTypeDescription
leveleither DEBUG, INFO, WARNING, ERROR or CRITICALSelect the verbosity level for file logging. To reduce the verbosity levels, use DEBUG, INFO, WARNING, ERROR, or CRITICAL. Default value is INFO.
pathstringRequired. Insert the path to the log file.
retentionintegerSpecify the number of days to keep logs. Default value is 7.

extractor

Global parameter. Contains the common extractor configuration.
ParameterTypeDescription
state-storeobjectInclude the state store section to save extraction states between runs. Use a state store if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.
subscription-batch-sizeintegerInput the batch size for time series subscriptions. Default value is 10000.
ingest-batch-sizeintegerInput the batch size for time series ingestion. Default value is 100000.
fabric-ingest-batch-sizeintegerInput the batch size for ingestion into Fabric. Default value is 1000.
poll-timeintegerEnter the time in seconds to wait between polling for new data. Default value is 3600.

state-store

Part of extractor configuration. Include the state store section to save extraction states between runs. Use a state store if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.
ParameterTypeDescription
rawobjectStores the extraction state in a table in CDF RAW.
localobjectStores the extraction state in a JSON file on the local machine.

raw

Part of state-store configuration. Stores the extraction state in a table in CDF RAW.
ParameterTypeDescription
databasestringRequired. Enter the database name in CDF RAW.
tablestringRequired. Enter the table name in CDF RAW.
upload-intervalintegerEnter the interval in seconds between each upload to CDF RAW. Default value is 30.

local

Part of state-store configuration. Stores the extraction state in a JSON file on the local machine.
ParameterTypeDescription
pathstringRequired. Insert the file path to a JSON file.
save-intervalintegerEnter the interval in seconds between each save. Default value is 30.

source

Global parameter. Insert the source configuration for data mirrored from Fabric.
ParameterTypeDescription
abfss-prefixstringInput the ABFSS prefix for the data lake.
data-set-idstringInput the data set ID.
event-pathstringEnter the folder combined with ABFFS Path to the event data.
event-path-incremental-fieldstringInput the field for incremental loading.
raw-time-series-pathstringEnter the folder with ABFFS Path to the raw time series data.
read-batch-sizeintegerInput the batch size for reading data from Fabric.
file-pathstringInput the file path for the file’s data.
raw-tableslistInput the list of raw tables to be ingested.

raw-tables

Part of source configuration. Input the list of raw tables to be ingested. Each element of this list should be a cDF RAW configuration
ParameterTypeDescription
table-namestringEnter the name of the RAW table in CDF to store rows.
db-namestringEnter the database name in CDF to store the row.
raw-pathstringInput the subpath in the lakehouse to read rows.
incremental-fieldstringInput the field for incremental loading. This value is normally a timestamp.

destination

Global parameter. Insert the destination configuration for time series data mirrored from Fabric.
ParameterTypeDescription
time-series-prefixstringEnter the prefix to add to CDF time series external IDs created from Fabric.

subscriptions

Global parameter. Insert the time series subscriptions configuration for time series mirrored to Fabric. Each element of this list should be a cDF time series sync to CDF configuration
ParameterTypeDescription
external-idstringInput the external ID of the time series subscription.
partitionslistEnter the List of partitions to be ingested.
lakehouse-abfss-path-dpsstringInput the ABFSS path to the data points.
lakehouse-abfss-path-tsstringInput the ABFSS path to the time series.

partitions

Part of subscriptions configuration. Enter the List of partitions to be ingested. Each element of this list should be an integer.

data-modeling

Global parameter. Insert the data modeling configuration for syncing a data model to Fabric. Each element of this list should be a data Modeling sync configuration
ParameterTypeDescription
spacestringEnter the data modeling space name to synchronize to Fabric
lakehouse-abfss-prefixstringEnter the full ABFSS prefix for a folder in the lakehouse.

event

Global parameter. Enter the event configuration for mirroring events to Fabric.
ParameterTypeDescription
lakehouse-abfss-path-eventsstringInput the path to the table in the lakehouse to store CDF events.
batch-sizeintegerInput the number of events to read in a single batch from CDF.
dataset_external_idstringInput the external id of the dataset to pull events from CDF (optional).

raw-tables

Global parameter. Enter the raw tables configuration to mirror the raw tables to Fabric. Each element of this list should be a cDF RAW configuration for to be synced to Fabric
ParameterTypeDescription
table-namestringEnter the name of the RAW table in CDF to sync to Fabric.
db-namestringEnter the database name in CDF to sync to Fabric.
lakehouse-abfss-path-rawstringEnter the full ABFFS path of the table to store RAW rows into.