Skip to main content

Configuration settings

To configure the File extractor, you must create a configuration file. The file must be in YAML format.

Tip

You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud.

Using values from environment variables

The configuration file allows substitutions with environment variables. For example:

cognite:
secret: ${COGNITE_CLIENT_SECRET}

will load the value from the COGNITE_CLIENT_SECRET environment variable into the cognite/secret parameter. You can also do string interpolation with environment variables, for example:

url: http://my-host.com/api/endpoint?secret=${MY_SECRET_TOKEN}
Note

Implicit substitutions only work for unquoted value strings. For quoted strings, use the !env tag to activate environment substitution:

url: !env 'http://my-host.com/api/endpoint?secret=${MY_SECRET_TOKEN}'

Using values from Azure Key Vault

The DB extractor also supports loading values from Azure Key Vault. To load a configuration value from Azure Key Vault, use the !keyvault tag followed by the name of the secret you want to load. For example, to load the value of the my-secret-name secret in Key Vault into a password parameter, configure your extractor like this:

password: !keyvault my-secret-name

To use Key Vault, you also need to include the azure-keyvault section in your configuration, with the following parameters:

ParameterDescription
keyvault-nameName of Key Vault to load secrets from
authentication-methodHow to authenticate to Azure. Either default or client-secret. For default, the extractor will look at the user running the extractor, and look for pre-configured Azure logins from tools like the Azure CLI. For client-secret, the extractor will authenticate with a configured client ID/secret pair.
client-idRequired for using the client-secret authentication method. The client ID to use when authenticating to Azure.
secretRequired for using the client-secret authentication method. The client secret to use when authenticating to Azure.
tenant-idRequired for using the client-secret authentication method. The tenant ID of the Key Vault in Azure.

Example:

azure-keyvault:
keyvault-name: my-keyvault-name
authentication-method: client-secret
tenant-id: 6f3f324e-5bfc-4f12-9abe-22ac56e2e648
client-id: 6b4cc73e-ee58-4b61-ba43-83c4ba639be6
secret: 1234abcd

Base configuration object

ParameterTypeDescription
versioneither string or integerConfiguration file version
typeeither local or remoteConfiguration file type. Either local, meaning the full config is loaded from this file, or remote, which means that only the cognite section is loaded from this file, and the rest is loaded from extraction pipelines. Default value is local.
cogniteobjectThe cognite section describes which CDF project the extractor will load data into and how to connect to the project.
loggerobjectThe optional logger section sets up logging to a console and files.
filesobjectConfigure files to be extracted to CDF.
extractorobjectGeneral configuration for the file extractor.

cognite

Global parameter.

The cognite section describes which CDF project the extractor will load data into and how to connect to the project.

ParameterTypeDescription
projectstringInsert the CDF project name.
idp-authenticationobjectThe idp-authentication section enables the extractor to authenticate to CDF using an external identity provider (IdP), such as Microsoft Entra ID (formerly Azure Active Directory).
data-setobjectEnter a data set the extractor should write data into
extraction-pipelineobjectEnter the extraction pipeline used for remote config and reporting statuses
hoststringInsert the base URL of the CDF project. Default value is https://api.cognitedata.com.
timeoutintegerEnter the timeout on requests to CDF, in seconds. Default value is 30.
external-id-prefixstringPrefix on external ID used when creating CDF resources
connectionobjectConfigure network connection details

idp-authentication

Part of cognite configuration.

The idp-authentication section enables the extractor to authenticate to CDF using an external identity provider (IdP), such as Microsoft Entra ID (formerly Azure Active Directory).

ParameterTypeDescription
authoritystringInsert the authority together with tenant to authenticate against Azure tenants. Default value is https://login.microsoftonline.com/.
client-idstringRequired. Enter the service principal client id from the IdP.
tenantstringEnter the Azure tenant.
token-urlstringInsert the URL to fetch tokens from.
secretstringEnter the service principal client secret from the IdP.
resourcestringResource parameter passed along with token requests.
audiencestringAudience parameter passed along with token requests.
scopeslistEnter a list of scopes requested for the token
min-ttlintegerInsert the minimum time in seconds a token will be valid. If the cached token expires in less than min-ttl seconds, it will be refreshed even if it is still valid. Default value is 30.
certificateobjectAuthenticate with a client certificate

scopes

Part of idp-authentication configuration.

Enter a list of scopes requested for the token

Each element of this list should be a string.

certificate

Part of idp-authentication configuration.

Authenticate with a client certificate

ParameterTypeDescription
authority-urlstringAuthentication authority URL
pathstringRequired. Enter the path to the .pem or .pfx certificate to be used for authentication
passwordstringEnter the password for the key file, if it is encrypted.

data-set

Part of cognite configuration.

Enter a data set the extractor should write data into

ParameterTypeDescription
idintegerResource internal id
external-idstringResource external id

extraction-pipeline

Part of cognite configuration.

Enter the extraction pipeline used for remote config and reporting statuses

ParameterTypeDescription
idintegerResource internal id
external-idstringResource external id

connection

Part of cognite configuration.

Configure network connection details

ParameterTypeDescription
disable-gzipbooleanWhether or not to disable gzipping of json bodies.
status-forceliststringHTTP status codes to retry. Defaults to 429, 502, 503 and 504
max-retriesintegerMax number of retries on a given http request. Default value is 10.
max-retries-connectintegerMax number of retries on connection errors. Default value is 3.
max-retry-backoffintegerRetry strategy employs exponential backoff. This parameter sets a max on the amount of backoff after any request failure. Default value is 30.
max-connection-pool-sizeintegerThe maximum number of connections which will be kept in the SDKs connection pool. Default value is 50.
disable-sslbooleanWhether or not to disable SSL verification.
proxiesobjectDictionary mapping from protocol to url.

proxies

Part of connection configuration.

Dictionary mapping from protocol to url.

logger

Global parameter.

The optional logger section sets up logging to a console and files.

ParameterTypeDescription
consoleobjectInclude the console section to enable logging to a standard output, such as a terminal window.
fileobjectInclude the file section to enable logging to a file. The files are rotated daily.
metricsbooleanEnables metrics on the number of log messages recorded per logger and level. This requires metrics to be configured as well

console

Part of logger configuration.

Include the console section to enable logging to a standard output, such as a terminal window.

ParameterTypeDescription
leveleither DEBUG, INFO, WARNING, ERROR or CRITICALSelect the verbosity level for console logging. Valid options, in decreasing verbosity levels, are DEBUG, INFO, WARNING, ERROR, and CRITICAL. Default value is INFO.

file

Part of logger configuration.

Include the file section to enable logging to a file. The files are rotated daily.

ParameterTypeDescription
leveleither DEBUG, INFO, WARNING, ERROR or CRITICALSelect the verbosity level for file logging. Valid options, in decreasing verbosity levels, are DEBUG, INFO, WARNING, ERROR, and CRITICAL. Default value is INFO.
pathstringRequired. Insert the path to the log file.
retentionintegerSpecify the number of days to keep logs for. Default value is 7.

files

Global parameter.

Configure files to be extracted to CDF.

ParameterTypeDescription
file-providerconfiguration for either Local Files, Sharepoint Online, FTP/FTPS, SFTP, GCP Cloud Storage, Azure Blob Storage, Samba, AWS S3 or DocumentumConfigure a file provider for where the files are extracted from.
extensionslistList of file extensions to include. If left out, all file extensions will be allowed.
labelslistList of label external IDs to add to extracted files.
security-categorieslistList of security category IDs to add to extracted files.
max-file-sizeeither string or numberMaximum file size of files to include. Set to -1 to allow any file size. Syntax is N(KB|MB|GB|TB|KiB|miB|GiB|TiB). Note that the extractor supports files up to 1000GiB. Default value is 100GiB.
with-metadatabooleanAdd metadata extracted from the file source to files in CDF.
directory-prefixstringPrefix to add to all extracted file directories.
metadata-to-rawobjectIf this is configured, write metadata to a table in CDF Raw instead of files.
filterconfiguration for either And, Or, Not, Equals or In
delete-behaviorobjectWhat the extractor should do with files that were removed from the source.

file-provider

Part of files configuration.

Configure a file provider for where the files are extracted from.

Either one of the following options:

local_files

Part of file-provider configuration.

Read files from a local folder. This file provider will recursively traverse the given path and extract all discovered files.

Example:

type: local
path: /some/local/path
ParameterTypeDescription
typestringSelect the type of file provider. Set to local for local files.
pathstringRequired. Enter the path (absolute or relative) where the local files are located.
ignore_patternsstringAny file path that matches this pattern will be ignored.

sharepoint_online

Part of file-provider configuration.

Read files from one or more sharepoint online sites.

Example:

type: sharepoint_online
client-id: ${SP_CLIENT_ID}
client-secret: ${SP_CLIENT_SECRET}
tenant-id: ${SP_AZURE_TENANT_ID}
paths:
- url: ${SP_EXTRACT_URL}
ParameterTypeDescription
typestringRequired. Select the type of file provider. Set to sharpeoint_online for sharepoint online.
client-idstringRequired. Enter the App registration client ID.
client-secretstringEnter the App registration secret.
certificate-pathstringEnter the path to a certificate used for authentication. Either this, client-secret, or certificate-data must be specified.
certificate-datastringProvide authentication certificate data directly.
tenant-idstringRequired. Enter the Azure tenant containing the App registration.
pathslistRequired. Enter a list of sharepoint base URLs to extract from.
datetime-formatstringFormat string for timestamp metadata. Default value is %Y-%m-%dT%H:%M:%SZ.
extract-columnsobjectExtract Sharepoint columns as metadata. This is a map from column names in Sharepoint to the name you want to extracted columns to have in file metadata in CDF.

Example:
{'columnNameInSharepoint': 'metadataNameInCdf'}
restrict-toobjectRestrict to extract only files visible to a given Sharepoint Group or SiteUser.
Important: In order to use this, the extractor MUST authenticate to Sharepoint online using a certificate, NOT a client secret.
ignore_patternsstringAny file path that matches this pattern will be ignored.
paths

Part of sharepoint_online configuration.

Enter a list of sharepoint base URLs to extract from.

Each element of this list should be an a sharepoint base URL to extract from.

ParameterTypeDescription
urlstringRequired. URL to the Sharepoint location you want to extract from. This can be the url to a site, a document library, or to a file or folder inside the library.
recursivebooleanWhether to traverse into subfolders or not, for this path. Default value is True.
extract-columns

Part of sharepoint_online configuration.

Extract Sharepoint columns as metadata. This is a map from column names in Sharepoint to the name you want to extracted columns to have in file metadata in CDF.

Example:

columnNameInSharepoint: metadataNameInCdf
ParameterTypeDescription
Any stringstringName of metadata field in CDF.
restrict-to

Part of sharepoint_online configuration.

Restrict to extract only files visible to a given Sharepoint Group or SiteUser. Important: In order to use this, the extractor MUST authenticate to Sharepoint online using a certificate, NOT a client secret.

ParameterTypeDescription
group-idstringThe ID of a Sharepoint Group
usernamestringThe "login name" of a SiteUser. This can be useful if you for example want to restrict extraction to the built-in "all users" SiteUser.

ftp/ftps

Part of file-provider configuration.

Read files from an FTP server.

Example:

type: ftp
host: ftp.myserver.com
username: username
password: ${FTP_PASSWORD}
ParameterTypeDescription
typestringRequired. Select the type of file provider. Set to ftp for FTP.
hoststringRequired. Host name for the FTP server.

Example:
ftp.myserver.com
portintegerFTP server port. Default value is 21.
usernamestringRequired. Username to use to login to the FTP server.
passwordstringRequired. Password to use to login to the FTP server.
root-directorystringRoot folder for extraction. Default value is /.
recursivebooleanWhether to recursively traverse sub-folders for files. Default value is True.
use-sslbooleanWhether to connect using FTPS (FTP with SSL/TLS). Using SSL is strongly recommended.
certificate-file-pathstringPath to SSL certificate authority certificate for the FTP server, useful if using self-signed certificates.
ignore_patternsstringAny file path that matches this pattern will be ignored.

sftp

Part of file-provider configuration.

Read files from an SFTP server, file transfer over SSH.

Example:

type: sftp
host: ftp.myserver.com
username: username
password: ${FTP_PASSWORD}
ParameterTypeDescription
typestringRequired. Select the type of file provider. Set to sftp for SFTP.
hoststringRequired. Host name for the SSH server.

Example:
ftp.myserver.com
portintegerSSH server port. Default value is 22.
usernamestringRequired. Username to use to login to the SSH server.
passwordstringPassword to use to login to the SSH server. Either password or key-path is required.
key-pathstringPath to SSH private key for the connection. Either password or key-path is required.
key-passwordstringPassword for SSH private key if the key is encrypted. Only used in combination with key-path.
root-directorystringRoot folder for extraction. Default value is /.
recursivebooleanWhether to recursively traverse sub-folders for files. Default value is True.
ignore_patternsstringAny file path that matches this pattern will be ignored.

gcp_cloud_storage

Part of file-provider configuration.

Read files from a GCP Cloud Storage bucket.

Example:

type: gcp_cloud_storage
google-application-credentials: ${GOOGLE_APPLICATION_CREDENTIALS}
bucket: bucket_name
folders:
- list
- of
- folders
ParameterTypeDescription
typestringSelect the type of file provider. Set to gcp_cloud_storage for GCP Cloud Storage.
google-application-credentialsstringBase-64 encoded GCP service account credentials.
bucketstringName of GCP Cloud Storage bucket to fetch files from.
folderslistList of folders in bucket to fetch files from.
ignore_patternsstringAny file path that matches this pattern will be ignored.
folders

Part of gcp_cloud_storage configuration.

List of folders in bucket to fetch files from.

Each element of this list should be a string.

azure_blob_storage

Part of file-provider configuration.

Read files from an Azure Blob Store.

Example:

type: azure_blob_storage
connection-string: ${AZURE_BLOB_STORAGE_CONNECTION_STRING}
ParameterTypeDescription
typestringRequired. Select the type of file provider. Set to azure_blob_storage for Azure Blob Storage.
connection-stringstringRequired. Azure Blob Storage connection string.
containerslistOptional list of containers to extract from. If left out or empty, all files will be read.
ignore_patternsstringAny file path that matches this pattern will be ignored.
containers

Part of azure_blob_storage configuration.

Optional list of containers to extract from. If left out or empty, all files will be read.

Each element of this list should be a string.

samba

Part of file-provider configuration.

Read files from a Samba file share.

Example:

type: smb
server: my.server.com
share-path: \\server\share_path
username: username
password: ${SMB_PASSWORD}
ParameterTypeDescription
typestringRequired. Select the type of file provider. Set to smb for Samba.
share-pathstringRequired. Share path, i.e. name of path shared.

Example:
\\server\share_path
portintegerPort to connect to on the Samba server. Default value is 445.
usernamestringUsername for authentication.
passwordstringPassword for authentication.
require-signingbooleanWhether signing is required on messages sent to and received from the Samba server. Default value is True.
domain-controllerstringThe domain controller hostname. When set the file provider will send a DFS referral request to this hostname to populate the domain cache used for DFS connections or when connecting to SYSVOL or NETLOGON
skip-dfsbooleanWhether to skip using any DFS referral checks and treat any path as a normal path. This is only useful if there are problems with the DFS resolver or you wish to avoid the extra round trip(s) the resolver requires.
auth-protocoleither negotiate, kerberos or ntlmThe protocol to use for authentication. Default value is negotiate.
require-secure-negotiatebooleanWhether to verify the negotiated dialects and capabilities on the connection to a share to protect against man in the middle downgrade attacks. Default value is True.
ignore_patternsstringAny file path that matches this pattern will be ignored.

aws_s3

Part of file-provider configuration.

Read files from an AWS S3 cloud bucket.

Example:

type: aws_s3
bucket: some_bucket
aws-access-key-id: ${AWS_ACCESS_KEY_ID}
aws-secret-access-key: ${AWS_SECRET_ACCESS_KEY}
ParameterTypeDescription
typestringRequired. Select the type of file provider. Set to aws_s3 for AWS S3.
bucketstringRequired. AWS S3 cloud bucket to read files from. Valid formats are bucket-name or bucket-name/subfolder1/subfolder2.
aws-access-key-idstringAWS access key ID to use for authentication. If left out, use default authentication configured on the machine the extractor is running on.
aws-secret-access-keystringAWS secret access key to use for authentication. If left out, use default authentication configured on the machine the extractor is running on.
ignore_patternsstringAny file path that matches this pattern will be ignored.

documentum

Part of file-provider configuration.

Read files from an OpenText Documentum server.

Example:

base-url: https://my.documentum.server/dctm
username: username
password: ${DOCUMENTUM_PASSWORD}
repositories:
- name: some_repo
query: SELECT * FROM tech_document WHERE object_name LIKE 'PREFIX%' ORDER BY r_object_id
ParameterTypeDescription
typestringRequired. Select the type of file provider. Set to documentum to read from a Documentum server.
base-urlstringRequired. URL of the documentum server to read from.
usernamestringRequired. Documentum server username.
passwordstringRequired. Documentum server password.
repositorieslistRequired. List of documentum repositories to read from, and the query to make towards each repository.
ssl-verifybooleanEnable SSL certificate verification.

Warning: Disabling SSL verification is a potential security risk. Do not use this option over the internet, only on local networks secured through other means. Default value is True.
get-all-renditionsbooleanGet all renditions of each document, not just the primary.
external-id-separatorstringText to use as a separator between the different parts of the created external IDs. Default value is ..
include-extension-in-file-namesbooleanInclude the file extension in the name of the file in CDF. For example, upload it as My File.pdf instead of My File
timeoutstringTimeout on queries to documentum. On the form N(s|m|h|d). Default value is 5m.
field-mapobjectMap documentum metadata to fields on files in CDF. For most Documentum deployments, the defaults should not need to be changed.
performanceobjectConfiguration to tune the parallelism of the file extractor.
ignore_patternsstringAny file path that matches this pattern will be ignored.
repositories

Part of documentum configuration.

List of documentum repositories to read from, and the query to make towards each repository.

Each element of this list should be an a documentum repository the file extractor should read from, and the query it should use.

ParameterTypeDescription
namestringRequired. Name of the documentum repository.
querystringRequired. Query to retrieve file info from the repository.

Warning: The query must return a consistent ordering. To ensure this, add an ORDER BY clause for some ID field.

Example:
SELECT * FROM tech_document WHERE object_name LIKE 'PREFIX%' ORDER BY r_object_id
field-map

Part of documentum configuration.

Map documentum metadata to fields on files in CDF. For most Documentum deployments, the defaults should not need to be changed.

ParameterTypeDescription
external-idlistMap documentum columns to external ID. This is required in order to correctly track files in CDF.
namelistMap documentum columns to file name.
file-extensionlistMap documentum columns to file extension.
modify-datelistMap documentum columns to source modified time in CDF.
mime-typelistMap documentum columns to mime type.
external-id

Part of field-map configuration.

Map documentum columns to external ID. This is required in order to correctly track files in CDF.

Each element of this list should be a string.

name

Part of field-map configuration.

Map documentum columns to file name.

Each element of this list should be a string.

file-extension

Part of field-map configuration.

Map documentum columns to file extension.

Each element of this list should be a string.

modify-date

Part of field-map configuration.

Map documentum columns to source modified time in CDF.

Each element of this list should be a string.

mime-type

Part of field-map configuration.

Map documentum columns to mime type.

Each element of this list should be a string.

performance

Part of documentum configuration.

Configuration to tune the parallelism of the file extractor.

ParameterTypeDescription
workersintegerNumber of parallel workers used to read from documentum. Default value is 30.
document-bufferintegerNumber of document metadata instances to buffer before uploading the file contents to CDF. Default value is 60.
page-bufferintegerNumber of document metadata pages to buffer. Default value is 5.

extensions

Part of files configuration.

List of file extensions to include. If left out, all file extensions will be allowed.

Each element of this list should be a string.

labels

Part of files configuration.

List of label external IDs to add to extracted files.

Each element of this list should be a string.

security-categories

Part of files configuration.

List of security category IDs to add to extracted files.

Each element of this list should be an integer.

metadata-to-raw

Part of files configuration.

If this is configured, write metadata to a table in CDF Raw instead of files.

ParameterTypeDescription
databasestringRequired. Write file metadata to this Raw database.
tablestringRequired. Write file metadata to this Raw table.

filter

Part of files configuration.

Either one of the following options:

Example:

and:
- equals:
property: some_metadata_field
value: some_metadata_value
- not:
in:
property: some_metadata_field
values:
- metadata
- values

and

Part of filter configuration.

Matches if all sub filters match.

ParameterTypeDescription
andlistRequired. List of sub filters, all of these must match.
and

Part of and configuration.

List of sub filters, all of these must match.

Each element of this list should be a configuration for either And, Or, Not, Equals or In.

or

Part of filter configuration.

Matches if any of the sub filters match.

ParameterTypeDescription
orlistRequired. List of sub filters, at least one of these must match.
or

Part of or configuration.

List of sub filters, at least one of these must match.

Each element of this list should be a configuration for either And, Or, Not, Equals or In.

not

Part of filter configuration.

Matches if the sub filter does not match.

ParameterTypeDescription
notconfiguration for either And, Or, Not, Equals or In

equals

Part of filter configuration.

Matches if the property on the file is equal to the given value.

ParameterTypeDescription
equalsobjectRequired. Equality filter.
equals

Part of equals configuration.

Equality filter.

ParameterTypeDescription
propertystringRequired. File property name.
valuestringRequired. Property value to match.

in

Part of filter configuration.

Matches if the property on the file is equal to one of the given values.

ParameterTypeDescription
inobjectRequired. In filter.
in

Part of in configuration.

In filter.

ParameterTypeDescription
propertystringRequired. File property name.
valueslistProperty values. One of these must match.
values

Part of in configuration.

Property values. One of these must match.

Each element of this list should be a string.

delete-behavior

Part of files configuration.

What the extractor should do with files that were removed from the source.

ParameterTypeDescription
modeeither soft or hardConfigure how deleted files are treated. soft means that a metadata field is added to the file, given by key. hard means that the file is deleted from CDF.
keystringMetadata field to add to the deleted file in CDF. Default value is deleted.

extractor

Global parameter.

General configuration for the file extractor.

ParameterTypeDescription
state-storeobjectInclude the state store section to save extraction states between runs. Use this if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.
errors-thresholdintegerMaximum number of retries for fallible operations in the extractor. Default value is 5.
upload-queue-sizeintegerMaximum number of files in the upload queue at a time. Default value is 10.
parallelismintegerMaximum number of files to upload to CDF in parallel. Note that this files are streamed directly from the source, so this is also the number of parallel downloads. Default value is 4.
scheduleconfiguration for either Cron Expression or IntervalFile extractor schedule.

Examples:
{'type': 'cron', 'expression': '*/30 * * * *'}
{'type': 'interval', 'expression': '10m'}

state-store

Part of extractor configuration.

Include the state store section to save extraction states between runs. Use this if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.

ParameterTypeDescription
rawobjectA RAW state store stores the extraction state in a table in CDF RAW.
localobjectA local state store stores the extraction state in a JSON file on the local machine.

raw

Part of state-store configuration.

A RAW state store stores the extraction state in a table in CDF RAW.

ParameterTypeDescription
databasestringRequired. Enter the database name in CDF RAW.
tablestringRequired. Enter the table name in CDF RAW.
upload-intervalintegerEnter the interval in seconds between each upload to CDF RAW. Default value is 30.

local

Part of state-store configuration.

A local state store stores the extraction state in a JSON file on the local machine.

ParameterTypeDescription
pathstringRequired. Insert the file path to a JSON file.
save-intervalintegerEnter the interval in seconds between each save. Default value is 30.

schedule

Part of extractor configuration.

File extractor schedule.

Either one of the following options:

Examples:

type: cron
expression: '*/30 * * * *'
type: interval
expression: 10m

cron_expression

Part of schedule configuration.

ParameterTypeDescription
typestring
expressionstringRequired. Cron expression schedule.

Example:
*/30 * * * *

interval

Part of schedule configuration.

ParameterTypeDescription
typestring
expressionstringRequired. Fixed time interval. On the form N(s|m|h|d).

Examples:
10m
3h