Skip to main content

Configure the file extractor

To configure the file extractor, you must create a configuration file. The file must be in YAML format.

Minimal YAML configuration file

You can use substitutions with environment variables in the configuration files. The values wrapped in ${} are replaced with environment variables with that name.


logger:
console:
level: INFO

cognite:
# Read these from environment variables
host: ${COGNITE_BASE_URL}
project: ${COGNITE_PROJECT}
data_set:
external_id: equipment

idp-authentication:
token-url: ${COGNITE_TOKEN_URL}

client-id: ${COGNITE_CLIENT_ID}
secret: ${COGNITE_CLIENT_SECRET}
scopes:
- ${COGNITE_BASE_URL}/.default

files:
# List of file extensions to extract
extensions:
- .pdf
- .json
# How often to check files for changes
schedule-interval-sec: 300

# Number of consecutive errors to tolerate before failing
errors-threshold: 5

# Optional list of CDF labels. Currently, only already existing labels can be used
labels:
- some_label

# Limit to files smaller than
max-file-size: 64Mb

# Optional flag if available file metadata should be stored in CDF. By default, equal to false
with-metadata: true

# Where files should be extracted from.
file-provider:
type: local
# path to extract files from
path: "d:\\test"

Logger

Include the logger section to set up logging to a console.

ParameterDescription
consoleEnable logging to a standard output, such as a terminal window.
levelSelect the verbosity level for console logging. Valid options, in decreasing levels of verbosity, are trace, debug, info, warning, error, fatal, off. The default value is info.

Cognite

Include the cognite section to configure which CDF project the extractor will load data into and how to connect to the project. This section is mandatory and should always contain the project and authentication configuration.

ParameterDescription
hostInsert the base URL of the CDF project. The default value is https://api.cognitedata.com.
projectInsert the CDF project name you want to ingest data into. This is a required value.
data_setSpecify the data set ID to assign to CDF Files. This parameter is optional.

Identity provider (IdP) authentication

Include the idp-authentication subsection to enable the extractor to authenticate to CDF using an external identity provider, such as Azure AD.

ParameterDescription
token-urlInsert the URL to fetch tokens from. You must enter either a token URL or an Azure tenant.
client-idEnter the client ID from the IdP. This is a required value.
secretEnter the client secret from the IdP. This is a required value.
scopesList the scopes. This is a required value.

Files

Include the files section to configure the extraction options.

ParameterDescription
extensionsList the file extensions you want to extract. If you include this parameter, only matching extensions are uploaded.
schedule-interval-secEnter how frequent the extractors should check files for changes.
errors-thresholdEnter the number of consecutive errors to tolerate before the extractor fails.
labelsEnter CDF labels to attach to uploaded files. This parameter is optional.
max-file-sizeEnter a maximum file size before skipping upload. The default value is no maximum.
with-metadataSet to true to store file metadata in CDF. The default value is false.
file-providerInsert the path to where files should be extracted from.