Configure the file extractor
To configure the file extractor, you must create a configuration file. The file must be in YAML format.
Minimal YAML configuration file
You can use substitutions with environment variables in the configuration files. The values wrapped in ${}
are replaced with environment variables with that name.
logger:
console:
level: INFO
cognite:
# Read these from environment variables
host: ${COGNITE_BASE_URL}
project: ${COGNITE_PROJECT}
data_set:
external_id: equipment
idp-authentication:
token-url: ${COGNITE_TOKEN_URL}
client-id: ${COGNITE_CLIENT_ID}
secret: ${COGNITE_CLIENT_SECRET}
scopes:
- ${COGNITE_BASE_URL}/.default
files:
# List of file extensions to extract
extensions:
- .pdf
- .json
# How often to check files for changes
schedule-interval-sec: 300
# Number of consecutive errors to tolerate before failing
errors-threshold: 5
# Optional list of CDF labels. Currently, only already existing labels can be used
labels:
- some_label
# Limit to files smaller than
max-file-size: 64Mb
# Optional flag if available file metadata should be stored in CDF. By default, equal to false
with-metadata: true
# Where files should be extracted from.
file-provider:
type: local
# path to extract files from
path: "d:\\test"
Logger
Include the logger
section to set up logging to a console.
Parameter | Description |
---|---|
console | Enable logging to a standard output, such as a terminal window. |
level | Select the verbosity level for console logging. Valid options, in decreasing levels of verbosity, are trace , debug , info , warning , error , fatal , off . The default value is info . |
Cognite
Include the cognite
section to configure which CDF project the extractor will load data into and how to connect to the project. This section is mandatory and should always contain the project and authentication configuration.
Parameter | Description |
---|---|
host | Insert the base URL of the CDF project. The default value is https://api.cognitedata.com. |
project | Insert the CDF project name you want to ingest data into. This is a required value. |
data_set | Specify the data set ID to assign to CDF Files. This parameter is optional. |
Identity provider (IdP) authentication
Include the idp-authentication
subsection to enable the extractor to authenticate to CDF using an external identity provider, such as Azure AD.
Parameter | Description |
---|---|
token-url | Insert the URL to fetch tokens from. You must enter either a token URL or an Azure tenant. |
client-id | Enter the client ID from the IdP. This is a required value. |
secret | Enter the client secret from the IdP. This is a required value. |
scopes | List the scopes. This is a required value. |
Files
Include the files
section to configure the extraction options.
Parameter | Description |
---|---|
extensions | List the file extensions you want to extract. If you include this parameter, only matching extensions are uploaded. |
schedule-interval-sec | Enter how frequent the extractors should check files for changes. |
errors-threshold | Enter the number of consecutive errors to tolerate before the extractor fails. |
labels | Enter CDF labels to attach to uploaded files. This parameter is optional. |
max-file-size | Enter a maximum file size before skipping upload. The default value is no maximum. |
with-metadata | Set to true to store file metadata in CDF. The default value is false . |
file-provider | Insert the path to where files should be extracted from. |