Skip to main content

Configure the File Extractor

To configure the File Extractor, you must create a configuration file. The file must be in YAML format.

The configuration file allows substitutions with environment variables:

config-parameter: ${CONFIG_VALUE}
Note

Implicit substitutions only work for unquoted value strings. For quoted strings, use the !env tag to activate environment substitution:

config-parameter: !env 'PARAM=SYSTEM;CONFIG=${CONFIG_VALUE}'

The configuration file also contains the global parameter version, which holds the version of the configuration schema used in the configuration file. This document describes version 3 of the configuration schema.

Tip

You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud.

Logger

The optional logger section sets up logging to a console and files.

ParameterDescription
consoleSets up console logger configuration. See the Console section.
fileSets ut file logger configuration. See the File section.

Console

Include the console section to enable logging to a standard output, such as a terminal window.

ParameterDescription
levelSelect the verbosity level for console logging. Valid options, in decreasing verbosity levels, are DEBUG, INFO, WARNING, ERROR, and CRITICAL.

File

Include the file section to enable logging to a file. The files are rotated daily.

ParameterDescription
levelSelect the verbosity level for file logging. Valid options, in decreasing verbosity levels, are DEBUG, INFO, WARNING, ERROR, and CRITICAL.
pathInsert the path to the log file.
retentionSpecify the number of days to keep logs for. The default value is 7.

Cognite

The cognite section describes which CDF project the extractor will load data into and how to connect to the project.

ParameterDescription
projectInsert the CDF project name. This is a required value.
hostInsert the base URL of the CDF project. The default value is https://api.cognitedata.com.
idp-authenticationInsert the credentials for authenticating to CDF using an external identity provider. You must enter either an API key or use IdP authentication.
data-setInsert an optional data set ID that will be used if you've set the extractor to create missing time series. This value must contain either id or external-id.

Identity provider (IdP) authentication

The idp-authentication section enables the extractor to authenticate to CDF using an external identity provider, such as Azure AD..

ParameterDescription
client-idEnter the client ID from the IdP. This is a required value.
secretEnter the client secret from the IdP. This is a required value.
scopesList the scopes. This is a required value.
resourceInsert token requests. This is an optional field.
token-urlInsert the URL to fetch tokens from. You must enter either a token URL or an Azure tenant.
tenantEnter the Azure tenant. You must enter either a token URL or an Azure tenant
min-ttlInsert the minimum time in seconds a token will be valid. If the cached token expires in less than min_ttl seconds, it will be refreshed. The default value is 30.

Extractor

The optional extractor section contains tuning parameters.

ParameterDescription
errors_thresholdEnter the amount of retries the extractor should execute when a file extraction fails. The default value is 5
parallelismInsert the number of parallel queries to run. The default value is 4.
state-storeSet to true to configure state store. The default value is no state store, and the incremental load is deactivated. See the State store section.
scheduleSchedule the interval which the file extraction should be execute. Use this parameter when the extractor is set to continuous mode. See the Schedule section.

Schedule

Use the schedule subsection to schedule runs when the extractor runs as a service.

ParameterDescription
typeInsert the schedule type. Valid options are cron and interval.
  • cron uses regular cron expressions.
  • interval expects an interval-based schedule.
  • expressionEnter the cron or interval expression to trigger the query. For example, 1h repeats the query hourly, and 5m repeats the query every 5 minutes.

    State store

    Use the state store subsection to save extraction states between runs. Use this if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.

    ParameterDescription
    localLocal state store configuration. See the Local section.
    rawRAW state store configuration. See the RAW section.

    Local

    Use the local section to store the extraction state in a JSON file on a local machine.

    ParameterDescription
    pathInsert the file path to a JSON file.
    save-intervalEnter the interval in seconds between each save. The default value is 30 seconds.

    RAW

    Use the RAW section to store the extraction state in a table in the CDF staging area.

    ParameterDescription
    databaseEnter the database name in the CDF staging area.
    tableEnter the table name in the CDF staging area.
    upload-intervalEnter the interval in seconds between each save. The default value is 30 seconds.

    Files

    The files section contains the configuration needed in order to connect to the file source. The schema for the file configuration depends on which file source you are connecting to. These are distinguished by the type parameter. Possible file source types include:

    • Azure Blob Storage
    • FTP / FTPS
    • Google Cloud Storage
    • Local files
    • Amazon S3
    • Samba / SMB
    • SFTP
    • Sharepoint Online

    Navigate to Integrate > Connect to source system > Cognite File Extractor in CDF to see all supported sources and the recommended approach.

    This is the schema for Azure Blob Storage source:

    ParameterDescription
    typeType of file source, set to azure_blob_storage for Azure Blob storage files.
    connection_stringConnection string needed to connect to Azure Blob storage. This is a mandatory field.
    containersList of Azure blob containers. This is an optional field.

    This is the schema for FTP/FTPS source:

    ParameterDescription
    typeType of file source, set to ftp for FTP or FTPS source.
    base-urlEnter the base URL for the FTP server. This is a mandatory field.
    portEnter the port related to the FTP server. This is an optional field.
    client-loginEnter the FTP username. This is an mandatory field.
    client-passwordEnter the FTP password. This is an mandatory field.
    main-folderEnter the root directory on which the extractor will start the extractor. This is an optional field.
    with-subfoldersFlag that allows the extractor to traverse into sub-folders in order to retrieve the related files. Possible values are true or false. Default value is false. This is an optional field.
    use-sslWhen set to true, it connects to the source using SSL (FTPS). Possible values are true or false. Default value is false. This is an optional field.
    certificate-file-pathEnter the path to the certificate file. This is an optional field.

    This is the schema for Google Cloud Storage source:

    ParameterDescription
    typeType of file source, set to gcp_cloud_storage for Google Cloud Storage source.
    google-application-credentialsEnter the Google Cloud Platform service account credentials (encoded in base64 format). This is a mandatory field.
    bucketEnter the name of the bucket where the files are located. This is a mandatory field.
    foldersEnter the list of folders where the files are located . This is an mandatory field.

    This is the schema for local files source:

    ParameterDescription
    typeType of file source, set to local for local files.
    pathEnter the path (absolute or relative) where the local files are located. This is a mandatory parameter.

    This is the schema for Amazon S3 source:

    ParameterDescription
    typeType of file source, set to aws_s3 for Amazon S3 source.
    aws_access_key_idEnter the AWS Access Key ID. This is a mandatory parameter.
    aws_secret_access_keyEnter the AWS Secret Access Key. This is a mandatory field.
    bucketEnter the name of the bucket where the files are located. This is a mandatory field.

    This is the schema for Samba / SMB source:

    ParameterDescription
    typeType of file source, set to smb for Samba source.
    serverEnter the server address related to the Samba server. This is a mandatory field.
    share_pathEnter the Samba server share path . This is a mandatory field.
    usernameEnter the Samba server username. This is an mandatory field.
    passwordEnter the Samba server password. This is an mandatory field.

    This is the schema for FTP/FTPS source:

    ParameterDescription
    typeType of file source, set to sftp for STFP source.
    base-urlEnter the base URL for the FTP server. This is a mandatory field.
    portEnter the port related to the FTP server. This is an optional field.
    client-loginEnter the FTP username. This is an mandatory field.
    client-passwordEnter the FTP password. This is an mandatory field.
    main-folderEnter the root directory on which the extractor will start the extractor. This is an optional field.
    with-subfoldersFlag that allows the extractor to traverse into sub-folders in order to retrieve the related files. Possible values are true or false. Default value is false. This is an optional field.
    certificate-file-pathEnter the path to the certificate file. This is an optional field.

    This is the schema for Sharepoint Online source:

    ParameterDescription
    typeType of file source, set to sharepoint_online for Sharepoint Online source.
    client-idEnter the App registration client ID. This is a mandatory field.
    client-secretEnter the App registration secret. This is a mandatory field.
    tenant-idEnter the Azure tenant related to the App registration . This is a mandatory field.
    base-urlEnter the Sharepoint Online base URL. This is an mandatory field.
    siteEnter the Sharepoint site where the document library is located. This is a mandatory field.
    document-libraryEnter the Sharepoint document library where the files are located. This is a mandatory field.
    with-subfoldersFlag that allows the extractor to traverse into sub-folders in order to retrieve the related files. Possible values are true or false. Default value is false. This is an optional field.