Configuration settings
To configure the File extractor, you must create a configuration file. The file must be in YAML format.
You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud.
Using values from environment variables
The configuration file allows substitutions with environment variables. For example:
cognite:
secret: ${COGNITE_CLIENT_SECRET}
will load the value from the COGNITE_CLIENT_SECRET
environment variable into the cognite/secret
parameter. You can also do string interpolation with environment variables, for example:
url: http://my-host.com/api/endpoint?secret=${MY_SECRET_TOKEN}
Implicit substitutions only work for unquoted value strings. For quoted strings, use the !env
tag to activate environment substitution:
url: !env 'http://my-host.com/api/endpoint?secret=${MY_SECRET_TOKEN}'
Using values from Azure Key Vault
The DB extractor also supports loading values from Azure Key Vault. To load a configuration value from Azure Key Vault, use the !keyvault
tag followed by the name of the secret you want to load. For example, to load the value of the my-secret-name
secret in Key Vault into a password
parameter, configure your extractor like this:
password: !keyvault my-secret-name
To use Key Vault, you also need to include the azure-keyvault
section in your configuration, with the following parameters:
Parameter | Description |
---|---|
keyvault-name | Name of Key Vault to load secrets from |
authentication-method | How to authenticate to Azure. Either default or client-secret . For default , the extractor will look at the user running the extractor, and look for pre-configured Azure logins from tools like the Azure CLI. For client-secret , the extractor will authenticate with a configured client ID/secret pair. |
client-id | Required for using the client-secret authentication method. The client ID to use when authenticating to Azure. |
secret | Required for using the client-secret authentication method. The client secret to use when authenticating to Azure. |
tenant-id | Required for using the client-secret authentication method. The tenant ID of the Key Vault in Azure. |
Example:
azure-keyvault:
keyvault-name: my-keyvault-name
authentication-method: client-secret
tenant-id: 6f3f324e-5bfc-4f12-9abe-22ac56e2e648
client-id: 6b4cc73e-ee58-4b61-ba43-83c4ba639be6
secret: 1234abcd
Base configuration object
Parameter | Type | Description |
---|---|---|
version | either string or integer | Configuration file version |
type | either local or remote | Configuration file type. Either local , meaning the full config is loaded from this file, or remote , which means that only the cognite section is loaded from this file, and the rest is loaded from extraction pipelines. Default value is local . |
cognite | object | The cognite section describes which CDF project the extractor will load data into and how to connect to the project. |
logger | object | The optional logger section sets up logging to a console and files. |
files | object | Configure files to be extracted to CDF. |
extractor | object | General configuration for the file extractor. |
metrics | object | The metrics section describes where to send metrics on extractor performance for remote monitoring of the extractor. We recommend sending metrics to a Prometheus pushgateway, but you can also send metrics as time series in the CDF project. |
cognite
Global parameter.
The cognite section describes which CDF project the extractor will load data into and how to connect to the project.
Parameter | Type | Description |
---|---|---|
project | string | Insert the CDF project name. |
idp-authentication | object | The idp-authentication section enables the extractor to authenticate to CDF using an external identity provider (IdP), such as Microsoft Entra ID (formerly Azure Active Directory). |
data-set | object | Enter a data set the extractor should write data into |
extraction-pipeline | object | Enter the extraction pipeline used for remote config and reporting statuses |
host | string | Insert the base URL of the CDF project. Default value is https://api.cognitedata.com . |
timeout | integer | Enter the timeout on requests to CDF, in seconds. Default value is 30 . |
external-id-prefix | string | Prefix on external ID used when creating CDF resources |
connection | object | Configure network connection details |
idp-authentication
Part of cognite
configuration.
The idp-authentication
section enables the extractor to authenticate to CDF using an external identity provider (IdP), such as Microsoft Entra ID (formerly Azure Active Directory).
Parameter | Type | Description |
---|---|---|
authority | string | Insert the authority together with tenant to authenticate against Azure tenants. Default value is https://login.microsoftonline.com/ . |
client-id | string | Required. Enter the service principal client id from the IdP. |
tenant | string | Enter the Azure tenant. |
token-url | string | Insert the URL to fetch tokens from. |
secret | string | Enter the service principal client secret from the IdP. |
resource | string | Resource parameter passed along with token requests. |
audience | string | Audience parameter passed along with token requests. |
scopes | list | Enter a list of scopes requested for the token |
min-ttl | integer | Insert the minimum time in seconds a token will be valid. If the cached token expires in less than min-ttl seconds, it will be refreshed even if it is still valid. Default value is 30 . |
certificate | object | Authenticate with a client certificate |
scopes
Part of idp-authentication
configuration.
Enter a list of scopes requested for the token
Each element of this list should be a string.
certificate
Part of idp-authentication
configuration.
Authenticate with a client certificate
Parameter | Type | Description |
---|---|---|
authority-url | string | Authentication authority URL |
path | string | Required. Enter the path to the .pem or .pfx certificate to be used for authentication |
password | string | Enter the password for the key file, if it is encrypted. |
data-set
Part of cognite
configuration.
Enter a data set the extractor should write data into
Parameter | Type | Description |
---|---|---|
id | integer | Resource internal id |
external-id | string | Resource external id |
extraction-pipeline
Part of cognite
configuration.
Enter the extraction pipeline used for remote config and reporting statuses
Parameter | Type | Description |
---|---|---|
id | integer | Resource internal id |
external-id | string | Resource external id |
connection
Part of cognite
configuration.
Configure network connection details
Parameter | Type | Description |
---|---|---|
disable-gzip | boolean | Whether or not to disable gzipping of json bodies. |
status-forcelist | string | HTTP status codes to retry. Defaults to 429, 502, 503 and 504 |
max-retries | integer | Max number of retries on a given http request. Default value is 10 . |
max-retries-connect | integer | Max number of retries on connection errors. Default value is 3 . |
max-retry-backoff | integer | Retry strategy employs exponential backoff. This parameter sets a max on the amount of backoff after any request failure. Default value is 30 . |
max-connection-pool-size | integer | The maximum number of connections which will be kept in the SDKs connection pool. Default value is 50 . |
disable-ssl | boolean | Whether or not to disable SSL verification. |
proxies | object | Dictionary mapping from protocol to url. |
proxies
Part of connection
configuration.
Dictionary mapping from protocol to url.
logger
Global parameter.
The optional logger
section sets up logging to a console and files.
Parameter | Type | Description |
---|---|---|
console | object | Include the console section to enable logging to a standard output, such as a terminal window. |
file | object | Include the file section to enable logging to a file. The files are rotated daily. |
metrics | boolean | Enables metrics on the number of log messages recorded per logger and level. This requires metrics to be configured as well |
console
Part of logger
configuration.
Include the console section to enable logging to a standard output, such as a terminal window.
Parameter | Type | Description |
---|---|---|
level | either DEBUG , INFO , WARNING , ERROR or CRITICAL | Select the verbosity level for console logging. Valid options, in decreasing verbosity levels, are DEBUG , INFO , WARNING , ERROR , and CRITICAL . Default value is INFO . |
file
Part of logger
configuration.
Include the file section to enable logging to a file. The files are rotated daily.
Parameter | Type | Description |
---|---|---|
level | either DEBUG , INFO , WARNING , ERROR or CRITICAL | Select the verbosity level for file logging. Valid options, in decreasing verbosity levels, are DEBUG , INFO , WARNING , ERROR , and CRITICAL . Default value is INFO . |
path | string | Required. Insert the path to the log file. |
retention | integer | Specify the number of days to keep logs for. Default value is 7 . |
files
Global parameter.
Configure files to be extracted to CDF.
Parameter | Type | Description |
---|---|---|
file-provider | configuration for either Local Files, Sharepoint Online, FTP/FTPS, SFTP, GCP Cloud Storage, Azure Blob Storage, Samba, AWS S3 or Documentum | Configure a file provider for where the files are extracted from. |
extensions | list | List of file extensions to include. If left out, all file extensions will be allowed. |
labels | list | List of label external IDs to add to extracted files. |
security-categories | list | List of security category IDs to add to extracted files. |
max-file-size | either string or number | Maximum file size of files to include. Set to -1 to allow any file size. Syntax is N(KB|MB|GB|TB|KiB|miB|GiB|TiB) . Note that the extractor supports files up to 1000GiB. Default value is 100GiB . |
with-metadata | boolean | Add metadata extracted from the file source to files in CDF. |
directory-prefix | string | Prefix to add to all extracted file directories. |
metadata-to-raw | object | If this is configured, write metadata to a table in CDF Raw instead of files. |
data_model | object | When this is provided, all files metadata are uploaded to Data models, hence metadata-to-raw becomes redundant. |
source | object | Sets the 'Source' metadata field for the related files. When data modelling is set, it updates the underlying CogniteSourceSystem with the correspondent source.This is an optional parameter. |
filter | configuration for either And, Or, Not, Equals or In | |
delete-behavior | object | What the extractor should do with files that were removed from the source. |
missing-as-deleted | boolean | Whether the extractor should treat files that were not returned from the source as deleted. |
file-provider
Part of files
configuration.
Configure a file provider for where the files are extracted from.
Either one of the following options:
- Local Files
- Sharepoint Online
- FTP/FTPS
- SFTP
- GCP Cloud Storage
- Azure Blob Storage
- Samba
- AWS S3
- Documentum
local_files
Part of file-provider
configuration.
Read files from a local folder. This file provider will recursively traverse the given path and extract all discovered files.
Examples:
type: local
path: /some/local/path
type: local
path:
- /some/local/path
- /another/path
Parameter | Type | Description |
---|---|---|
type | string | Select the type of file provider. Set to local for local files. |
path | configuration for either string or list | |
ignore_patterns | configuration for either String or Pattern with flags | Any file path that matches this pattern will be ignored. |
ignore_patterns
Part of local_files
configuration.
Any file path that matches this pattern will be ignored.
Either one of the following options:
pattern_with_flags
Part of ignore_patterns
configuration.
Parameter | Type | Description |
---|---|---|
pattern | string | Pattern string |
flags | either a or i |
sharepoint_online
Part of file-provider
configuration.
Read files from one or more sharepoint online sites.
Example:
type: sharepoint_online
client-id: ${SP_CLIENT_ID}
client-secret: ${SP_CLIENT_SECRET}
tenant-id: ${SP_AZURE_TENANT_ID}
paths:
- url: ${SP_EXTRACT_URL}
Parameter | Type | Description |
---|---|---|
type | string | Required. Select the type of file provider. Set to sharpeoint_online for sharepoint online. |
client-id | string | Required. Enter the App registration client ID. |
client-secret | string | Enter the App registration secret. |
certificate-path | string | Enter the path to a certificate used for authentication. Either this, client-secret , or certificate-data must be specified. |
certificate-data | string | Provide authentication certificate data directly. |
tenant-id | string | Required. Enter the Azure tenant containing the App registration. |
paths | list | Required. Enter a list of sharepoint base URLs to extract from. |
datetime-format | string | Format string for timestamp metadata. Default value is %Y-%m-%dT%H:%M:%SZ . |
extract-columns | object | Extract Sharepoint columns as metadata. This is a map from column names in Sharepoint to the name you want to extracted columns to have in file metadata in CDF. Example: {'columnNameInSharepoint': 'metadataNameInCdf'} |
restrict-to | object | Restrict to extract only files visible to a given Sharepoint Group or SiteUser. Important: In order to use this, the extractor MUST authenticate to Sharepoint online using a certificate, NOT a client secret. |
ignore_patterns | configuration for either String or Pattern with flags | Any file path that matches this pattern will be ignored. |
paths
Part of sharepoint_online
configuration.
Enter a list of sharepoint base URLs to extract from.
Each element of this list should be an a sharepoint base URL to extract from.
Parameter | Type | Description |
---|---|---|
url | string | Required. URL to the Sharepoint location you want to extract from. This can be the url to a site, a document library, or to a file or folder inside the library. |
recursive | boolean | Whether to traverse into subfolders or not, for this path. Default value is True . |
extract-columns
Part of sharepoint_online
configuration.
Extract Sharepoint columns as metadata. This is a map from column names in Sharepoint to the name you want to extracted columns to have in file metadata in CDF.
Example:
columnNameInSharepoint: metadataNameInCdf
Parameter | Type | Description |
---|---|---|
Any string | string | Name of metadata field in CDF. |
restrict-to
Part of sharepoint_online
configuration.
Restrict to extract only files visible to a given Sharepoint Group or SiteUser. Important: In order to use this, the extractor MUST authenticate to Sharepoint online using a certificate, NOT a client secret.
Parameter | Type | Description |
---|---|---|
group-id | string | The ID of a Sharepoint Group |
username | string | The "login name" of a SiteUser. This can be useful if you for example want to restrict extraction to the built-in "all users" SiteUser. |
ignore_patterns
Part of sharepoint_online
configuration.
Any file path that matches this pattern will be ignored.
Either one of the following options:
pattern_with_flags
Part of ignore_patterns
configuration.
Parameter | Type | Description |
---|---|---|
pattern | string | Pattern string |
flags | either a or i |
ftp/ftps
Part of file-provider
configuration.
Read files from an FTP server.
Example:
type: ftp
host: ftp.myserver.com
username: username
password: ${FTP_PASSWORD}
Parameter | Type | Description |
---|---|---|
type | string | Required. Select the type of file provider. Set to ftp for FTP. |
host | string | Required. Host name for the FTP server. Example: ftp.myserver.com |
port | integer | FTP server port. Default value is 21 . |
username | string | Required. Username to use to login to the FTP server. |
password | string | Required. Password to use to login to the FTP server. |
root-directory | string | Root folder for extraction. Default value is / . |
recursive | boolean | Whether to recursively traverse sub-folders for files. Default value is True . |
use-ssl | boolean | Whether to connect using FTPS (FTP with SSL/TLS). Using SSL is strongly recommended. |
certificate-file-path | string | Path to SSL certificate authority certificate for the FTP server, useful if using self-signed certificates. |
ignore_patterns | configuration for either String or Pattern with flags | Any file path that matches this pattern will be ignored. |
ignore_patterns
Part of ftp/ftps
configuration.
Any file path that matches this pattern will be ignored.
Either one of the following options:
pattern_with_flags
Part of ignore_patterns
configuration.
Parameter | Type | Description |
---|---|---|
pattern | string | Pattern string |
flags | either a or i |
sftp
Part of file-provider
configuration.
Read files from an SFTP server, file transfer over SSH.
Example:
type: sftp
host: ftp.myserver.com
username: username
password: ${FTP_PASSWORD}
Parameter | Type | Description |
---|---|---|
type | string | Required. Select the type of file provider. Set to sftp for SFTP. |
host | string | Required. Host name for the SSH server. Example: ftp.myserver.com |
port | integer | SSH server port. Default value is 22 . |
username | string | Required. Username to use to login to the SSH server. |
password | string | Password to use to login to the SSH server. Either password or key-path is required. |
key-path | string | Path to SSH private key for the connection. Either password or key-path is required. |
key-password | string | Password for SSH private key if the key is encrypted. Only used in combination with key-path . |
root-directory | string | Root folder for extraction. Default value is / . |
recursive | boolean | Whether to recursively traverse sub-folders for files. Default value is True . |
ignore_patterns | configuration for either String or Pattern with flags | Any file path that matches this pattern will be ignored. |
ignore_patterns
Part of sftp
configuration.
Any file path that matches this pattern will be ignored.
Either one of the following options:
pattern_with_flags
Part of ignore_patterns
configuration.
Parameter | Type | Description |
---|---|---|
pattern | string | Pattern string |
flags | either a or i |
gcp_cloud_storage
Part of file-provider
configuration.
Read files from a GCP Cloud Storage bucket.
Example:
type: gcp_cloud_storage
google-application-credentials: ${GOOGLE_APPLICATION_CREDENTIALS}
bucket: bucket_name
folders:
- list
- of
- folders
Parameter | Type | Description |
---|---|---|
type | string | Select the type of file provider. Set to gcp_cloud_storage for GCP Cloud Storage. |
google-application-credentials | string | Base-64 encoded GCP service account credentials. |
bucket | string | Name of GCP Cloud Storage bucket to fetch files from. |
folders | list | List of folders in bucket to fetch files from. |
ignore_patterns | configuration for either String or Pattern with flags | Any file path that matches this pattern will be ignored. |
folders
Part of gcp_cloud_storage
configuration.
List of folders in bucket to fetch files from.
Each element of this list should be a string.
ignore_patterns
Part of gcp_cloud_storage
configuration.
Any file path that matches this pattern will be ignored.
Either one of the following options:
pattern_with_flags
Part of ignore_patterns
configuration.
Parameter | Type | Description |
---|---|---|
pattern | string | Pattern string |
flags | either a or i |
azure_blob_storage
Part of file-provider
configuration.
Read files from an Azure Blob Store.
Example:
type: azure_blob_storage
connection-string: ${AZURE_BLOB_STORAGE_CONNECTION_STRING}
Parameter | Type | Description |
---|---|---|
type | string | Required. Select the type of file provider. Set to azure_blob_storage for Azure Blob Storage. |
connection-string | string | Required. Azure Blob Storage connection string. |
containers | list | Optional list of containers to extract from. If left out or empty, all files will be read. |
ignore_patterns | configuration for either String or Pattern with flags | Any file path that matches this pattern will be ignored. |
containers
Part of azure_blob_storage
configuration.
Optional list of containers to extract from. If left out or empty, all files will be read.
Each element of this list should be a string.
ignore_patterns
Part of azure_blob_storage
configuration.
Any file path that matches this pattern will be ignored.
Either one of the following options:
pattern_with_flags
Part of ignore_patterns
configuration.
Parameter | Type | Description |
---|---|---|
pattern | string | Pattern string |
flags | either a or i |
samba
Part of file-provider
configuration.
Read files from a Samba file share.
Example:
type: smb
server: serverhost
share-path: \\serverhost\share_path
username: username
password: ${SMB_PASSWORD}
Parameter | Type | Description |
---|---|---|
type | string | Required. Select the type of file provider. Set to smb for Samba. |
share-path | string | Required. Share path, i.e. name of path shared. Example: \\server\share_path |
server | string | Required. The IP or hostname of the server to connect to |
port | integer | Port to connect to on the Samba server. Default value is 445 . |
username | string | Username for authentication. |
password | string | Password for authentication. |
require-signing | boolean | Whether signing is required on messages sent to and received from the Samba server. Default value is True . |
domain-controller | string | The domain controller hostname. When set the file provider will send a DFS referral request to this hostname to populate the domain cache used for DFS connections or when connecting to SYSVOL or NETLOGON |
skip-dfs | boolean | Whether to skip using any DFS referral checks and treat any path as a normal path. This is only useful if there are problems with the DFS resolver or you wish to avoid the extra round trip(s) the resolver requires. |
auth-protocol | either negotiate , kerberos or ntlm | The protocol to use for authentication. Default value is negotiate . |
require-secure-negotiate | boolean | Whether to verify the negotiated dialects and capabilities on the connection to a share to protect against man in the middle downgrade attacks. Default value is True . |
ignore_patterns | configuration for either String or Pattern with flags | Any file path that matches this pattern will be ignored. |
ignore_patterns
Part of samba
configuration.
Any file path that matches this pattern will be ignored.
Either one of the following options:
pattern_with_flags
Part of ignore_patterns
configuration.
Parameter | Type | Description |
---|---|---|
pattern | string | Pattern string |
flags | either a or i |
aws_s3
Part of file-provider
configuration.
Read files from an AWS S3 cloud bucket.
Example:
type: aws_s3
bucket: some_bucket
aws-access-key-id: ${AWS_ACCESS_KEY_ID}
aws-secret-access-key: ${AWS_SECRET_ACCESS_KEY}
Parameter | Type | Description |
---|---|---|
type | string | Required. Select the type of file provider. Set to aws_s3 for AWS S3. |
bucket | string | Required. AWS S3 cloud bucket to read files from. Valid formats are bucket-name or bucket-name/subfolder1/subfolder2 . |
aws-access-key-id | string | AWS access key ID to use for authentication. If left out, use default authentication configured on the machine the extractor is running on. |
aws-secret-access-key | string | AWS secret access key to use for authentication. If left out, use default authentication configured on the machine the extractor is running on. |
ignore_patterns | configuration for either String or Pattern with flags | Any file path that matches this pattern will be ignored. |
ignore_patterns
Part of aws_s3
configuration.
Any file path that matches this pattern will be ignored.
Either one of the following options:
pattern_with_flags
Part of ignore_patterns
configuration.
Parameter | Type | Description |
---|---|---|
pattern | string | Pattern string |
flags | either a or i |
documentum
Part of file-provider
configuration.
Read files from an OpenText Documentum server.
Example:
base-url: https://my.documentum.server/dctm
username: username
password: ${DOCUMENTUM_PASSWORD}
repositories:
- name: some_repo
query: SELECT * FROM tech_document WHERE object_name LIKE 'PREFIX%' ORDER BY r_object_id
Parameter | Type | Description |
---|---|---|
type | string | Required. Select the type of file provider. Set to documentum to read from a Documentum server. |
base-url | string | Required. URL of the documentum server to read from. |
username | string | Required. Documentum server username. |
password | string | Required. Documentum server password. |
repositories | list | Required. List of documentum repositories to read from, and the query to make towards each repository. |
ssl-verify | boolean | Enable SSL certificate verification. Warning: Disabling SSL verification is a potential security risk. Do not use this option over the internet, only on local networks secured through other means. Default value is True . |
get-all-renditions | boolean | Get all renditions of each document, not just the primary. |
external-id-separator | string | Text to use as a separator between the different parts of the created external IDs. Default value is . . |
include-extension-in-file-names | boolean | Include the file extension in the name of the file in CDF. For example, upload it as My File.pdf instead of My File |
timeout | string | Timeout on queries to documentum. On the form N(s|m|h|d) . Default value is 5m . |
field-map | object | Map documentum metadata to fields on files in CDF. For most Documentum deployments, the defaults should not need to be changed. |
performance | object | Configuration to tune the parallelism of the file extractor. |
ignore_patterns | configuration for either String or Pattern with flags | Any file path that matches this pattern will be ignored. |
repositories
Part of documentum
configuration.
List of documentum repositories to read from, and the query to make towards each repository.
Each element of this list should be an a documentum repository the file extractor should read from, and the query it should use.
Parameter | Type | Description |
---|---|---|
name | string | Required. Name of the documentum repository. |
query | string | Required. Query to retrieve file info from the repository. Warning: The query must return a consistent ordering. To ensure this, add an ORDER BY clause for some ID field.Example: SELECT * FROM tech_document WHERE object_name LIKE 'PREFIX%' ORDER BY r_object_id |
field-map
Part of documentum
configuration.
Map documentum metadata to fields on files in CDF. For most Documentum deployments, the defaults should not need to be changed.
Parameter | Type | Description |
---|---|---|
external-id | list | Map documentum columns to external ID. This is required in order to correctly track files in CDF. |
name | list | Map documentum columns to file name. |
file-extension | list | Map documentum columns to file extension. |
modify-date | list | Map documentum columns to source modified time in CDF. |
mime-type | list | Map documentum columns to mime type. |
external-id
Part of field-map
configuration.
Map documentum columns to external ID. This is required in order to correctly track files in CDF.
Each element of this list should be a string.
name
Part of field-map
configuration.
Map documentum columns to file name.
Each element of this list should be a string.
file-extension
Part of field-map
configuration.
Map documentum columns to file extension.
Each element of this list should be a string.
modify-date
Part of field-map
configuration.
Map documentum columns to source modified time in CDF.
Each element of this list should be a string.
mime-type
Part of field-map
configuration.
Map documentum columns to mime type.
Each element of this list should be a string.
performance
Part of documentum
configuration.
Configuration to tune the parallelism of the file extractor.
Parameter | Type | Description |
---|---|---|
workers | integer | Number of parallel workers used to read from documentum. Default value is 30 . |
document-buffer | integer | Number of document metadata instances to buffer before uploading the file contents to CDF. Default value is 60 . |
page-buffer | integer | Number of document metadata pages to buffer. Default value is 5 . |
ignore_patterns
Part of documentum
configuration.
Any file path that matches this pattern will be ignored.
Either one of the following options:
pattern_with_flags
Part of ignore_patterns
configuration.
Parameter | Type | Description |
---|---|---|
pattern | string | Pattern string |
flags | either a or i |
extensions
Part of files
configuration.
List of file extensions to include. If left out, all file extensions will be allowed.
Each element of this list should be a string.
labels
Part of files
configuration.
List of label external IDs to add to extracted files.
Each element of this list should be a string.
security-categories
Part of files
configuration.
List of security category IDs to add to extracted files.
Each element of this list should be an integer.
metadata-to-raw
Part of files
configuration.
If this is configured, write metadata to a table in CDF Raw instead of files.
Parameter | Type | Description |
---|---|---|
database | string | Required. Write file metadata to this Raw database. |
table | string | Required. Write file metadata to this Raw table. |
data_model
Part of files
configuration.
When this is provided, all files metadata are uploaded to Data models, hence metadata-to-raw becomes redundant.
Parameter | Type | Description |
---|---|---|
space | string |
source
Part of files
configuration.
Sets the 'Source' metadata field for the related files. When data modelling is set, it updates the underlying CogniteSourceSystem with the correspondent source.This is an optional parameter.
Parameter | Type | Description |
---|---|---|
name | string | |
external_id | string |
filter
Part of files
configuration.
Either one of the following options:
Example:
and:
- equals:
property: some_metadata_field
value: some_metadata_value
- not:
in:
property: some_metadata_field
values:
- metadata
- values
and
Part of filter
configuration.
Matches if all sub filters match.
Parameter | Type | Description |
---|---|---|
and | list | Required. List of sub filters, all of these must match. |
and
Part of and
configuration.
List of sub filters, all of these must match.
Each element of this list should be a configuration for either And, Or, Not, Equals or In.
or
Part of filter
configuration.
Matches if any of the sub filters match.
Parameter | Type | Description |
---|---|---|
or | list | Required. List of sub filters, at least one of these must match. |
or
Part of or
configuration.
List of sub filters, at least one of these must match.
Each element of this list should be a configuration for either And, Or, Not, Equals or In.
not
Part of filter
configuration.
Matches if the sub filter does not match.
Parameter | Type | Description |
---|---|---|
not | configuration for either And, Or, Not, Equals or In |
equals
Part of filter
configuration.
Matches if the property on the file is equal to the given value.
Parameter | Type | Description |
---|---|---|
equals | object | Required. Equality filter. |
equals
Part of equals
configuration.
Equality filter.
Parameter | Type | Description |
---|---|---|
property | string | Required. File property name. |
value | string | Required. Property value to match. |
in
Part of filter
configuration.
Matches if the property on the file is equal to one of the given values.
Parameter | Type | Description |
---|---|---|
in | object | Required. In filter. |
in
Part of in
configuration.
In filter.
Parameter | Type | Description |
---|---|---|
property | string | Required. File property name. |
values | list | Property values. One of these must match. |
values
Part of in
configuration.
Property values. One of these must match.
Each element of this list should be a string.
delete-behavior
Part of files
configuration.
What the extractor should do with files that were removed from the source.
Parameter | Type | Description |
---|---|---|
mode | either soft or hard | Configure how deleted files are treated. soft means that a metadata field is added to the file, given by key . hard means that the file is deleted from CDF. |
key | string | Metadata field to add to the deleted file in CDF. Default value is deleted . |
extractor
Global parameter.
General configuration for the file extractor.
Parameter | Type | Description |
---|---|---|
state-store | object | Include the state store section to save extraction states between runs. Use this if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time. |
errors-threshold | integer | Maximum number of retries for fallible operations in the extractor. Default value is 5 . |
upload-queue-size | integer | Maximum number of files in the upload queue at a time. Default value is 10 . |
parallelism | integer | Maximum number of files to upload to CDF in parallel. Note that this files are streamed directly from the source, so this is also the number of parallel downloads. Default value is 4 . |
dry-run | boolean | Run the extractor in dry run mode. If set to true, nothing will be uploaded to CDF and no states will be stored. This means that we will load file metadata from the source, but not download any files. |
schedule | configuration for either Cron Expression or Interval | File extractor schedule. Examples: {'type': 'cron', 'expression': '*/30 * * * *'} {'type': 'interval', 'expression': '10m'} |
state-store
Part of extractor
configuration.
Include the state store section to save extraction states between runs. Use this if data is loaded incrementally. We support multiple state stores, but you can only configure one at a time.
Parameter | Type | Description |
---|---|---|
raw | object | A RAW state store stores the extraction state in a table in CDF RAW. |
local | object | A local state store stores the extraction state in a JSON file on the local machine. |
raw
Part of state-store
configuration.
A RAW state store stores the extraction state in a table in CDF RAW.
Parameter | Type | Description |
---|---|---|
database | string | Required. Enter the database name in CDF RAW. |
table | string | Required. Enter the table name in CDF RAW. |
upload-interval | integer | Enter the interval in seconds between each upload to CDF RAW. Default value is 30 . |
local
Part of state-store
configuration.
A local state store stores the extraction state in a JSON file on the local machine.
Parameter | Type | Description |
---|---|---|
path | string | Required. Insert the file path to a JSON file. |
save-interval | integer | Enter the interval in seconds between each save. Default value is 30 . |
schedule
Part of extractor
configuration.
File extractor schedule.
Either one of the following options:
Examples:
type: cron
expression: '*/30 * * * *'
type: interval
expression: 10m
cron_expression
Part of schedule
configuration.
Parameter | Type | Description |
---|---|---|
type | string | |
expression | string | Required. Cron expression schedule. Example: */30 * * * * |
interval
Part of schedule
configuration.
Parameter | Type | Description |
---|---|---|
type | string | |
expression | string | Required. Fixed time interval. On the form N(s|m|h|d) .Examples: 10m 3h |
metrics
Global parameter.
The metrics
section describes where to send metrics on extractor performance for remote monitoring of the extractor. We recommend sending metrics to a Prometheus pushgateway, but you can also send metrics as time series in the CDF project.
Parameter | Type | Description |
---|---|---|
push-gateways | list | List of prometheus pushgateway configurations |
cognite | object | Push metrics to CDF timeseries. Requires CDF credentials to be configured |
server | object | The extractor can also be configured to expose a HTTP server with prometheus metrics for scraping |
push-gateways
Part of metrics
configuration.
List of prometheus pushgateway configurations
Each element of this list should be a the push-gateways sections contain a list of metric destinations.
Parameter | Type | Description |
---|---|---|
host | string | Enter the address of the host to push metrics to. |
job-name | string | Enter the value of the exported_job label to associate metrics with. This separates several deployments on a single pushgateway, and should be unique. |
username | string | Enter the credentials for the pushgateway. |
password | string | Enter the credentials for the pushgateway. |
clear-after | either null or integer | Enter the number of seconds to wait before clearing the pushgateway. When this parameter is present, the extractor will stall after the run is complete before deleting all metrics from the pushgateway. The recommended value is at least twice that of the scrape interval on the pushgateway. This is to ensure that the last metrics are gathered before the deletion. Default is disabled. |
push-interval | integer | Enter the interval in seconds between each push. Default value is 30 . |
cognite
Part of metrics
configuration.
Push metrics to CDF timeseries. Requires CDF credentials to be configured
Parameter | Type | Description |
---|---|---|
external-id-prefix | string | Required. Prefix on external ID used when creating CDF time series to store metrics. |
asset-name | string | Enter the name for a CDF asset that will have all the metrics time series attached to it. |
asset-external-id | string | Enter the external ID for a CDF asset that will have all the metrics time series attached to it. |
push-interval | integer | Enter the interval in seconds between each push to CDF. Default value is 30 . |
data-set | object | Data set the metrics will be created under |
data-set
Part of cognite
configuration.
Data set the metrics will be created under
Parameter | Type | Description |
---|---|---|
id | integer | Resource internal id |
external-id | string | Resource external id |
server
Part of metrics
configuration.
The extractor can also be configured to expose a HTTP server with prometheus metrics for scraping
Parameter | Type | Description |
---|---|---|
host | string | Host to run the prometheus server on. Default value is 0.0.0.0 . |
port | integer | Local port to expose the prometheus server on. Default value is 9000 . |