Configure the Documentum extractor
We are deprecating the Documentum extractor in favor of the File extractor. We strongly encourage you to adopt the File extractor as soon as possible.
To configure the Documentum extractor, you must create a configuration file. The file must be in YAML format.
You must name the configuration file config.yml.
You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud.
The configuration file has a global parameter version, which holds the version of the configuration schema used in the configuration file. This document describes version 1 of the configuration schema.
You can use substitutions with environment variables in the configuration files. The values wrapped in ${}
are replaced with environment variables with that name. For example, ${COGNITE_PROJECT}
will be replaced with the value of the environment variable called COGNITE_PROJECT
.
`idp-authentication`:
project: ${COGNITE_PROJECT}
idp-authentication:
tenant: ${COGNITE_TENANT_ID}
client-id: ${COGNITE_CLIENT_ID}
secret: ${COGNITE_CLIENT_SECRET}
scopes:
- ${COGNITE_SCOPE}
Logger
Include the logger
section to set up logging to a console and to files.
Parameter | Description |
---|---|
console | Enable logging to a standard output, such as a terminal window. See the Console section. |
file | Enable logging to a file. See the File section. |
Console
Include the console
subsection to log events to a standard output, such as a terminal window. This section is optional. If level
has an invalid value, no logs are sent to the console.
Parameter | Description |
---|---|
level | Select the verbosity level for console logging. Valid options, in decreasing levels of verbosity, are trace , debug , info , warning , error , fatal , off . The default value is info . |
File
Include the file
subsection to log events to a file. This subsection is optional. If level
has an invalid value, no logs are sent to the file.
Parameter | Description |
---|---|
level | Select the verbosity level for file logging. Valid options, in decreasing levels of verbosity, are trace , debug , info , warning , error , fatal , off . |
path | Insert the path to the log file. |
Cognite
Include the cognite
section to configure which CDF project the extractor will load data into and how to connect to the project. This section is mandatory and should always contain the project and authentication configuration.
Parameter | Description |
---|---|
project | Insert the CDF project name you want to ingest data into. This is a required value. |
idp-authentication | Insert the credentials for authenticating to CDF using an external identity provider. You must enter either an API key or use IdP authentication. |
host | Insert the base URL of the CDF project. The default value is <https://api.cognitedata.com>. |
external-id-prefix | Enter the external ID prefix to identify the documents in CDF. Leave empty for no prefix. See also External IDs. |
source | Enter the source of the external ID. The default value is documentum . |
data-set-id | Specify the data set ID to assign to CDF Files. |
security-categories | Insert a list of internal IDs for security categories added to CDF Files. |
extraction-pipeline | Insert the external ID of an extraction pipeline in CDF. You should create the extraction pipeline before you configure this section. |
Identity provider (IdP) authentication
Include the idp-authentication
subsection to enable the extractor to authenticate to CDF using an external identity provider, such as Azure AD.
Parameter | Description |
---|---|
client-id | Enter the client ID from the IdP. This is a required value. |
secret | Enter the client secret from the IdP. This is a required value. |
scopes | List the scopes. This is a required value. |
tenant | Enter the Azure tenant. This is a required value. |
authority | Insert the base URL of the authority. The default value is <https://login.microsoftonline.com> |
min-ttl | Insert the minimum time in seconds a token will be valid. The cached token is refreshed if it expires in less than min-ttl seconds. The default value is 30. |
Extractor
The extractor
section contains various configurations for the operation of the extractor.
Parameter | Description |
---|---|
tmp-folder | Insert a folder where the extractor places temporary files. The default value is data/files relative to the working directory. |
keep-files | Set to true to keep temporary files after processing. The default value is false , which means temporary files are deleted. |
upload | Set to false to run the extractor in dry-run mode where files are accessed and processed, but no changes are made in CDF. The default value is true . |
delete | Set to true to delete files from CDF. There are two triggers for deleting documents:
|
delete-threshold | Insert a ratio between 0 and 1 of how much to maximum delete in a single run. For full sync mode, this is measured towards the size of CDF Files. For quick sync mode, this is measured towards the size of the current extraction. The default value is 1, indicating no threshold. |
threads | Enter the number of parallel documents to run. Note that this isn't number of connections to CDF or Documentum. The default value is 10. |
sync-mode | Set the synchronization mode. Options are full or quick . Full sync is typically faster for many files, while quick sync is typically faster for a smaller number of files. The default value is full . See Sync data modes. |
quick-sync-interval | Enter the number of hours to go back for a quick sync. For instance, if you set this value to 24, only the documents changed during the last day are included. The default value is 24. |
dump-json-file | Enter the name of a file to dump this extraction to. This is used to activate a JSON dump. The JSON dump is only intended for debugging purposes and will use a lot of RAM. Don't use this parameter for extractions where you expect over ~50k documents. The default value is no dump. |
Metrics
Include the metrics
section to send metrics about the extractor performance for remote monitoring of the extractor. This section is optional.
Pushgateways
Include the push-gateways
subsection to describe an array of Prometheus Pushgateway destinations to which the extractor will push metrics. This subsection is optional.
Parameter | Description |
---|---|
host | Insert the absolute URL of the Pushgateway host. Example: http://localhost:9091 . If you are using Cognite's Pushgateway, this is https://prometheus-push.cognite.ai/ . The default value is null/empty. |
job-name | Enter the value of the exported_job label you want to associate with metrics. |
username | Enter the user name in the Pushgateway. The default value is null/empty. |
password | Enter the password. The default value is null/empty. |
push-interval | Enter the interval in seconds between each push. The default value is 30. |
If you configure this section, the extractor pushes metrics that you, for instance, can display in Grafana. Create Grafana dashboards with the extractor metrics using Prometheus as the data source.
Documentum
Include the documentum
section to configure the queries, sync mode, and access to Documentum.
If you're connecting via the Documentum Foundation Classes (DFC) Java SDK, you don't need to enter username
, password
, host
, timeout
, and retries
since the extractor reads these values from dfc.properties.
Parameter | Description |
---|---|
mode | Enter how the extractor connects to Documentum. This is either via the D2 REST API (recommended) or the DFC Java SDK. The default value is D2. |
query | Enter a data query language (DQL) query to run on the Documentum server. This is a required value. |
metadata-properties | Insert the fields in a document's metadata that contains important information to the extractor. See the Metadata properties section. This is a required value. |
username | Enter the username for authenticating to D2. This is a required value for D2 extractions. |
password | Enter the password for authenticating to D2. This is a required value for D2 extractions. |
host | Insert the base URL of the D2 repository. This is a required value for D2 extractions. |
timeout | Specify the timeout in seconds for HTTP requests on D2. The default value is 60. |
retries | Specify the number of retries to failed requests before stopping the extractor. The default value is 5. |
Metadata properties
Include the metadata properties
section to describe where the extractor will look for information in a document's metadata.
Parameter | Description |
---|---|
file-type-short | Enter a shortened file type. This is the file name suffix, such as pdf . The default value is dos_extension . |
file-type-full | Enter a full mime type. This is the full file type, such as applications/pdf . The default value is mime_type . |
soft-delete-key | Include this parameter to turn on detection of soft-deletion. Include the metadata field that indicates a deleted document. The default value is no value. |
soft-delete-values | Insert values that trigger a deletion when this value and soft-delete-key form a key-value pair in the files' metadata. Values are case-sensitive. The default value is an empty list. |
object-id | This is the file ID for a document that tracks changes and generates external IDs in CDF. This value should be unique across repository and stay the same when a document changes. The recommended value is i_chronicle_id . |
modify-date | This is the time when a document was last changed. Use this parameter to track changes. |