Salt la conținutul principal

Configure the Documentum extractor

Deprecated

We are deprecating the Documentum extractor in favor of the File extractor. We strongly encourage you to adopt the File extractor as soon as possible.

To configure the Documentum extractor, you must create a configuration file. The file must be in YAML format.

Naming the configuration file

You must name the configuration file config.yml.

Tip

You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud.

The configuration file has a global parameter version, which holds the version of the configuration schema used in the configuration file. This document describes version 1 of the configuration schema.

You can use substitutions with environment variables in the configuration files. The values wrapped in ${} are replaced with environment variables with that name. For example, ${COGNITE_PROJECT} will be replaced with the value of the environment variable called COGNITE_PROJECT.

`idp-authentication`:
project: ${COGNITE_PROJECT}
idp-authentication:
tenant: ${COGNITE_TENANT_ID}
client-id: ${COGNITE_CLIENT_ID}
secret: ${COGNITE_CLIENT_SECRET}
scopes:
- ${COGNITE_SCOPE}

Logger

Include the logger section to set up logging to a console and to files.

ParameterDescription
consoleEnable logging to a standard output, such as a terminal window. See the Console section.
fileEnable logging to a file. See the File section.

Console

Include the console subsection to log events to a standard output, such as a terminal window. This section is optional. If level has an invalid value, no logs are sent to the console.

ParameterDescription
levelSelect the verbosity level for console logging. Valid options, in decreasing levels of verbosity, are trace, debug, info, warning, error, fatal, off. The default value is info.

File

Include the file subsection to log events to a file. This subsection is optional. If level has an invalid value, no logs are sent to the file.

ParameterDescription
levelSelect the verbosity level for file logging. Valid options, in decreasing levels of verbosity, are trace, debug, info, warning, error, fatal, off.
pathInsert the path to the log file.

Cognite

Include the cognite section to configure which CDF project the extractor will load data into and how to connect to the project. This section is mandatory and should always contain the project and authentication configuration.

ParameterDescription
projectInsert the CDF project name you want to ingest data into. This is a required value.
idp-authenticationInsert the credentials for authenticating to CDF using an external identity provider. You must enter either an API key or use IdP authentication.
hostInsert the base URL of the CDF project. The default value is <https://api.cognitedata.com>.
external-id-prefixEnter the external ID prefix to identify the documents in CDF. Leave empty for no prefix. See also External IDs.
sourceEnter the source of the external ID. The default value is documentum.
data-set-idSpecify the data set ID to assign to CDF Files.
security-categoriesInsert a list of internal IDs for security categories added to CDF Files.
extraction-pipelineInsert the external ID of an extraction pipeline in CDF. You should create the extraction pipeline before you configure this section.

Identity provider (IdP) authentication

Include the idp-authentication subsection to enable the extractor to authenticate to CDF using an external identity provider, such as Azure AD.

ParameterDescription
client-idEnter the client ID from the IdP. This is a required value.
secretEnter the client secret from the IdP. This is a required value.
scopesList the scopes. This is a required value.
tenantEnter the Azure tenant. This is a required value.
authorityInsert the base URL of the authority. The default value is <https://login.microsoftonline.com>
min-ttlInsert the minimum time in seconds a token will be valid. The cached token is refreshed if it expires in less than min-ttl seconds. The default value is 30.

Extractor

The extractor section contains various configurations for the operation of the extractor.

ParameterDescription
tmp-folderInsert a folder where the extractor places temporary files. The default value is data/files relative to the working directory.
keep-filesSet to true to keep temporary files after processing. The default value is false, which means temporary files are deleted.
uploadSet to false to run the extractor in dry-run mode where files are accessed and processed, but no changes are made in CDF. The default value is true.
deleteSet to true to delete files from CDF. There are two triggers for deleting documents:
  • A document with the same source and external ID prefix as this extractor exists in CDF but is absent from the query (requires syncMode to be set to full) will be deleted from CDF.
  • A document with the configured soft-delete-key in the metadata with a value equal to the configured soft-delete-values is considered voided and will be deleted from CDF.
delete-thresholdInsert a ratio between 0 and 1 of how much to maximum delete in a single run. For full sync mode, this is measured towards the size of CDF Files. For quick sync mode, this is measured towards the size of the current extraction. The default value is 1, indicating no threshold.
threadsEnter the number of parallel documents to run. Note that this isn't number of connections to CDF or Documentum. The default value is 10.
sync-modeSet the synchronization mode. Options are full or quick. Full sync is typically faster for many files, while quick sync is typically faster for a smaller number of files. The default value is full. See Sync data modes.
quick-sync-intervalEnter the number of hours to go back for a quick sync. For instance, if you set this value to 24, only the documents changed during the last day are included. The default value is 24.
dump-json-fileEnter the name of a file to dump this extraction to. This is used to activate a JSON dump. The JSON dump is only intended for debugging purposes and will use a lot of RAM. Don't use this parameter for extractions where you expect over ~50k documents. The default value is no dump.

Metrics

Include the metrics section to send metrics about the extractor performance for remote monitoring of the extractor. This section is optional.

Pushgateways

Include the push-gateways subsection to describe an array of Prometheus Pushgateway destinations to which the extractor will push metrics. This subsection is optional.

ParameterDescription
hostInsert the absolute URL of the Pushgateway host. Example: http://localhost:9091. If you are using Cognite's Pushgateway, this is https://prometheus-push.cognite.ai/. The default value is null/empty.
job-nameEnter the value of the exported_job label you want to associate with metrics.
usernameEnter the user name in the Pushgateway. The default value is null/empty.
passwordEnter the password. The default value is null/empty.
push-intervalEnter the interval in seconds between each push. The default value is 30.

If you configure this section, the extractor pushes metrics that you, for instance, can display in Grafana. Create Grafana dashboards with the extractor metrics using Prometheus as the data source.

Documentum

Include the documentum section to configure the queries, sync mode, and access to Documentum.

If you're connecting via the Documentum Foundation Classes (DFC) Java SDK, you don't need to enter username, password, host, timeout, and retries since the extractor reads these values from dfc.properties.

ParameterDescription
modeEnter how the extractor connects to Documentum. This is either via the D2 REST API (recommended) or the DFC Java SDK. The default value is D2.
queryEnter a data query language (DQL) query to run on the Documentum server. This is a required value.
metadata-propertiesInsert the fields in a document's metadata that contains important information to the extractor. See the Metadata properties section. This is a required value.
usernameEnter the username for authenticating to D2. This is a required value for D2 extractions.
passwordEnter the password for authenticating to D2. This is a required value for D2 extractions.
hostInsert the base URL of the D2 repository. This is a required value for D2 extractions.
timeoutSpecify the timeout in seconds for HTTP requests on D2. The default value is 60.
retriesSpecify the number of retries to failed requests before stopping the extractor. The default value is 5.

Metadata properties

Include the metadata properties section to describe where the extractor will look for information in a document's metadata.

ParameterDescription
file-type-shortEnter a shortened file type. This is the file name suffix, such as pdf. The default value is dos_extension.
file-type-fullEnter a full mime type. This is the full file type, such as applications/pdf. The default value is mime_type.
soft-delete-keyInclude this parameter to turn on detection of soft-deletion. Include the metadata field that indicates a deleted document. The default value is no value.
soft-delete-valuesInsert values that trigger a deletion when this value and soft-delete-key form a key-value pair in the files' metadata. Values are case-sensitive. The default value is an empty list.
object-idThis is the file ID for a document that tracks changes and generates external IDs in CDF. This value should be unique across repository and stay the same when a document changes. The recommended value is i_chronicle_id.
modify-dateThis is the time when a document was last changed. Use this parameter to track changes.