Cognite Documentum extractor
The Cognite Documentum extractor connects to OpenText Documentum and OpenText Documentum D2 and extracts documents in batches to the Cognite Data Fusion (CDF) files service. Metadata goes into the CDF staging area (RAW).
The extractor runs a configurable Data query language (DQL) query on the Documentum server, processes the output, and then exits.
Sync data modes
You can run the Cognite Documentum extractor in full sync mode or quick sync mode using the
sync-mode configuration parameter.
Full sync queries the entire document database and processes every returned document. The extractor skips documents that already exist in CDF with an equal or newer document version. Full sync detects hard deletes. If a previously extracted document isn't in the latest query, it's deleted from CDF if you've set the
deleteconfiguration parameter to
Full sync starts by creating a local cache from CDF containing existing external IDs and modification times. Quick sync queries the external IDs and modification times instantly. A full sync is considerably faster when extracting a large set of documents.
Quick sync appends a
WHEREclause to the DQL query and looks for documents that are newer than a defined threshold. You define the threshold as an interval in the
quick-sync-intervalconfiguration parameter. For instance, if you set this parameter to 4, the extractor only looks for documents that have changed during the last 4 hours.
You can configure the extractor to send document metadata into the CDF staging area (RAW) or the CDF files service. In the CDF files service, the file appears with the same metadata as in the source system. In CDF RAW, the object in CDF files only has outer level fields, such as
mime type, and
externalID, and the metadata is sent as a row to CDF RAW.
Nested metadata in Documentum D2
In Documentum D2, a document can have several versions or renderings, called Content. For example, a document can exist as a PDF, .doc, or HTML file, each saved as separate content. Each content entry is linked to a parent document. The metadata is gathered from the content entry when the extractor uploads content to CDF. In CDF, the parent document entry and the metadata have a content: or document: prefix.
The extractor generates external IDs for CDF based on the object IDs in Documentum.
If you're using the D2 REST API, the extractor uses the document's object ID with the content type as a suffix. This is useful if you search for all content for a D2 document in CDF files using the
externalIdPrefixparameter. Insert the field with the Documentum file ID in the
If you're using the DFC Java SDK, the extractor uses the object ID only.
It's a good practice to give external IDs a deployment-specific external ID prefix. Keeping the source systems and deployments separated by unique external ID prefixes is useful for detecting deleted or changed documents. You configure this with the
external-id-prefix configuration parameter.