Cognite Documentum extractor
The Cognite File extractor can now connect to OpenText Documentum and OpenText Documentum D2. We encourage users to update to the File extractor, and we'll discontinue downloads of the Documentum extractor. Already installed Documentum extractors will continue to run.
The Cognite Documentum extractor connects to OpenText Documentum and OpenText Documentum D2 and extracts documents in batches to the Cognite Data Fusion (CDF) files service. Metadata goes into the CDF staging area (RAW).
The extractor runs a configurable Data query language (DQL) query on the Documentum server, processes the output, and then exits.
Sync data modes
You can run the Cognite Documentum extractor in full sync mode or quick sync mode using the sync-mode
configuration parameter.
-
Full sync queries the entire document database and processes every returned document. The extractor skips documents that already exist in CDF with an equal or newer document version. Full sync detects hard deletes. If a previously extracted document isn't in the latest query, it's deleted from CDF if you've set the
delete
configuration parameter totrue
.Full sync starts by creating a local cache from CDF containing existing external IDs and modification times. Quick sync queries the external IDs and modification times instantly. A full sync is considerably faster when extracting a large set of documents.
-
Quick sync appends a
WHERE
clause to the DQL query and looks for documents that are newer than a defined threshold. You define the threshold as an interval in thequick-sync-interval
configuration parameter. For instance, if you set this parameter to 4, the extractor only looks for documents that have changed during the last 4 hours.
Run a full sync when you set up the extractor or update a set of documents that haven't been updated in CDF for a while. Run quick syncs regularly via cron or the Windows Task Scheduler.
Extract metadata
You can configure the extractor to send document metadata into the CDF staging area (RAW) or the CDF files service. In the CDF files service, the file appears with the same metadata as in the source system. In CDF RAW, the object in CDF files only has outer level fields, such as source
, mime type
, and externalID
, and the metadata is sent as a row to CDF RAW.
Nested metadata in Documentum D2
In Documentum D2, a document can have several versions or renderings, called Content. For example, a document can exist as a PDF, .doc, or HTML file, each saved as separate content. Each content entry is linked to a parent document. The metadata is gathered from the content entry when the extractor uploads content to CDF. In CDF, the parent document entry and the metadata have a content: or document: prefix.
External IDs
The extractor generates external IDs for CDF based on the object IDs in Documentum.
-
If you're using the D2 REST API, the extractor uses the document's object ID with the content type as a suffix. This is useful if you search for all content for a D2 document in CDF files using the
externalIdPrefix
parameter. Insert the field with the Documentum file ID in theobject-id
configuration parameter. -
If you're using the DFC Java SDK, the extractor uses the object ID only.
It's a good practice to give external IDs a deployment-specific external ID prefix. Keeping the source systems and deployments separated by unique external ID prefixes is useful for detecting deleted or changed documents. You configure this with the external-id-prefix
configuration parameter.