Skip to main content

Cognite Documentum extractor

Deprecated

The Cognite File extractor can now connect to OpenText Documentum and OpenText Documentum D2. We encourage users to update to the File extractor, and we'll discontinue downloads of the Documentum extractor. Already installed Documentum extractors will continue to run.

The Cognite Documentum extractor connects to OpenText Documentum and OpenText Documentum D2 and extracts documents in batches to the Cognite Data Fusion (CDF) files service. Metadata goes into the CDF staging area (RAW).

The extractor runs a configurable Data query language (DQL) query on the Documentum server, processes the output, and then exits.

Sync data modes

You can run the Cognite Documentum extractor in full sync mode or quick sync mode using the sync-mode configuration parameter.

  • Full sync queries the entire document database and processes every returned document. The extractor skips documents that already exist in CDF with an equal or newer document version. Full sync detects hard deletes. If a previously extracted document isn't in the latest query, it's deleted from CDF if you've set the delete configuration parameter to true.

    Full sync starts by creating a local cache from CDF containing existing external IDs and modification times. Quick sync queries the external IDs and modification times instantly. A full sync is considerably faster when extracting a large set of documents.

  • Quick sync appends a WHERE clause to the DQL query and looks for documents that are newer than a defined threshold. You define the threshold as an interval in the quick-sync-interval configuration parameter. For instance, if you set this parameter to 4, the extractor only looks for documents that have changed during the last 4 hours.

Tip

Run a full sync when you set up the extractor or update a set of documents that haven't been updated in CDF for a while. Run quick syncs regularly via cron or the Windows Task Scheduler.

Extract metadata

You can configure the extractor to send document metadata into the CDF staging area (RAW) or the CDF files service. In the CDF files service, the file appears with the same metadata as in the source system. In CDF RAW, the object in CDF files only has outer level fields, such as source, mime type, and externalID, and the metadata is sent as a row to CDF RAW.

Nested metadata in Documentum D2

In Documentum D2, a document can have several versions or renderings, called Content. For example, a document can exist as a PDF, .doc, or HTML file, each saved as separate content. Each content entry is linked to a parent document. The metadata is gathered from the content entry when the extractor uploads content to CDF. In CDF, the parent document entry and the metadata have a content: or document: prefix.

External IDs

The extractor generates external IDs for CDF based on the object IDs in Documentum.

  • If you're using the D2 REST API, the extractor uses the document's object ID with the content type as a suffix. This is useful if you search for all content for a D2 document in CDF files using the externalIdPrefix parameter. Insert the field with the Documentum file ID in the object-id configuration parameter.

  • If you're using the DFC Java SDK, the extractor uses the object ID only.

TIP

It's a good practice to give external IDs a deployment-specific external ID prefix. Keeping the source systems and deployments separated by unique external ID prefixes is useful for detecting deleted or changed documents. You configure this with the external-id-prefix configuration parameter.