Parsing documents

The features described in this section are in public preview and may change. The features are currently only available to customers via our Early Adopter program. For more information and to sign up, visit the Early Adopter group on the Cognite Hub.

This procedure is for data engineers who populate data model views from key-value PDFs. You should already have source documents that meet Input requirements and have completed the tasks in Before you start. Each extracted property value includes a confidence score to help you decide which values to review and correct before you approve the parsing job. Approved data is ingested into data model instances.

Input requirements

The input for the document parsing job must meet these criteria:

PDF documents with English text and up to 100 pages. Smaller files usually give better results.
Embedded text or scanned documents.
Documents that describe a single asset or piece of equipment.
Key-value pair data representation.

Before you start

Ingest the documents into CDF.
Set up access capabilities.
Create a view in a data model with properties that reflect the key-value data.

For better confidence scores when parsing, align property names with how fields appear in your documents. For broader guidance on data models for AI (including document parsing and search), see Optimizing data models for AI.

Parse documents

Navigate to document parsing

Navigate to Data fusion > Contextualize > Document parsing.

Create parsing task

Select Create parsing task, and then select the documents you want to parse.

You can parse several documents at the same time, but data from each document is ingested into a separate data model view.

Continue to view selection

Select Next to continue.

Select views and run

Select the views you want to populate parsed data into and select Run.

Review the parsed data

Review the parsed data.

Select a property in the Parsed data sidebar to zoom into a field in the document.
Hover over a field to update the value.
Enable Confidence score to view the confidence level for each extracted property. Use the confidence score to decide how much to trust each value and where to focus your review before you approve:

Score range	Interpretation
High (for example, 80–100%)	Strong match. The extracted value is likely correct; spot-check if needed.
Medium (for example, 50–80%)	Moderate match. Review the value and the document to confirm it maps to the right property.
Low (for example, below 50%)	Weak match. The field name in the document may differ from the property name. Verify or correct the value before approving.

Exact score bands may vary. Focus review on lower-confidence properties; properties with higher scores usually require less review and can be trusted for automated workflows.

If many properties show low scores, your view property names may not align with field names in the document, for example, abbreviations, different wording, or spelling. Rename properties in the view so they resemble the field names more closely, then run parsing again.

Approve or reject parsing

Approve or reject the parsing. The approved data is stored as a data model instance.

Data engineering

Input requirements

Before you start

Parse documents

Further reading

​Input requirements

​Before you start

​Parse documents

​Further reading

Input requirements

Before you start

Parse documents

Further reading