> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cognite.com/llms.txt
> Use this file to discover all available pages before exploring further.

# About document parsing

> Learn how the document parser extracts structured data from documents into data model views, and how the confidence score helps you verify the results.

<Warning>
  The features described in this section are in [public preview](/cdf/product_feature_status#public-preview) and may change. The features are currently only available to customers via our **Early Adopter** program. For more information and to sign up, visit the [Early Adopter group](https://hub.cognite.com/groups) on the [Cognite Hub](https://hub.cognite.com).
</Warning>

Document parsing extracts data from documents with key-value representation into [data model views](/cdf/dm/dm_concepts#view) in Cognite Data Fusion (CDF). Each key-value pair in the document maps to a property in the data model view. You can modify and verify extracted data before you approve the parsing job. Approved data is ingested into [data model instances](/cdf/dm/dm_concepts#instance) and becomes available for users to explore and analyze, for example in asset and equipment monitoring.

## Assess extraction accuracy

The document parser assigns a **confidence score** to each extracted property value. The score appears as a **percentage** (for example, **60%**). See the [score ranges in Parsing documents](/cdf/integration/guides/contextualization/parse_documents#review-the-parsed-data) for how to interpret it.

### What the scores measure

The score compares two strings: the **property name** from your data model view and the **field name** as it appears in the document, for example, the label or caption next to a value, such as "Design Pressure" or "Manufacturer". A higher score means a closer match. The score does not guarantee that the extracted value is correct.

### How the score is calculated

The parser uses a **hybrid** calculation: it first scores how alike the two strings look (**syntactic similarity**). Only if that score falls below a threshold does it also score whether the strings mean the same thing (**semantic similarity**).

**Syntactic similarity** compares form: characters, spelling, and string structure. The document parser computes a similarity ratio by **finding the longest contiguous character sequences that match in both strings, then applying the same idea to the leftover parts**. Identical or near-identical strings score high, while different spelling or wording scores lower. For example, `Design Pressure` vs `design pressure` gives a high score, while `Design Pressure` vs `DP` gives a low score.

**Semantic similarity (optional)** compares **meaning**, not spelling or layout: two strings can match when they describe the same concept, even if the words or formatting differ. The parser uses this step only when the syntactic score is already low. It may turn each string into a **vector embedding** (a numeric representation of meaning) and measure how close those vectors are. For example, `Design pressure (bar)` vs `designPressure` can score high semantically even though they look different.

## Next steps

* [Parse documents](/cdf/integration/guides/contextualization/parse_documents) – Step-by-step procedure to parse documents and ingest data into data model views
