Skip to main content
The documents API provides access to document content stored in CDF. You can search documents by metadata or content, perform semantic passage search, aggregate document properties, and retrieve document content.

Document structure

A document is a file that has been indexed by the document search engine. Every time a file is uploaded, updated or deleted in the Files API, it will also be scheduled for processing by the document search engine. After some processing, it will be possible to search for the file in the document search API. The document search engine is able to extract content from a variety of document types, and perform classification, contextualization and other operations on the file. This extracted and derived information is made available in the form of a Document object. The document structure consists of a selection of derived fields, such as the title, author and language of the document, plus some of the original fields from the raw file. The fields from the raw file can be found in the sourceFile structure. The derived fields are described in more detail below.

Derived fields

Some document types (such as PDFs) contain additional metadata fields. If the document contains its title as part of this metadata, this field will be populated with that title.
We do not currently extract the title from the document content itself. If there is a need for this, we may consider adding such functionality in the future.
Similar to the title field, the author field is another field that can often be extracted from the document’s metadata.
The producer field also exists in the document metadata. It contains information about the software or the system that was used to create the document.
The createdTime we assign to the document is not exactly the same as the one found in the Files API. We first try to extract the created time from the document metadata. If the document does not contain such a timestamp, we fall back to the time set in the Files API.
If there is a mime type set on the file in the Files API, this field will be set to the same mime type. If there is no mime type set on the file, we will try to auto-detect it.
This field contains the extension of the file, derived from the file name. For instance, if the file name is My Document.docx, the extension field will contain docx.
Contains the number of pages in the document, if possible to determine.
The type field contains a high-level file type, derived from the mime type. Mime types are not that pleasant to look at, and not always easy to understand. That is why we map the mime types into more user-friendly types. Below is the list of types currently returned, but be aware that this list may be extended in the future.
TypeDescription
DocumentDocument files from Microsoft Word or similar word processing software
PDFPDF files
SpreadsheetFiles from Microsoft Excel or similar spreadsheet software
PresentationSlides from Microsoft Powerpoint or similar
ImageAny kind of image such as PNG or JPG files
VideoAny kind of video such as MOV or MP4 files
Tabular dataCSV, TSV and other kinds of tabular data files
Plain textPlain text files
CompressedZIP files and other kinds of compressed archive files
ScriptProgram code such as Python or MATLAB
OtherAnything that doesn’t fit in any of the above types
If there is a geolocation set on the file in the Files API, then this field will contain the same geolocation. If there is no explicitly assigned geolocation, the document processing system will try to detect a location using two different techniques:
  1. Embedded GPS locations: Extract locations from files that contain embedded GPS locations. Photos and videos often have this kind of metadata.
  2. Related assets: Look at related assets that have locations, and assign the same location(s) to the document.

File type support

We create a document for each uploaded file, but only derive data from certain files. The following file types are eligible for further data extraction and enrichment:
  • PDF files
  • Spreadsheets, documents, and presentations from the Microsoft, Libre Office and macOS office suites
  • Plain text files
  • Images

Automatic extraction

CDF automatically extracts text and metadata from uploaded files. Once ingested, documents are indexed for search. You can query by standard metadata fields or by semantic meaning using passage search.

Organizing documents

Documents are associated with files in CDF. You can filter by source type, metadata, and content. Use aggregations to understand document distributions and properties across your corpus.
Document content extraction happens asynchronously. Large files may take time to process before they appear in search results.

Key capabilities

  • Search by metadata or content
  • Semantic passage search for meaning-based retrieval
  • Aggregate properties for analytics
  • Retrieve document content for display or processing
Last modified on April 23, 2026