Document structure
A document is a file that has been indexed by the document search engine. Every time a file is uploaded, updated or deleted in the Files API, it will also be scheduled for processing by the document search engine. After some processing, it will be possible to search for the file in the document search API. The document search engine is able to extract content from a variety of document types, and perform classification, contextualization and other operations on the file. This extracted and derived information is made available in the form of aDocument object.
The document structure consists of a selection of derived fields, such as the title, author and language of the document, plus some of the original fields from the raw file. The fields from the raw file can be found in the sourceFile structure. The derived fields are described in more detail below.
Derived fields
title
title
Some document types (such as PDFs) contain additional metadata fields. If the document contains its title as part of this metadata, this field will be populated with that title.
We do not currently extract the title from the document content itself. If there is a need for this, we may consider adding such functionality in the future.
author
author
producer
producer
The
producer field also exists in the document metadata. It contains information about the software or the system that was used to create the document.createdTime
createdTime
The
createdTime we assign to the document is not exactly the same as the one found in the Files API. We first try to extract the created time from the document metadata. If the document does not contain such a timestamp, we fall back to the time set in the Files API.mimeType
mimeType
If there is a mime type set on the file in the Files API, this field will be set to the same mime type. If there is no mime type set on the file, we will try to auto-detect it.
extension
extension
This field contains the extension of the file, derived from the file name. For instance, if the file name is
My Document.docx, the extension field will contain docx.pageCount
pageCount
Contains the number of pages in the document, if possible to determine.
type
type
The
type field contains a high-level file type, derived from the mime type. Mime types are not that pleasant to look at, and not always easy to understand. That is why we map the mime types into more user-friendly types. Below is the list of types currently returned, but be aware that this list may be extended in the future.| Type | Description |
|---|---|
Document | Document files from Microsoft Word or similar word processing software |
PDF | PDF files |
Spreadsheet | Files from Microsoft Excel or similar spreadsheet software |
Presentation | Slides from Microsoft Powerpoint or similar |
Image | Any kind of image such as PNG or JPG files |
Video | Any kind of video such as MOV or MP4 files |
Tabular data | CSV, TSV and other kinds of tabular data files |
Plain text | Plain text files |
Compressed | ZIP files and other kinds of compressed archive files |
Script | Program code such as Python or MATLAB |
Other | Anything that doesn’t fit in any of the above types |
geoLocation
geoLocation
If there is a geolocation set on the file in the Files API, then this field will contain the same geolocation. If there is no explicitly assigned geolocation, the document processing system will try to detect a location using two different techniques:
- Embedded GPS locations: Extract locations from files that contain embedded GPS locations. Photos and videos often have this kind of metadata.
- Related assets: Look at related assets that have locations, and assign the same location(s) to the document.
File type support
We create a document for each uploaded file, but only derive data from certain files. The following file types are eligible for further data extraction and enrichment:- PDF files
- Spreadsheets, documents, and presentations from the Microsoft, Libre Office and macOS office suites
- Plain text files
- Images
Automatic extraction
CDF automatically extracts text and metadata from uploaded files. Once ingested, documents are indexed for search. You can query by standard metadata fields or by semantic meaning using passage search.Organizing documents
Documents are associated with files in CDF. You can filter by source type, metadata, and content. Use aggregations to understand document distributions and properties across your corpus.Document content extraction happens asynchronously. Large files may take time to process before they appear in search results.
Key capabilities
- Search by metadata or content
- Semantic passage search for meaning-based retrieval
- Aggregate properties for analytics
- Retrieve document content for display or processing