Building containers
This article outlines the best practices to design and implement containers for the Cognite Data Fusion (CDF) industrial knowledge graph.
To build efficient containers, first make sure to understand the use cases and information models you're implementing the containers for. Then, use the information in this article to meet your implementation needs.
To manage containers and govern schema components, you can use NEAT (open source) and the CDF Toolkit. The tools are community-supported with contributions from Cognite engineers.
Indexes and constraints
To improve the speed of data retrieval and sorting, you can index any property defined in a container. However, indexes also come with a performance cost during certain operations.
You can configure a maximum of 10 indexes for each container. Each index can include an individual property, or it can include combinations of properties (composite indexes). For composite indexes, the order of the properties in the index is important.
The most frequently included properties in your queries are primary candidates to consider including in an index. Also, you should consider the size of the data set, and the frequency of updates to the data.
You can specify two main index types for the data modeling service: Btree and inverted indexes.
The examples in this section assume the following container definitions:
# Example container definitions
space: site_assets
container:
# A container for the "Asset" type with a single property "asset_id"
- externalId: Asset
name: Asset
usedFor: node
description: The Asset container
properties:
asset_id:
name: asset_id
type:
type: int64
site:
name: site
type:
type: text
name:
name: name
type:
type: text
# [... more asset specific container properties could be included here]
# A container for the "Description" type
- externalId: Description
usedFor: node
properties:
title:
type:
type: text
list: false
collation: ucs_basic
nullable: true
description_text:
type:
type: text
list: false
collation: ucs_basic
nullable: true
labels:
type:
type: text
list: true
collation: ucs_basic
nullable: true
Why not index all properties?
Well-planned indexes that match the query patterns of your data consumers improve the performance of data retrieval operations. But indexes come at a cost, and creating an index on the wrong properties can be detrimental to both data retrieval, and insert/update operations:
- Indexes built on properties containing a lot of unique data will often impede performance.
- BTree indexes for properties with large amounts of data can use significant amounts of memory and can cause intermittent error/timeout statuses for update or add operations.
- Composite indexes for millions of pieces of data can lead to poor performance during add and update operations for the indexed properties. It may impact the experienced stability of your solutions and your data objects.
Having an index for every property is unlikely to give the desired result even if the total property count in the container would allow it. Too many indexes, across too many properties, in too many related containers may lead to slow performance during ingestion, query, search, and filtering operations. Slow performance can result in a lot of push-back on the client in the form of retry statuses from the data modeling service.
BTree indexes
BTree indexes maintain data in a sorted order and lets you quickly search, insert, delete, and read data sequentially. The indexes are ideal for range queries, and for ordered data retrieval.
BTree indexes are particularly useful for data stored as scalar data types. To see the scalar data types available in container properties, navigate to the type
> PrimitiveProperty
section of the Create or update containers entry in the API documentation.
The following example illustrates creating a single BTree index for the Asset container configuration:
indexes:
# A BTree index for the AssetId property in the Asset container
assetIdIndex:
properties:
- assetId
# alternative approach to property inclusion for the index:
# properties: [assetId]
indexType: btree
Inverted indexes
Inverted indexes are excellent for full-text search queries, and for multi-property indexes. For example, to list unique words that appear in any document. You can also use inverted indexes to calculate relevance. For example, to get the frequency at which a certain term appears.
This example uses 3 text properties, and a single inverted index for the labels in the Asset container:
# An inverted index for the "Description" container in section 3.1
indexes:
labels:
properties:
- labels
indexType: inverted
Composite indexes
Composite indexes are effective when using properties together in query conditions, for example, a WHERE
filter using several related properties. Build composite indexes for properties that are commonly used together to filter or sort data. This improves performance because the data modeling service data store quickly can find the data that meet all criteria included in the index.
The following query examples highlight the usefulness of a composite inverted index, and the order of the properties in the index:
a) return assets where their name starts with the letter A, from site X b) return all assets at site Y c) return all assets where their name starts with the letter B
Query a) and b) should see a performance benefit by defining the composite inverted index described as site_asset_name
in the the example below. However, the site_asset_name
won't improve the performance of query c).
To improve the performance for query c), you would need the asset_name_site
composite index below. The asset_name_site
index would also improve performance for query a), but not for query b).
You cannot create composite indexes from properties hosted in different containers
We suggest you use the requirements collection phase to test, and validate, your data volume expectations for the container(s). Also use it to identify the most likely query patterns. Then build the smallest number of composite indexes you need, so your graph will deliver the performance and scalability your consumers expect during data onboarding and consumption.
With large data sets, try using distinct and small composite indexes to support the specific queries you expect consumers will use.
Composite index examples
# Composite index combining the site and asset name for the Asset container
indexes:
site_asset_name:
# An inverted index for the combined site and asset name properties should deliver more performant queries
# when searching for asset names within a site.
properties:
- site
- name
# alternative definition for properties included in the index:
# properties: [site, name]
indexType: inverse
asset_name_site:
# An inverted index for the combined asset name and site properties should deliver more performant queries
# when searching for whether a site is linked to a (set of) asset name(s).
properties:
- name
- site
# alternative definition for properties included in the index:
# properties: [name, site]
indexType: inverse
Pagination (cursorable indexes)
Use cursorable indexes when you need to support efficient pagination through large data sets. For example, when
your users need to receive all available data, based on a set of filters. To enable these types of indexes, set
the cursorable:
parameter to true
for a BTree index.
Remember, cursorable indexes can consume significant amounts of resources and impact performance.
For incremental processing — "Return all data that has changed, based on this/these filters" — use the sync feature. For small or static data sets, we recommend that you keep the default value for cursorable:
(false
).
Inverted indexes can't be cursorable.
# A "cursorable" btree index for the asset_id property in the example Asset container
indexes:
# The BTree index for the asset_id property in the example Asset container
assetIdIndex:
properties:
- asset_id
# alternative definition for properties included in the index:
# properties: [asset_id]
indexType: btree
cursorable: true # Set to true to enable a cursorable index
Constraints
See Containers, views, polymorphism, data models for more information about constraints, and how to create them for your containers.
Adding constraints creates indexes for the properties you include in the constraint definition and affect performance when you add or update data (to asset_id
in the example). The reason is that CDF needs to make sure that the value you're adding or updating remains unique for the asset_id
property (in the example).
# Example constraint definition for asset_id in the Asset container
constraints:
asset_id_is_unique:
constraintType: uniqueness
properties:
- asset_id
Container sizes
To design a performant container, you need to consider questions like:
- How will your consumers use data hosted in the container(s)?
- How many instances and properties are needed for a search/query/filter operation?
- Which properties are indexed, and using what kind of index?
- Which constraints apply?
- Which, and how many, views are combining data from more than one container?
- How will the views access data from the container(s)?
- Will the data in the container be updated together with other data in other containers?
For example, creating a wide container with many indexed properties, and a frequently updated
column like last_inspection
or last_viewed
(properties) will probably lead to poor performance.
The same goes for frequently updated properties where the data is actually changed whenever
the property is updated. These kinds of containers are prime candidates for you to split
to "small" containers.
The content of a CDF data modeling container
How you design a container shapes the "physical" storage of the data in the data modeling services data store. The data store is a PostgreSQL database instance. When you use the data modeling service API to create a container, think of the container as a table in PostgreSQL. The table contains all specified properties and the additional internal properties the service needs.
For example, if you create a container with a single property named my-property
, the container will look like this in the data modeling data store:
internal_data | internal_data | space | my-property |
---|---|---|---|
YYY | 10 | foo | |
XXX | 10 | not foo |
Notice the internal data columns — properties — in the example. The data modeling service hides them and their data from you, but uses them extensively behind the scenes.
The underlying data store doesn't allow composite indexes for properties that are stored in different containers.
When is a container too big?
A container is too big (wide) when it has too many properties and indexes to be performant. Wide containers with many properties means that ingesting or updating data takes longer and the indexes can become bloated. This affects write and query performance for the data consumer and can cause more HTTP status 408 errors.
Each CDF project comes with a set of preconfigured default limits. One of the limits is the maximum number of properties per schema object. By default, this limit is 100 properties, but it can be increased.
In addition to consumption and data update patterns, the "correct" width of a container depends on the number of other wide containers in the graph.
In some cases, it may be efficient to have a container with 200 properties. If only a few, or none, of the properties have relationships, or dependencies, that must be updated when you onboard or update data, and if you don't have consumption relationships to and from other containers, having 10 wide containers (>50 properties) can also be a good solution.
To safeguard the performance of your solution, make sure that you experiment with, and test the scalability for, your onboarding and consumption patterns.
Don't use the data modeling service as an "entity-attribute-value" (EAV) store. Learn why here: Anti-pattern: the entity-attribute-value (EAV) model
When is a container too small?
Using only small containers results in excessive query-time JOINs as the data modeling service processes queries from your data consumers. This, in turn, makes it difficult to form useful queries and return relevant data to the consumer.
In addition, the overhead of internal index structures will likely result in poor performance. The data modeling service manages indexes for many of the internal properties and composite indexes for combinations of internal properties. The service uses the indexes to speed up access for an instance when it receives an explicit instance_id
to retrieve data from. The indexes are also used when a consumer enumerates everything in the container, with or without using a space filter. These enumerations are common when executing list operations using the has_data
filter(s).
Another side effect of using a "one container per property" approach is that you can't specify indexes consisting of multiple properties (a composite index). This reduces performance across the joined tables when consumers query for data.
There are, however, patterns where it may be efficient to use small containers. For example:
-
You want to store data that's independent, or infrequently used together with data from other containers.
-
You're hosting a small amount of data.
-
Your data is frequently updated. Example: The container has a
last_viewed
property, or alast_updated
property.
Properties for filtering vs fetching properties
In the data modeling service, a query has two "phases":
-
For any per-table expression, the service determines which instances to return in what order.
-
The service fetches the requested data from the containers representing those instances. For this, the data modeling service has two strategies:
-
Fetch all the data from the container.
-
Query a single system-internal container;
node_source
andedge_source
. This system-internal container holds JSON blob representations of the data stored in all the other containers.
-
The first strategy reflects how the schema and data modeling services work today. The strategy works well for queries that need to return specific data from a few specific containers.
However, the strategy doesn't work well when you have a query that returns all the data for any number of instances. If you, for example, have 10,000 containers, the first strategy would need to look in 10,000 tables for every instance — not a workable solution. Instead, the strategy searches through a table, with a (sub-)query equivalent to:
select * from node_source join the_nodes_we_need_to_return
using(internal_data, internal_data) order by ....
This operation can be slow, and will often result in timeouts and error conditions for the consumer.
The cost of updates
In most transactional systems, an update to data typically creates new copies of the data. It will likely not change any existing data "in place". As an example, if you have a container with document_body
and last_viewed
properties and have configured the system to update the last_viewed
property frequently, each update also saves the document_body
.
Also, any update to data including at least one indexed column results in updates to all indexes for that table. (See technical details.)
In sum, these updates contribute to container and index bloat that will affect performance for data consumers (clients).
We recommend that you:
- Don't include frequently updated data in wide containers. Examples of properties like these include data for
view_count
,last_viewed
,last_updated
,reactions
, etc. - Keep rarely changed and heavily indexed data in separate containers from frequently updated and indexed property data.
The cost of queries
In general, fewer JOINs result in the best query performance. A query only joins containers that your data consumers include by using filters. The database query optimizer uses statistics to decide the order of the filters and JOINs. The data store maintains the statistics and hosts it per container. These statistical values are based on sampling, and not continuous recording.
As you combine more properties or containers, the statistics used by the optimizer are less likely to support good decision-making. For instance, when the optimizer determines the order of JOINs, it estimates the expected number of instances (rows) in the container. If the statistics for the container confirm that it's processing for a small container, this is considered a strong signal that the container has few instances (rows).
The size of the table is a simple statistic to maintain. Knowing much about the number of instances and the size of the container, is a lot more difficult. This is especially the case for data where properties a
, b
, c
, d
, and e
are not null
for a specific row, but for many others they are null
. This may, for example, be the case for a sparsely populated container with many properties, but relatively little data in total.
This statistics problem impairs the precision of the optimizer. In some cases, it may result in the system needing to scan all the container data to produce the data to return. The data modeling service isn't optimized for full scans of containers hosting a lot of data for many properties. As a result, consumers may receive HTTP status code 408
after submitting a query to the data modeling service.
Share data containers
This example illustrates how you can create a performant data filtering pattern with a combination of properties, their composite index, and a WHERE
filter.
This example assumes that you have three properties containing data: a
, b
, and c
:
-
Create a composite index concatenating the property values in the same order as you want to structure a
WHERE
filter in a query:index(a, b, c)
. -
Add the filtering clause
WHERE a = $1 AND b = $2 AND c >= $3
to your query.$1
,$2
, and$3
are parameters. Before the query is sent to the data modeling service, the parameters are replaced with actual values to test for in theWHERE
clause.
However, you can't build a composite index if the properties are from separate containers, for example if (a, b)
are in a different container from c
. For such cases, you can move (a, b)
to the same container as c
(or the other way around), or you can repeat/duplicate property c
and its data to several different containers.
Use the information in this section to test and verify the best option for your data and your consumers.
An index scan determines which rows to return as matching your filter. Filter-less list/sync queries always use this method to return results.
If you split properties across different containers, and the properties are indexed separately, the data modeling service optimizer can choose different strategies. The service uses statistical analysis and is likely to assume that c >= $3
only matches a few instances (rows). Consequently, it'll look up those rows first.
Next, it verifies that the identified rows match (a, b)
. For the match, the system doesn't always use the composite (a, b)
index you have defined. If property c
represents a last_modified
timestamp property for an instance, it assumes that the property is frequently updated as is the case with most last_modified
properties.
When the query attempts to filter on last_modified >= 0
, or c >= 0
in the example above, it's not efficient to begin the comparison by processing the c
filter. The filter probably matches a lot of data. Also, for a platform like CDF, much of the data will have changed at about the same time. Additionally, the filter probably identifies less than all data content for the container. If so, the potential matches of both indexes — (a, b)
, and (c)
— might be exhausted. The data modeling service then uses the resulting bitmaps to intersect, looking for candidate rows. This is a necessary, but low-performing strategy. Client applications may see slow query responses or timeout errors (HTTP status 408) when they use the filter.
To summarize, if a property is frequently used by filters and the filters span data across many containers, it may be beneficial to duplicate the data for the property in multiple containers.
One such example is the default space
property. We observed that our consumers often filter space
. The property was originally protected, but we decided to add the space information to every container created. You can now add the space
property to any index definition. Rather than keeping the property and related information in a single container, we duplicate the small amount of data represented by the space identifier. This helps you achieve better performance for queries if you need to filter or query together with the space
property from any container.
Anti-pattern: the Entity-Attribute-Value (EAV) model
The EAV model is an anti-pattern and we strongly recommend not using it when building containers in CDF.
The Entity-Attribute-Value (EAV) model is often used in data stores where low-execution-cost flexibility is the most important outcome. Rather than using conventional structured tables with fixed columns, the EAV model uses three main components: entities, attributes, and values.
One of the appealing aspects of the EAV approach is its ability to handle dynamic and sparse data. You can add new attributes without altering the underlying schema. This makes the pattern highly adaptable to evolving data requirements.
However, we strongly recommend NOT using the EAV model with the CDF data modeling service. Using the model will have a significant negative impact on cost and performance. Some key problems are:
-
Instance explosion.
Each new property you add to the EAV store multiplies the nodes and instances in the system.
-
Increased storage cost.
Adding only two properties can increase your stored data by more than 600%. This is mostly due to additional instance-related system metadata. The storage and processing cost is exacerbated with more properties, not amortized.
-
Increased query processing.
With EAV instances, every query results in large matrices of many-to-many-to-many JOINs. The JOINs are based on tables where "everything is mixed together", which also undermines the service's ability to collect and use meaningful statistics to determine data distribution.
The indexes can't be granular. Instead, they have to include everything, also properties that don't need indexing. That causes the indexes to be far less efficient, more expensive to maintain, and each property far more expensive to query.
The query execution engines will be running blind, impacting the performance of the solution, and likely causing timeouts (HTTP status 408).
Instead of using the EAV model, we recommend that you establish schemas using first class properties. If properties are always used together in a lot of filters, and they effectively filter the data, you can include them in a composite index to enable fast lookup operations. The query engine will have statistics for the properties to reason about, and to execute the query. Typically, it does not JOIN every possible combination of edges-to-property
-nodes.
Everything is not a first-class property. We've seen that most data models follow a power/Zipf distribution. The data in the most popular property is used 1/10th of the time. The data in the second most popular property is used 1/20th of the time. This pattern continues for each subsequent property. For this reason, not all data stored by the service needs to use property typed properties. If you don't plan on using a property for filtering, sorting, or indexing, you can save it in a JSON-object with JSON-typed properties. That allows the service to store data and do string-like filtering without you having to define schemas for everything stored in the data model(s).