Data modeling performance considerations
This article outlines key data modeling factors that impact performance when ingesting and querying data.
To manage containers and govern schema components, you can use NEAT (open source) and the CDF Toolkit.
Indexes and constraints
To improve the speed of data retrieval and sorting, you can index any property defined in a container. However, indexes also come with a performance cost during certain operations.
You can configure a maximum of 10 indexes for each container. Each index can include an individual property, or combinations of properties (composite indexes). The order of the properties is important for composite indexes.
The properties most frequently included in queries are primary candidates to include in an index. Also, you should consider the size of the data set, and the update frequency of the data.
You can specify two main index types for the data modeling service: Btree and inverted indexes.
The examples in this section assume the following container definitions:
# Example container definitions
space: site_assets
container:
# A container for the "Asset" type with a single property "asset_id"
- externalId: Asset
name: Asset
usedFor: node
description: The Asset container
properties:
asset_id:
name: asset_id
type:
type: int64
site:
name: site
type:
type: text
name:
name: name
type:
type: text
# [... more asset specific container properties could be included here]
# A container for the "Description" type
- externalId: Description
usedFor: node
properties:
title:
type:
type: text
list: false
collation: ucs_basic
nullable: true
description_text:
type:
type: text
list: false
collation: ucs_basic
nullable: true
labels:
type:
type: text
list: true
collation: ucs_basic
nullable: true
Why not index all properties?
Well-planned indexes that match the query patterns of your data consumers improve the performance of data retrieval operations. But indexes come at a cost, and creating an index on the wrong properties can be detrimental to both data retrieval, and insert/update operations:
- Indexes built on properties containing a lot of unique data will often impede performance.
- BTree indexes for properties with large amounts of data can use significant amounts of memory and can cause intermittent error/timeout statuses for update or add operations.
- Composite indexes for millions of pieces of data can lead to poor performance during add and update operations for the indexed properties. It may impact the experienced stability of your solutions and your data objects.
Having an index for every property is unlikely to give the desired result even if the total property count in the container would allow it. Too many indexes, across too many properties, in too many related containers may lead to slow performance during ingestion, query, search, and filtering operations. Slow performance can result in a lot of push-back on the client in the form of retry statuses from the data modeling service.
BTree indexes
BTree indexes maintain data in a sorted order and lets you quickly look up instances based on the properties defined in the index. The indexes are ideal for queries using both range and equality filters. Conceptually, a BTree is similar to a phone book, which is sorted by name, allowing you to find a person's phone number quickly.
BTree indexes are particularly useful for data stored as scalar data types. To see the scalar data types available in container properties, navigate to the type
> PrimitiveProperty
section of the Create or update containers entry in the API documentation.
The following example illustrates creating BTree indexes for the asset name and site:
indexes:
siteIndex:
properties:
- site
indexType: btree
cursorable: true
nameIndex:
properties:
- name
indexType: btree
Inverted indexes
Inverted indexes allow you to filter on the values of array properties efficiently. Conceptually, this is like a cookbook where you can quickly find all recipes that contain a specific ingredient. Inverted indexes cannot be made cursorable.
This example adds a single inverted index for the labels property in the asset container:
indexes:
labels:
properties:
- labels
indexType: inverted
Composite indexes
Composite indexes are effective when properties are used together in query conditions, for example, a WHERE
filter using several related properties. Build composite indexes for properties that are commonly used together to filter or sort data. This improves performance because the data modeling service data store quickly can find the data that meet all criteria included in the index.
The following query examples illustrate the usefulness of a composite inverted index, and the order of the properties in the index:
- a) return assets where their name starts with the letter A, from site X
- b) return all assets at site Y
- c) return all assets where their name starts with the letter B
Query a) and b) should see a performance benefit by defining the composite inverted index described as site_asset_name
in the the example below. However, the site_asset_name
won't improve the performance of query c).
To improve the performance for query c), you would need the asset_name_site
composite index below. The asset_name_site
index would also improve performance for query a), but not for query b).
You cannot create composite indexes using properties in different containers
We suggest you use the requirements collection phase to test, and validate, your data volume expectations for the container(s). Also use it to identify the most likely query patterns. Then build the smallest number of composite indexes you need, so your graph will deliver the performance and scalability your consumers expect during data onboarding and consumption.
With large data sets, try using distinct and small composite indexes to support the specific queries you expect consumers will use.
Composite index examples
# Composite index combining the site and asset name for the Asset container
indexes:
site_asset_name:
# An inverted index for the combined site and asset name properties should deliver more performant queries
# when searching for asset names within a site.
properties:
- site
- name
# alternative definition for properties included in the index:
# properties: [site, name]
indexType: inverse
asset_name_site:
# An inverted index for the combined asset name and site properties should deliver more performant queries
# when searching for whether a site is linked to a (set of) asset name(s).
properties:
- name
- site
# alternative definition for properties included in the index:
# properties: [name, site]
indexType: inverse
Pagination (cursorable indexes)
Use cursorable indexes to support efficient pagination through large data sets. For example, when your users need to receive all available data based on a set of filters. To enable these types of indexes, set the cursorable:
parameter to true
for a BTree index.
A cursorable index will have an internal unique identifier padded to it, and cursors have a suitable tiebreaker when the indexed values are not unique. As a result, cursorable indexes are more expensive than non-cursorable indexes, so use them carefully.
For incremental processing — "Return all data that has changed, based on this/these filters" — use the sync feature. For small or static data sets, we recommend that you keep the default value for cursorable:
(false
).
Inverted indexes can't be cursorable.
# A "cursorable" btree index for the asset_id property in the example Asset container
indexes:
# The BTree index for the asset_id property in the example Asset container
assetIdIndex:
properties:
- asset_id
# alternative definition for properties included in the index:
# properties: [asset_id]
indexType: btree
cursorable: true # Set to true to enable a cursorable index
For DMS's query executor to use the cursorable index, the sort in your query must match the cursorable index.
A cursorable index is ordered. If you have a cursorable index on property1
and property2
, its data could looks like this:
property1 | property2 |
---|---|
1 | 1 |
1 | 2 |
2 | 1 |
2 | null |
null | null |
To cursor through this index efficiently, you need to either sort ascending with nulls last:
- property: property1
direction: ascending
nullsFirst: false
- property: property2
direction: ascending
nullsFirst: false
or descending with nulls first:
- property: property1
direction: descending
nullsFirst: true
- property: property2
direction: descending
nullsFirst: true
Indexed base properties
These base properties are indexed with B-Tree indexes by default:
space
,externalId
type
(not cursorable)startNode
(not cursorable)endNode
(not cursorable)
This implies that:
- Cursoring through all instances based on
space
is performant. type
,startNode
, andendNode
filters are efficient, but not suitable for cursoring.- Filters on other base properties (like
lastUpdatedTime
,createdTime
etc.) should be avoided unless it's for very small amounts of data
Indexes on direct relations
Traversing direct relations does a lookup on the identifier on the nodes in either end. This requires B-Tree indexes to be efficient. If you, for example, have two pumps in a system of pumps:
To retrieve the system for Pump1
, the query executor needs to do a lookup in the core node container for the node with the internal identifier corresponding to System1
. This internal identifier property has a B-Tree index on it to make this lookup efficient.
If, however, you want to retrieve all pumps in System1
, the query executor need to do a lookup in the Pump.system
direct relation property. To make this this query performant, you need to add a B-Tree index to Pump.system
.
Constraints
See Containers, views, polymorphism, data models for more information about constraints, and how to create them for your containers.
Adding constraints creates indexes for the properties you include in the constraint definition and affect performance when you add or update data (to asset_id
in the example). The reason is that CDF needs to make sure that the value you're adding or updating remains unique for the asset_id
property (in the example).
# Example constraint definition for asset_id in the Asset container
constraints:
asset_id_is_unique:
constraintType: uniqueness
properties:
- asset_id
Query performance considerations
Excessive joins slow down queries
When querying instances, you can use the sources
field to specify which containers or views to retrieve data from. You can also use hasData
filters to return only instances with data in certain containers or views. Also, you can use nested
filters to filter on the properties of a node pointed to by a direct relation. For all these operations, the query executor needs to join in the underlying containers.
For example, if you select data from a view that maps five containers, the query executor needs to perform five joins to fetch all the data. For every nested filter, the query executor must perform another two joins. The number of joins quickly adds up to significantly reduce the query performance.
To keep your queries performant, limit the number of containers and views you query from and the number of nested filters. Also, consider carefully how you use hasData
filters.
/list
endpoint pitfalls
The default sort on the list endpoint is by the internal ID. The index on the internal ID is cursorable, and it's not well-suited to use with arbitrary filters. The only filters that'll be performant with the default sort are space
filters and hasData
filters.
To use the list endpoint with other filters, add a cursorable index on the properties you want to filter on. Make sure the sort in your query matches the index.
The /list
endpoint emits a cursor regardless of the sort, but not all sorts will be performant. If you have more than a few thousand instances, your sort must be backed by a cursorable index.
Container sizing
How large (wide), or small (narrow), should the containers be when you build your schema? The answer to this seemingly simple question is, frustratingly; "It depends".
The challenge is two-fold:
- We cannot exclusively use tiny containers, and;
- We cannot exclusively use huge containers
It might be very convenient for a property to be its own container. However, if that worked well we would not need the containers concept in the first place. On the other hand, it could also be convenient to construct a single container with all the properties described by your information model. Or the properties needed by all consumers of data from your CDF project. But, again we would not need containers in the first place.
Note that neither of these challenges imply that you can't ever use tiny, or large containers! It does imply that you fully understand the trade-offs you face when using large, or small, containers.
To design a performant container, you need to consider questions like:
- How will your consumers use data hosted in the container(s)?
- How many instances and properties are needed for a search/query/filter operation?
- Which properties are indexed, and using what kind of index?
- Which constraints apply?
- Which, and how many, views are combining data from more than one container?
- How will the views access data from the container(s)?
- Will the data in the container be updated together with other data in other containers?
For example, creating a wide container with many indexed properties, and a frequently updated
column like last_inspection
or last_viewed
(properties) will probably lead to poor performance. The same goes for frequently updated properties where the data is actually changed whenever the property is updated. These kinds of containers are prime candidates for you to split
to "small" containers.
The content of a CDF data modeling container
How you design a container shapes the "physical" storage of the data in the data modeling services data store. The data store is a PostgreSQL database instance. When you use the data modeling service API to create a container, think of the container as a table in PostgreSQL. The table contains all specified properties and the additional internal properties the service needs.
For example, if you create a container with a single property named my-property
, the container will look like this in the data modeling data store:
internal_data | internal_data | space | my-property |
---|---|---|---|
YYY | 10 | foo | |
XXX | 10 | not foo |
Notice the internal data columns — properties — in the example. The data modeling service hides them and their data from you, but uses them extensively behind the scenes.
The underlying data store doesn't allow composite indexes for properties that are stored in different containers.
When is a container too big?
A container is too big (wide) when it has too many properties and indexes to be performant. Wide containers with many properties means that ingesting or updating data takes longer and the indexes can become bloated. This affects write and query performance for the data consumer and can cause more HTTP status 408 errors.
Each CDF project comes with a set of preconfigured default limits. One of the limits is the maximum number of properties per schema object. By default, this limit is 100 properties, but it can be increased.
In addition to consumption and data update patterns, the "correct" width of a container depends on the number of other wide containers in the graph.
In some cases, it may be efficient to have a container with 200 properties. If only a few, or none, of the properties have relationships, or dependencies, that must be updated when you onboard or update data, and if you don't have consumption relationships to and from other containers, having 10 wide containers (>50 properties) can also be a good solution.
To safeguard the performance of your solution, make sure that you experiment with, and test the scalability for, your onboarding and consumption patterns.
Don't use the data modeling service as an "entity-attribute-value" (EAV) store. Learn why here: Anti-pattern: the entity-attribute-value (EAV) model