Configuration settings

To configure the PI extractor, you must create a configuration file. This file must be in YAML format. The configuration file is split into sections, each represented by a top-level entry in the YAML format. Subsections are nested under a section in the YAML format.

You can use either the sample complete or minimal configuration files included with the installer as a starting point for your configuration settings:

config.default.yml - This file contains all configuration options and descriptions.
config.minimal.yml - This file contains a minimum configuration and no descriptions.

Naming the configuration file

You must name the configuration file config.yml.

Tip

You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud.

Before you start

Optionally, copy one of the sample files in the config directory and rename it to config.yml.
The config.minimal.yml file doesn't include a metrics section. Copy this section from the example below if the extractor is required to send metrics to a Prometheus Pushgateway.
Set up an extraction pipeline and note the external ID.

Minimal YAML configuration file

The YAML settings below contain valid PI extractor 2.1 configurations. The values wrapped in ${} are replaced with environment variables with that name. For example, ${COGNITE_PROJECT} will be replaced with the value of the environment variable called COGNITE_PROJECT.

The configuration file has a global parameter version, which holds the version of the configuration schema used in the configuration file. This document describes version 3 of the configuration schema.

version: 3

cognite:
  project: '${COGNITE_PROJECT}'
  idp-authentication:
    tenant: ${COGNITE_TENANT_ID}
    client-id: ${COGNITE_CLIENT_ID}
    secret: ${COGNITE_CLIENT_SECRET}
    scopes:
      - ${COGNITE_SCOPE}

time-series:
  external-id-prefix: 'pi:'

pi:
  host: ${PI_HOST}
  username: ${PI_USER}
  password: ${PI_PASSWORD}

state-store:
  database: LiteDb
  location: 'state.db'

logger:
  file:
    level: 'information'
    path: 'logs/log.log'

where:

version is the version of the configuration schema. Use version 3 to be compatible with the Cognite PI extractor 2.1.
cognite is the how the extractor reads the authentication details for PI (pi) and CDF (cognite) from environment variables. Since no host is specified in the cognite section, the extractor uses the default value, <https://api.cognitedata.com >, and assumes that the PI server uses Windows authentication.
time-series configures the extractor to create time series in CDF where the external IDs will be prefixed with pi:. You can also use a data set ID configuration to add all time series created by the extractor to a particular data set.
state-store configures the extractor to save the extraction state locally using a LiteDB database file named state.db.
logger configures the extractor to log at information level and outputs log messages to a log file in the logs/log.log directory. By default, new files are created daily and retained for 31 days. The date is appended to the file name.

Writing to CDF Data Models

The extractor can write directly to a core data model time series, or more correctly, to an extended version of the CogniteTimeSeries Data Model type. To save/update instances of CogniteTimeSeries, specify the identifier for the target Data Modeling Space in the configuration, using the space-id parameter of the time series configuration section.

The behavior in this mode is almost identical to classic time series destination, except for a few key differences:

Metadata is written as a json blob to the extended time series type (CogniteExtractorTimeSeries), and saved in the extractedData property.
data-set-id configuration option is ignored.

Using values from Azure Key Vault

The PI extractor also supports loading values from Azure Key Vault. To load a configuration value from Azure Key Vault, use the !keyvault tag followed by the name of the secret you want to load. For example, to load the value of the my-secret-name secret in Key Vault into a password parameter, configure your extractor like this:

password: !keyvault my-secret-name

To use Key Vault, you also need to include the key-vault section in your configuration, with the following parameters:

Parameter	Description
`keyvault-name`	Name of Key Vault to load secrets from
`authentication-method`	How to authenticate to Azure. Either `default` or `client-secret`. For `default`, the extractor will look at the user running the extractor, and look for pre-configured Azure logins from tools like the Azure CLI. For `client-secret`, the extractor will authenticate with a configured client ID/secret pair.
`client-id`	Required for using the `client-secret` authentication method. The client ID to use when authenticating to Azure.
`secret`	Required for using the `client-secret` authentication method. The client secret to use when authenticating to Azure.
`tenant-id`	Required for using the `client-secret` authentication method. The tenant ID of the Key Vault in Azure.

Example:

key-vault:
  keyvault-name: my-keyvault-name
  authentication-method: client-secret
  tenant-id: 6f3f324e-5bfc-4f12-9abe-22ac56e2e648
  client-id: 6b4cc73e-ee58-4b61-ba43-83c4ba639be6
  secret: 1234abcd

Timestamps and intervals

In most places where time intervals are required, you can use a CDF-like syntax of [N][timeunit], for example 10m for 10 minutes or 1h for 1 hour. timeunit is one of d, h, m, s, ms. You can also use a cron expression in some places.

For configuring the earliest point to backfill to you can use a similar syntax. [N][timeunit] and [N][timeunit]-ago. 1d-ago means 1 day in the past from the time history starts, and 1h means 1 hour in the future. For instance, you can use this syntax to configure the backfill only recent history.

You can also set the backfill target to a specific date on RFC 3339 form, like 2025-07-02T22:23:12Z, or on a shorter form like 2022-11-20+12:00. A timezone specifier is necessary, Z means UTC.

Configure the PI extractor

Configuration for the PI extractor. Each section configures a different aspect of the extractor.

Parameter	Type	Description
`version`	integer	Version of the config file, the extractor specifies which config file versions are accepted in each version of the extractor.
`logger`	object	Configuration for logging to console or file. Log entries are either `Fatal`, `Error`, `Warning`, `Information`, `Debug`, or `Verbose`, in order of decreasing priority. The extractor will log any messages at an equal or higher log level than the configured level for each sink.
`metrics`	object	Configuration for publishing metrics.
`cognite`	object	Configure connection to Cognite Data Fusion (CDF)
`state-store`	object	Include the `state-store` section to configure the extractor to save the extraction state periodically. This makes the extraction resume faster in the next run. This section is optional. If not present, or if `database` is set to `none`, the extraction state is restored by querying the timestamps of the first and last data points of each time series. If CDF Raw is used as a state store, you can see the extracted ranges under Manage staged data in CDF.
`pi`	object	Configure the extractor to connect to a particular PI server or PI collective. If you configure the extractor with a PI collective, the extractor will transparently maintain a connection to one of the active servers in the collective. The default settings provide Active Directory authorization to the PI host when the server account for the Windows service is authorized.
`time-series`	object	Include the `time-series` section for configuration related to the time series ingested by the extractor. This section is optional.
`events`	object	Configuration for writing events on reconnect and data loss incidents
`extractor`	object	The `extractor` section contains various configuration options for the operation of the extractor itself. The options here can be used to extract only a subset of the PI points in the server. This is how the list is created: 1. If `include-tags`, `include-prefixes`, `include-patterns` or `include-attribute-values` are not empty, start with the union of these three. Otherwise, start with all points. 2. Remove points as specified by `exclude-tags`, `exclude-prefixes`, `exclude-patterns` and `exclude-attribute-values`.
`backfill`	object	Include the `backfill` section to configure how the extractor fills in historical data back in time with respect to the first data point in CDF. The backfill process completes when all the data points in the PI Data Archive are sent to CDF or when the extractor reaches the target timestamp for all time series if the `to` parameter is set.
`frontfill`	object	Include the `frontfill` section to configure how the extractor fills in historical data forward in time with respect to the last data point in CDF. At startup, the extractor fills in the gap between the last data point in CDF and the last data point in PI by querying the archived data in the PI Data Archive. After that, the extractor only receives data streamed through the PI Data Pipe. These are real-time changes made to the time series in PI before archiving.
`high-availability`	object	Configuration for a Redis based high availability store. Requires Redis to be configured in `state-store`.

`logger`

Global parameter.

Configuration for logging to console or file. Log entries are either Fatal, Error, Warning, Information, Debug, or Verbose, in order of decreasing priority. The extractor will log any messages at an equal or higher log level than the configured level for each sink.

Parameter	Type	Description
`console`	object	Configuration for logging to the console.
`file`	object	Configuration for logging to a rotating log file.
`trace-listener`	object	Adds a listener that uses the configured logger to output messages from `System.Diagnostics.Trace`

`console`

Part of logger configuration.

Configuration for logging to the console.

Parameter	Type	Description
`level`	either `verbose`, `debug`, `information`, `warning`, `error` or `fatal`	Required. Minimum level of log events to write to the console. If not present, or invalid, logging to console is disabled.
`stderr-level`	either `verbose`, `debug`, `information`, `warning`, `error` or `fatal`	Log events at this level or above are redirected to standard error.

`file`

Part of logger configuration.

Configuration for logging to a rotating log file.

Parameter	Type	Description
`level`	either `verbose`, `debug`, `information`, `warning`, `error` or `fatal`	Required. Minimum level of log events to write to file.
`path`	string	Required. Path to the files to be logged. If this is set to `logs/log.txt`, logs on the form `logs/log[date].txt` will be created, depending on `rolling-interval`.
`retention-limit`	integer	Maximum number of log files that are kept in the log folder. Default value is `31`.
`rolling-interval`	either `day` or `hour`	Rolling interval for log files. Default value is `day`.

`trace-listener`

Part of logger configuration.

Adds a listener that uses the configured logger to output messages from System.Diagnostics.Trace

Parameter	Type	Description
`level`	either `verbose`, `debug`, `information`, `warning`, `error` or `fatal`	Required. Level to output trace messages at

`metrics`

Global parameter.

Configuration for publishing metrics.

Parameter	Type	Description
`server`	object	Configuration for having the extractor start a Prometheus scrape server on a local port.
`push-gateways`	list	A list of pushgateway destinations to push metrics to. The extractor will automatically push metrics to each of these.

`server`

Part of metrics configuration.

Configuration for having the extractor start a Prometheus scrape server on a local port.

Parameter	Type	Description
`host`	string	Required. Host name for local Prometheus server, must be exposed to some prometheus instance for scraping. Examples: `localhost` `0.0.0.0`
`port`	integer	Required. The port used for a local Prometheus server.

`push-gateways`

Part of metrics configuration.

A list of pushgateway destinations to push metrics to. The extractor will automatically push metrics to each of these.

Parameter	Type	Description
`host`	string	Required. URI of the pushgateway host Example: `http://my.pushgateway:9091`
`job`	string	Required. Name of the Prometheus pushgateway job.
`username`	string	Username for basic authentication
`password`	string	Password for basic authentication
`push-interval`	integer	Interval in seconds between each push to the gateway. Default value is `1`.

`cognite`

Global parameter.

Configure connection to Cognite Data Fusion (CDF)

Parameter	Type	Description
`project`	string	CDF project to connect to.
`idp-authentication`	object	The `idp-authentication` section enables the extractor to authenticate to CDF using an external identity provider (IdP), such as Microsoft Entra ID (formerly Azure Active Directory). See OAuth 2.0 client credentials flow
`host`	string	Insert the base URL of the CDF project. Default value is `https://api.cognitedata.com`.
`cdf-retries`	object	Configure automatic retries on requests to CDF.
`cdf-chunking`	object	Configure chunking of data on requests to CDF. Note that increasing these may cause requests to fail due to limits in the API itself
`cdf-throttling`	object	Configure the maximum number of parallel requests for different CDF resources.
`sdk-logging`	object	Configure logging of requests from the SDK
`nan-replacement`	either number or null	Replacement for NaN values when writing to CDF. If left out, NaN values are skipped.
`extraction-pipeline`	object	Configure an associated extraction pipeline
`certificates`	object	Configure special handling of SSL certificates. This should never be considered a permanent solution to certificate problems
`metadata-targets`	object	Configuration for targets for time series metadata.

`idp-authentication`

Part of cognite configuration.

The idp-authentication section enables the extractor to authenticate to CDF using an external identity provider (IdP), such as Microsoft Entra ID (formerly Azure Active Directory). See OAuth 2.0 client credentials flow

Parameter	Type	Description
`authority`	string	AInsert the authority together with `tenant` to authenticate against Azure tenants. Default value is `https://login.microsoftonline.com/`.
`client-id`	string	Required. Enter the service principal client id from the IdP.
`tenant`	string	Enter the Azure tenant.
`token-url`	string	Insert the URL to fetch tokens from.
`secret`	string	Enter the service principal client secret from the IdP.
`resource`	string	Resource parameter passed along with token requests.
`audience`	string	Audience parameter passed along with token requests.
`scopes`	configuration for either list or string
`min-ttl`	integer	Insert the minimum time in seconds a token will be valid. If the cached token expires in less than `min-ttl` seconds, it will be refreshed even if it is still valid. Default value is `30`.
`certificate`	object	Authenticate with a client certificate

`certificate`

Part of idp-authentication configuration.

Authenticate with a client certificate

Parameter	Type	Description
`authority-url`	string	Authentication authority URL
`path`	string	Required. Enter the path to the .pem or .pfx certificate to be used for authentication
`password`	string	Enter the password for the key file, if it is encrypted.

`cdf-retries`

Part of cognite configuration.

Configure automatic retries on requests to CDF.

Parameter	Type	Description
`timeout`	integer	Timeout in milliseconds for each individual request to CDF. Default value is `80000`.
`max-retries`	integer	Maximum number of retries on requests to CDF. If this is less than 0, retry forever. Default value is `5`.
`max-delay`	integer	Max delay in milliseconds between each retry. Base delay is calculated according to 125*2^retry milliseconds. If less than 0, there is no maximum. Default value is `5000`.

`cdf-chunking`

Part of cognite configuration.

Configure chunking of data on requests to CDF. Note that increasing these may cause requests to fail due to limits in the API itself

Parameter	Type	Description
`time-series`	integer	Maximum number of timeseries per get/create timeseries request. Default value is `1000`.
`assets`	integer	Maximum number of assets per get/create assets request. Default value is `1000`.
`data-point-time-series`	integer	Maximum number of timeseries per datapoint create request. Default value is `10000`.
`data-point-delete`	integer	Maximum number of ranges per delete datapoints request. Default value is `10000`.
`data-point-list`	integer	Maximum number of timeseries per datapoint read request. Used when getting the first point in a timeseries. Default value is `100`.
`data-points`	integer	Maximum number of datapoints per datapoints create request. Default value is `100000`.
`data-points-gzip-limit`	integer	Minimum number of datapoints in request to switch to using gzip. Set to -1 to disable, and 0 to always enable (not recommended). The minimum HTTP packet size is generally 1500 bytes, so this should never be set below 100 for numeric datapoints. Even for larger packages gzip is efficient enough that packages are compressed below 1500 bytes. At 5000 it is always a performance gain. It can be set lower if bandwidth is a major issue. Default value is `5000`.
`raw-rows`	integer	Maximum number of rows per request to cdf raw. Default value is `10000`.
`raw-rows-delete`	integer	Maximum number of row keys per delete request to raw. Default value is `1000`.
`data-point-latest`	integer	Maximum number of timeseries per datapoint read latest request. Default value is `100`.
`events`	integer	Maximum number of events per get/create events request. Default value is `1000`.
`sequences`	integer	Maximum number of sequences per get/create sequences request. Default value is `1000`.
`sequence-row-sequences`	integer	Maximum number of sequences per create sequence rows request. Default value is `1000`.
`sequence-rows`	integer	Maximum number of sequence rows per sequence when creating rows. Default value is `10000`.
`instances`	integer	Maximum number of data modeling instances per get/create instance request. Default value is `1000`.

`cdf-throttling`

Part of cognite configuration.

Configure the maximum number of parallel requests for different CDF resources.

Parameter	Type	Description
`time-series`	integer	Maximum number of parallel requests per timeseries operation. Default value is `20`.
`assets`	integer	Maximum number of parallel requests per assets operation. Default value is `20`.
`data-points`	integer	Maximum number of parallel requests per datapoints operation. Default value is `10`.
`raw`	integer	Maximum number of parallel requests per raw operation. Default value is `10`.
`ranges`	integer	Maximum number of parallel requests per get first/last datapoint operation. Default value is `20`.
`events`	integer	Maximum number of parallel requests per events operation. Default value is `20`.
`sequences`	integer	Maximum number of parallel requests per sequences operation. Default value is `10`.
`instances`	integer	Maximum number of parallel requests per data modeling instances operation. Default value is `4`.

`sdk-logging`

Part of cognite configuration.

Configure logging of requests from the SDK

Parameter	Type	Description
`disable`	boolean	True to disable logging from the SDK, it is enabled by default
`level`	either `trace`, `debug`, `information`, `warning`, `error`, `critical` or `none`	Log level to log messages from the SDK at. Default value is `debug`.
`format`	string	Format of the log message. Default value is `CDF ({Message}): {HttpMethod} {Url} {ResponseHeader[X-Request-ID]} - {Elapsed} ms`.

`extraction-pipeline`

Part of cognite configuration.

Configure an associated extraction pipeline

Parameter	Type	Description
`external-id`	string	External ID of the extraction pipeline
`frequency`	integer	Frequency to report `Seen` to the extraction pipeline in seconds. Less than or equal to zero will not report automatically. Default value is `600`.

`certificates`

Part of cognite configuration.

Configure special handling of SSL certificates. This should never be considered a permanent solution to certificate problems

Parameter	Type	Description
`accept-all`	boolean	Accept all remote SSL certificates. This introduces a severe risk of man-in-the-middle attacks
`allow-list`	list	List of certificate thumbprints to automatically accept. This is a much smaller risk than accepting all certificates

`allow-list`

Part of certificates configuration.

List of certificate thumbprints to automatically accept. This is a much smaller risk than accepting all certificates

Each element of this list should be a string.

`metadata-targets`

Part of cognite configuration.

Configuration for targets for time series metadata.

Parameter	Type	Description
`raw`	object	Configuration for writing metadata to CDF Raw.
`clean`	object	Configuration for enabling writing metadata to CDF Clean.

`raw`

Part of metadata-targets configuration.

Configuration for writing metadata to CDF Raw.

Parameter	Type	Description
`database`	string	Required. The Raw database to write to.
`timeseries-table`	string	Required. Name of the Raw table to write timeseries metadata to, enables writing metadata to Raw. Metadata in this case includes name, description, and unit.

`clean`

Part of metadata-targets configuration.

Configuration for enabling writing metadata to CDF Clean.

Parameter	Type	Description
`timeseries`	boolean	Set to `false` to disable writing metadata to time series. Default value is `True`.

`state-store`

Global parameter.

Include the state-store section to configure the extractor to save the extraction state periodically. This makes the extraction resume faster in the next run. This section is optional. If not present, or if database is set to none, the extraction state is restored by querying the timestamps of the first and last data points of each time series. If CDF Raw is used as a state store, you can see the extracted ranges under Manage staged data in CDF.

Parameter	Type	Description
`location`	string	Required. Path to .db file used for storage, or name of a CDF RAW database.
`database`	either `None`, `LiteDb` or `Raw`	Which type of database to use. Default value is `None`.
`interval`	string	Enter the time between each write to the state store. 0 or less disables the state store. Format is as given in Intervals. Default value is `10s`.
`time-series-table-name`	string	Table name in Raw/Redis/LiteDB for storing time series state. Default value is `ranges`.
`extractor-table-name`	string	Table name in Raw/Redis/LiteDB for storing general extractor state. Default value is `extractor`.
`redis`	boolean	Use a redis state store. Overrides `database`.

`pi`

Global parameter.

Configure the extractor to connect to a particular PI server or PI collective. If you configure the extractor with a PI collective, the extractor will transparently maintain a connection to one of the active servers in the collective. The default settings provide Active Directory authorization to the PI host when the server account for the Windows service is authorized.

Parameter	Type	Description
`host`	string	Required. Enter the hostname or IP address of the PI server or the PI collective name. If the host is a collective member name, connect to that member.
`username`	string	Enter the username to use for authentication. Leave this empty if the Windows service account under which the extractor runs is authorized in Active Directory to access the PI host.
`password`	string	Enter the password for the given username, if any.
`native-authentication`	boolean	Determines whether the extractor will use native PI authentication or Windows authentication. The default value is `false`, which indicates Windows authentication.
`parallelism`	integer	Insert the number of parallel requests to the PI server. If `backfill-parallelism` is set, this excludes backfill requests. Default value is `1`.
`backfill-parallelism`	integer	Insert the number of parallel requests to the PI server for backfills. This allows the separate throttling of backfills. The default value is 0, which means that the value in `parallelism` is used.
`use-member-priority`	boolean	When connecting to a PI Collective, attempt to connect to the member with highest priority. If set to `false`, attempt to connect to the member with the same name as `host`.
`max-connection-retries`	integer	The maximum number of times to attempt to connect to the PI server before failing fatally. If this is `0`, retry forever.

`time-series`

Global parameter.

Include the time-series section for configuration related to the time series ingested by the extractor. This section is optional.

The example would create time series on the following form in CDF:

{
 "externalId": "pi:12345",
 "name": "PI-Point-Name",
 "isString": false,
 "isStep": false,
 "dataSetId": 1234567890123456
}

Example:

external-id-prefix: 'pi:'
external-id-source: SourceId
data-set-id: 1234567890123456

Parameter	Type	Description
`external-id-prefix`	string	Enter the external ID prefix to identify the time series in CDF. Leave empty for no prefix. The external ID in CDF will be this prefix followed by either the PI Point name or PI Point ID.
`external-id-source`	either `Name` or `SourceId`	Enter the source of the external ID. `Name` means that the PI Point name is used, while `SourceID` means that the PI Point ID is used. Default value is `Name`.
`data-set-id`	integer	Specify the data set to assign to all time series controlled by the extractor, both new and current. If you don't configure this, the extractor will not change the current time series' data set.
`data-set-external-id`	string	Specify the external ID of the data set to use, see `data-set-id`. Using this requires the dataSets:READ ACL in CDF
`sanitation`	either `Remove`, `Clean` or `None`	Specify what to do when the time series fields exceed CDF limits. `Remove` will skip any time series that fail sanitation. `Clean` will truncate and remove values to conform to limits. `None` does nothing (requests may fail as a result). External IDs are never truncated, any time series exceeding CDF limits will be skipped to avoid external ID collisions, regardless of this configuration. Default value is `Clean`.
`update-metadata`	boolean	Enable updating time series if extractor configuration or PI metadata changes. Default value is `True`.
`space-id`	string	The identifier for the Data modeling space. This extractor configuration parameter overrides `data-set-id` parameter, and changes the extractor destination from the Asset-Centric TimeSeries to a Core Data Model based TimeSeries, using the `CogniteExtractorTimeSeries` system model extension.
`source-id`	string	Data modeling source id, this parameter will override the external id of the source node in data modeling, which defaults to the `host` parameter from PI server configuration.

`events`

Global parameter.

Configuration for writing events on reconnect and data loss incidents

The example configuration produces events on the following form

{
  "externalId": "pi-events:PiExtractor-2020-08-04 01:01:21.395(0)",
  "startTime": 1596502850668,
  "endTime": 1596502880393,
  "type": "DataLoss",
  "subtype": "DataPipeOverflow",
  "source": "PiExtractor",
  "dataSetId": 1234567890123456
}

Example:

source: PiExtractor
external-id-prefix: 'pi-events:'
data-set-id: 1234567890123456
store-extractor-events-interval: 5m

Parameter	Type	Description
`source`	string	Events have this value as the `source` in CDF.
`external-id-prefix`	string	Set an external ID prefix to identify events in CDF. Leave empty for no prefix.
`data-set-id`	integer	Select a data set to assign to all events created by the extractor. We recommend using the same data set ID used in the Time series section.
`data-set-external-id`	string	Data set external ID added to events written to CDF, see `data-set-id`. Using this requires the `dataSets:READ` ACL in CDF
`store-extractor-events-interval`	string	Store events in CDF with this interval. Format is as given in Intervals. If this is not set, events are not created in CDF. Examples: `10s` `5m`

`extractor`

Global parameter.

The extractor section contains various configuration options for the operation of the extractor itself. The options here can be used to extract only a subset of the PI points in the server. This is how the list is created:

If include-tags, include-prefixes, include-patterns or include-attribute-values are not empty, start with the union of these three. Otherwise, start with all points.
Remove points as specified by exclude-tags, exclude-prefixes, exclude-patterns and exclude-attribute-values.

Parameter	Type	Description
`include-tags`	list	Include tags with name exactly equal to one of these names.
`include-prefixes`	list	Include tags with name starting with one of these prefixes.
`include-patterns`	list	Include tags with name containing one of these substrings.
`include-attribute-values`	list	Include tags with attributes equal to at least one of these key/value pairs.
`exclude-tags`	list	Exclude tags with name exactly equal to one of these names.
`exclude-prefixes`	list	Exclude tags with name starting with one of these prefixes.
`exclude-patterns`	list	Exclude tags with name containing one of these substrings.
`exclude-attribute-values`	list	Exclude tags with attributes equal to at least one of these key/value pairs.
`time-series-update-interval`	either string or integer	Interval between checks for new time series in PI. Format is as given in Intervals, this option accepts cron expressions. Default value is `24h`.
`deleted-time-series`	object	Include the `deleted-time-series` subsection to configure how the extractor handles time series that exist in CDF but not in PI. This subsection is optional, and the default behavior is `none` (do nothing).
`end-of-stream-interval`	string	Interval for fetching the end-of-stream timestamps from PI. Format is as given in Intervals.
`end-of-stream-chunking`	integer	Maximum number of time series per end-of-stream request. Default value is `10000`.
`end-of-stream-throttle`	integer	Maximum number of parallel end-of-stream requests. Default value is `10`.
`time-series-update-throttle`	integer	Maximum number of parallel time series updates. Default value is `10`.
`dry-run`	boolean	Set this to `true` to run the extractor in dry-run mode, where it does not push any data to CDF. Useful for testing the connection to the PI Server
`read-extracted-ranges`	boolean	Set this to `false` to disable reading from CDF on extractor startup. If this is set to `false`, it is strongly recommended to have a state-store configured, or the extractor will read all history from PI on every run. Default value is `True`.
`status-codes`	object	Configuration for ingesting status codes to CDF timeseries.

`include-tags`

Part of extractor configuration.

Include tags with name exactly equal to one of these names.

Each element of this list should be a string.

`include-prefixes`

Part of extractor configuration.

Include tags with name starting with one of these prefixes.

Each element of this list should be a string.

`include-patterns`

Part of extractor configuration.

Include tags with name containing one of these substrings.

Each element of this list should be a string.

`include-attribute-values`

Part of extractor configuration.

Include tags with attributes equal to at least one of these key/value pairs.

Parameter	Type	Description
`key`	string	Attribute name
`value`	string	Attribute value

`exclude-tags`

Part of extractor configuration.

Exclude tags with name exactly equal to one of these names.

Each element of this list should be a string.

`exclude-prefixes`

Part of extractor configuration.

Exclude tags with name starting with one of these prefixes.

Each element of this list should be a string.

`exclude-patterns`

Part of extractor configuration.

Exclude tags with name containing one of these substrings.

Each element of this list should be a string.

`exclude-attribute-values`

Part of extractor configuration.

Exclude tags with attributes equal to at least one of these key/value pairs.

Parameter	Type	Description
`key`	string	Attribute name
`value`	string	Attribute value

`deleted-time-series`

Part of extractor configuration.

Include the deleted-time-series subsection to configure how the extractor handles time series that exist in CDF but not in PI. This subsection is optional, and the default behavior is none (do nothing).

Note

This only affects time series with the same data set ID and external ID prefix as the time series configured in the extractor

To find time series that exist in CDF but not in PI, the extractor:

Lists all time series in CDF that have the configured external ID prefix and data set ID
Filters the time series using the include/exclude rules defined in the extractor section.
Matches the result against the time series obtained from the PI Server after filtering these using the include/exclude rules.

Parameter	Type	Description
`behavior`	either `None`, `Flag` or `Delete`	Select the action taken by the extractor. Setting this to `Flag` will perform soft deletion: Flag the time series as deleted but don't delete them from CDF. Setting it to `Delete` will delete the time series from CDF. If you set this to `Delete`, the time series in CDF that cannot be found in PI will be permanently deleted from CDF.. Default value is `None`.
`flag-name`	string	If you've set `behavior` to `Flag`, use this string to mark the time series as deleted. Metadata is added to the time series with this as the key, and the current time as the value. Default value is `deletedByExtractor`.
`time-series-delete-throttle`	integer	Maximum number of parallel delete operations. Default value is `10`.

`status-codes`

Part of extractor configuration.

Configuration for ingesting status codes to CDF timeseries.

Parameter	Type	Description
`status-codes-to-ingest`	either `GoodOnly`, `Uncertain` or `All`	Which data points to ingest to CDF. `All` ingests all datapoints, including bad. `Uncertain` ingests good and uncertain data points. `GoodOnly` ingest only good datapoints. Default value is `GoodOnly`.

`backfill`

Global parameter.

Include the backfill section to configure how the extractor fills in historical data back in time with respect to the first data point in CDF. The backfill process completes when all the data points in the PI Data Archive are sent to CDF or when the extractor reaches the target timestamp for all time series if the to parameter is set.

Parameter	Type	Description
`skip`	boolean	Set to `true` to disable backfill.
`step-size-hours`	integer	Step, in whole number of hours. Set to 0 to freely backfill all time series. Each iteration of backfill will backfill all time series to the next step before stepping further backward. This helps even out the progression of long backfill processes. Default value is `168`.
`to`	string	The target CDF timestamp in miliseconds at which to stop the backfill. Format is as given in Timestamps and intervals, `-ago` can be added to make a relative timestamp in the past. Default value is `0`. Examples: `3d-ago` `2025-07-02T22:23:12Z` `2022-11-20+12:00`
`time-series-delay`	integer	Delay in milliseconds between each time series backfill request within a step.
`step-delay`	integer	Delay in milliseconds between each step.
`retry-bucket-size`	integer	Maximum size of the retry bucket. Zero or less for no limit. Default value is `100`.
`retry-bucket-cleanup-time`	integer	Delay in seconds between each cleanup of the retry bucket. Default value is `3600`.

`frontfill`

Global parameter.

Include the frontfill section to configure how the extractor fills in historical data forward in time with respect to the last data point in CDF. At startup, the extractor fills in the gap between the last data point in CDF and the last data point in PI by querying the archived data in the PI Data Archive. After that, the extractor only receives data streamed through the PI Data Pipe. These are real-time changes made to the time series in PI before archiving.

Note

When data points are archived in PI, they may be subject to compression, reducing the total amount of data points in a time series. Therefore, the backfill and frontfill tasks will receive data points after compression, while the streaming task will receive data points before compression. Learn more about compression in this video.

Parameter	Type	Description
`skip`	boolean	Set this to `true` to disable frontfill and streaming.
`streaming-interval`	string	Interval between each call to the PI Data Pipe to fetch new data. Format is as given in Intervals. If set to zero, the extractor will not fetch live data from PI. If you set this parameter to a high value, there is a higher chance of having a client outbound queue overflow in the PI server. Overflow may cause loss of data points in CDF. Recommended range is between 1 and 30 seconds. If you observe data loss events, consider turning this value down. Default value is `1s`.
`delete-data-points`	boolean	If you set this to `true`, the corresponding data points are deleted in CDF when data point deletion events are received in the PI Data Pipe. Enabling this parameter may increase the streaming latency of the extractor since the extractor verifies the data point deletion before proceeding with the insertions.
`use-data-pipe`	boolean	Older PI servers may not support data pipes. If that's the case, set this value to `false` to disable data pipe streaming. The frontfiller task will run constantly and will frequently query the PI Data Archive for new data points. Default value is `True`.
`time-series-chunk`	integer	The maximum number of time series in each frontfill query to PI. The chunks will be adapted according to the density of data points per time series. Default value is `1000`.
`data-points-chunk`	integer	The maximum number of requested data points in each frontfill query to PI. Default value is `10000`.

`high-availability`

Global parameter.

Configuration for a Redis based high availability store. Requires Redis to be configured in state-store.

Parameter	Type	Description
`interval`	string	Interval to update the high availability state in Redis. Format is as given in Intervals.
`timeout`	integer	Timeout in seconds before taking over as active extractor. Must be set greater than 0 to enable high availability.

Before you start​

Minimal YAML configuration file​

Writing to CDF Data Models​

Using values from Azure Key Vault​

Timestamps and intervals​

Configure the PI extractor​

logger​

console​

file​

trace-listener​

metrics​

server​

push-gateways​

cognite​

idp-authentication​

certificate​

cdf-retries​

cdf-chunking​

cdf-throttling​

sdk-logging​

extraction-pipeline​

certificates​

allow-list​

metadata-targets​

raw​

clean​

state-store​

pi​

time-series​

events​

extractor​

include-tags​

include-prefixes​

include-patterns​

include-attribute-values​

exclude-tags​

exclude-prefixes​

exclude-patterns​

exclude-attribute-values​

deleted-time-series​

status-codes​

backfill​

frontfill​

high-availability​

Before you start

Minimal YAML configuration file

Writing to CDF Data Models

Using values from Azure Key Vault

Timestamps and intervals

Configure the PI extractor

`logger`

`console`

`file`

`trace-listener`

`metrics`

`server`

`push-gateways`

`cognite`

`idp-authentication`

`certificate`

`cdf-retries`

`cdf-chunking`

`cdf-throttling`

`sdk-logging`

`extraction-pipeline`

`certificates`

`allow-list`

`metadata-targets`

`raw`

`clean`

`state-store`

`pi`

`time-series`

`events`

`extractor`

`include-tags`

`include-prefixes`

`include-patterns`

`include-attribute-values`

`exclude-tags`

`exclude-prefixes`

`exclude-patterns`

`exclude-attribute-values`

`deleted-time-series`

`status-codes`

`backfill`

`frontfill`

`high-availability`