Skip to main content

Configure the OSDU extractor

To configure the OSDU extractor, you must create a configuration file. This file must be in YAML format. The configuration file is split into sections, each represented by a top-level entry in the YAML format.

Naming the configuration file

You must name the configuration file config.yml.

You can use the sample configuration file included with the extractor as a starting point for your configuration settings. You must as a minimum adjust these settings before you run the extractor:

connector:
extract:
raw-database: Enter the name of the target CDF RAW database.
dataset-id: Enter the ID of the target CDF data set (16-digit integer).
kinds: Check that this matches the list of OSDU schema IDs to extract.
Tip

You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud.

cognite

Include the cognite section to configure which CDF project the extractor will load data into and how to connect to the project. This section is mandatory and should always contain the project and authentication configuration.

ParameterDescription
hostInsert the base URL of the CDF project. The default value is https://api.cognitedata.com.
projectInsert the CDF project name you want to ingest data into.
timeoutSpecify the number of seconds to wait for a response to a request made to CDF. The default value is 30 seconds.
idp-authenticationInsert the credentials for authenticating to CDF using an external identity provider.
client-idEnter the client ID from the IdP.
scopesList the scopes. This is usually [{host}/.default].
secretEnter the client secret from the IdP.
token-urlInsert the URL to fetch authentication tokens from.
extraction-pipelineInsert the external ID of an extraction pipeline in CDF. You should create the extraction pipeline before you configure this section. This parameter is optional.
external-idEnter the external ID of the extraction pipeline in CDF. This parameter is optional.
idEnter the ID of the extraction pipeline in CDF. This parameter is optional.

osdu-client

Include the osdu-client to configure the connection to the OSDU platform.

authentication

ParameterDescription
api-urlInsert the base URL of the OSDU API.
client-idEnter the OSDU client ID.
client-secretEnter the OSDU client secret.
data-partitionEnter the name of the OSDU data partition in OSDU.
scopeSpecify the OSDU scope.
tenant-idEnter the Azure tenant ID. This parameter is optional.
timeoutEnter the maximum time in seconds to wait for a response to a request made to OSDU. The default value is 30 seconds.
token-urlInsert the URL to fetch authentication tokens from.

services

ParameterDescription
cursor-fetch-sizeSpecify the number of search results to fetch in each request.
dms-parallelismInsert the number of parallel threads hitting the DMS API. The default value is 6.
generic-parallelismInsert the number of parallel threads hitting the generic API. The default value is 32.

connector

Include the connector section to configure the general settings for the extractor.

ParameterDescription
sleep-timeEnter the number of seconds to pause between polls for changes.
total-parallelismEnter the maximum number of threads employed. The default value is 64.
upload-queue-sizeEnter the number of items to accumulate in a queue between extraction and writing to CDF RAW and CDF Files.
upload-queue-intervalEnter the number of seconds between uploads of a queue when the size is not reached.
extractSettings specific to the extraction direction.
raw-databaseEnter the database name in CDF RAW that stores the extracted OSDU records.
statestore-tableEnter the name of the auxiliary CDF RAW table that stores the state of the extractor.
dataset-idInsert the data set ID for the extracted data files.
kindsList of the OSDU data types to extract. Each kind has the following settings:
name-patternEnter the name / schema ID of the OSDU kind. You can include multiple kinds in a single entry by entering a pattern with Unix shell-style wildcards (* and ?). For example, osdu:wks:master-data--Well*:* would match any version of osdu:wks:master-data--Well as well as osdu:wks:master-data--Wellbore.
filterEnter a query following the Lucene syntax to filter which records to extract from OSDU, for example "createTime:[2022-04-19T16 TO *]" or "data.Source:\"BLENDED\"". This parameter is optional.
dms-kindUse this parameter when the data for the kind is stored in a DDMS instead of generic OSDU files. Supported values are wellbore_dms_well_log and wellbore_dms_trajectory. This parameter is optional.

logger

Include the logger section to set up logging to a console and files.

ParameterDescription
consoleEnable logging to a standard output, such as a terminal window. This parameter is optional.
levelSelect the verbosity level for console logging. Valid options are debug, info, warning, and error. The default value is info.
fileEnable logging to a file. This parameter is optional.
levelSelect the verbosity level for file logging. Valid options are debug, info, warning, and error. The default value is info.
log_jsonSet to true to enable logging in JSON format. The default value is false.
pathInsert the file system path to the log file.
retentionSpecify the maximum number of days to retain logs. The default value is 7.

metrics

Include the metrics section to send metrics about the extractor performance for remote monitoring of the extractor. This section is optional. We recommend sending metrics to a Prometheus pushgateway, but you can also send metrics as time series in the CDF project.

ParameterDescription
cogniteCognite metrics configurations. This parameter is optional.
external_id_prefixEnter an external ID prefix to identify the CDF time series created for each metric.
push-intervalEnter the interval in seconds between each push. The default value is 30.