Configure the OSDU extractor
To configure the OSDU extractor, you must create a configuration file. This file must be in YAML format. The configuration file is split into sections, each represented by a top-level entry in the YAML format.
You must name the configuration file config.yml.
You can use the sample configuration file included with the extractor as a starting point for your configuration settings. You must as a minimum adjust these settings before you run the extractor:
connector:
extract:
raw-database: Enter the name of the target CDF RAW database.
dataset-id: Enter the ID of the target CDF data set (16-digit integer).
kinds: Check that this matches the list of OSDU schema IDs to extract.
You can set up extraction pipelines to use versioned extractor configuration files stored in the cloud.
cognite
Include the cognite
section to configure which CDF project the extractor will load data into and how to connect to the project. This section is mandatory and should always contain the project and authentication configuration.
Parameter | Description |
---|---|
host | Insert the base URL of the CDF project. The default value is https://api.cognitedata.com. |
project | Insert the CDF project name you want to ingest data into. |
timeout | Specify the number of seconds to wait for a response to a request made to CDF. The default value is 30 seconds. |
idp-authentication | Insert the credentials for authenticating to CDF using an external identity provider. |
client-id | Enter the client ID from the IdP. |
scopes | List the scopes. This is usually [{host}/.default]. |
secret | Enter the client secret from the IdP. |
token-url | Insert the URL to fetch authentication tokens from. |
extraction-pipeline | Insert the external ID of an extraction pipeline in CDF. You should create the extraction pipeline before you configure this section. This parameter is optional. |
external-id | Enter the external ID of the extraction pipeline in CDF. This parameter is optional. |
id | Enter the ID of the extraction pipeline in CDF. This parameter is optional. |
osdu-client
Include the osdu-client
to configure the connection to the OSDU platform.
authentication
Parameter | Description |
---|---|
api-url | Insert the base URL of the OSDU API. |
client-id | Enter the OSDU client ID. |
client-secret | Enter the OSDU client secret. |
data-partition | Enter the name of the OSDU data partition in OSDU. |
scope | Specify the OSDU scope. |
tenant-id | Enter the Azure tenant ID. This parameter is optional. |
timeout | Enter the maximum time in seconds to wait for a response to a request made to OSDU. The default value is 30 seconds. |
token-url | Insert the URL to fetch authentication tokens from. |
services
Parameter | Description |
---|---|
cursor-fetch-size | Specify the number of search results to fetch in each request. |
dms-parallelism | Insert the number of parallel threads hitting the DMS API. The default value is 6. |
generic-parallelism | Insert the number of parallel threads hitting the generic API. The default value is 32. |
connector
Include the connector
section to configure the general settings for the extractor.
Parameter | Description |
---|---|
sleep-time | Enter the number of seconds to pause between polls for changes. |
total-parallelism | Enter the maximum number of threads employed. The default value is 64. |
upload-queue-size | Enter the number of items to accumulate in a queue between extraction and writing to CDF RAW and CDF Files. |
upload-queue-interval | Enter the number of seconds between uploads of a queue when the size is not reached. |
extract | Settings specific to the extraction direction. |
raw-database | Enter the database name in CDF RAW that stores the extracted OSDU records. |
statestore-table | Enter the name of the auxiliary CDF RAW table that stores the state of the extractor. |
dataset-id | Insert the data set ID for the extracted data files. |
kinds | List of the OSDU data types to extract. Each kind has the following settings: |
name-pattern | Enter the name / schema ID of the OSDU kind. You can include multiple kinds in a single entry by entering a pattern with Unix shell-style wildcards (* and ? ). For example, osdu:wks:master-data--Well*:* would match any version of osdu:wks:master-data--Well as well as osdu:wks:master-data--Wellbore . |
filter | Enter a query following the Lucene syntax to filter which records to extract from OSDU, for example "createTime:[2022-04-19T16 TO *]" or "data.Source:\"BLENDED\"" . This parameter is optional. |
dms-kind | Use this parameter when the data for the kind is stored in a DDMS instead of generic OSDU files. Supported values are wellbore_dms_well_log and wellbore_dms_trajectory . This parameter is optional. |
logger
Include the logger
section to set up logging to a console and files.
Parameter | Description |
---|---|
console | Enable logging to a standard output, such as a terminal window. This parameter is optional. |
level | Select the verbosity level for console logging. Valid options are debug , info , warning , and error . The default value is info . |
file | Enable logging to a file. This parameter is optional. |
level | Select the verbosity level for file logging. Valid options are debug , info , warning , and error . The default value is info . |
log_json | Set to true to enable logging in JSON format. The default value is false . |
path | Insert the file system path to the log file. |
retention | Specify the maximum number of days to retain logs. The default value is 7. |
metrics
Include the metrics
section to send metrics about the extractor performance for remote monitoring of the extractor. This section is optional. We recommend sending metrics to a Prometheus pushgateway, but you can also send metrics as time series in the CDF project.
Parameter | Description |
---|---|
cognite | Cognite metrics configurations. This parameter is optional. |
external_id_prefix | Enter an external ID prefix to identify the CDF time series created for each metric. |
push-interval | Enter the interval in seconds between each push. The default value is 30. |