Skip to main content

Setting up the DB extractor

Follow the steps below to set up the extractor.

Before you start

  1. Set up ODBC drivers for the databases the extractor will connect to.
  2. Check the server requirements for the extractor.
  3. Create a configuration file according to the configuration settings. The file must be in YAML format.

Connect to a database

The DB extractor uses the ODBC layer to connect to databases, and you must install an ODBC driver for your database on the machine you're running the extractor from. Here's a list of links to ODBC drivers for some source systems:

ODBC connection strings

ODBC uses connection strings to reference databases. The connection strings contain information about which ODBC driver to use, where to find the database, sign-in credentials, etc. See Connection strings for examples.

Set up a Data Source Name (DSN)

We recommend setting up a DSN for the database if you're running the extractor on Windows. Then, the Windows DSN system handles password storage instead of keeping it in the configuration file or as an environment variable. In addition, the connection strings will be less complex.

To set up a DSN for your database:

  1. Open the ODBC Data Sources tool on the machine you're running the extractor from.

  2. Select the -bit version for your ODBC driver.

  3. Select Add and the ODBC driver you want to use. In this example, we're configuring a PostgreSQL database:

    Add ODBC driver
  4. Select Finish.

  5. Enter the connection information and the database name. Note that the image below may differ depending on which database type you are configuring a DSN for.

    Add ODBC driver
  6. Select Test to verify that the information is correct, then Save.

The DSN is now set up, and you can use the simplified connection string in your configuration file:

databases:-    name: my-postgres    connection-string: "DSN=MyPostgreSQL"

Run as a Windows standalone executable file

  1. Download the dbextractor-standalone-{VERSION}-win32.exe file via the download links available from the Cognite DB extractor section on the Extract data page in the CDF user interface.
  2. Save the file in a folder.
  3. Open a command line window and run the file with a configuration file as an argument.

In this example, the configuration file is named config.yml and saved in the same folder as the executable file:


> .\dbextractor-standalone-<VERSION>-win32.exe .\config.yml

Tips

For scheduled routine runs, you can automate these calls with the Windows Task Scheduler.

Run as a Windows service

  1. Download the dbextractor-service-{VERSION}-win32.exe file via the download links available from the Cognite DB extractor section on the Extract data page in the CDF user interface.

  2. Save the file to the same directory as a configuration file named config.yml. This must be the exact name of the file.

  3. As an administrator, open up the command line window in the folder you placed the executable file and the configuration file and run the following command:


    > .\dbextractor-service-<VERSION>-win32.exe install

  4. Open the Services app in Windows and find the Cognite DB Extractor Executor service.

  5. Right-click and select Properties.

  6. Configure the service according to your requirements.

Run as a Docker container

Use a Docker container to run the DB extractor on Mac OS or Linux systems. This article describes how to create file-specific bindings with the container. You should consider sharing a folder if a deployment uses more than one file, for example, logs and local state files.

Step 1: Set up the Docker container

  1. Create a Docker image containing the ODBC drivers for the database systems you want to extract data from. This example uses a PostgreSQL database.

    1. Create a Docker file:

      FROM cognite/db-extractor-base:2.5.0
      1. Install your ODBC drivers. For example, for PostgreSQL:
      RUN apt-get install odbc-postgresql
      1. Save the Docker file with a descriptive name. In this example, we use postgres.Dockerfile.

      2. Build the Docker image:

      docker build -f postgres.Dockerfile -t db-extractor-postgres

      Replace postgres.Dockerfile with the name of your Docker file.

      Replace db-extractor-postgres with the tag you want to refer to this Docker image as.

  2. Create a Docker container running this image:

    docker run <other options in here, such as volume sharing and network config> db-extractor-postgres

    Replace db-extractor-postgres with the tag you created above.

Run against a local host on Docker

When a Docker container runs, the localhost address points to the container and not the host. This creates an issue if the container runs against a local database. If you're running Docker for a Mac OC, use the docker.for.mac.localhost DNS name to resolve to the host.

Step 2: Run the Docker container

Step 2a: Using docker run

  1. Enter this docker run statement:
  `$ docker run --network=host --name db-extractor -v /path/to/config.yaml:/config.yaml eu.gcr.io/cognite-registry/db-extractor:<version> /config.yaml`

Where:

  • docker run starts a new container.

  • --name db-extractor names the container db-extractor. This is to recognize the extractor when running docker ps and to stop the extractor with docker stop db-extractor.

  • --network=host grants the container full network access. This is required for pushing metrics and connecting to CDF.

  • -v /path/to/config.yaml:/config.yaml shares the file at /path/to/config.yaml with the container, putting it at /config.yaml.

  • eu.gcr.io/cognite-registry/db-extractor:<version> is the name of the image. Replace <version> with the version you'll run.

  • /config.yaml is sent to the extractor as the config file to be opened.

  1. Share log files similarly to config files if you want the log files to persist when a run stops.

    Tip

    If you're using the local state store option, remember to share the extractor state JSON file.

  2. If you're using environment variables to hold configuration values, you must share the environment variable with the container. Use the -e parameter, and follow this syntax:

-e CONTAINER_VAR_NAME=value

Step 2b: Using docker-compose

Store the parameters in a .yaml file and use the docker-compose up command to run the extractor.

  1. Create a docker-compose.yml file:
    version: '3'    services:      db-extractor:         `image: eu.gcr.io/cognite-registry/db-extractor:<version>`        volumes:        `/path/to/config.yaml:/config.yaml`        # Other volumes as necessary, such as logs and state file        network_mode: host        command: `/config.yaml
  1. Share any environment variables with the container. Use the environment section under db-extractor.

Here's an example using an environment variable for the API key:

version: '3'services:  db-extractor:    image: eu.gcr.io/cognite-registry/db-extractor:<version>    volumes:      - /path/to/config.yaml:/config.yaml      # Other volumes as necessary, such as logs and state file    environment:      - <COGNITE_API_KEY=<cognite api key>    network_mode: host    command: /config.yaml

Load data incrementally

If the database table has a column containing an incremental field, you can set up the extractor to only process new or updated rows since the last extraction. An incremental field is a field that increases for new entries, such as time stamps for the latest update or insertions if the rows never change, or a numerical index.

To load data incrementally for a query:

  1. Include the state-store parameter in the extractor section of the configuration.
  2. Include the incremental-field and initial-start parameters in the query configuration.
  3. Update the SQL query with a WHERE statement using {incremental-field} and {start-at}. For example:
SELECT * FROM table WHERE {incremental-field} >= '{start-at}' ORDER BY {incremental-field} ASC

If the query returns many rows, consider paging the results with the LIMIT statement. Be aware that this demands a well-defined paging strategy. Consider this example:

SELECT * FROM data WHERE {incremental-field} >= '{start-at}' ORDER BY {incremental-field} ASC LIMIT 10
Multiple rows with identical values in the incremental field

If there are 10 rows or more with the same time, the start-at field will not be updated since the largest value of the incremental field didn't increase in the run. The next run will start with the same value, and the extractor is stuck in a loop. However, changing the query from >= to > can cause data loss if multiple rows have the same value for the incremental field.

Therefore, if there are duplicates in the incremental field and you're using result paging, make sure you set a sufficiently high limit.

There might be a slight boost in the extractor runtime if the database has an index on the incremental field. However, for a normal run of the DB extractor, network delays are often the main bottleneck for performance, so in many cases, the difference is minimal.

Schedule automatic runs

To schedule automatic runs on Windows, read more in Run extractors in Windows Task Scheduler.

To schedule automatic runs on Mac OS and Linux, use cron expressions. To enter a new cron job, run crontab -e to edit the cron table file with the default system text editor.

Here's the format for a job in the cron table:

<minute>  <hour>  <day of month (1-31)>  <month (1-12)>  <day of week (0-6 starting on Sunday)>  <command>

Most cron versions also allow human-readable entries for month and day-of-week, such as jan or thu. An asterisk matches all possible values. For example, add this line to the cron table to run the extractor every night at 01:00:

0 1 * * * docker run ...