Databricks is a collaborative, Jupyter-style notebook application that lets you analyze and transform data in Cognite Data Fusion (CDF) using distributed cloud computing, Spark, and the Cognite Spark Data Source.
The collaborative features make Databricks an excellent tool to use for experimentation and development during the initial phases of a project. Later, you can move the code from the Databricks notebooks to your production setup, or continue to use Databricks notebooks as part of your production workflow. For example, you can use Databricks to:
- Quickly set up complex transformations in an ingestion pipeline.
- Cooperatively develop transformations.
- Explore data.
- Contextualize data.
- Train machine learning models.
Databricks notebooks can also be part of your Azure Data Factory pipeline, where they are a good fit for the more complex calculations and processing-heavy components of data science workflows.
Enable Databricks and create a cluster in Azure
To use Databricks in Azure, you need to create an Azure Databricks workspace and an Apache Spark cluster:
To create an Azure Databricks workspace, follow these instructions.
To create an Apache Spark cluster, follow these instructions.
Install the Cognite libraries on your cluster
To work with the data in your CDF project, you need to install both the Cognite Python SDK and Cognite Spark Data Source on your cluster.
Install the Cognite Python SDK
Click the Clusters icon in the sidebar.
Select your cluster.
Click the Libraries tab.
Click Install New.
Type cognite-sdk in the Package field, and click Install.
Install the Cognite Spark Data Source
Select the Clusters icon in the sidebar.
Select your cluster.
Select the Libraries tab.
Select Install New.
Select Search Packages.
Choose to search in Maven Central, and search for com.cognite.spark.datasource.
Select the row where the version number (*cdf-spark-datasource_X.xx*) in the Artifact Id column matches the version of Scala that is running on the cluster. Then click Select.
Create a notebook and connect it to your cluster
Click the Azure Databricks icon in the sidebar.
Under Common Task, click New Notebook.
Give your notebook a Name, choose Python as the language and choose the cluster you just created. Then click Create.
In the notebook menu bar, find the dropdown with the name of your cluster. Click the small arrow pointing downwards, and click start Cluster.
Confirm that you want to start the cluster.
The cluster takes a few minutes to start. A solid green light shows when it is running.
Set up and use a secret scope to store your API key
We recommend that you use the Databricks CLI and secrets to store the API key for your CDF project. Do not store API keys in clear text in your notebooks.
To set up a scope and a secret:
To install the Databricks CLI, follow these instructions.
To generate a personal access token, follow these instructions.
databricks configure --tokenand specify your Databricks host and your personal access token to set up authentication.
databricks secrets create-scope --scope <scope-name>to create your secret scope.
databricks secrets put --scope <scope-name> --key <key-name>to add the API key as a secret to the scope.
In the editor that starts, paste in the API key for your CDF project. Then save and exit the editor.
To use the secret in your notebook, add this code snippet into the first cell in the notebook:
ts = spark.read.format("cognite.spark.v1") \
.option("type", "timeseries") \
.option("apiKey", dbutils.secrets.get("<scope-name>", "<key-name>")) \
<key-name>with your own values, and then press Ctrl + Enter to run the cell.
The result is a table with all the time series you have ingested into CDF.
Share a notebook with other users
To collaborate with your colleagues, you can give other users permissions to use your notebook:
Open the notebook you want to share, and in the notebook menu bar, click Permissions.
Select the users and groups you want to share the notebook with, and specify the permissions you want to grant them. Then click Add and Done.
Share the link to your notebook.