Salt la conținutul principal

Data science tools

This article introduces the tools you need to develop data science applications for the data in Cognite Data Fusion (CDF). We'll cover both built-in tools and tools made by other companies.

APIs and SDKs

The Cognite REST API is the main interface to CDF. We recommend that you use one of our SDKs to develop your solution and interact with the API.

The Python and JavaScript SDKs for CDF are primarily for backend and frontend work, respectively. The Python SDK is also used in the Jupyter Notebook and Streamlit-powered web experiences, and the JavaScript SDK is used with Node.js.

We also offer toolkits for setting up and handling CDF projects, and for creating extractors to import data into CDF.

Learn more about our SDKs and toolkits at https://developer.cognite.com/sdks/.

Jupyter Notebook

Jupyter Notebook is integrated with Cognite Data Fusion (CDF) and you can create interactive notebook documents that can contain live code, equations, visualizations, and media. For details, see Jupyter Notebook.

Cognite data analysis tools

Cognite has support for using out-of-the-box and custom tools to explore, trend, and analyze the industrial data stored in CDF. For details, see Analyze and explore data.

Cognite Charts

Charts is a built-in CDF tool that helps you explore, track, and analyze time series and engineering diagrams.

Industrial Canvas

Add assets, engineering diagrams, sensor data, charts, images, work orders (events), and 3D models to a canvas. Visualize and explore data from different data sets, summarize results, leave comments, and mention people to collaborate with your coworkers.

Cognite Data Source for Grafana

Grafana is an open-source web app to analyze and visualize data. It works on different devices and connects to CDF using the Grafana Data Source connector included with CDF. You need to set up your own Grafana instance to visualize your data and share insights across your organization.

Cognite Power BI Connector

Power BI is Microsoft's interactive data visualization software focused on business intelligence. It connects to CDF using the Power BI connector included with CDF. You need to set up your own Power BI instance to visualize your data and share insights across your organization.

Custom data analysis tools

These are examples of other data analysis tools we often use:

NameDescription
StreamlitStreamlit is an open-source app framework for machine learning and data science teams. CDF includes a simple Streamlit interface, and we often use Streamlit for simple proof-of-concept projects. It's easy to set up and it integrates with the Python SDK.
Plotly DashDash is Plotly's Python framework for creating interactive web applications. We typically use Dash for applications with a high level of user interaction, and when we need more flexibility than what Streamlit offers. Dash integrates with the Python SDK.
ReactReact is an open-source front-end JavaScript library for building user interfaces. We use React when we need even more flexibility than what we get with Plotly's Dash. You can connect to CDF with the JavaScript SDK. One common alternative is to use the Python SDK for back-end integration and the JavaScript SDK only for the front-end components.

Testing

Testing is essential to ensure that the algorithms, models, and data used in the data science application are accurate and producing the expected results. Without proper testing, the app could produce unreliable or misleading results.

As an example, one common approach to testing data science applications is to use test-driven development (TDD) in combination with the Given-When-Then framework.

Regardless of the tools and frameworks you use, we strongly recommend that you make testing an integral part of your development process.

Unit testing

For unit testing, we recommend using a tool like pytest to establish strategies for basic testing, mocking, snapshot testing, and integration testing.

Basic testing strategies

These are some strategies you can use to test different components of a function:

  • Assertion: An assertion is a boolean expression which will be true unless there is a bug in the function. For example, you can use an assert statement if you know what the output should be for a specific set of inputs.

  • Instance: If the value is less important than the type returned by the function, you can use the isinstance keyword instead of an assert statement.

  • Error: To catch errors in your Python code, you can use the with keyword in combination with the exception handlers in pytest. For example, you use the pytest.raises(ZeroDivisionError) to check an operation involving division.

  • Parameterization: Parameterization allows you to group tests that use the same code and avoid code duplication. This is particularly useful for functions that have to account for different corner cases.

Mocking

Mocking means replacing the part of the application you are testing with a dummy version that works in the same way as the part itself. You can use the monkeypatch_cognite_client object from the cognite.client.testing module in the Python SDK to set up mocking for your Cognite Client.

Other testing tools

The coverage library can generate reports to help you identify lines that aren't run by your tests. You can also use doctest to find edge cases in your functions that may not be covered by your unit tests.

Snapshot testing

Snapshot testing compares the output of text or UI to the last time the test was run. It can help you evaluate the output differences for inconsistencies, for example to understand if your application is failing due to changes in the code or changes in CDF.

To perform snapshot testing with pytest, you can, for example, use the pytest-regressions plugin.

Integration and end-to-end testing

Integration testing is to test two or more components together, and end-to-end testing is to test the entire pipeline.

Integration testing could be as simple as pinging an endpoint and checking it is reachable. However, real-world integration testing is often much more complex and involves testing the interdependencies of modules, data flow, error handling, etc. Its primary purpose is to verify the functionality, performance, and reliability between the units that are integrated.