# Entity matching beta

Use entity matching to contextualize your data with machine learning (ML) and rules engines, and then let domain experts validate and fine-tune the results.

Different sources of industrial data can use different naming standards when they refer to the same entity. With CDF, you can match all entities like time series, files, and sequences, as well as 3d from different source systems to assets.

In this article:

# About entity matching

Assets in CDF connect related data from different sources, such as time series, files, sequences, and events. The data often use different naming conventions, even when they refer to the same entity. Entity matching applies artificial intelligence techniques to automatically match the different resources according to name, description, etc.

TIP

See the entity matching API documentation for more information about how to work with relationships.

# Example data

This tutorial uses a small example data set to demonstrate how to do entity matching. The data is presented in Python format, and you can convert it to other formats that fits your programming language.

sources = [
    {"id":0, "name" : "KKL_21AA1019CA.PV", "description": "correct"},
    {"id":1, "name" : "KKL_13FV1234BU.VW", "description": "ok"}
]
targets = [
    {"id":0, "name" : "21AA1019CA", "description": "correct"},
    {"id":1, "name" : "21AA1019CA", "description": "wrong"},
    {"id":2, "name" : "13FV1234BU"},
    {"id":3, "name" : "13FV1234BU", "description": "ok"}
]
true_matches =  [{"sourceId": 0,"targetId": 0}]
1
2
3
4
5
6
7
8
9
10
11

# Fit a supervised ML model and predict for the same data

The supervised model calculates one or more similarity measures between match-to and match-from items. Then it uses these calculated similarity measures as features and fits a classification model using the labeled data.

Note that before calculating the similarity measures and training a model, a set of candidate matches is selected. A pair of match-to and match-from items is considered a candidate if they have at least one token in common. Only the candidate match-from, match-to combinations are used in the training. This is done to reduce computing time - calculating similarity measures for all possible combinations can be extremely heavy (10.000 time series and 30.000 assets -> 300.000.000 combinations).

# When to use a supervised ML model?

A supervised ML model is applicable when you have a lot of labeled data items. The more labeled data you have, the better results you can achieve. We don't recommend applying a supervised model if you have fewer than 500 labeled data items.

# Create a model

POST /api/v1/projects/publicdata/entitymatching
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
{
   "sources":sources,
   "targets":targets,
   "trueMatches":true_matches,
}
1
2
3
4
5
6
7
8
9
10

# Predict

When predict is called without any data, predictions are on the training data.

num_matches determines the number of matches to return for each matchFrom item. The default is 1.

POST /api/v1/projects/publicdata/entitymatching/predict
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
{
  "id": modelId,
  "numMatches": 3,
  "scoreThreshold": 0.7
}
1
2
3
4
5
6
7
8
9
10

# Get matching results

Replace job_id in the following get body with the jobId returned by the response above.

GET /api/v1/projects/publicdata/entitymatching/jobs/<job_id>
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
1
2
3
4

Example output:

[
  {
    "matchFrom": {
      "description": "correct",
      "id": 0,
      "name": "KKL_21AA1019CA.PV"
    },
    "matches": [
      {
        "matchTo": { "description": "correct", "id": 0, "name": "21AA1019CA" },
        "score": 0.35,
        "target": { "description": "correct", "id": 0, "name": "21AA1019CA" }
      },
      {
        "matchTo": { "description": "wrong", "id": 1, "name": "21AA1019CA" },
        "score": 0.35,
        "target": { "description": "wrong", "id": 1, "name": "21AA1019CA" }
      }
    ],
    "source": { "description": "correct", "id": 0, "name": "KKL_21AA1019CA.PV" }
  },
  {
    "matchFrom": { "description": "ok", "id": 1, "name": "KKL_13FV1234BU.VW" },
    "matches": [
      {
        "matchTo": { "id": 2, "name": "13FV1234BU" },
        "score": 0.35,
        "target": { "id": 2, "name": "13FV1234BU" }
      },
      {
        "matchTo": { "description": "ok", "id": 3, "name": "13FV1234BU" },
        "score": 0.35,
        "target": { "description": "ok", "id": 3, "name": "13FV1234BU" }
      }
    ],
    "source": { "description": "ok", "id": 1, "name": "KKL_13FV1234BU.VW" }
  }
]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

Note: For both matchFrom items, the two matches have an equal score, and the model is not able to distinguish between the correct and incorrect match. Also, the scores are relatively low. When the data set is small, unsupervised learning makes more sense.

# Refit

Refit lets you retrain a model, using the same parameters, with additional labels/true-matches. The new true_matches (1,3) are added to the true_matches list from the original model.

To fit a model using only the (1,3) label. A new model must be trained using fit.

true_matches=[{"sourceId": 0,"targetId": 3}]
1
POST /api/v1/projects/publicdata/entitymatching/refit
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
{
   "id": modelId
   "trueMatches":true_matches,
}
1
2
3
4
5
6
7
8
9
POST /api/v1/projects/publicdata/entitymatching/predict
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
{
  "id": modelId,
  "numMatches": 3,
  "scoreThreshold": 0.7
}
1
2
3
4
5
6
7
8
9
10
GET /api/v1/projects/publicdata/entitymatching/jobs/<job_id>
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
1
2
3
4

Note: In this example, the results are the same. The new true-match follows the exact same pattern as the original.

# Fit an unsupervised ML model

If there are no true_matches included in the fit call, an unsupervised model is applied.

As with supervised models, candidates are selected and similarity measures between the candidates are calculated. Instead of training a classification model, the average of the average of the similarity measures are calculated and returned as the score.

# When to use an unsupervised ML model?

Use an unsupervised ML model when there are no or few true matches (labeled data).

# Create a model

POST /api/v1/projects/publicdata/entitymatching/
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
{
   "sources":sources,
   "targets":targets,
   "name":"simple_model_1",
   "description":"Simple model 1",
   "featureType":"simple",
   "classifier":"randomforest"
}
1
2
3
4
5
6
7
8
9
10
11
12
13

Example response:

{
  "classifier": "randomforest",
  "createdTime": 1606151476539,
  "description": "Simple model 1",
  "externalId": "None",
  "featureType": "simple",
  "id": 7895111848480381,
  "ignoreMissingFields": True,
  "matchFields": [
    {
      "source": "name",
      "target": "name"
    },
    {
      "source": "description",
      "target": "description"
    }
  ],
  "name": "simple_model_1",
  "startTime": "None",
  "status": "Queued",
  "statusTime": 1606151476539
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

# Create a matching job

In the following POST body, replace id with the id returned by the response above. You can set maximum matching items and threshold to filter valid matching.

POST /api/v1/projects/publicdata/entitymatching/predict
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
{
  "id": 7895111848480381,
  "numMatches": 3,
  "scoreThreshold": 0.7
}
1
2
3
4
5
6
7
8
9
10

Example response:

{
   "createdTime":1606151691970,
   "jobId":6147120367590349,
   "startTime":"None",
   "status":"Queued",
   "statusTime":1606151691970
}
1
2
3
4
5
6
7

# Get matching results

In the following GET body, replace job_id with the jobId return by the response above.

GET /api/v1/projects/publicdata/entitymatching/jobs/<job_id>
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
1
2
3
4

Example response:

{'createdTime': 1606216545060,
 'items': [{'matchFrom': {'description': 'correct',
    'id': 0,
    'name': 'KKL_21AA1019CA.PV'},
   'matches': [{'matchTo': {'description': 'correct',
      'id': 0,
      'name': '21AA1019CA'},
     'score': 1.0,
     'target': {'description': 'correct', 'id': 0, 'name': '21AA1019CA'}},
    {'matchTo': {'description': 'wrong', 'id': 1, 'name': '21AA1019CA'},
     'score': 1.0,
     'target': {'description': 'wrong', 'id': 1, 'name': '21AA1019CA'}}],
   'source': {'description': 'correct', 'id': 0, 'name': 'KKL_21AA1019CA.PV'}},
  {'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
   'matches': [{'matchTo': {'id': 2, 'name': '13FV1234BU'},
     'score': 1.0,
     'target': {'id': 2, 'name': '13FV1234BU'}},
    {'matchTo': {'description': 'ok', 'id': 3, 'name': '13FV1234BU'},
     'score': 1.0,
     'target': {'description': 'ok', 'id': 3, 'name': '13FV1234BU'}}],
   'source': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'}}],
 'jobId': 6147120367590349,
 'startTime': 1606216546372,
 'status': 'Completed',
 'statusTime': 1606216546552}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

# Add additional key match_fields

By default, only name in sources and name in targets are used to calculate similarity measures. The match_fields parameter lets you specify all combinations of fields in sources and targets that should be used to calculate features.

In this example, it looks like comparing the description field for both targets and sources could improve the model.

Note: Calculating similarity measures can be time consuming. We recommend that you don't use match_fields combinations that adds little or no information to the model.

POST /api/v1/projects/publicdata/entitymatching/
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
{
   "sources":sources,
   "targets":targets,
   "name":"simple_model_1",
   "description":"Simple model 1",
   "featureType":"simple",
   "matchFields":[
      {
         "source":"name",
         "target":"name"
      },
      {
         "source":"description",
         "target":"description"
      }
   ],
   "classifier":"randomforest",
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

The request above results in an error because one of the items in match_to is missing description. If complete_missing is set to True, missing values are replaced by empty strings.

# Add ignore_missing_fields

To avoid errors if items in sources or targets have missing values, add ignore_missing_fields=True.

POST /api/v1/projects/publicdata/entitymatching/
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
{
   "sources":sources,
   "targets":targets,
   "name":"simple_model_1",
   "description":"Simple model 1",
   "featureType":"simple",
   "matchFields":[
      {
         "source":"name",
         "target":"name"
      },
      {
         "source":"description",
         "target":"description"
      }
   ],
   "classifier":"randomforest",
   "ignoreMissingFields":true
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

Note: The model gives the correct matches a score of 1, and the incorrect matches a score 0.5.

# Using feature types

By default, feature_type is set to simple. The different feature types are created to improve the model's accuracy for different types of input data. Which feature type works best for your model varies based on what your data look like. The options for feature_type are:"simple", "bigram", "frequency-weighted-bigram", "bigram-extra-tokenizers", "bigram-combo". This section illustrates the strengths and weaknesses of the different feature types.

# When to use feature_type=simple?

Simple is the default feature type and is preferred when one string is a substring of the other and the other characters do not appear in the other string. For example, "BCDEF" is a substring of "ABCDEFG" and the other characters "AG" do not appear in the string "BCDEF". This feature type is also the fastest option.

# Limitations of the simple feature type

The data below is the same as in the previous examples, except two new items in the targets. Id 10 and 13 are similar to 0 and 3, respectively. The first letter combination ("AA" and "FV") is swapped with the prefix for the match_from items (KKL).

This leads to difficulties if we use the simple feature type.

sources = [
    {"id":0, "name" : "KKL_21AA1019CA.PV", "description": "correct"},
    {"id":1, "name" : "KKL_13FV1234BU.VW", "description": "ok"}
]
targets = [
    {"id":0, "name" : "21AA1019CA", "description": "correct"},
    {"id":10, "name" : "21KKL1019CA", "description": "correct"},
    {"id":1, "name" : "21AA1119CA", "description": "wrong"},
    {"id":2, "name" : "13FV1234BU"},
    {"id":3, "name" : "13FV1334BU", "description": "ok"},
    {"id":13, "name" : "13KKL1234BU", "description": "ok"}
]
1
2
3
4
5
6
7
8
9
10
11
12
POST /api/v1/projects/publicdata/entitymatching/
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
{
   "sources":sources,
   "targets":targets,
   "name":"simple_model_1",
   "description":"Simple model 1",
   "featureType":"simple",
   "matchFields":[
      {
         "source":"name",
         "target":"name"
      }
   ],
   "classifier":"randomforest",
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Example results:

{'createdTime': 1606153318209,
 'items': [{'matchFrom': {'description': 'correct',
    'id': 0,
    'name': 'KKL_21AA1019CA.PV'},
   'matches': [{'matchTo': {'description': 'correct',
      'id': 0,
      'name': '21AA1019CA'},
     'score': 0.8944271909999159,
     'target': {'description': 'correct', 'id': 0, 'name': '21AA1019CA'}},
    {'matchTo': {'description': 'correct', 'id': 10, 'name': '21KKL1019CA'},
     'score': 0.8944271909999159,
     'target': {'description': 'correct', 'id': 10, 'name': '21KKL1019CA'}}],
   'source': {'description': 'correct', 'id': 0, 'name': 'KKL_21AA1019CA.PV'}},
  {'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
   'matches': [{'matchTo': {'id': 2, 'name': '13FV1234BU'},
     'score': 0.8944271909999159,
     'target': {'id': 2, 'name': '13FV1234BU'}},
    {'matchTo': {'description': 'ok', 'id': 13, 'name': '13KKL1234BU'},
     'score': 0.8944271909999159,
     'target': {'description': 'ok', 'id': 13, 'name': '13KKL1234BU'}}],
   'source': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'}}],
 'jobId': 5049964554908808,
 'startTime': 1606153318586,
 'status': 'Completed',
 'statusTime': 1606153318777}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

Note: The new target items have identical scores to the correct matches. This is because with feature_type="simple", only the number of matching tokens are considered.

Target items with id 0 21, AA, 1019, and CA match a token in sources items with id 0. Target items with id 10 21, KKL, 1019, and CA matches a token in source items with id 0. Thus, the same number of tokens matches.

The model does not consider that the target items with id 0 have more and longer contiguous sequences of tokens.

# When to use feature_type=bigram?

The "bigram" feature type does account for sequences of tokens. In addition to counting the number of matching tokens, it looks at the number of matching bigrams. That is, the number of matching tokens when two and two adjacent tokens are combined, BCDEF and ABBCDE1F, for example.

sources = [
    {"id":0, "name" : "KKL_21AA1019CA.PV", "description": "correct"},
    {"id":1, "name" : "KKL_13FV1234BU.VW", "description": "ok"}
]
targets = [
    {"id":0, "name" : "21AA1019CA", "description": "correct"},
    {"id":10, "name" : "21KKL1019CA", "description": "correct"},
    {"id":1, "name" : "21AA1119CA", "description": "wrong"},
    {"id":2, "name" : "13FV1234BU"},
    {"id":3, "name" : "13FV1334BU", "description": "ok"},
    {"id":13, "name" : "13KKL1234BU", "description": "ok"}
]
1
2
3
4
5
6
7
8
9
10
11
12
POST /api/v1/projects/publicdata/entitymatching/
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
{
   "sources":sources,
   "targets":targets,
   "name":"bigram_model_1",
   "description":"bigram model 1",
   "featureType":"bigram",
   "matchFields":[
      {
         "source":"name",
         "target":"name"
      }
   ],
   "classifier":"randomforest",
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Example results:

{'createdTime': 1606153733720,
 'items': [{'matchFrom': {'description': 'correct',
    'id': 0,
    'name': 'KKL_21AA1019CA.PV'},
   'matches': [{'matchTo': {'description': 'correct',
      'id': 0,
      'name': '21AA1019CA'},
     'score': 0.9149207688467006,
     'target': {'description': 'correct', 'id': 0, 'name': '21AA1019CA'}}],
   'source': {'description': 'correct', 'id': 0, 'name': 'KKL_21AA1019CA.PV'}},
  {'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
   'matches': [{'matchTo': {'id': 2, 'name': '13FV1234BU'},
     'score': 0.9149207688467006,
     'target': {'id': 2, 'name': '13FV1234BU'}}],
   'source': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'}}],
 'jobId': 6677803085358639,
 'startTime': 1606153733995,
 'status': 'Completed',
 'statusTime': 1606153734183}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

# When to use feature_type=Frequency-Weighted-Bigram?

The "Frequency-Weighted-Bigram" feature type calculates a similarity score based on the sequence of the terms. It also gives higher weights to less commonly occurring tokens. This can be helpful when the "simple" feature type does not return useful results.

sources = [
    {"id":0, "name" : "KKL_21AA1019CA.PV", "description": "correct"},
    {"id":1, "name" : "KKL_13FV1234BU.VW", "description": "ok"}
]
targets = [
    {"id":0, "name" : "21AA1019CA", "description": "correct"},
    {"id":10, "name" : "21KKL1019CA", "description": "correct"},
    {"id":1, "name" : "21AA1119CA", "description": "wrong"},
    {"id":2, "name" : "13FV1234BU"},
    {"id":3, "name" : "13FV1334BU", "description": "ok"},
    {"id":13, "name" : "13KKL1234BU", "description": "ok"}
]
1
2
3
4
5
6
7
8
9
10
11
12
POST /api/v1/projects/publicdata/entitymatching/
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
{
   "sources":sources,
   "targets":targets,
   "name":"fbw_model_1",
   "description":"fwb model 1",
   "featureType":"Frequency-Weighted-Bigram",
   "matchFields":[
      {
         "source":"name",
         "target":"name"
      }
   ],
   "classifier":"randomforest",
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Example results:

{'createdTime': 1606153836538,
 'items': [{'matchFrom': {'description': 'correct',
    'id': 0,
    'name': 'KKL_21AA1019CA.PV'},
   'matches': [{'matchTo': {'description': 'correct',
      'id': 0,
      'name': '21AA1019CA'},
     'score': 0.9149207688467006,
     'target': {'description': 'correct', 'id': 0, 'name': '21AA1019CA'}}],
   'source': {'description': 'correct', 'id': 0, 'name': 'KKL_21AA1019CA.PV'}},
  {'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
   'matches': [{'matchTo': {'id': 2, 'name': '13FV1234BU'},
     'score': 0.9149207688467006,
     'target': {'id': 2, 'name': '13FV1234BU'}}],
   'source': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'}}],
 'jobId': 5204435061404568,
 'startTime': 1606153836790,
 'status': 'Completed',
 'statusTime': 1606153837071}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

# When to use feature_type=Bigram-Extra-Tokenizers?

The "Bigram-Extra-Tokenizers" feature type is similar to bigram, but can learn that leading zeros and spaces should be ignored in matching. For example ABCDE and 000ABBCDE1F.

sources = [
    {"id":0, "name" : "KKL_21AA1019CA.PV", "description": "correct"},
    {"id":1, "name" : "KKL_13FV1234BU.VW", "description": "ok"}
]
targets = [
    {"id":0, "name" : "000021AA1019CA", "description": "correct"},
    {"id":10, "name" : "21KKL1019CA", "description": "correct"},
    {"id":1, "name" : "21AA1119CA", "description": "wrong"},
    {"id":2, "name" : "000013FV1234BU"},
    {"id":3, "name" : "13FV1334BU", "description": "ok"},
    {"id":13, "name" : "13KKL1234BU", "description": "ok"}
]
1
2
3
4
5
6
7
8
9
10
11
12
POST /api/v1/projects/publicdata/entitymatching/
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
{
   "sources":sources,
   "targets":targets,
   "name":"BET_model_1",
   "description":"BET model 1",
   "featureType":"Bigram-Extra-Tokenizers",
   "matchFields":[
      {
         "source":"name",
         "target":"name"
      }
   ],
   "classifier":"randomforest",
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Example results:

{'createdTime': 1606154023578,
 'items': [{'matchFrom': {'description': 'correct',
    'id': 0,
    'name': 'KKL_21AA1019CA.PV'},
   'matches': [{'matchTo': {'description': 'correct',
      'id': 0,
      'name': '000021AA1019CA'},
     'score': 0.8477338488399959,
     'target': {'description': 'correct', 'id': 0, 'name': '000021AA1019CA'}}],
   'source': {'description': 'correct', 'id': 0, 'name': 'KKL_21AA1019CA.PV'}},
  {'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
   'matches': [{'matchTo': {'id': 2, 'name': '000013FV1234BU'},
     'score': 0.8477338488399959,
     'target': {'id': 2, 'name': '000013FV1234BU'}}],
   'source': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'}}],
 'jobId': 6675842531673047,
 'startTime': 1606154023915,
 'status': 'Completed',
 'statusTime': 1606154024190}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

# When to use feature_type=Bigram-Combo?

The "Bigram-Combo" feature type calculates all of the above options, relying on the model to determine the appropriate features to use. This is the slowest option and mostly suitable for supervised models.

POST /api/v1/projects/publicdata/entitymatching/
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
{
   "sources":sources,
   "targets":targets,
   "name":"BC_model_1",
   "description":"BC model 1",
   "featureType":"Bigram-Combo",
   "matchFields":[
      {
         "source":"name",
         "target":"name"
      }
   ],
   "classifier":"randomforest",
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Example results:

{'createdTime': 1606154392343,
 'items': [{'matchFrom': {'description': 'correct',
    'id': 0,
    'name': 'KKL_21AA1019CA.PV'},
   'matches': [{'matchTo': {'description': 'correct',
      'id': 0,
      'name': '21AA1019CA'},
     'score': 0.8195704160778803,
     'target': {'description': 'correct', 'id': 0, 'name': '21AA1019CA'}}],
   'source': {'description': 'correct', 'id': 0, 'name': 'KKL_21AA1019CA.PV'}},
  {'matchFrom': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'},
   'matches': [{'matchTo': {'id': 2, 'name': '13FV1234BU'},
     'score': 0.8195704160778803,
     'target': {'id': 2, 'name': '13FV1234BU'}}],
   'source': {'description': 'ok', 'id': 1, 'name': 'KKL_13FV1234BU.VW'}}],
 'jobId': 4397635623636246,
 'startTime': 1606154392670,
 'status': 'Completed',
 'statusTime': 1606154392848}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

# About the score

A score above 0.8 indicates that the source and the target are matched with high probability.

Note that a score below 0.8 and above 0.5 does not indicate that the source and target are matched with more than 50% probability. The example below illustrates that even if the source and the target don't match at all, they can still receive a score over 0.6:

sources = [{"id":0, "name" : "J04_ONSTREAM_HOUR_AVG", "description": "correct"}]
targets = [{"id":0, "name" : "87-JB-004-J04", "description": "correct"}]
1
2
POST /api/v1/projects/publicdata/entitymatching/
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
{
   "sources":sources,
   "targets":targets,
   "name":"bigram_model_1",
   "description":"bigram model 1",
   "featureType":"bigram",
   "matchFields":[
      {
         "source":"name",
         "target":"name"
      }
   ],
   "classifier":"randomforest",
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
POST /api/v1/projects/publicdata/entitymatching/predict
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
{
  "id": 6147120367590349,
  "numMatches": 3,
  "scoreThreshold": 0.5
}
1
2
3
4
5
6
7
8
9
10
GET /api/v1/projects/publicdata/entitymatching/jobs/<job_id>
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
1
2
3
4

Example matching results

{'createdTime': 1606154525247,
 'items': [{'matchFrom': {'description': 'correct',
    'id': 0,
    'name': 'J04_ONSTREAM_HOUR_AVG'},
   'matches': [{'matchTo': {'description': 'correct',
      'id': 0,
      'name': '87-JB-004-J04'},
     'score': 0.6049029006116509,
     'target': {'description': 'correct', 'id': 0, 'name': '87-JB-004-J04'}}],
   'source': {'description': 'correct',
    'id': 0,
    'name': 'J04_ONSTREAM_HOUR_AVG'}}],
 'jobId': 1019931106940009,
 'startTime': 1606154525580,
 'status': 'Completed',
 'statusTime': 1606154525758}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

# Get model info

If you have a model_id and want to know which parameters you used when training the model, use the retrieve method.

GET /api/v1/projects/publicdata/entitymatching/<model_id>
Host: api.cognitedata.com
api-key: <key>
Content-Type: application/json
1
2
3
4
Last Updated: 11/30/2020, 8:58:21 AM