API Reference

Documentation

Similarity API provides scalable, deterministic fuzzy string matching via a REST API — from any language or platform that can make an HTTP request.


Quick start

Get your first result in three steps.

1

Get a token

POST your email to /token-gen. Your token comes back immediately — no account setup required.

2

Call /reconcile or /dedupe

POST a JSON body with your data and Bearer token. Results are synchronous.

3

Read your results

The response_data field contains your matches in the format you chose.

Authentication

All endpoints require a Bearer token in the Authorization header.

POSThttps://api.similarity-api.com/token-gen

Returns a JWT for your email address. No password or account required to start.

Request body

FieldTypeDescription
email requiredstringYour real email address. Placeholder values are rejected.

Response

{ "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..." }

Token behaviour

  • Tokens do not expire on a timer — reuse the same token across sessions.
  • Free-tier tokens remain valid until your trial row limit is exhausted.
  • If your limit is already exhausted when you call this endpoint, you receive a 403 instead of a token.

Using the token

Authorization: Bearer <your-token>

Choosing your path

Two endpoints, two directions of matching:

ReconcileDedupe
OperationMatch A → B (two lists)Find duplicates within A (one list)
Inputdata_a + data_bdata
Typical useLink new records to a reference database; fuzzy join two datasetsClean a list before import; find near-duplicate CRM records
DeliverySynchronous (JSON body)Synchronous (JSON body) or async file upload

For the dedupe endpoint, there are two delivery options. Use the standard endpoint for most work — POST your array, get results back immediately. Switch to file upload when your payload is too large for a single HTTP request or you're processing data from an existing file pipeline.


POST /reconcile

Matches each string in data_a against the strings in data_b and returns the best matches above your threshold. Ideal for linking new records to a reference list, fuzzy-joining two datasets, or checking whether items from one source exist in another.

POSThttps://api.similarity-api.com/reconcile

Request body

{
  "data_a": ["string1", "string2", ...],
  "data_b": ["ref1", "ref2", ...],
  "config": { ... }
}
FieldTypeDefaultDescription
data_a requiredstring[]The list to reconcile — e.g., incoming records or a spreadsheet column.
data_b requiredstring[]The reference list to match against — e.g., your CRM, master database, or canonical list.
configobject{}Optional. All tuning parameters go here.

Config parameters

ParameterTypeDefaultDescription
similarity_thresholdfloat0.85Minimum score (0–1) to consider a match. Lower = more matches, higher = stricter.
top_nint10Maximum number of data_b matches to return per data_a item.
output_formatstring"index_pairs""flat_table", "string_pairs", or "index_pairs". See Output formats.
use_casestring | nullnullSet to "company_names" to strip business entity suffixes (Inc., Corp., Ltd., LLC, GmbH, S.A., PLC, "The", etc.) before matching. Works best combined with to_lowercase: true.
to_lowercasebooleanfalseLowercase all strings before comparison. Recommended for most use cases.
remove_punctuationbooleanfalseStrip punctuation before comparison.
use_token_sortbooleanfalseSort tokens alphabetically before comparison. Useful when word order varies ("Smith John" vs "John Smith").
💡

Quota is counted as len(data_a) + len(data_b) rows per request.

Output formats

flat_table (recommended) — One entry per data_a item, including non-matches. Best for data analysis and spreadsheet-style review.

{
  "status": "success",
  "response_data": [
    {
      "index_a": 0,
      "text_a": "Microsoft",
      "matched": true,
      "index_b": 1,
      "text_b": "Microsoft Corp",
      "score": 0.9412,
      "threshold": 0.75
    },
    {
      "index_a": 1,
      "text_a": "Apple Inc",
      "matched": false,
      "index_b": null,
      "text_b": null,
      "score": null,
      "threshold": 0.75
    }
  ]
}

Examples

const axios = require('axios');

async function reconcile() {
  const url = 'https://api.similarity-api.com/reconcile';
  const token = process.env.SIMILARITY_API_KEY; 

  const payload = {
    data_a: ['Microsoft', 'appLE'],
    data_b: ['Apple Inc.', 'Microsoft'],
    config: {
      similarity_threshold: 0.75,
      top_n: 10,
      remove_punctuation: true,
      to_lowercase: true,
      use_token_sort: true,
      output_format: 'flat_table'
    }
  };

  try {
    const res = await axios.post(url, payload, {
      headers: { Authorization: `Bearer ${token}`, 'Content-Type': 'application/json' }
    });
    console.log('Reconcile results:', res.data);
  } catch (err) {
    console.error('Error:', err.response?.data || err.message);
  }
}

reconcile();

POST /dedupe

Find near-duplicate records within a single list. Send your array, get results back in the same response.

POSThttps://api.similarity-api.com/dedupe

Request body

{
  "data": ["string1", "string2", ...],
  "config": { ... }
}
FieldTypeDefaultDescription
data requiredstring[]Array of strings to deduplicate.
configobject{}Optional. All tuning parameters go here.

Config parameters

ParameterTypeDefaultDescription
similarity_thresholdfloat0.85Minimum score (0–1) to consider two strings duplicates.
output_formatstring"index_pairs"See Output formats.
use_casestring | nullnullSet to "company_names" to strip business entity suffixes (Inc., Corp., Ltd., LLC, GmbH, S.A., PLC, "The", etc.) before matching. Works best combined with to_lowercase: true.
to_lowercasebooleanfalseLowercase all strings before comparison.
remove_punctuationbooleanfalseStrip punctuation before comparison.
use_token_sortbooleanfalseSort tokens alphabetically before comparison.
top_nint10Max candidates considered per string. Reducing speeds up large inputs at the cost of potentially missing distant matches.

Response

{
  "status": "success",
  "response_data": [ ... ]
}

Response headers on every call: X-Trial-Limit and X-Trial-Remaining.

Output formats

row_annotationsrow_annotations returns one object per input row, including the cluster representative and the row's similarity to that representative.

{
  "status": "success",
  "response_data": [
    {
      "index": 0,
      "original_string": "Acme Corp",
      "rep_index": 0,
      "rep_string": "Acme Corp",
      "similarity_to_rep": 1
    },
    {
      "index": 1,
      "original_string": "ACME Corporation",
      "rep_index": 0,
      "rep_string": "Acme Corp",
      "similarity_to_rep": 0.91
    },
    {
      "index": 2,
      "original_string": "acme inc",
      "rep_index": 0,
      "rep_string": "Acme Corp",
      "similarity_to_rep": 0.82
    },
    {
      "index": 3,
      "original_string": "Globex Ltd",
      "rep_index": 3,
      "rep_string": "Globex Ltd",
      "similarity_to_rep": 1
    }
  ]
}

Examples

const axios = require('axios');

(async () => {
  const res = await axios.post(
    'https://api.similarity-api.com/dedupe',
    {
      data: [
        'Apple Inc',
        'Apple Inc.',
        'Apple Incorporated',
        'Microsoft Corporation',
        'Microsoft Corp'
      ],
      config: {
        similarity_threshold: 0.85,
        remove_punctuation: false,
        to_lowercase: false,
        use_token_sort: false,
        output_format: 'index_pairs'
      }
    },
    { headers: { Authorization: `Bearer ${process.env.SIMILARITY_API_KEY}` } }
  );
  console.log(res.data);
})();

Dedupe — File upload

For datasets over 10MB, upload a Parquet or CSV file. Five steps:

1

Create jobPOST /jobs

Returns a job_id and a signed upload URL (valid 1 hour).

2

Upload your file

PUT your file to the signed URL. No auth header needed — the URL is pre-signed.

3

CommitPOST /jobs/:id/commit

Triggers processing. Returns rows_total after file inspection.

4

PollGET /jobs/:idoptional

Check job_status, stage, and progress (0–100) until status is completed or failed.

5

Download results

Fetch result_url from the completed job response. Expires after 1 hour.


POST /jobs

Creates a job and returns a signed upload URL. All processing config is set here — it cannot be changed after commit.

POSThttps://api.similarity-api.com/dedupe/jobs

Request body

{
  "config": {
    "input_format": "parquet",
    "input_column": "company_name",
    "similarity_threshold": 0.85,
    "use_case": "company_names",
    "output_format": "row_annotations",
    "output_file_format": "parquet"
  }
}
Config parameterTypeDefaultDescription
input_formatstring"parquet""parquet" or "csv". Must match the file you upload.
input_columnstring | nullnullColumn to use when your file has multiple columns. Required for multi-column files.
output_file_formatstring"parquet""parquet" or "csv" for the result file.
output_formatstring"index_pairs"Same options as the standard endpoint. See Output formats.
similarity_threshold, use_case, to_lowercase, remove_punctuation, use_token_sort, top_nSame as the standard dedupe endpoint. See Parameters reference.

Upload your file

PUT your file to the upload_url. No Authorization header — the URL is pre-authenticated.

curl -X PUT \
  -H 'Content-Type: application/octet-stream' \
  --data-binary @your_data.parquet \
  "<upload_url>"
⚠️

The upload URL expires after 1 hour. If it expires, create a new job and start over.

File requirements: Parquet or CSV matching config.input_format. Single-column files use that column automatically; multi-column files require config.input_column. Null values are treated as empty strings.

POST /jobs/:job_id/commit

Validates the uploaded file, counts rows, and queues processing. A failed job can be recommitted after re-uploading a corrected file.

POSThttps://api.similarity-api.com/dedupe/jobs/:job_id/commit

Response

{
  "status": "success",
  "job_id": "a3f9c2d1e8b047...",
  "rows_total": 250000,
  "auto_started": true
}

GET /jobs/:job_id

Returns the current state of a job. Poll until job_status is completed or failed. A 5–15 second polling interval is reasonable.

GEThttps://api.similarity-api.com/dedupe/jobs/:job_id

Response — completed

{
  "status": "success",
  "job_id": "a3f9c2d1e8b047...",
  "job_status": "completed",
  "stage": "completed",
  "progress": 100,
  "rows_total": 250000,
  "rows_processed": 250000,
  "started_at_ms": 1718200200000,
  "completed_at_ms": 1718200420000,
  "error": null,
  "result_url": "https://storage.googleapis.com/..."
}
⚠️

result_url expires after 1 hour. Download immediately, or call this endpoint again for a fresh URL.

Job statuses

StatusDescription
createdWaiting for file upload and commit.
queuedCommitted and row count verified. Processing queued.
runningProcessing. Check stage and progress for detail.
completedresult_url is available for download.
failedCheck error for the reason. Can be recommitted.
cancelledCancelled by the user. Input and output files are deleted.

Processing stages

downloading → loading → preprocessing → encoding → index_build → ann_search → clustering → writing_output → completed

ℹ️

Jobs that make no progress for 2 hours are automatically marked failed. This prevents jobs staying in running or queued indefinitely.

POST /jobs/:job_id/cancel

Requests cancellation. Cancellation is cooperative and may take a short time. Input and output files are deleted on cancellation.

POSThttps://api.similarity-api.com/dedupe/jobs/:job_id/cancel

Response

{
  "status": "success",
  "job_id": "a3f9c2d1e8b047...",
  "cancel_requested": true
}

Parameters reference

All parameters live inside the config object. Shared parameters apply to both endpoints unless noted.

ParameterTypeDefaultDescription
similarity_thresholdfloat0.85Minimum similarity (0–1) to consider a match. Start with 0.8–0.9 for company names.
to_lowercasebooleanfalseLowercase all strings before comparison. Recommended for most use cases.
remove_punctuationbooleanfalseStrip punctuation before comparison.
use_token_sortbooleanfalseSort tokens alphabetically before comparison. Useful when word order varies.
output_formatstringvariesShape of results. See Output formats.
use_case
dedupe only
string | nullnull"company_names" enables built-in normalisation: strips Inc., Corp., Ltd., LLC, GmbH, S.A., PLC, "The ", and more. Loops until fully resolved — so "Acme Corp Ltd" becomes "Acme". Use with to_lowercase: true.
top_n
dedupe only
int10Max candidates per string. Reducing speeds up large inputs at the cost of recall.
top_n
reconcile only
int10Max data_b matches returned per data_a item.
input_format
file upload only
string"parquet""parquet" or "csv". Must match the uploaded file.
input_column
file upload only
string | nullnullColumn to use for multi-column files.
output_file_format
file upload only
string"parquet""parquet" or "csv" for the result file.

Output formats

Reconcile

flat_table

One entry per data_a item including non-matches. Best for spreadsheet-style review.

string_pairs

Matched pairs only, with text values. Human-readable.

index_pairs

Matched pairs only, indices and score. Most compact.

Dedupe

row_annotations

One entry per input row with cluster info and similarity to representative. Best for review and export.

membership_map

Maps each row index to its representative index. Compact for programmatic use.

deduped_strings

The representative string from each cluster — a clean deduplicated list.

deduped_indices

The indices of representatives. Useful for filtering a DataFrame.

index_pairs

All matching pairs as [left, right, score].

string_pairs

All matching pairs with string values and score.

ℹ️

How representatives are chosen. Within each cluster, the record with the lowest index in the original input is the representative. Membership is transitive — if A matches B and B matches C, all three are in the same cluster even if A and C don't directly match.

Errors

All errors follow the same shape:

{ "status": "error", "message": "Description of what went wrong" }
200Success.
400Bad request — missing or invalid parameters. Check message.
401Authentication failed — missing, malformed, or invalid token.
402Free-tier quota exceeded. Response includes limit and remaining.
403/token-gen: free-tier limit already exhausted. Job endpoints: accessing another account's job.
404Job not found.
500Internal server error.

Quota error (402)

{
  "status": "error",
  "message": "You have exceeded your free trial...",
  "limit": 100000,
  "remaining": 0
}

Common mistakes

SymptomLikely cause
401 on every requestWrong Authorization header format. Use exactly: Bearer <token>
400 "CSV has multiple columns"Add "input_column": "your_column_name" to config.
result_url returns 403Signed URL expired (1 hour TTL). Call GET /jobs/:id again for a fresh URL.
Job committed but immediately failedFile wasn't uploaded before commit, or input_format doesn't match the file.

Custom Solution

Need a custom solution? Our team can help you implement specialized string similarity and deduplication solutions tailored to your specific needs. to set up an exploratory call.