Contents
API Reference
Documentation
Similarity API provides scalable, deterministic fuzzy string matching via a REST API — from any language or platform that can make an HTTP request.
Quick start
Get your first result in three steps.
Get a token
POST your email to /token-gen. Your token comes back immediately — no account setup required.
Call /reconcile or /dedupe
POST a JSON body with your data and Bearer token. Results are synchronous.
Read your results
The response_data field contains your matches in the format you chose.
Authentication
All endpoints require a Bearer token in the Authorization header.
Returns a JWT for your email address. No password or account required to start.
Request body
| Field | Type | Description |
|---|---|---|
| email required | string | Your real email address. Placeholder values are rejected. |
Response
{ "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..." }Token behaviour
- Tokens do not expire on a timer — reuse the same token across sessions.
- Free-tier tokens remain valid until your trial row limit is exhausted.
- If your limit is already exhausted when you call this endpoint, you receive a
403instead of a token.
Using the token
Authorization: Bearer <your-token>
Choosing your path
Two endpoints, two directions of matching:
| Reconcile | Dedupe | |
|---|---|---|
| Operation | Match A → B (two lists) | Find duplicates within A (one list) |
| Input | data_a + data_b | data |
| Typical use | Link new records to a reference database; fuzzy join two datasets | Clean a list before import; find near-duplicate CRM records |
| Delivery | Synchronous (JSON body) | Synchronous (JSON body) or async file upload |
For the dedupe endpoint, there are two delivery options. Use the standard endpoint for most work — POST your array, get results back immediately. Switch to file upload when your payload is too large for a single HTTP request or you're processing data from an existing file pipeline.
POST /reconcile
Matches each string in data_a against the strings in data_b and returns the best matches above your threshold. Ideal for linking new records to a reference list, fuzzy-joining two datasets, or checking whether items from one source exist in another.
Request body
{
"data_a": ["string1", "string2", ...],
"data_b": ["ref1", "ref2", ...],
"config": { ... }
}| Field | Type | Default | Description |
|---|---|---|---|
| data_a required | string[] | — | The list to reconcile — e.g., incoming records or a spreadsheet column. |
| data_b required | string[] | — | The reference list to match against — e.g., your CRM, master database, or canonical list. |
| config | object | {} | Optional. All tuning parameters go here. |
Config parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| similarity_threshold | float | 0.85 | Minimum score (0–1) to consider a match. Lower = more matches, higher = stricter. |
| top_n | int | 10 | Maximum number of data_b matches to return per data_a item. |
| output_format | string | "index_pairs" | "flat_table", "string_pairs", or "index_pairs". See Output formats. |
| use_case | string | null | null | Set to "company_names" to strip business entity suffixes (Inc., Corp., Ltd., LLC, GmbH, S.A., PLC, "The", etc.) before matching. Works best combined with to_lowercase: true. |
| to_lowercase | boolean | false | Lowercase all strings before comparison. Recommended for most use cases. |
| remove_punctuation | boolean | false | Strip punctuation before comparison. |
| use_token_sort | boolean | false | Sort tokens alphabetically before comparison. Useful when word order varies ("Smith John" vs "John Smith"). |
Quota is counted as len(data_a) + len(data_b) rows per request.
Output formats
flat_table (recommended) — One entry per data_a item, including non-matches. Best for data analysis and spreadsheet-style review.
{
"status": "success",
"response_data": [
{
"index_a": 0,
"text_a": "Microsoft",
"matched": true,
"index_b": 1,
"text_b": "Microsoft Corp",
"score": 0.9412,
"threshold": 0.75
},
{
"index_a": 1,
"text_a": "Apple Inc",
"matched": false,
"index_b": null,
"text_b": null,
"score": null,
"threshold": 0.75
}
]
}Examples
const axios = require('axios');
async function reconcile() {
const url = 'https://api.similarity-api.com/reconcile';
const token = process.env.SIMILARITY_API_KEY;
const payload = {
data_a: ['Microsoft', 'appLE'],
data_b: ['Apple Inc.', 'Microsoft'],
config: {
similarity_threshold: 0.75,
top_n: 10,
remove_punctuation: true,
to_lowercase: true,
use_token_sort: true,
output_format: 'flat_table'
}
};
try {
const res = await axios.post(url, payload, {
headers: { Authorization: `Bearer ${token}`, 'Content-Type': 'application/json' }
});
console.log('Reconcile results:', res.data);
} catch (err) {
console.error('Error:', err.response?.data || err.message);
}
}
reconcile();POST /dedupe
Find near-duplicate records within a single list. Send your array, get results back in the same response.
Request body
{
"data": ["string1", "string2", ...],
"config": { ... }
}| Field | Type | Default | Description |
|---|---|---|---|
| data required | string[] | — | Array of strings to deduplicate. |
| config | object | {} | Optional. All tuning parameters go here. |
Config parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| similarity_threshold | float | 0.85 | Minimum score (0–1) to consider two strings duplicates. |
| output_format | string | "index_pairs" | See Output formats. |
| use_case | string | null | null | Set to "company_names" to strip business entity suffixes (Inc., Corp., Ltd., LLC, GmbH, S.A., PLC, "The", etc.) before matching. Works best combined with to_lowercase: true. |
| to_lowercase | boolean | false | Lowercase all strings before comparison. |
| remove_punctuation | boolean | false | Strip punctuation before comparison. |
| use_token_sort | boolean | false | Sort tokens alphabetically before comparison. |
| top_n | int | 10 | Max candidates considered per string. Reducing speeds up large inputs at the cost of potentially missing distant matches. |
Response
{
"status": "success",
"response_data": [ ... ]
}Response headers on every call: X-Trial-Limit and X-Trial-Remaining.
Output formats
row_annotations — row_annotations returns one object per input row, including the cluster representative and the row's similarity to that representative.
{
"status": "success",
"response_data": [
{
"index": 0,
"original_string": "Acme Corp",
"rep_index": 0,
"rep_string": "Acme Corp",
"similarity_to_rep": 1
},
{
"index": 1,
"original_string": "ACME Corporation",
"rep_index": 0,
"rep_string": "Acme Corp",
"similarity_to_rep": 0.91
},
{
"index": 2,
"original_string": "acme inc",
"rep_index": 0,
"rep_string": "Acme Corp",
"similarity_to_rep": 0.82
},
{
"index": 3,
"original_string": "Globex Ltd",
"rep_index": 3,
"rep_string": "Globex Ltd",
"similarity_to_rep": 1
}
]
}Examples
const axios = require('axios');
(async () => {
const res = await axios.post(
'https://api.similarity-api.com/dedupe',
{
data: [
'Apple Inc',
'Apple Inc.',
'Apple Incorporated',
'Microsoft Corporation',
'Microsoft Corp'
],
config: {
similarity_threshold: 0.85,
remove_punctuation: false,
to_lowercase: false,
use_token_sort: false,
output_format: 'index_pairs'
}
},
{ headers: { Authorization: `Bearer ${process.env.SIMILARITY_API_KEY}` } }
);
console.log(res.data);
})();Dedupe — File upload
For datasets over 10MB, upload a Parquet or CSV file. Five steps:
Create job— POST /jobs
Returns a job_id and a signed upload URL (valid 1 hour).
Upload your file
PUT your file to the signed URL. No auth header needed — the URL is pre-signed.
Commit— POST /jobs/:id/commit
Triggers processing. Returns rows_total after file inspection.
Poll— GET /jobs/:idoptional
Check job_status, stage, and progress (0–100) until status is completed or failed.
Download results
Fetch result_url from the completed job response. Expires after 1 hour.
POST /jobs
Creates a job and returns a signed upload URL. All processing config is set here — it cannot be changed after commit.
Request body
{
"config": {
"input_format": "parquet",
"input_column": "company_name",
"similarity_threshold": 0.85,
"use_case": "company_names",
"output_format": "row_annotations",
"output_file_format": "parquet"
}
}| Config parameter | Type | Default | Description |
|---|---|---|---|
| input_format | string | "parquet" | "parquet" or "csv". Must match the file you upload. |
| input_column | string | null | null | Column to use when your file has multiple columns. Required for multi-column files. |
| output_file_format | string | "parquet" | "parquet" or "csv" for the result file. |
| output_format | string | "index_pairs" | Same options as the standard endpoint. See Output formats. |
| similarity_threshold, use_case, to_lowercase, remove_punctuation, use_token_sort, top_n | Same as the standard dedupe endpoint. See Parameters reference. | ||
Upload your file
PUT your file to the upload_url. No Authorization header — the URL is pre-authenticated.
curl -X PUT \ -H 'Content-Type: application/octet-stream' \ --data-binary @your_data.parquet \ "<upload_url>"
The upload URL expires after 1 hour. If it expires, create a new job and start over.
File requirements: Parquet or CSV matching config.input_format. Single-column files use that column automatically; multi-column files require config.input_column. Null values are treated as empty strings.
POST /jobs/:job_id/commit
Validates the uploaded file, counts rows, and queues processing. A failed job can be recommitted after re-uploading a corrected file.
Response
{
"status": "success",
"job_id": "a3f9c2d1e8b047...",
"rows_total": 250000,
"auto_started": true
}GET /jobs/:job_id
Returns the current state of a job. Poll until job_status is completed or failed. A 5–15 second polling interval is reasonable.
Response — completed
{
"status": "success",
"job_id": "a3f9c2d1e8b047...",
"job_status": "completed",
"stage": "completed",
"progress": 100,
"rows_total": 250000,
"rows_processed": 250000,
"started_at_ms": 1718200200000,
"completed_at_ms": 1718200420000,
"error": null,
"result_url": "https://storage.googleapis.com/..."
}result_url expires after 1 hour. Download immediately, or call this endpoint again for a fresh URL.
Job statuses
| Status | Description |
|---|---|
| created | Waiting for file upload and commit. |
| queued | Committed and row count verified. Processing queued. |
| running | Processing. Check stage and progress for detail. |
| completed | result_url is available for download. |
| failed | Check error for the reason. Can be recommitted. |
| cancelled | Cancelled by the user. Input and output files are deleted. |
Processing stages
downloading → loading → preprocessing → encoding → index_build → ann_search → clustering → writing_output → completed
Jobs that make no progress for 2 hours are automatically marked failed. This prevents jobs staying in running or queued indefinitely.
POST /jobs/:job_id/cancel
Requests cancellation. Cancellation is cooperative and may take a short time. Input and output files are deleted on cancellation.
Response
{
"status": "success",
"job_id": "a3f9c2d1e8b047...",
"cancel_requested": true
}Parameters reference
All parameters live inside the config object. Shared parameters apply to both endpoints unless noted.
| Parameter | Type | Default | Description |
|---|---|---|---|
| similarity_threshold | float | 0.85 | Minimum similarity (0–1) to consider a match. Start with 0.8–0.9 for company names. |
| to_lowercase | boolean | false | Lowercase all strings before comparison. Recommended for most use cases. |
| remove_punctuation | boolean | false | Strip punctuation before comparison. |
| use_token_sort | boolean | false | Sort tokens alphabetically before comparison. Useful when word order varies. |
| output_format | string | varies | Shape of results. See Output formats. |
| use_case dedupe only | string | null | null | "company_names" enables built-in normalisation: strips Inc., Corp., Ltd., LLC, GmbH, S.A., PLC, "The ", and more. Loops until fully resolved — so "Acme Corp Ltd" becomes "Acme". Use with to_lowercase: true. |
| top_n dedupe only | int | 10 | Max candidates per string. Reducing speeds up large inputs at the cost of recall. |
| top_n reconcile only | int | 10 | Max data_b matches returned per data_a item. |
| input_format file upload only | string | "parquet" | "parquet" or "csv". Must match the uploaded file. |
| input_column file upload only | string | null | null | Column to use for multi-column files. |
| output_file_format file upload only | string | "parquet" | "parquet" or "csv" for the result file. |
Output formats
Reconcile
flat_table
One entry per data_a item including non-matches. Best for spreadsheet-style review.
string_pairs
Matched pairs only, with text values. Human-readable.
index_pairs
Matched pairs only, indices and score. Most compact.
Dedupe
row_annotations
One entry per input row with cluster info and similarity to representative. Best for review and export.
membership_map
Maps each row index to its representative index. Compact for programmatic use.
deduped_strings
The representative string from each cluster — a clean deduplicated list.
deduped_indices
The indices of representatives. Useful for filtering a DataFrame.
index_pairs
All matching pairs as [left, right, score].
string_pairs
All matching pairs with string values and score.
How representatives are chosen. Within each cluster, the record with the lowest index in the original input is the representative. Membership is transitive — if A matches B and B matches C, all three are in the same cluster even if A and C don't directly match.
Errors
All errors follow the same shape:
{ "status": "error", "message": "Description of what went wrong" }message.limit and remaining./token-gen: free-tier limit already exhausted. Job endpoints: accessing another account's job.Quota error (402)
{
"status": "error",
"message": "You have exceeded your free trial...",
"limit": 100000,
"remaining": 0
}Common mistakes
| Symptom | Likely cause |
|---|---|
| 401 on every request | Wrong Authorization header format. Use exactly: Bearer <token> |
| 400 "CSV has multiple columns" | Add "input_column": "your_column_name" to config. |
result_url returns 403 | Signed URL expired (1 hour TTL). Call GET /jobs/:id again for a fresh URL. |
Job committed but immediately failed | File wasn't uploaded before commit, or input_format doesn't match the file. |
Custom Solution
Need a custom solution? Our team can help you implement specialized string similarity and deduplication solutions tailored to your specific needs. to set up an exploratory call.