1M-Row Fuzzy Matching Benchmark (2025): Similarity API vs RapidFuzz, TheFuzz, Levenshtein
TL;DR
- •Similarity API is 12× faster at 10k rows, ~20× faster at 100k rows, and 300×–1,000× faster at 1M rows compared to RapidFuzz, TheFuzz, and python‑Levenshtein.
- •Local Python fuzzy matching libraries scale as O(N²) and become infeasible beyond ~50k–100k strings without heavy custom engineering.
- •Similarity API handled 1,000,000-string deduplication in a single warm API call on a basic 2‑CPU machine.
- •Preprocessing and matching behavior are configured through simple API flags—no custom pipelines, blocking rules, or indexing layers.
- •Full benchmark code + reproducible Colab notebook available.
Why This Benchmark Matters
Fuzzy string matching is at the core of common data tasks—cleaning CRM data, merging product catalogs, reconciling records, or doing fuzzy joins inside ETL pipelines. Yet most developers still rely on local Python libraries that work great at 1k–10k records but don't scale when you hit real-world volumes.
This benchmark compares:
- Similarity API (cloud-native, adaptive matching engine)
- RapidFuzz (fast, modern C++/Python library)
- TheFuzz (FuzzyWuzzy fork)
- python-Levenshtein (core edit-distance implementation)
We test them at 10k, 100k, and 1M strings.
Data & Benchmark Setup
Environment
Tests ran in a standard Google Colab CPU environment:
- 2 vCPUs
- ~13GB RAM
- Python 3.x
Timings represent warm runs. The first API call has a small cold-start penalty, but subsequent calls match production steady-state behavior.
Synthetic Data
We generate names from a curated base list (people, companies, etc.) and apply realistic typos:
- Insertions / deletions
- Adjacent swaps
- Random character replacements
This produces realistic noisy variants such as:
- Micsrosoft Corpp
- Aplpe Inc.
- Charlle Brown
Each string gets a label based on its base name so we can run a quick accuracy sanity check.
Dataset Sizes
Benchmarks run at:
- 10,000 strings
- 100,000 strings
- 1,000,000 strings
Local libraries are measured at 10k (RapidFuzz is also measured at 100k) and estimated for the remaining larger sizes via O(N²) scaling.
How Each Tool Was Used
Similarity API
A simple /dedupe call with configurable preprocessing:
POST https://api.similarity-api.com/dedupe
{
"data": [...strings...],
"config": {
"similarity_threshold": 0.85,
"remove_punctuation": false,
"to_lowercase": false,
"use_token_sort": false,
"output_format": "index_pairs"
}
}Changing matching behavior is simply toggling config options—no preprocessing code or custom pipelines.
RapidFuzz
We use RapidFuzz's optimized C++ engine:
process.cdist(strings, strings, scorer=fuzz.ratio, workers=-1)
TheFuzz & python-Levenshtein
Used through naive Python loops, as they do not offer a bulk vectorized similarity matrix.
Quick Accuracy Sanity Check
Using a 2,000-string subset with known duplicate labels we ran a lightweight sanity check:
All tools achieved very high precision at a reasonably strict threshold (everything they returned was actually a duplicate).
We also checked that the number of unique entities after deduplication is close to the ground truth (using Similarity API's deduped_indices output format and clustering the pairs from the local libraries).
Results
Our benchmarks revealed significant performance differences between the tools, particularly as dataset sizes increased:
| Library | 10K Records | 100K Records | 1M Records |
|---|---|---|---|
| Similarity API | 0.8 s | 58.8 s | 421.8 s |
| RapidFuzz | 13.0 s | 1301.8 s | 130,180.0 s (est.) |
| python-Levenshtein | 46.8 s | 4684.9 s (est.) | 468,490.0 s (est.) |
| TheFuzz | 39.4 s | 3938.2 s (est.) | 393,820.0 s (est.) |
Performance at 10K and 100K Rows
Performance Across All Dataset Sizes (10K, 100K, 1M Rows)
A few things stand out:
- At 10k rows, Similarity API is already an order of magnitude faster than the fastest local library.
- By 100k rows, local libraries are effectively in "batch job" territory, while Similarity API is still something you can run interactively.
- At 1M rows, Similarity API finishes in about 7 minutes, while naive estimates for the local libraries are in the tens to hundreds of hours.
If you're cleaning real-world datasets or running dedupe inside production pipelines, these differences are the line between "runs during a coffee break" and "needs an overnight batch job plus a lot of custom infrastructure."
Get a free API key and plug your own data into the benchmark, or drop Similarity API directly into your ETL / data quality pipeline.
Why Similarity API Wins
1. Adaptive proprietary algorithm
Similarity API uses an internal algorithm that adapts its strategy depending on input size and structure—indexing, parallelization, and optimized data layouts—so you get top-tier fuzzy matching without designing complex systems.
2. Preprocessing as configuration, not code
Lowercasing, punctuation removal, token sorting—just toggle a boolean instead of writing preprocessing pipelines.
3. Zero infrastructure
No servers, threading, batch jobs, or memory concerns. You pass strings; the API scales.
4. Transparent pricing with a generous free tier
Process 100,000 rows for free. Pay-as-you-go and tier plans available.
Try It Yourself
Run the full benchmark yourself using either Google Colab (interactive) or clone the GitHub repository (local development).
Google Colab Notebook
Run the benchmark instantly in your browser. No setup required—just click and execute the code cells.
Open in ColabGitHub Repository
Clone the full source code and run the benchmark locally. Customize and extend the tests to your needs.
View on GitHubYou'll need to sign up to get a free API key to run the benchmarks.