1M-Row Fuzzy Matching Benchmark (2025): Similarity API vs RapidFuzz, TheFuzz, Levenshtein
TL;DR
When should you stop using local fuzzy matching?
- •< 50k rows → local libraries are usually fine
- •50k–200k rows → slow iteration, painful tuning
- •~1M rows → local approaches become impractical
In this benchmark:
- •Similarity API deduplicates 1,000,000 strings in ~7 minutes
- •RapidFuzz, TheFuzz, and python-Levenshtein extrapolate to tens to hundreds of hours
- •That's roughly 300×–1,000× faster compute at 1M rows (excluding implementation and maintenance time)
Want to see this on your own data?
Paste sample strings or upload a CSV (up to 100k rows free, no setup).
Why This Benchmark Matters
Fuzzy string matching is at the core of common data tasks—cleaning CRM data, merging product catalogs, reconciling records, or doing fuzzy joins inside ETL pipelines. Yet most developers still rely on local Python libraries that work great at 1k–10k records but don't scale when you hit real-world volumes.
This benchmark compares:
- Similarity API (cloud-native, adaptive matching engine)
- RapidFuzz (fast, modern C++/Python library)
- TheFuzz (FuzzyWuzzy fork)
- python-Levenshtein (core edit-distance implementation)
We test them at 10k, 100k, and 1M strings.
Data & Benchmark Setup
Environment
Tests ran in a standard Google Colab CPU environment:
- 2 vCPUs
- ~13GB RAM
- Python 3.x
Timings represent warm runs. The first API call has a small cold-start penalty, but subsequent calls match production steady-state behavior.
Synthetic Data
We generate names from a curated base list (people, companies, etc.) and apply realistic typos:
- Insertions / deletions
- Adjacent swaps
- Random character replacements
This produces realistic noisy variants such as:
- Micsrosoft Corpp
- Aplpe Inc.
- Charlle Brown
Each string gets a label based on its base name so we can run a quick accuracy sanity check.
Dataset Sizes
Benchmarks run at:
- 10,000 strings
- 100,000 strings
- 1,000,000 strings
Local libraries are measured at 10k (RapidFuzz is also measured at 100k) and estimated for the remaining larger sizes via O(N²) scaling.
How Each Tool Was Used
Similarity API
A simple /dedupe call with configurable preprocessing:
POST https://api.similarity-api.com/dedupe
{
"data": [...strings...],
"config": {
"similarity_threshold": 0.85,
"remove_punctuation": false,
"to_lowercase": false,
"use_token_sort": false,
"output_format": "index_pairs"
}
}Changing matching behavior is simply toggling config options—no preprocessing code or custom pipelines.
RapidFuzz
We use RapidFuzz's optimized C++ engine:
process.cdist(strings, strings, scorer=fuzz.ratio, workers=-1)
TheFuzz & python-Levenshtein
Used through naive Python loops, as they do not offer a bulk vectorized similarity matrix.
Quick Accuracy Sanity Check
Using a 2,000-string subset with known duplicate labels we ran a lightweight sanity check:
All tools achieved very high precision at a reasonably strict threshold (everything they returned was actually a duplicate).
We also checked that the number of unique entities after deduplication is close to the ground truth (using Similarity API's deduped_indices output format and clustering the pairs from the local libraries).
Results
Our benchmarks revealed significant performance differences between the tools, particularly as dataset sizes increased:
| Library | 10K Records | 100K Records | 1M Records |
|---|---|---|---|
| Similarity API | 0.8 s | 58.8 s | 421.8 s |
| RapidFuzz | 13.0 s | 1301.8 s | 130,180.0 s (est.) |
| python-Levenshtein | 46.8 s | 4684.9 s (est.) | 468,490.0 s (est.) |
| TheFuzz | 39.4 s | 3938.2 s (est.) | 393,820.0 s (est.) |
Performance at 10K and 100K Rows
Performance Across All Dataset Sizes (10K, 100K, 1M Rows)
A few things stand out:
- At 10k rows, Similarity API is already an order of magnitude faster than the fastest local library.
- By 100k rows, local libraries are effectively in "batch job" territory, while Similarity API is still something you can run interactively.
- At 1M rows, Similarity API finishes in about 7 minutes, while naive estimates for the local libraries are in the tens to hundreds of hours.
If you're cleaning real-world datasets or running dedupe inside production pipelines, these differences are the line between "runs during a coffee break" and "needs an overnight batch job plus a lot of custom infrastructure."
Stop building fuzzy matching pipelines
If your datasets are already in the 100k+ range, local libraries will keep slowing you down — even before accuracy becomes a problem.
Why Similarity API Wins
1. Adaptive proprietary algorithm
Similarity API uses an internal algorithm that adapts its strategy depending on input size and structure—indexing, parallelization, and optimized data layouts—so you get top-tier fuzzy matching without designing complex systems.
2. Preprocessing as configuration, not code
Lowercasing, punctuation removal, token sorting—just toggle a boolean instead of writing preprocessing pipelines.
3. Zero infrastructure
No servers, threading, batch jobs, or memory concerns. You pass strings; the API scales.
4. Transparent pricing with a generous free tier
Process 100,000 rows for free. Pay-as-you-go and tier plans available.
Try It Yourself
Run the full benchmark yourself using either Google Colab (interactive) or clone the GitHub repository (local development).
Google Colab Notebook
Run the benchmark instantly in your browser. No setup required—just click and execute the code cells.
Open in ColabGitHub Repository
Clone the full source code and run the benchmark locally. Customize and extend the tests to your needs.
View on GitHubYou'll need to sign up to get a free API key to run the benchmarks.