1M-Row Fuzzy Matching Benchmark (2025): Similarity API vs RapidFuzz, TheFuzz, Levenshtein

November 19, 20255 min readBy Similarity API Team

TL;DR

  • Similarity API is 12× faster at 10k rows, ~20× faster at 100k rows, and 300×–1,000× faster at 1M rows compared to RapidFuzz, TheFuzz, and python‑Levenshtein.
  • Local Python fuzzy matching libraries scale as O(N²) and become infeasible beyond ~50k–100k strings without heavy custom engineering.
  • Similarity API handled 1,000,000-string deduplication in a single warm API call on a basic 2‑CPU machine.
  • Preprocessing and matching behavior are configured through simple API flags—no custom pipelines, blocking rules, or indexing layers.
  • Full benchmark code + reproducible Colab notebook available.

Why This Benchmark Matters

Fuzzy string matching is at the core of common data tasks—cleaning CRM data, merging product catalogs, reconciling records, or doing fuzzy joins inside ETL pipelines. Yet most developers still rely on local Python libraries that work great at 1k–10k records but don't scale when you hit real-world volumes.

This benchmark compares:

  • Similarity API (cloud-native, adaptive matching engine)
  • RapidFuzz (fast, modern C++/Python library)
  • TheFuzz (FuzzyWuzzy fork)
  • python-Levenshtein (core edit-distance implementation)

We test them at 10k, 100k, and 1M strings.

Data & Benchmark Setup

Environment

Tests ran in a standard Google Colab CPU environment:

  • 2 vCPUs
  • ~13GB RAM
  • Python 3.x

Timings represent warm runs. The first API call has a small cold-start penalty, but subsequent calls match production steady-state behavior.

Synthetic Data

We generate names from a curated base list (people, companies, etc.) and apply realistic typos:

  • Insertions / deletions
  • Adjacent swaps
  • Random character replacements

This produces realistic noisy variants such as:

  • Micsrosoft Corpp
  • Aplpe Inc.
  • Charlle Brown

Each string gets a label based on its base name so we can run a quick accuracy sanity check.

Dataset Sizes

Benchmarks run at:

  • 10,000 strings
  • 100,000 strings
  • 1,000,000 strings

Local libraries are measured at 10k (RapidFuzz is also measured at 100k) and estimated for the remaining larger sizes via O(N²) scaling.

How Each Tool Was Used

Similarity API

A simple /dedupe call with configurable preprocessing:

POST https://api.similarity-api.com/dedupe
{
  "data": [...strings...],
  "config": {
    "similarity_threshold": 0.85,
    "remove_punctuation": false,
    "to_lowercase": false,
    "use_token_sort": false,
    "output_format": "index_pairs"
  }
}

Changing matching behavior is simply toggling config options—no preprocessing code or custom pipelines.

RapidFuzz

We use RapidFuzz's optimized C++ engine:

process.cdist(strings, strings, scorer=fuzz.ratio, workers=-1)

TheFuzz & python-Levenshtein

Used through naive Python loops, as they do not offer a bulk vectorized similarity matrix.

Quick Accuracy Sanity Check

Using a 2,000-string subset with known duplicate labels we ran a lightweight sanity check:

All tools achieved very high precision at a reasonably strict threshold (everything they returned was actually a duplicate).

We also checked that the number of unique entities after deduplication is close to the ground truth (using Similarity API's deduped_indices output format and clustering the pairs from the local libraries).

Results

Our benchmarks revealed significant performance differences between the tools, particularly as dataset sizes increased:

Library10K Records100K Records1M Records
Similarity API0.8 s58.8 s421.8 s
RapidFuzz13.0 s1301.8 s130,180.0 s (est.)
python-Levenshtein46.8 s4684.9 s (est.)468,490.0 s (est.)
TheFuzz39.4 s3938.2 s (est.)393,820.0 s (est.)

Performance at 10K and 100K Rows

Performance Across All Dataset Sizes (10K, 100K, 1M Rows)

A few things stand out:

  • At 10k rows, Similarity API is already an order of magnitude faster than the fastest local library.
  • By 100k rows, local libraries are effectively in "batch job" territory, while Similarity API is still something you can run interactively.
  • At 1M rows, Similarity API finishes in about 7 minutes, while naive estimates for the local libraries are in the tens to hundreds of hours.

If you're cleaning real-world datasets or running dedupe inside production pipelines, these differences are the line between "runs during a coffee break" and "needs an overnight batch job plus a lot of custom infrastructure."

Get a free API key and plug your own data into the benchmark, or drop Similarity API directly into your ETL / data quality pipeline.

Why Similarity API Wins

1. Adaptive proprietary algorithm

Similarity API uses an internal algorithm that adapts its strategy depending on input size and structure—indexing, parallelization, and optimized data layouts—so you get top-tier fuzzy matching without designing complex systems.

2. Preprocessing as configuration, not code

Lowercasing, punctuation removal, token sorting—just toggle a boolean instead of writing preprocessing pipelines.

3. Zero infrastructure

No servers, threading, batch jobs, or memory concerns. You pass strings; the API scales.

4. Transparent pricing with a generous free tier

Process 100,000 rows for free. Pay-as-you-go and tier plans available.

See the full docs to explore the /dedupe and /reconcile endpoints.

Try It Yourself

Run the full benchmark yourself using either Google Colab (interactive) or clone the GitHub repository (local development).

Google Colab Notebook

Run the benchmark instantly in your browser. No setup required—just click and execute the code cells.

Open in Colab

GitHub Repository

Clone the full source code and run the benchmark locally. Customize and extend the tests to your needs.

View on GitHub

You'll need to sign up to get a free API key to run the benchmarks.

FAQ