Battle of the Matches: A fuzzy string match benchmark exercise!
String deduplication is a core component in modern data workflows, from cleaning CRM databases and merging customer records, to normalizing product catalogs and detecting near-duplicate news articles.
In this article, we benchmark four widely used tools and APIs for fuzzy deduplication, with a special focus on threshold-based string similarity. We also highlight how the Similarity API delivers significantly better performance at scale compared to traditional libraries.
We target readers evaluating string matching tools for use cases such as:
- Entity resolution across tables or databases
- Fuzzy joins in ETL pipelines
- Product name normalization in e-commerce
- Deduplicating people names, places, or company records
- Deduplication before training ML models
Tools Compared in This Benchmark
We evaluated:
- Similarity API – A cloud-native, highly configurable, ultra-fast string similarity API
- RapidFuzz – A performant local fuzzy matching library in Python
- TheFuzz – The modern fork of FuzzyWuzzy
- python-Levenshtein – A backend library used by other fuzzy tools
Each was tested using a threshold deduplication task: identify all near-duplicate pairs above a similarity score of 0.85.
Benchmark Setup and Methodology
We ran the benchmarks on synthetic data with real-world structure: thousands of strings with slight variations (e.g. typos, formatting). For each tool, we ran all-to-all comparisons and filtered matches above the similarity threshold.
We tested datasets with:
- 2,000 strings
- 5,000 strings
- 10,000 strings
All local benchmarks were run on Google Colab (single CPU). The Similarity API was tested by making actual HTTP requests.
Code Used in the Benchmark
Step 1: Generate Synthetic Data
import random def typo(s): if len(s) < 4: return s idx = random.randint(0, len(s) - 2) return s[:idx] + s[idx+1] + s[idx] + s[idx+2:] def generate_strings(n): base = ["Alice Johnson", "Bob Smith", "Charlie Brown", "Diana Prince", "Eve Adams"] return [typo(random.choice(base)) for _ in range(n)]
Step 2: Benchmark Local Libraries
import time from rapidfuzz import fuzz as rf_fuzz from fuzzywuzzy import fuzz as fw_fuzz import Levenshtein as lev def threshold_matches(strings, fn, threshold=90): matches = [] for i, s1 in enumerate(strings): for j in range(i + 1, len(strings)): s2 = strings[j] score = fn(s1, s2) if score >= threshold: matches.append((i, j)) return matches def benchmark_tool(name, strings, fn, threshold=90): start = time.time() _ = threshold_matches(strings, fn, threshold) return round(time.time() - start, 2)
Step 3: Call the Similarity API
import requests import json def similarity_api(strings): API_URL = "https://api.similarity-api.com/close-match" API_TOKEN = "eyJhbGciOiJI...JCiiqErKDqkdYjGWsMGFgb2LGUlNL1FcmGEsvKkL2Ps" headers = {"Content-Type": "application/json"} data = { "auth_token": f"{API_TOKEN}", "data": strings, "config": { "similarity_threshold": 0.85, "remove_punctuation": False, "to_lowercase": False, "output_format": "deduped_indices" } } start = time.time() r = requests.post(API_URL, data=json.dumps(data), headers=headers) r.raise_for_status() return round(time.time() - start, 2)
Step 4: Running the Experiment!
for size in [2000, 5000, 10000]: strings = generate_strings(size) print(f"\nBenchmarking with {size} strings:") print("RapidFuzz:", benchmark_tool("RapidFuzz", strings, rf_fuzz.ratio)) print("FuzzyWuzzy:", benchmark_tool("FuzzyWuzzy", strings, fw_fuzz.ratio)) print("Levenshtein:", benchmark_tool("Levenshtein", strings, lambda s1, s2: int((1 - lev.distance(s1, s2)/max(len(s1), len(s2)))*100))) print("My API:", similarity_api(strings))
Results
Our benchmarks revealed significant performance differences between the tools, particularly as dataset sizes increased:
Library | 2K Records | 5K Records | 10K Records |
---|---|---|---|
Similarity API | 0.91 s | 1.77 s | 3.3 s |
RapidFuzz | 1.77 s | 12.05 s | 48.83 s |
python-Levenshtein | 1.83 s | 12.59 s | 53.52 s |
TheFuzz | 5.13 s | 35.65 s | 150.49 s |
Performance Visualization
The performance advantage of Similarity API is clear: for a 10,000 string dataset, it is almost 46x faster than TheFuzz and almost 15x faster than RapidFuzz! Wow!
Why Similarity API Wins
Similarity API is designed from the ground up for:
- Large-scale string matching workloads
- Real-time deduplication
- Multi-threaded, cloud-native performance
- Easy integration into pipelines
- Custom thresholds and scoring functions
Because it runs on managed infrastructure, you avoid worrying about memory usage, multi-threading, or slow Python loops. It just works — fast.
Try It Yourself
You can try this benchmark using Google Colab. You'll need to sign up to get an API key and access the free tier.
What's Coming Next
In future posts, we'll explore:
- Accuracy benchmarks! - Similarity API is not just lightning-fast, but also highly configurable for maximum accuracy in any use case!
- Threshold tuning strategies - for different use cases - no use case is out of scope for Similarity API!
- Benchmark with cosine similarity technique - because we are feeling pretty confident!
- Using string similarity in AI agent pipelines - because the future is agentic!