Battle of the Matches: A fuzzy string match benchmark exercise!

May 18, 20255 min readBy Similarity API Team

String deduplication is a core component in modern data workflows, from cleaning CRM databases and merging customer records, to normalizing product catalogs and detecting near-duplicate news articles.

In this article, we benchmark four widely used tools and APIs for fuzzy deduplication, with a special focus on threshold-based string similarity. We also highlight how the Similarity API delivers significantly better performance at scale compared to traditional libraries.

We target readers evaluating string matching tools for use cases such as:

  • Entity resolution across tables or databases
  • Fuzzy joins in ETL pipelines
  • Product name normalization in e-commerce
  • Deduplicating people names, places, or company records
  • Deduplication before training ML models

Tools Compared in This Benchmark

We evaluated:

  • Similarity API – A cloud-native, highly configurable, ultra-fast string similarity API
  • RapidFuzz – A performant local fuzzy matching library in Python
  • TheFuzz – The modern fork of FuzzyWuzzy
  • python-Levenshtein – A backend library used by other fuzzy tools

Each was tested using a threshold deduplication task: identify all near-duplicate pairs above a similarity score of 0.85.

Benchmark Setup and Methodology

We ran the benchmarks on synthetic data with real-world structure: thousands of strings with slight variations (e.g. typos, formatting). For each tool, we ran all-to-all comparisons and filtered matches above the similarity threshold.

We tested datasets with:

  • 2,000 strings
  • 5,000 strings
  • 10,000 strings

All local benchmarks were run on Google Colab (single CPU). The Similarity API was tested by making actual HTTP requests.

Code Used in the Benchmark

Step 1: Generate Synthetic Data

import random 

def typo(s):
    if len(s) < 4: return s
    idx = random.randint(0, len(s) - 2)
    return s[:idx] + s[idx+1] + s[idx] + s[idx+2:]

def generate_strings(n):
    base = ["Alice Johnson", "Bob Smith", "Charlie Brown", "Diana Prince", "Eve Adams"]
    return [typo(random.choice(base)) for _ in range(n)]

Step 2: Benchmark Local Libraries

import time
from rapidfuzz import fuzz as rf_fuzz
from fuzzywuzzy import fuzz as fw_fuzz
import Levenshtein as lev

def threshold_matches(strings, fn, threshold=90):
    matches = []
    for i, s1 in enumerate(strings):
        for j in range(i + 1, len(strings)):
            s2 = strings[j]
            score = fn(s1, s2)
            if score >= threshold:
                matches.append((i, j))
    return matches

def benchmark_tool(name, strings, fn, threshold=90):
    start = time.time()
    _ = threshold_matches(strings, fn, threshold)
    return round(time.time() - start, 2)

Step 3: Call the Similarity API

import requests
import json

def similarity_api(strings):
    API_URL = "https://api.similarity-api.com/close-match" 
    API_TOKEN = "eyJhbGciOiJI...JCiiqErKDqkdYjGWsMGFgb2LGUlNL1FcmGEsvKkL2Ps"
    headers = {"Content-Type": "application/json"}
    data = {
    "auth_token": f"{API_TOKEN}",
    "data": strings,
    "config": {
        "similarity_threshold": 0.85,
        "remove_punctuation": False,
        "to_lowercase": False,
        "output_format": "deduped_indices"
    }
    }
    start = time.time()
    r = requests.post(API_URL, data=json.dumps(data), headers=headers)
    r.raise_for_status()
    return round(time.time() - start, 2)

Step 4: Running the Experiment!

for size in [2000, 5000, 10000]:
    strings = generate_strings(size)
    print(f"\nBenchmarking with {size} strings:")

    print("RapidFuzz:", benchmark_tool("RapidFuzz", strings, rf_fuzz.ratio))
    print("FuzzyWuzzy:", benchmark_tool("FuzzyWuzzy", strings, fw_fuzz.ratio))
    print("Levenshtein:", benchmark_tool("Levenshtein", strings, lambda s1, s2: int((1 - lev.distance(s1, s2)/max(len(s1), len(s2)))*100)))
    print("My API:", similarity_api(strings))

Results

Our benchmarks revealed significant performance differences between the tools, particularly as dataset sizes increased:

Performance Results (execution time in seconds)
Library2K Records5K Records10K Records
Similarity API0.91 s1.77 s3.3 s
RapidFuzz1.77 s12.05 s48.83 s
python-Levenshtein1.83 s12.59 s53.52 s
TheFuzz5.13 s35.65 s150.49 s

Performance Visualization

The performance advantage of Similarity API is clear: for a 10,000 string dataset, it is almost 46x faster than TheFuzz and almost 15x faster than RapidFuzz! Wow!

Why Similarity API Wins

Similarity API is designed from the ground up for:

  • Large-scale string matching workloads
  • Real-time deduplication
  • Multi-threaded, cloud-native performance
  • Easy integration into pipelines
  • Custom thresholds and scoring functions

Because it runs on managed infrastructure, you avoid worrying about memory usage, multi-threading, or slow Python loops. It just works — fast.

Try It Yourself

You can try this benchmark using Google Colab. You'll need to sign up to get an API key and access the free tier.

What's Coming Next

In future posts, we'll explore:

  • Accuracy benchmarks! - Similarity API is not just lightning-fast, but also highly configurable for maximum accuracy in any use case!
  • Threshold tuning strategies - for different use cases - no use case is out of scope for Similarity API!
  • Benchmark with cosine similarity technique - because we are feeling pretty confident!
  • Using string similarity in AI agent pipelines - because the future is agentic!