Scaling Fuzzy Matching for Graph Clustering
I’ve been building a knowledge graph that represents hundreds of thousands of pages of content - documentation, guides, reference material - ingested and broken down into pages, sections, topics, categories, and tags. After ingestion, the graph had grown to millions of nodes. The next step was clustering: figuring out which nodes are actually talking about the same thing, or are closely related enough to be grouped together.
The clustering pipeline had three tiers of matching. Tier 1 was exact matching - normalize the names, compare strings, done. Fast, cheap, and it catches a lot. But the interesting (and painful) problem was everything Tier 1 missed: nodes that refer to the same concept but use different words, different casing conventions, abbreviations, or slightly different phrasing. Think configureS3Bucket vs. configure-s3-bucket vs. Configuring S3 Buckets. Same thing, three different strings.
That’s where fuzzy matching (Tier 2) and eventually semantic matching with embeddings (Tier 3) come in. And that’s where everything fell over.
The quadratic wall
Here’s the math that broke things. After Tier 1 exact matching cleaned up the obvious duplicates, there were still 800,000 nodes in the testing corpus that needed fuzzy comparison. A naive approach - compare every node to every other node - means roughly 800,000 * 800,000 / 2, or about 320 billion pairwise comparisons.
Even if each comparison only takes a microsecond (which is optimistic for fuzzy string metrics), you’re looking at years of compute time. Running this on Lambda with batches of 100 nodes was already timing out at the 15-minute limit. Scaling horizontally with bigger machines (say, Fargate) just kicks the can down the road. Double the compute, and you’ve halved the time, but the exponent hasn’t changed. Not exactly a win.
The problem isn’t that we need more compute. The problem is that O(n^2) doesn’t scale, and no amount of horizontal scaling changes the exponent. We need to reduce the search space algorithmically - to figure out which pairs are even worth comparing before we spend CPU cycles on expensive fuzzy scoring.
What blocking is and why it works
The technique that solves this is called blocking, and the concept is simple: instead of comparing every node to every other node, you group nodes into “blocks” based on some cheap-to-compute criterion, and then only compare nodes within the same block.
If you can design blocks where true matches are very likely to land in the same block, and where blocks are small relative to the full dataset, you’ve turned an O(n^2) problem into something much more manageable.
Think of it like a library. If you’re looking for a book about Python programming, you don’t walk through every shelf in the building. You go to the Computer Science section first. That’s blocking - you’ve reduced your search space from “the entire library” to “one section,” and you’re betting that the book you want is in that section.
The specific flavor of blocking that worked best here is token blocking with frequency filtering. Let me break that down piece by piece.
Token blocking: the core idea
Token blocking works like this: take each node’s name, break it into individual tokens (words), and build an inverted index - a mapping from each token to the set of nodes whose names contain that token.
When you want to find candidates to compare against a given node, you look up its tokens in the index and take the union of all the node sets. Those are your candidates. Any node that shares at least one meaningful token with your target is worth scoring; everything else gets skipped.
Here’s a minimal implementation:
import re
from collections import defaultdict
def tokenize(name: str) -> set[str]:
"""
Break a node name into normalized tokens.
Splits on camelCase boundaries, hyphens, underscores, and whitespace,
then lowercases everything.
'configureS3Bucket' -> {'configure', 's3', 'bucket'}
'configure-s3-bucket' -> {'configure', 's3', 'bucket'}
'Configuring S3 Buckets' -> {'configuring', 's3', 'buckets'}
"""
# Insert a space before each uppercase letter that follows a lowercase
# letter or digit, so 'configureS3Bucket' becomes 'configure S3 Bucket'
spaced = re.sub(r'([a-z0-9])([A-Z])', r'\1 \2', name)
# Split on hyphens, underscores, whitespace, and dots
raw_tokens = re.split(r'[-_.\s]+', spaced)
return {t.lower() for t in raw_tokens if t}
The tokenizer handles the exact naming variations that show up in this kind of graph. configureS3Bucket, configure-s3-bucket, and Configuring S3 Buckets all produce overlapping token sets. They’ll land in the same blocks, which means they’ll be compared against each other - exactly what we want.
Now build the inverted index:
def build_token_index(
nodes: dict[str, str]
) -> dict[str, set[str]]:
"""
Build an inverted index from tokens to node IDs.
Args:
nodes: mapping of node_id -> node_name
Returns:
mapping of token -> set of node_ids containing that token
"""
index: dict[str, set[str]] = defaultdict(set)
for node_id, name in nodes.items():
for token in tokenize(name):
index[token].add(node_id)
return index
And retrieve candidates:
def get_candidates(
node_id: str,
name: str,
index: dict[str, set[str]]
) -> set[str]:
"""
Find all nodes that share at least one token with the given name.
"""
candidates: set[str] = set()
for token in tokenize(name):
if token in index:
candidates.update(index[token])
# Don't compare a node to itself
candidates.discard(node_id)
return candidates
This is already a huge improvement over brute-force pairwise comparison. But there’s a catch, and it’s a big one.
The frequency problem
Some tokens are useless for blocking because they’re everywhere. In a knowledge graph built from cloud documentation, tokens like aws, amazon, service, configure, and create appear in tens of thousands of node names. If you include service in your inverted index, then every node with “service” in its name gets compared against every other node with “service” in its name. You’ve basically recreated the quadratic problem, just scoped to a slightly smaller (but still enormous) subset.
This is the same insight behind stopwords in search engines. Words like “the”, “a”, and “is” appear in nearly every document, so including them in a search index just adds noise without helping you find anything. In our case, high-frequency domain tokens are the equivalent of stopwords.
The fix is frequency filtering: drop any token that appears in more than a certain percentage of names for a given label type. A simple cutoff works well in practice. Say you set the threshold at 0.1% of the total node count for that label. For ~800,000 concept nodes, that means any token appearing in more than ~800 names gets dropped from the index.
def build_filtered_token_index(
nodes: dict[str, str],
max_frequency_pct: float = 0.001
) -> dict[str, set[str]]:
"""
Build an inverted index, but drop tokens that appear in too many names.
Args:
nodes: mapping of node_id -> node_name
max_frequency_pct: maximum fraction of nodes a token can appear in
before it's considered too common (default 0.1%)
Returns:
mapping of token -> set of node_ids (with high-frequency tokens removed)
"""
# First pass: build the full index
index: dict[str, set[str]] = defaultdict(set)
for node_id, name in nodes.items():
for token in tokenize(name):
index[token].add(node_id)
# Second pass: filter out high-frequency tokens
max_count = max(1, int(len(nodes) * max_frequency_pct))
filtered = {
token: node_ids
for token, node_ids in index.items()
if len(node_ids) <= max_count
}
return filtered
With frequency filtering in place, your average block size drops dramatically - typically to somewhere between 50 and 200 nodes per token. That means each candidate lookup returns a few hundred comparisons instead of hundreds of thousands. A node with 3 tokens might have a candidate set of 400 nodes after deduplication. Run fuzzy scoring on those 400 pairs, and you’re done in milliseconds.
At that scale, even a sequential scan through all ~800,000 nodes - each comparing against a few hundred candidates - is comfortably within Lambda’s 15-minute window.
A more refined approach: IDF weighting
The hard frequency cutoff is simple and usually sufficient, but there’s a more precise alternative if you need it: IDF (Inverse Document Frequency) weighting. Instead of a binary keep/drop decision, you assign each token a weight based on how rare it is. Rare tokens get high weights, common tokens get low weights, and you rank candidates by the sum of weights of their shared tokens.
import math
def build_idf_weighted_index(
nodes: dict[str, str],
max_frequency_pct: float = 0.001
) -> tuple[dict[str, set[str]], dict[str, float]]:
"""
Build an inverted index with IDF weights for each token.
Still drops tokens above the frequency threshold (they'd have near-zero
IDF anyway), but weights the remaining tokens by informativeness.
Returns:
(index, idf_weights) - the inverted index and a token -> weight mapping
"""
index: dict[str, set[str]] = defaultdict(set)
for node_id, name in nodes.items():
for token in tokenize(name):
index[token].add(node_id)
n = len(nodes)
max_count = max(1, int(n * max_frequency_pct))
filtered_index = {}
idf_weights = {}
for token, node_ids in index.items():
if len(node_ids) <= max_count:
filtered_index[token] = node_ids
# IDF: log(total_nodes / num_nodes_with_token)
# Higher value = rarer token = more informative
idf_weights[token] = math.log(n / len(node_ids))
return filtered_index, idf_weights
def get_ranked_candidates(
node_id: str,
name: str,
index: dict[str, set[str]],
idf_weights: dict[str, float],
top_k: int = 200
) -> list[tuple[str, float]]:
"""
Find candidate nodes ranked by cumulative IDF weight of shared tokens.
Instead of returning every node that shares any token, this ranks
candidates so the most promising matches (sharing rare, informative
tokens) float to the top.
"""
scores: dict[str, float] = defaultdict(float)
for token in tokenize(name):
if token in index:
weight = idf_weights.get(token, 0.0)
for candidate_id in index[token]:
if candidate_id != node_id:
scores[candidate_id] += weight
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return ranked[:top_k]
IDF weighting is useful when your token frequency distribution is more of a gradient than a binary split. If you have tokens that are uncommon enough to survive the frequency cutoff but still appear in thousands of nodes, IDF weighting will deprioritize them relative to truly distinctive tokens. In practice, the hard cutoff alone worked fine for this dataset, but IDF is worth knowing about as an option.
Putting it all together: the matching pipeline
With token blocking in place, the full matching pipeline for a node looks like this:
from rapidfuzz import fuzz
from rapidfuzz.distance import JaroWinkler
A quick note on these two scoring functions, since they show up throughout the pipeline:
Jaro-Winkler is a character-level string similarity metric. It measures how many character transpositions you’d need to turn one string into another, with a bonus for strings that share a common prefix. It returns a score between 0 (completely different) and 1 (identical). It’s good at catching typos and minor character-level variations, but it penalizes length differences heavily - so “EKS” vs. “Elastic Kubernetes Service” would score very low even though they mean the same thing.
Token-set ratio (from the rapidfuzz library, a faster drop-in replacement for fuzzywuzzy) takes a different approach. It splits both strings into sets of tokens, then compares the intersection and remainder sets separately. This makes it much more forgiving of word reordering and extra words - “Configure S3 Bucket” vs. “S3 Bucket Configuration” would score high because the token overlap is strong, even though the character order is different. It returns a score between 0 and 100.
def score_pair(name_a: str, name_b: str) -> dict:
"""
Score a candidate pair using Jaro-Winkler and token-set ratio.
"""
jw = JaroWinkler.similarity(name_a, name_b)
tsr = fuzz.token_set_ratio(name_a, name_b) / 100.0
return {
"jaro_winkler": jw,
"token_set_ratio": tsr,
"is_match": jw >= 0.80 and tsr >= 0.85
}
def run_matching_pipeline(
nodes: dict[str, str],
max_frequency_pct: float = 0.001
) -> list[dict]:
"""
Full pipeline: build blocked index, then score only candidate pairs.
"""
index = build_filtered_token_index(nodes, max_frequency_pct)
matches = []
seen_pairs: set[tuple[str, str]] = set()
for node_id, name in nodes.items():
candidates = get_candidates(node_id, name, index)
for candidate_id in candidates:
# Avoid scoring the same pair twice (A,B) and (B,A)
pair = tuple(sorted([node_id, candidate_id]))
if pair in seen_pairs:
continue
seen_pairs.add(pair)
result = score_pair(name, nodes[candidate_id])
if result["is_match"]:
matches.append({
"source": node_id,
"target": candidate_id,
"jaro_winkler": result["jaro_winkler"],
"token_set_ratio": result["token_set_ratio"]
})
return matches
The key thing to notice: the expensive fuzzy scoring (score_pair) only runs on candidate pairs that share at least one informative token. Everything else is skipped. The inverted index lookup and set union are cheap - O(tokens per name * block size) - and they gate access to the expensive part.
What about stemming?
One thing the tokenizer above doesn’t do is stemming - reducing words to their root form. “Configuring” and “Configure” produce different tokens (configuring vs. configure), so they won’t generate a block overlap on that token alone. They might still land in the same block via other shared tokens (like s3 or bucket), but it’s not guaranteed.
Adding a stemmer makes blocking more aggressive - more true matches land in the same block - at the cost of some extra false positives. For this use case, a lightweight stemmer like Porter or Snowball is a good tradeoff. Both are available via NLTK:
from nltk.stem import PorterStemmer
_stemmer = PorterStemmer()
def tokenize_with_stemming(name: str) -> set[str]:
"""
Tokenize and stem. 'Configuring' and 'Configure' both become 'configur'.
"""
spaced = re.sub(r'([a-z0-9])([A-Z])', r'\1 \2', name)
raw_tokens = re.split(r'[-_.\s]+', spaced)
return {_stemmer.stem(t.lower()) for t in raw_tokens if t}
Whether you need stemming depends on how much inflectional variation your data has. In this graph, where names came from technical documentation with lots of gerund forms (“Configuring”, “Managing”, “Deploying”) alongside imperative forms (“Configure”, “Manage”, “Deploy”), stemming caught a meaningful number of extra matches.
The abbreviation edge case
There’s one class of match that token blocking handles poorly: abbreviations. “EKS” and “Elastic Kubernetes Service” share zero tokens. No amount of token overlap analysis will put them in the same block.
This is a real problem in a knowledge graph built from cloud documentation, where abbreviations are everywhere. The main line of defense is the alias table in Tier 1 - a lookup of known abbreviation-to-expansion mappings that catches these before fuzzy matching even starts.
But if abbreviations slip through the alias table, you need a separate detection path. One approach is to check whether one name looks like an acronym of the other’s tokens:
def is_acronym_of(short: str, long: str) -> bool:
"""
Check if 'short' could be an acronym of 'long'.
is_acronym_of('EKS', 'Elastic Kubernetes Service') -> True
is_acronym_of('S3', 'Simple Storage Service') -> True
"""
short_clean = short.strip().upper()
long_tokens = [t for t in re.split(r'[-_.\s]+', long) if t]
initials = ''.join(t[0].upper() for t in long_tokens)
return short_clean == initials
This is a narrow heuristic and won’t catch every abbreviation pattern (e.g., “S3” is only the first letter plus the count), but it covers the common case. For anything more complex, you’re in Tier 3 territory - that’s where semantic embeddings shine, because a model that’s seen enough cloud documentation will learn that “EKS” and “Elastic Kubernetes Service” are the same thing, regardless of surface-level token overlap.
Where this sits in the bigger picture
Token blocking isn’t meant to be the final answer. It’s one layer in a pipeline that gets progressively more expensive and more powerful:
Tier 1 - Exact and alias matching. Normalize names, do direct string comparison and alias table lookups. Catches the easy cases. Essentially free.
Tier 1.5 - Token blocking + fuzzy scoring. Build the inverted index, filter high-frequency tokens, generate candidate pairs, and score with Jaro-Winkler and token-set ratio. Catches token-level variations (casing, hyphenation, inflection, word reordering). Cheap enough to run on all ~800K nodes.
Tier 3 - Semantic matching with embeddings. Generate vector embeddings for each node name (or a richer representation including context), index them with approximate nearest neighbor search (FAISS, Annoy, etc.), and find nodes that are semantically similar even when they share zero tokens. Catches paraphrases, synonyms, and abbreviations that the earlier tiers miss. More expensive to build and query, but only needs to run on the unmatched residual.
The important thing is that each tier reduces the workload for the next one. Tier 1 cleans up the obvious duplicates, so Tier 1.5 doesn’t waste time on them. Tier 1.5 catches the token-level variations, so Tier 3 only needs to handle the hard semantic cases. By the time you get to embeddings, you’re comparing a much smaller set of genuinely ambiguous nodes - which is good, because embedding comparison is the most expensive operation in the pipeline.
A note on the fuzzy scoring itself
The chat that prompted this post flagged something worth calling out: the dual-threshold fuzzy scoring (requiring both Jaro-Winkler >= 0.80 and token-set ratio >= 0.85) has a blind spot.
As mentioned above, Jaro-Winkler heavily penalizes length differences. It works great for typos and minor character variations, but struggles with names that are semantically identical but structurally very different. Compare “EKS” to “Elastic Kubernetes Service” - the Jaro-Winkler score there would be well under 0.80, even though they’re the same thing.
Token-set ratio is more forgiving of token reordering and extra tokens, but still relies on character overlap between tokens.
Requiring both metrics to pass means that any pair where either metric is weak gets rejected. For abbreviation-to-expansion pairs, Jaro-Winkler will almost always fail. One approach is to switch from a dual-threshold to a weighted combination - something like 0.4 * jw + 0.6 * tsr with a single threshold - which lets a strong token-set-ratio score compensate for a weak Jaro-Winkler score. Another is to detect abbreviation pairs early (using the is_acronym_of check above) and route them through a separate scoring path that doesn’t rely on character-level similarity.
The numbers
Before token blocking: ~800,000 nodes * ~800,000 nodes / 2 = roughly 320 billion comparisons. Fuzzy string metrics aren’t cheap per-call, and at that volume there’s no batch size that fits within Lambda’s 15-minute timeout while making meaningful progress through the full dataset.
After token blocking with frequency filtering at 0.1%: average block size of ~50-200 nodes per token, candidate sets of roughly 200-400 per node after deduplication. Assuming an average of ~300 candidates per node, the total unique pairs come out to approximately 800,000 * 300 / 2 = ~120 million comparisons - down from 320 billion. That’s a reduction of about 2,500x.
At a microsecond per fuzzy score (which is realistic for Jaro-Winkler + token-set ratio via rapidfuzz), 120 million comparisons takes roughly 2 minutes of CPU time. Even if you’re generous and assume 5-10 microseconds per pair to account for Python overhead and the token lookup itself, you’re still looking at 10-20 minutes total - easily parallelizable across a handful of Lambda invocations, or runnable in a single pass on a slightly beefier instance.
The inverted index itself builds in seconds and fits easily in memory. The frequency filtering pass is a single scan through the index. The whole thing is a few hundred lines of Python with no infrastructure changes needed.