By MyQuants | November 2025
Introduction
Clustering products by their descriptions is one of those fundamental problems that there is practically no resource for online. So I've written this to help out fellow lost souls..
The basic concept is simple, take a large volume of products from multiple vendors, and cluster those products into groups of same or similar products based on their description. Sounds easy on the face of it but as we all know it's never that easy.
Problem Definition
The primary problem is to identify which products across two datasets are similar or belong in the same cluster based on textual descriptions.
- High volume: Catalogs often contain tens or hundreds of thousands of SKUs.
- Structured short text: Product descriptions are usually brief and template-like.
- Deterministic results: Clustering needs to be reproducible for downstream pipelines.
- Memory efficiency: Large-scale matching must avoid memory bloat.
Conceptual Approach
The idea is simple, match products by tokenizing cleaned and refined words, put those tokens into numpy arrays, then count the number of matches inside the arrays, take the product with the highest match value and put it in a group, the process is as follows:
- The Data: Get a large list of product descriptions.
- Text Cleaning: Normalize and remove irrelevant words.
- Tokenization: Turn each unique word into a number.
- Vocabulary Mapping: Identify unique tokens across all descriptions.
- Vectorization: Map descriptions to fixed-length NumPy vectors.
- Vector Matching: Compare vectors efficiently using NumPy.
- Thresholding: Determine cluster membership based on token match thresholds.
- Fuzzy Match: Of that massively reduced list of cluster members, filter out elements that are low in match value
- Cluster: Remaining elements will be the cluster
Comparison to Embedding-Based Methods
| Feature | Vectorized Token Approach | Transformer Embeddings |
|---|---|---|
| Determinism | ✅ | ❌ |
| Explainability | ✅ | ❌ |
| Memory Usage | Low | High |
| Speed | Very Fast | Slower |
| Semantic Awareness | Low | High |
Applications
- Marketplace Integration
- SKU Deduplication
- Variant Normalization
- Preprocessing for AI pipelines
- Cross-language SKU Matching
Full Code Implementation
For you lazy copy / pasters, below is the full codebase for efficient product matching. You can take 2 lists of description strings
import numpy as np
import re
SIZES = [
# Basic
'extra small','xs','xxs','tiny','mini','small','sml','sm','petite',
'medium','med','m','avg','standard','regular','classic',
'large','lrg','lg','l','big','grand','xl','extra large','xxl','xxxl','oversize',
# Volume (litres, gallons, cups, etc.)
'ml','millilitre','milliliter','centilitre','cl','decilitre','dl',
'litre','liter','ltr','l','kilolitre','kl',
'gallon','gal','quart','qt','pint','pt','cup','c','fluid ounce','fl oz','oz',
'tablespoon','tbsp','teaspoon','tsp','drop','dash','splash','shot','jug','bottle', 'mtr',
# Weight (grams, kilos, pounds, etc.)
'mg','milligram','gram','g','kg','kilogram','tonne','t','ounce','oz','lb','pound',
'stone','st','hundredweight','cwt',
# Length/Dimension
'mm','millimeter','millimetre','cm','centimeter','centimetre','dm','decimeter',
'm','meter','metre','km','kilometer','kilometre','inch','in','foot','ft','yard','yd','mile','mi',
# Clothing / Fashion
'slim','slim fit','skinny','regular fit','relaxed fit','oversized','stretch',
'short','tall','petite','plus size','curvy','husky','boy’s','girl’s','men’s','women’s','unisex',
# Food/Drink Serving Sizes
'bite size','snack size','fun size','personal','single','double','triple','family size',
'party size','jumbo','super size','mega','giga','colossal','king size','queen size', 'length', 'cut'
'economy','bulk','value pack',
# Misc / Abstract Sizes
'nano','micro','mini','tiny','compact','moderate','ample','chunky','giant','immense',
'massive','enormous','huge','titanic','monumental','infinite'
]
STOPWORDS = ['ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during',
'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours',
'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 'am', 'or', 'who', 'as', 'from',
'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through',
'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while',
'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them',
'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what',
'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too',
'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 'being', 'if', 'theirs', 'my',
'against', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than', 'need', 'want', 'pack', 'delivery', 'collection', 'sold']
#// use this list to prioritise colours
COLOURS = [
'white','yellow','blue','red','green','black','brown','azure','ivory','teal',
'silver','purple','navy blue','pea green','gray','orange','maroon','charcoal',
'aquamarine','coral','fuchsia','wheat','lime','crimson','khaki','hot pink',
'magenta','olden','plum','olive','cyan',
# Additional colors
'beige','lavender','mint','salmon','gold','bronze','mustard','indigo','turquoise',
'peach','rose','burgundy','emerald','jade','ruby','sapphire','amethyst','topaz',
'copper','brass','onyx','pearl','ivory','sand','sepia','ochre','taupe','mauve',
'periwinkle','chartreuse','scarlet','vermilion','cerulean','cobalt','denim',
'sky blue','baby blue','midnight blue','royal blue','steel blue','dodger blue',
'powder blue','seafoam','forest green','lime green','hunter green','kelly green',
'sage','pistachio','mint green','pine green','spring green','apple green',
'neon green','yellow green','army green','moss green','viridian','sea green',
'cyan blue','ice blue','arctic blue','glacier blue','indigo dye','prussian blue',
'ultramarine','lapis','bistre','umber','burnt sienna','burnt umber','raw sienna',
'camel','tan','desert sand','buff','ecru','linen','almond','coffee','mocha',
'espresso','mahogany','chestnut','cinnamon','ginger','hazel','caramel',
'butterscotch','lemon','banana yellow','canary yellow','goldenrod','amber',
'honey','dandelion','butter','sunflower','saffron','flax','apricot',
'pumpkin','tangerine','persimmon','rust','firebrick','brick red','blood red',
'rose red','ruby red','wine','claret','mulberry','raspberry','strawberry',
'cranberry','cherry','poppy','vermillion','persian red','alizarin','oxblood',
'dusty rose','blush pink','baby pink','bubblegum','cotton candy','rose quartz',
'orchid','heliotrope','thistle','violet','lavender blush','lilac','grape',
'iris','eggplant','boysenberry','indigo purple','electric purple','deep plum',
'moss','fern','basil','artichoke','avocado','seaweed','shamrock','parakeet',
'spearmint','jade green','chartreuse yellow','bright lime','key lime',
'neon yellow','laser lemon','citrine','opal','smoky topaz','gunmetal',
'slate gray','ash gray','pewter','zinc','nickel','lead','platinum',
'cloud','mist','storm gray','dove gray','granite','charcoal black','jet',
'ink black','obsidian','ebony','raven','shadow','coal',
#abbriviated
'blk', 'blck'
]
remove_stopwords = lambda text: ' '.join(word for word in text.split(' ') if word.strip().lower() not in STOPWORDS)
remove_sizes = lambda text: ' '.join(word for word in text.split(' ') if word.strip().lower() not in SIZES)
remove_colours = lambda text: ' '.join(word for word in text.split(' ') if word.strip().lower() not in COLOURS)
def clean_text(description: str) -> str:
#// lowercase & remove non-letters
text = re.sub(r'\b(?=\w*[a-z])(?=\w*\d)\w+\b', ' ', description.lower())
text = re.sub(r'[^a-z\s]', ' ', text)
#// remove stopwords, colours and sizes
text = remove_stopwords(text)
text = remove_colours(text)
text = remove_sizes(text)
#// remove consecutive duplicate letters
text = re.sub(r'(.)\1+', r'\1', text)
#// remove multiple spaces
text = re.sub(r'\s+', ' ', text).strip()
text = [x.strip().rstrip('s') for x in text.split(' ') if len(x.strip()) > 2 ] #// remove short words
return text
def get_first_n_word_coords(description, vocab, n=10, token_size=1):
#// this is meant to map each word to its vocab list index. giving it a token value
get_tokens = lambda desc : [ ' '.join(list(sorted(desc[i:i+token_size]))) for i in range(len(desc) - (token_size-1)) ]
description_tokens = get_tokens(clean_text(description)) #// change this to individual words at some point
description_coords = []
for x in description_tokens: #// get the vocab token value by looping through vocab
if len(description_coords) >= n: #// if we hit n number of words, then break
break
for idx , m in enumerate(vocab):
#print(x, n)
if m == x:
description_coords.append(idx)
break #// pad if dimensionality not met
description_coords += [ None for _ in range(int(n))] #// add padding if our descriptions are too short
return description, description_coords[:n]
def cluster_product_descriptions(descriptions_a:list=[] , descriptions_b:list=[], dimensions:int=6, token_size=1, threshold=0.6):
#print(descritpions)
raw_a = [clean_text(desc) for desc in descriptions_a] #// get cleaned descriptions, extracting only key words from all descriptions in a
raw_b = [clean_text(desc) for desc in descriptions_b] #// get cleaned descriptions, extracting only key words from all descriptions in b
tokens_a = [' '.join(list(sorted(word[i:i+token_size]))) for word in raw_a for i in range(len(word) - (token_size-1))] #// get words from a sort into a token list
tokens_b = [' '.join(list(sorted(word[i:i+token_size]))) for word in raw_b for i in range(len(word) - (token_size-1))]
sorted_unique_tokens = sorted(list(set(tokens_a + tokens_b))) #// get unique word pairs / individual words depending if n=1 (n=1 gives the best results it seems, might just change this to individual words only)
print(f' Token Count: {len(sorted_unique_tokens)}') #// vocabulary size
#print(sorted_unique_tokens) #// vocabulary size
########### MAIN VECTOR LOGIC #############
#// process clusters
full_vector_map_a = [get_first_n_word_coords(desc, sorted_unique_tokens, dimensions, token_size) for desc in descriptions_a] #// get pair coordinates of n length...do this for every description to return one large list of word vectors representing each description in a
full_vector_map_b = [get_first_n_word_coords(desc, sorted_unique_tokens, dimensions, token_size) for desc in descriptions_b] #// get pair coordinates of n length...do this for every description to return one large list of word vectors representing each description in b
arr_a = [] #// create array of
for x in full_vector_map_a:
arr_a.append(x[1]) #// this gives us a list of token vectors representing key words in each description
vectors_a = np.array(arr_a).astype(float) #// get a list of vectors from descriptions_a
full_clusters = {}
print(f'UV: {len(vectors_a)}') #/
vectors_b = np.array([n[1] for n in full_vector_map_b], dtype=float)
vectors_b_ids = [n[0] for n in full_vector_map_b]
nonnan_a = ~np.isnan(vectors_a)
nonnan_b = ~np.isnan(vectors_b)
vectmatch = np.array([np.isin(vectors_b, i) for i in vectors_a]) # listcomp isin vectorsb vs each in vectors a, not more efficient but it is more effective and decreases memory usage.
# Sum over vector elements to get per-pair match count
match_count = np.sum(vectmatch, axis=2) # shape: (len(a), len(b))
# Count non-NaN elements in vectors_b for thresholding
non_nan_count = np.sum(nonnan_b, axis=1) # shape: (len(b),)
if threshold < 1:
thresh = np.where(non_nan_count * threshold <= 2, 2, non_nan_count * threshold)
else:
thresh = threshold # scalar
# Broadcast threshold across vectors_a
valid = match_count >= thresh[None, :]
matches = {
f"{descriptions_a[u]}": [vectors_b_ids[m] for m in np.where(valid[u])[0]]
for u in range(vectors_a.shape[0])
if np.any(valid[u] > 0)
}
#for k,v in matches.items():
# print(k,v)
######################################################################
reduce_clusters =False
from collections import defaultdict
if reduce_clusters == True:
groups = {gid: [item[0] for item in products] for gid, products in groups.items()}
# Step 1: Track the largest group for each product
product_to_group = {}
for gid, products in groups.items():
for p in products:
if p not in product_to_group or len(groups[gid]) > len(groups[product_to_group[p]]):
product_to_group[p] = gid
# Step 2: Rebuild final grouped structure
final_groups = defaultdict(list)
for product, gid in product_to_group.items():
final_groups[gid].append(product)
# Convert defaultdict back to dict
final_groups = dict(final_groups)
# Print results
# for gid, products in final_groups.items():
#print(gid, ":", products)
return matches
Summary
This method appears on the face of it to be many times faster than looping through and calculating similarity directly. With a bit of extra post processing using the technique outlined in our article Comparing Product Descriptions Using Python. You could easily return a list of product description matches with little compute spend