Bulk Matching Products Based on Descriptions

By MyQuants | November 2025

Introduction

Clustering products by their descriptions is one of those fundamental problems that there is practically no resource for online. So I've written this to help out fellow lost souls..

The basic concept is simple, take a large volume of products from multiple vendors, and cluster those products into groups of same or similar products based on their description. Sounds easy on the face of it but as we all know it's never that easy.

Problem Definition

The primary problem is to identify which products across two datasets are similar or belong in the same cluster based on textual descriptions.

High volume: Catalogs often contain tens or hundreds of thousands of SKUs.
Structured short text: Product descriptions are usually brief and template-like.
Deterministic results: Clustering needs to be reproducible for downstream pipelines.
Memory efficiency: Large-scale matching must avoid memory bloat.

Conceptual Approach

The idea is simple, match products by tokenizing cleaned and refined words, put those tokens into numpy arrays, then count the number of matches inside the arrays, take the product with the highest match value and put it in a group, the process is as follows:

The Data: Get a large list of product descriptions.
Text Cleaning: Normalize and remove irrelevant words.
Tokenization: Turn each unique word into a number.
Vocabulary Mapping: Identify unique tokens across all descriptions.
Vectorization: Map descriptions to fixed-length NumPy vectors.
Vector Matching: Compare vectors efficiently using NumPy.
Thresholding: Determine cluster membership based on token match thresholds.
Fuzzy Match: Of that massively reduced list of cluster members, filter out elements that are low in match value
Cluster: Remaining elements will be the cluster

Comparison to Embedding-Based Methods

Feature	Vectorized Token Approach	Transformer Embeddings
Determinism	✅	❌
Explainability	✅	❌
Memory Usage	Low	High
Speed	Very Fast	Slower
Semantic Awareness	Low	High

Applications

Marketplace Integration
SKU Deduplication
Variant Normalization
Preprocessing for AI pipelines
Cross-language SKU Matching

Full Code Implementation

For you lazy copy / pasters, below is the full codebase for efficient product matching. You can take 2 lists of description strings



import numpy as np
import re

SIZES = [
    # Basic
    'extra small','xs','xxs','tiny','mini','small','sml','sm','petite',
    'medium','med','m','avg','standard','regular','classic',
    'large','lrg','lg','l','big','grand','xl','extra large','xxl','xxxl','oversize',

    # Volume (litres, gallons, cups, etc.)
    'ml','millilitre','milliliter','centilitre','cl','decilitre','dl',
    'litre','liter','ltr','l','kilolitre','kl',
    'gallon','gal','quart','qt','pint','pt','cup','c','fluid ounce','fl oz','oz',
    'tablespoon','tbsp','teaspoon','tsp','drop','dash','splash','shot','jug','bottle', 'mtr',

    # Weight (grams, kilos, pounds, etc.)
    'mg','milligram','gram','g','kg','kilogram','tonne','t','ounce','oz','lb','pound',
    'stone','st','hundredweight','cwt',

    # Length/Dimension
    'mm','millimeter','millimetre','cm','centimeter','centimetre','dm','decimeter',
    'm','meter','metre','km','kilometer','kilometre','inch','in','foot','ft','yard','yd','mile','mi',

    # Clothing / Fashion
    'slim','slim fit','skinny','regular fit','relaxed fit','oversized','stretch',
    'short','tall','petite','plus size','curvy','husky','boy’s','girl’s','men’s','women’s','unisex',

    # Food/Drink Serving Sizes
    'bite size','snack size','fun size','personal','single','double','triple','family size',
    'party size','jumbo','super size','mega','giga','colossal','king size','queen size', 'length', 'cut'
    'economy','bulk','value pack',

    # Misc / Abstract Sizes
    'nano','micro','mini','tiny','compact','moderate','ample','chunky','giant','immense',
    'massive','enormous','huge','titanic','monumental','infinite'
  ]
STOPWORDS = ['ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during',
    'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 
    'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 'am', 'or', 'who', 'as', 'from',
    'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 
    'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 
    'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them',
    'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what',
    'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too',
    'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 'being', 'if', 'theirs', 'my',
    'against', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than', 'need', 'want', 'pack', 'delivery', 'collection', 'sold']

#// use this list to prioritise colours
COLOURS = [
                'white','yellow','blue','red','green','black','brown','azure','ivory','teal',
                'silver','purple','navy blue','pea green','gray','orange','maroon','charcoal',
                'aquamarine','coral','fuchsia','wheat','lime','crimson','khaki','hot pink',
                'magenta','olden','plum','olive','cyan',

                # Additional colors
                'beige','lavender','mint','salmon','gold','bronze','mustard','indigo','turquoise',
                'peach','rose','burgundy','emerald','jade','ruby','sapphire','amethyst','topaz',
                'copper','brass','onyx','pearl','ivory','sand','sepia','ochre','taupe','mauve',
                'periwinkle','chartreuse','scarlet','vermilion','cerulean','cobalt','denim',
                'sky blue','baby blue','midnight blue','royal blue','steel blue','dodger blue',
                'powder blue','seafoam','forest green','lime green','hunter green','kelly green',
                'sage','pistachio','mint green','pine green','spring green','apple green',
                'neon green','yellow green','army green','moss green','viridian','sea green',
                'cyan blue','ice blue','arctic blue','glacier blue','indigo dye','prussian blue',
                'ultramarine','lapis','bistre','umber','burnt sienna','burnt umber','raw sienna',
                'camel','tan','desert sand','buff','ecru','linen','almond','coffee','mocha',
                'espresso','mahogany','chestnut','cinnamon','ginger','hazel','caramel',
                'butterscotch','lemon','banana yellow','canary yellow','goldenrod','amber',
                'honey','dandelion','butter','sunflower','saffron','flax','apricot',
                'pumpkin','tangerine','persimmon','rust','firebrick','brick red','blood red',
                'rose red','ruby red','wine','claret','mulberry','raspberry','strawberry',
                'cranberry','cherry','poppy','vermillion','persian red','alizarin','oxblood',
                'dusty rose','blush pink','baby pink','bubblegum','cotton candy','rose quartz',
                'orchid','heliotrope','thistle','violet','lavender blush','lilac','grape',
                'iris','eggplant','boysenberry','indigo purple','electric purple','deep plum',
                'moss','fern','basil','artichoke','avocado','seaweed','shamrock','parakeet',
                'spearmint','jade green','chartreuse yellow','bright lime','key lime',
                'neon yellow','laser lemon','citrine','opal','smoky topaz','gunmetal',
                'slate gray','ash gray','pewter','zinc','nickel','lead','platinum',
                'cloud','mist','storm gray','dove gray','granite','charcoal black','jet',
                'ink black','obsidian','ebony','raven','shadow','coal',
                
                #abbriviated 
                'blk', 'blck'
            ]

remove_stopwords = lambda text: ' '.join(word for word in text.split(' ') if word.strip().lower() not in STOPWORDS)
remove_sizes = lambda text: ' '.join(word for word in text.split(' ') if word.strip().lower() not in SIZES)
remove_colours = lambda text: ' '.join(word for word in text.split(' ') if word.strip().lower() not in COLOURS)

def clean_text(description: str) -> str:
    #// lowercase & remove non-letters
    
    text = re.sub(r'\b(?=\w*[a-z])(?=\w*\d)\w+\b', ' ', description.lower())

    text = re.sub(r'[^a-z\s]', ' ', text)
    
    #// remove stopwords, colours and sizes
    
    text = remove_stopwords(text)
    text = remove_colours(text)
    text = remove_sizes(text)
    
    #// remove consecutive duplicate letters
    
    text = re.sub(r'(.)\1+', r'\1', text)
    #// remove multiple spaces
    text = re.sub(r'\s+', ' ', text).strip()

    text = [x.strip().rstrip('s') for x in text.split(' ') if len(x.strip()) > 2 ] #// remove short words

    return text

def get_first_n_word_coords(description, vocab, n=10, token_size=1): 
        #// this is meant to map each word to its vocab list index. giving it a token value 
        get_tokens = lambda desc : [ ' '.join(list(sorted(desc[i:i+token_size]))) for i in range(len(desc) - (token_size-1)) ] 
        description_tokens = get_tokens(clean_text(description)) #// change this to individual words at some point
        description_coords = [] 
        for x in description_tokens: #// get the vocab token value by looping through vocab 
            if len(description_coords) >= n: #// if we hit n number of words, then break 
                break 
            for idx , m in enumerate(vocab): 
                #print(x, n) 
                if m == x: 
                    description_coords.append(idx) 
                    break #// pad if dimensionality not met 
            
        description_coords += [ None for _ in range(int(n))] #// add padding if our descriptions are too short 
        return description, description_coords[:n]

def cluster_product_descriptions(descriptions_a:list=[] , descriptions_b:list=[], dimensions:int=6, token_size=1, threshold=0.6):
    #print(descritpions)

    raw_a =  [clean_text(desc) for desc in descriptions_a] #// get cleaned descriptions, extracting only key words from all descriptions in a
    raw_b = [clean_text(desc) for desc in descriptions_b] #// get cleaned descriptions, extracting only key words from all descriptions in b

    tokens_a = [' '.join(list(sorted(word[i:i+token_size]))) for word in raw_a for i in range(len(word) - (token_size-1))] #// get words from a sort into a token list 
    tokens_b =  [' '.join(list(sorted(word[i:i+token_size]))) for word in  raw_b for i in range(len(word) - (token_size-1))]

    sorted_unique_tokens = sorted(list(set(tokens_a + tokens_b))) #// get unique word pairs / individual words depending if n=1 (n=1 gives the best results it seems, might just change this to individual words only)

    print(f' Token Count:  {len(sorted_unique_tokens)}') #// vocabulary size
    #print(sorted_unique_tokens) #// vocabulary size

    ########### MAIN VECTOR LOGIC #############
    #// process clusters
    full_vector_map_a = [get_first_n_word_coords(desc, sorted_unique_tokens, dimensions, token_size) for desc in descriptions_a] #// get pair coordinates of n length...do this for every description to return one large list of word vectors representing each description in a 
    full_vector_map_b = [get_first_n_word_coords(desc, sorted_unique_tokens, dimensions, token_size) for desc in descriptions_b] #// get pair coordinates of n length...do this for every description to return one large list of word vectors representing each description in b
  
    arr_a = [] #// create array of 

    for x in  full_vector_map_a:
        arr_a.append(x[1]) #// this gives us a list of token vectors representing key words in each description

    vectors_a = np.array(arr_a).astype(float) #// get a list of vectors from descriptions_a

    full_clusters = {}

    print(f'UV: {len(vectors_a)}') #/

    vectors_b = np.array([n[1] for n in full_vector_map_b], dtype=float)
    vectors_b_ids = [n[0] for n in full_vector_map_b]

    nonnan_a = ~np.isnan(vectors_a)
    nonnan_b = ~np.isnan(vectors_b)

    vectmatch = np.array([np.isin(vectors_b, i) for i in vectors_a])  # listcomp isin vectorsb vs each in vectors a, not more efficient but it is more effective and decreases memory usage. 

    # Sum over vector elements to get per-pair match count
    match_count = np.sum(vectmatch, axis=2)  # shape: (len(a), len(b))

    # Count non-NaN elements in vectors_b for thresholding
    non_nan_count = np.sum(nonnan_b, axis=1)  # shape: (len(b),)

    if threshold < 1:
        thresh = np.where(non_nan_count * threshold <= 2, 2, non_nan_count * threshold)
    else:
        thresh = threshold  # scalar

    # Broadcast threshold across vectors_a
    valid = match_count >= thresh[None, :]

    matches = {
        f"{descriptions_a[u]}": [vectors_b_ids[m] for m in np.where(valid[u])[0]]
            for u in range(vectors_a.shape[0])
        if np.any(valid[u] > 0)
    }

    #for k,v in matches.items():
    #    print(k,v)

    ######################################################################
    reduce_clusters =False
    from collections import defaultdict
    if reduce_clusters == True:
        groups = {gid: [item[0] for item in products] for gid, products in groups.items()}

        # Step 1: Track the largest group for each product
        product_to_group = {}
        for gid, products in groups.items():
            for p in products:
                if p not in product_to_group or len(groups[gid]) > len(groups[product_to_group[p]]):
                    product_to_group[p] = gid

        # Step 2: Rebuild final grouped structure
        final_groups = defaultdict(list)
        for product, gid in product_to_group.items():
            final_groups[gid].append(product)

        # Convert defaultdict back to dict
        final_groups = dict(final_groups)

        # Print results
        # for gid, products in final_groups.items():
            #print(gid, ":", products)

    return matches

Summary

This method appears on the face of it to be many times faster than looping through and calculating similarity directly. With a bit of extra post processing using the technique outlined in our article Comparing Product Descriptions Using Python. You could easily return a list of product description matches with little compute spend