Product Description Similarity: A Practical Approach
Matching product descriptions is a common yet tricky problem in e-commerce, inventory management, and catalog matching. While classical string matching techniques exist, they often fall short when faced with real-world product descriptions.
In this article, we’ll explore why traditional approaches struggle and present a more robust method for comparing product descriptions. We’ll also provide Python code snippets to implement these methods.
Why Traditional String Matching Fails
Common string similarity techniques that are easily accessible include:
- Levenshtein distance
- Jaro-Winkler
- N-grams
These methods work well for general text comparison, but product descriptions present unique challenges, these challenges include the importance of size, shape, etc. Consider these six product descriptions:
Black Square angle bracket 2.2mm
Black Round Angle Bracket 2.2mm.
Forkstore Megasquare angled bracket 2.2mm
Forkstore Rounded angled bracket, black, 22mm
Forkstore angled black pipe coupler 22mm.
Blue Round Angle Bracket 23mm.
We'll write a quick example to calculate the Levenshtein Distance for each string:
The Levenshtein distance measures the number of edits required to transform one string into another. Here’s a Python implementation:
def levenshtein(a: str, b: str) -> int:
a, b = a.lower(), b.lower()
m, n = len(a), len(b)
dp = list(range(n+1))
for i, ca in enumerate(a, 1):
prev, dp[0] = dp[0], i
for j, cb in enumerate(b, 1):
cur = prev if ca == cb else prev + 1
cur = min(cur, dp[j] + 1, dp[j-1] + 1)
prev, dp[j] = dp[j], cur
return dp[n]
def similarity_ratio(a: str, b: str) -> float:
d = levenshtein(a, b)
return 1 - (d / max(len(a), len(b)))
descriptions = [
"Black Square angle bracket 2.2mm",
"Black Round Angle Bracket 2.2mm.",
"Forkstore black Megasquare angled bracket 2.2mm",
"Forkstore Rounded angled bracket, black, 22mm",
"Forkstore angled black pipe coupler 22mm.",
"Blue Round Angle Bracket 23mm."
]
similarities = [[f"{x} => {y} = {similarity_ratio(x, y)}" for y in descriptions] for x in descriptions]
for row in similarities:
for item in row:
print(item)
When we process the first description vs the others, we get this result:
['Black Square angle bracket 2.2mm => Black Square angle bracket 2.2mm = 1.0']
['Black Square angle bracket 2.2mm => Black Round Angle Bracket 2.2mm. = 0.8125']
['Black Square angle bracket 2.2mm => Forkstore black Megasquare angled bracket 2.2mm = 0.68085']
['Black Square angle bracket 2.2mm => Forkstore Rounded angled bracket, black, 22mm = 0.46667']
['Black Square angle bracket 2.2mm => Forkstore angled black pipe coupler 22mm. = 0.41463']
['Black Square angle bracket 2.2mm => Blue Round Angle Bracket 23mm. = 0.65625']
While somewhat correct, this ignores critical factors such as colors, dimensions, and SKU identifiers. Meaning that the closest product is actually not the correct one. The sku we want is square, not round. Traditional methods do not understand this context
So here is a potential solution for our problems. It’s not complete, but it’s a start.
Improving Ratios
Lets first choose our baseline similarity ratio calculation, levenshtein is ok, but in testing I’ve found one of the most efficient methods for calculating similarity is n_gram ratio. This is where we take each word in a sentence, and break it into list of n lengths…most commonly, lists of 3 letter combinations. For Example:
"this is a cat" -> ['thi', 'his', 'is ', 's i', ' is', ...]
N-grams are great as they don't really care about word order or exact matches....and elements like a trailing "s" (for plural) is seen as much less important in N-grams than in levenshtein
Another major factor we'll consider is string length differences, comparing two products Like Mastercrete 25kg Cement and Blue Circle Mastercrete 25kg Plastic Bag Cement Sold Bagged using Standard techniques will just return an inherently low ratio due to the string length difference. To resolve this, we'll pick the shortest string as the base length, then compare how many times the shorter string components appear in the longer.
For some extra zest we can weight the prefixes by doubling the ammount of times they appear in the extraction
Additionally, we can further improve accuracy if we clean the initial word imput by removing capitals and stopwords like “for,” “and,” “of” to reduce noise.
On top of that, lets add a method that enforces color matching and dimension matching using a list of colors and some regex for dimensions. So 2.2mm and 22mm will never match, and Pipe Blue and Pipe Red will also never match.
Combining all of this, we get the below code:
import re
def Description_Similarity(search_term:str, sku_name:str, match_numbers:bool=True) -> float:
def get_ngrams(string:str) -> list:
ngram_list = []
for word in string.split(): #// split string into list of words
if len(word) <= 3: #// if word is less than 3, then it's already a trigram so just pass to the list
ngram_list.append(word)
ngram_list.append(word) #// add again to weight these words as they are less important when compared to 4+ word objects
else:
idx = -1 ##// set the str index to -1 to reset
for letter in list(word): #// loop through each letter (just to give us a range)
idx += 1 #// add increment to string index
val = word[idx:idx+n] #// this asks for the next letter in 1 step increments + n extra letters to make the ngram
if idx == 1: #// add weight to prefixes, anything at idx 1 can be classed as a prefix
if len(val) < 3: #// if less than 3, not a trigram, just a terminating gram
continue
else:
ngram_list.append(val)
ngram_list.append(val)
if len(val) < 3: #// repeat process, effectively prefix get an extra entry to weight them
continue
else:
ngram_list.append(val)
return ngram_list
STOPWORDS = ['ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during',
'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours',
'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 'am', 'or', 'who', 'as', 'from',
'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through',
'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while',
'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them',
'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what',
'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too',
'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 'being', 'if', 'theirs', 'my',
'against', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than', 'need', 'want', 'pack', 'delivery', 'collection', 'sold']
#// use this list to prioritise colours w
colours = [
'white','yellow','blue','red','green','black','brown','azure','ivory','teal',
'silver','purple','navy blue','pea green','gray','orange','maroon','charcoal',
'aquamarine','coral','fuchsia','wheat','lime','crimson','khaki','hot pink',
'magenta','olden','plum','olive','cyan',
# Additional colors
'beige','lavender','mint','salmon','gold','bronze','mustard','indigo','turquoise',
'peach','rose','burgundy','emerald','jade','ruby','sapphire','amethyst','topaz',
'copper','brass','onyx','pearl','ivory','sand','sepia','ochre','taupe','mauve',
'periwinkle','chartreuse','scarlet','vermilion','cerulean','cobalt','denim',
'sky blue','baby blue','midnight blue','royal blue','steel blue','dodger blue',
'powder blue','seafoam','forest green','lime green','hunter green','kelly green',
'sage','pistachio','mint green','pine green','spring green','apple green',
'neon green','yellow green','army green','moss green','viridian','sea green',
'cyan blue','ice blue','arctic blue','glacier blue','indigo dye','prussian blue',
'ultramarine','lapis','bistre','umber','burnt sienna','burnt umber','raw sienna',
'camel','tan','desert sand','buff','ecru','linen','almond','coffee','mocha',
'espresso','mahogany','chestnut','cinnamon','ginger','hazel','caramel',
'butterscotch','lemon','banana yellow','canary yellow','goldenrod','amber',
'honey','dandelion','butter','sunflower','saffron','flax','apricot',
'pumpkin','tangerine','persimmon','rust','firebrick','brick red','blood red',
'rose red','ruby red','wine','claret','mulberry','raspberry','strawberry',
'cranberry','cherry','poppy','vermillion','persian red','alizarin','oxblood',
'dusty rose','blush pink','baby pink','bubblegum','cotton candy','rose quartz',
'orchid','heliotrope','thistle','violet','lavender blush','lilac','grape',
'iris','eggplant','boysenberry','indigo purple','electric purple','deep plum',
'moss','fern','basil','artichoke','avocado','seaweed','shamrock','parakeet',
'spearmint','jade green','chartreuse yellow','bright lime','key lime',
'neon yellow','laser lemon','citrine','opal','smoky topaz','gunmetal',
'slate gray','ash gray','pewter','zinc','nickel','lead','platinum',
'cloud','mist','storm gray','dove gray','granite','charcoal black','jet',
'ink black','obsidian','ebony','raven','shadow','coal',
#abbriviated
'blk', 'blck'
]
n = 3 #// ngram length
remove_stopwords = lambda text: ' '.join(word for word in text.split(' ') if word.strip().lower() not in STOPWORDS)
ngram_search = get_ngrams(remove_stopwords(search_term).lower())
ngram_sku = get_ngrams(remove_stopwords(sku_name).lower())
if len(ngram_sku) < len(ngram_search): #// flip to the shortest , search with the shortest
c_ = ngram_search
ngram_search = ngram_sku
ngram_sku = c_
pattern = r'(?<=\s|[(=\[\]])\d+(?=\s|[\[\])=]|$)'
potential_num_matches_search = re.findall(pattern, search_term) #// extract straight integers , must be less than len(4) , assume ints longer than this are codes
potential_num_matches_sku = re.findall(pattern, sku_name) #// extract straight integers , must be less than len(4) , assume ints longer than this are codes
if match_numbers == True:
if len(potential_num_matches_search) > 0:
num_ratio = len([x for x in potential_num_matches_sku if x in potential_num_matches_search]) / len(potential_num_matches_search)
else:
num_ratio = 1
else:
num_ratio = 1
#// s1 is our search term, therefore its the primary search, more than 80% of the ngrams need to exist in the ngram list to call it a match
gram_match = [x for x in ngram_search if x in ngram_sku] #// make a list of matching entries only
colour_words = [x.lower().strip() for x in search_term.split(' ') if x.lower().strip() in colours] #// find colour words in search term
colour_words_in_sku = [x.lower().strip() for x in sku_name.split(' ') if x.lower().strip() in colour_words] #// find same words in sku list
#a = re.findall("(?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?)", search_term)
x = "(?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?)" #// regex to get the dimensions from search
by = "(?: )?(?:by|x)(?: )?"
cm = "(?:\s*mm|\s*cm|\s*millimeter|\s*centimeter|\s*millimeters|\s*centimeters|\s*MM|\s*CM|\s*m|\s*in|\s*meter|\s*H|\s*W|\s*L|\s*kg|\s*meters|\s*metres|\s*grams|\s*g|\s*lbs|\s*pounds|\s*oz|\s*ounces|\s*litres|\s*L|\s*ml|\s*milliliters|\s*mL|\s*gallons|\s*gal|\s*sq\s*m|\s*sq\s*ft|\s*sq\s*in|\s*cu\s*m|\s*cu\s*ft|\s*cu\s*in\s*v|\s*ah)"
x_cm = "(?:" + x + " *(?:to|\-) *" + cm + "|" + x + cm + ")"
xy_cm = "(?:" + x + cm + by + x + cm+"|" + x + by + x + by + x +"|" + x + by + x + cm +"|" + x + cm + by + x +"|" + x + by + x +")"
search_dimms = [re.findall(r"\d+(?:\.\d+)?", x) for x in re.findall(f"{x_cm}|{xy_cm}", search_term.lower())] #// get dimms in search term
search_dimms = [item for sublist in search_dimms for item in sublist]
sku_dimms = [re.findall(r"\d+(?:\.\d+)?", x) for x in re.findall(f"{x_cm}|{xy_cm}", sku_name.lower())]#// get dimms in search term
sku_dimms = [item for sublist in sku_dimms for item in sublist]
if len(search_dimms) > 0: #// if len is >0 then get ratio, else skip
dimm_set = [x for x in search_dimms if x.lower().strip() in sku_dimms]
dimm_ratio = (len(dimm_set) / len(search_dimms) ) * 100
else:
dimm_ratio = 100
similarity = ( len(gram_match)/len(ngram_search) ) * 100
if match_numbers == False:
return similarity
if len(colour_words) == 0:
if dimm_ratio < 70 or num_ratio < 1: #// if fe dimms match
return 0 #/// return the ratio as 0
else:
return similarity #// return normal similarity
else:
colour_val = len(colour_words_in_sku) / len(colour_words) #// check if all of the asked for colours are in the name or not
if colour_val != 1:
return 0 #// if colours were asked for, but none were found, then return a ratio of zero to exclude any wrongly coloured products
else:
if dimm_ratio < 70 or num_ratio < 1: #// if no or only a few dimnsions match where they exist, return 0
return 0 #/// return the ratio as 0
else:
return similarity #// return normal similarity
Results
Using this enhanced approach, the similarity results are far more accurate:
['Black Square angle bracket 2.2mm => Black Square angle bracket 2.2mm = 100.0']
['Black Square angle bracket 2.2mm => Black Round Angle Bracket 2.2mm. = 78.57']
['Black Square angle bracket 2.2mm => Forkstore black Megasquare angled bracket 2.2mm = 100.0']
['Black Square angle bracket 2.2mm => Forkstore Rounded angled bracket, black, 22mm = 0']
['Black Square angle bracket 2.2mm => Forkstore angled black pipe coupler 22mm. = 0']
['Black Square angle bracket 2.2mm => Blue Round Angle Bracket 23mm. = 0']
As you can see, the Correct product has been selected. This method isn't perfect, but it's a tad more accurate than standard techniques.