Comparing Product Descriptions Using Python

Matching product descriptions using python and regex to find the most similar products

Product Description Similarity: A Practical Approach

Matching product descriptions is a common yet tricky problem in e-commerce, inventory management, and catalog matching. While classical string matching techniques exist, they often fall short when faced with real-world product descriptions.

In this article, we’ll explore why traditional approaches struggle and present a more robust method for comparing product descriptions. We’ll also provide Python code snippets to implement these methods.

Why Traditional String Matching Fails

Common string similarity techniques that are easily accessible include:

These methods work well for general text comparison, but product descriptions present unique challenges, these challenges include the importance of size, shape, etc. Consider these six product descriptions:


        Black Square angle bracket 2.2mm
        Black Round Angle Bracket 2.2mm.
        Forkstore Megasquare angled bracket 2.2mm
        Forkstore Rounded angled bracket, black, 22mm
        Forkstore angled black pipe coupler 22mm.
        Blue Round Angle Bracket 23mm.

We'll write a quick example to calculate the Levenshtein Distance for each string:

The Levenshtein distance measures the number of edits required to transform one string into another. Here’s a Python implementation:


        def levenshtein(a: str, b: str) -> int:
            a, b = a.lower(), b.lower()
            m, n = len(a), len(b)
            dp = list(range(n+1))
            for i, ca in enumerate(a, 1):
                prev, dp[0] = dp[0], i
                for j, cb in enumerate(b, 1):
                    cur = prev if ca == cb else prev + 1
                    cur = min(cur, dp[j] + 1, dp[j-1] + 1)
                    prev, dp[j] = dp[j], cur
            return dp[n]
        
        def similarity_ratio(a: str, b: str) -> float:
            d = levenshtein(a, b)
            return 1 - (d / max(len(a), len(b)))
        
        descriptions = [
            "Black Square angle bracket 2.2mm",
            "Black Round Angle Bracket 2.2mm.",
            "Forkstore black Megasquare angled bracket 2.2mm",
            "Forkstore Rounded angled bracket, black, 22mm",
            "Forkstore angled black pipe coupler 22mm.",
            "Blue Round Angle Bracket 23mm."
        ]
        
        similarities = [[f"{x} => {y} = {similarity_ratio(x, y)}" for y in descriptions] for x in descriptions]
        
        for row in similarities:
            for item in row:
                print(item)

When we process the first description vs the others, we get this result:


        ['Black Square angle bracket 2.2mm => Black Square angle bracket 2.2mm = 1.0']
        ['Black Square angle bracket 2.2mm => Black Round Angle Bracket 2.2mm. = 0.8125']
        ['Black Square angle bracket 2.2mm => Forkstore black Megasquare angled bracket 2.2mm = 0.68085']
        ['Black Square angle bracket 2.2mm => Forkstore Rounded angled bracket, black, 22mm = 0.46667']
        ['Black Square angle bracket 2.2mm => Forkstore angled black pipe coupler 22mm. = 0.41463']
        ['Black Square angle bracket 2.2mm => Blue Round Angle Bracket 23mm. = 0.65625']

While somewhat correct, this ignores critical factors such as colors, dimensions, and SKU identifiers. Meaning that the closest product is actually not the correct one. The sku we want is square, not round. Traditional methods do not understand this context

So here is a potential solution for our problems. It’s not complete, but it’s a start.

Improving Ratios

Lets first choose our baseline similarity ratio calculation, levenshtein is ok, but in testing I’ve found one of the most efficient methods for calculating similarity is n_gram ratio. This is where we take each word in a sentence, and break it into list of n lengths…most commonly, lists of 3 letter combinations. For Example:

"this is a cat" -> ['thi', 'his', 'is ', 's i', ' is', ...]

N-grams are great as they don't really care about word order or exact matches....and elements like a trailing "s" (for plural) is seen as much less important in N-grams than in levenshtein

Another major factor we'll consider is string length differences, comparing two products Like Mastercrete 25kg Cement and Blue Circle Mastercrete 25kg Plastic Bag Cement Sold Bagged using Standard techniques will just return an inherently low ratio due to the string length difference. To resolve this, we'll pick the shortest string as the base length, then compare how many times the shorter string components appear in the longer.

For some extra zest we can weight the prefixes by doubling the ammount of times they appear in the extraction

Additionally, we can further improve accuracy if we clean the initial word imput by removing capitals and stopwords like “for,” “and,” “of” to reduce noise.

On top of that, lets add a method that enforces color matching and dimension matching using a list of colors and some regex for dimensions. So 2.2mm and 22mm will never match, and Pipe Blue and Pipe Red will also never match.

Combining all of this, we get the below code:



  import re

  def Description_Similarity(search_term:str, sku_name:str, match_numbers:bool=True) -> float: 
          def get_ngrams(string:str) -> list:
              ngram_list = []
              for word in string.split(): #// split string into list of words
                  if len(word) <= 3: #// if word is less than 3, then it's already a trigram so just pass to the list
                      ngram_list.append(word)
                      ngram_list.append(word) #// add again to weight these words as they are less important when compared to 4+ word objects
                  else:
                      idx = -1 ##// set the str index to -1 to reset
                      for letter in list(word): #// loop through each letter (just to give us a range)
                          idx += 1 #// add increment to string index
                          val = word[idx:idx+n] #// this asks for the next letter in 1 step increments + n extra letters to make the ngram
                          if idx == 1: #// add weight to prefixes, anything at idx 1 can be classed as a prefix
                              if len(val) < 3: #// if less than 3, not a trigram, just a terminating gram
                                  continue
                              else:                             
                                  ngram_list.append(val)
                                  ngram_list.append(val)
                          if len(val) < 3: #// repeat process, effectively prefix get an extra entry to weight them
                                  continue
                          else:                           
                              ngram_list.append(val)
  
              return ngram_list
  
  
          STOPWORDS =  ['ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during',
                          'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 
                          'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 'am', 'or', 'who', 'as', 'from',
                          'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 
                          'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 
                          'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them',
                          'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what',
                          'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too',
                          'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 'being', 'if', 'theirs', 'my',
                          'against', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than', 'need', 'want', 'pack', 'delivery', 'collection', 'sold']
          #// use this list to prioritise colours w
          colours = [
                  'white','yellow','blue','red','green','black','brown','azure','ivory','teal',
                  'silver','purple','navy blue','pea green','gray','orange','maroon','charcoal',
                  'aquamarine','coral','fuchsia','wheat','lime','crimson','khaki','hot pink',
                  'magenta','olden','plum','olive','cyan',
  
                  # Additional colors
                  'beige','lavender','mint','salmon','gold','bronze','mustard','indigo','turquoise',
                  'peach','rose','burgundy','emerald','jade','ruby','sapphire','amethyst','topaz',
                  'copper','brass','onyx','pearl','ivory','sand','sepia','ochre','taupe','mauve',
                  'periwinkle','chartreuse','scarlet','vermilion','cerulean','cobalt','denim',
                  'sky blue','baby blue','midnight blue','royal blue','steel blue','dodger blue',
                  'powder blue','seafoam','forest green','lime green','hunter green','kelly green',
                  'sage','pistachio','mint green','pine green','spring green','apple green',
                  'neon green','yellow green','army green','moss green','viridian','sea green',
                  'cyan blue','ice blue','arctic blue','glacier blue','indigo dye','prussian blue',
                  'ultramarine','lapis','bistre','umber','burnt sienna','burnt umber','raw sienna',
                  'camel','tan','desert sand','buff','ecru','linen','almond','coffee','mocha',
                  'espresso','mahogany','chestnut','cinnamon','ginger','hazel','caramel',
                  'butterscotch','lemon','banana yellow','canary yellow','goldenrod','amber',
                  'honey','dandelion','butter','sunflower','saffron','flax','apricot',
                  'pumpkin','tangerine','persimmon','rust','firebrick','brick red','blood red',
                  'rose red','ruby red','wine','claret','mulberry','raspberry','strawberry',
                  'cranberry','cherry','poppy','vermillion','persian red','alizarin','oxblood',
                  'dusty rose','blush pink','baby pink','bubblegum','cotton candy','rose quartz',
                  'orchid','heliotrope','thistle','violet','lavender blush','lilac','grape',
                  'iris','eggplant','boysenberry','indigo purple','electric purple','deep plum',
                  'moss','fern','basil','artichoke','avocado','seaweed','shamrock','parakeet',
                  'spearmint','jade green','chartreuse yellow','bright lime','key lime',
                  'neon yellow','laser lemon','citrine','opal','smoky topaz','gunmetal',
                  'slate gray','ash gray','pewter','zinc','nickel','lead','platinum',
                  'cloud','mist','storm gray','dove gray','granite','charcoal black','jet',
                  'ink black','obsidian','ebony','raven','shadow','coal',
                  
                  #abbriviated 
                  'blk', 'blck'
              ]
  
          n = 3 #// ngram length
  
          remove_stopwords = lambda text: ' '.join(word for word in text.split(' ') if word.strip().lower() not in STOPWORDS)
          ngram_search = get_ngrams(remove_stopwords(search_term).lower()) 
          ngram_sku = get_ngrams(remove_stopwords(sku_name).lower())
  
          if len(ngram_sku) < len(ngram_search): #// flip to the shortest , search with the shortest
              c_ = ngram_search
              ngram_search = ngram_sku 
              ngram_sku = c_
  
          pattern = r'(?<=\s|[(=\[\]])\d+(?=\s|[\[\])=]|$)'
  
          potential_num_matches_search = re.findall(pattern, search_term) #// extract straight integers , must be less than len(4) , assume ints longer than this are codes
          potential_num_matches_sku = re.findall(pattern, sku_name) #// extract straight integers , must be less than len(4) , assume ints longer than this are codes
          
          if match_numbers == True:
              if len(potential_num_matches_search) > 0:
                  num_ratio = len([x for x in potential_num_matches_sku  if x in potential_num_matches_search]) / len(potential_num_matches_search)
              else:
                  num_ratio = 1
          else:
              num_ratio = 1
  
          #// s1 is our search term, therefore its the primary search, more than 80% of the ngrams need to exist in the ngram list to call it a match
  
          gram_match = [x for x in ngram_search if x in ngram_sku] #// make a list of matching entries only
  
          colour_words = [x.lower().strip() for x in search_term.split(' ') if x.lower().strip() in colours] #// find colour words in search term
          colour_words_in_sku = [x.lower().strip() for x in sku_name.split(' ') if x.lower().strip() in colour_words] #// find same words in sku list
  
          #a = re.findall("(?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?)", search_term)
  
          x = "(?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?)" #// regex to get the dimensions from  search
          by = "(?: )?(?:by|x)(?: )?"
          cm = "(?:\s*mm|\s*cm|\s*millimeter|\s*centimeter|\s*millimeters|\s*centimeters|\s*MM|\s*CM|\s*m|\s*in|\s*meter|\s*H|\s*W|\s*L|\s*kg|\s*meters|\s*metres|\s*grams|\s*g|\s*lbs|\s*pounds|\s*oz|\s*ounces|\s*litres|\s*L|\s*ml|\s*milliliters|\s*mL|\s*gallons|\s*gal|\s*sq\s*m|\s*sq\s*ft|\s*sq\s*in|\s*cu\s*m|\s*cu\s*ft|\s*cu\s*in\s*v|\s*ah)"
          x_cm = "(?:" + x + " *(?:to|\-) *" + cm + "|" + x + cm + ")"
          xy_cm = "(?:" + x + cm + by + x + cm+"|" + x + by + x + by + x  +"|"  + x + by + x + cm +"|" + x + cm + by + x +"|" + x + by + x +")"
  
          search_dimms = [re.findall(r"\d+(?:\.\d+)?", x) for x in re.findall(f"{x_cm}|{xy_cm}", search_term.lower())] #// get dimms in search term
          search_dimms = [item for sublist in search_dimms for item in sublist]
          sku_dimms = [re.findall(r"\d+(?:\.\d+)?", x) for x in re.findall(f"{x_cm}|{xy_cm}", sku_name.lower())]#// get dimms in search term
          sku_dimms = [item for sublist in sku_dimms for item in sublist]
  
          if len(search_dimms) > 0: #// if len is >0 then get ratio, else skip
              dimm_set = [x for x in search_dimms if x.lower().strip() in sku_dimms] 
              dimm_ratio = (len(dimm_set) / len(search_dimms) ) * 100
          else: 
              dimm_ratio = 100
  
          
          similarity = ( len(gram_match)/len(ngram_search) ) * 100
  
          if match_numbers == False:
              return similarity 
  
          if len(colour_words) == 0:
              if dimm_ratio < 70 or num_ratio < 1: #// if fe dimms match
                  return 0  #/// return the ratio as 0
              else:
                  return similarity #// return normal similarity
          else:
              colour_val = len(colour_words_in_sku) / len(colour_words) #// check if all of the asked for colours are in the name or not
              if colour_val != 1:
                  return 0 #// if colours were asked for, but none were found, then return a ratio of zero to exclude any wrongly coloured products
              else:
                  if dimm_ratio < 70 or num_ratio < 1: #// if no or only a few dimnsions match where they exist, return 0
                      return 0  #/// return the ratio as 0
                  else:
                      return similarity #// return normal similarity
  

        

Results

Using this enhanced approach, the similarity results are far more accurate:



        ['Black Square angle bracket 2.2mm => Black Square angle bracket 2.2mm = 100.0']
        ['Black Square angle bracket 2.2mm => Black Round Angle Bracket 2.2mm. = 78.57']
        ['Black Square angle bracket 2.2mm => Forkstore black Megasquare angled bracket 2.2mm = 100.0']
        ['Black Square angle bracket 2.2mm => Forkstore Rounded angled bracket, black, 22mm = 0']
        ['Black Square angle bracket 2.2mm => Forkstore angled black pipe coupler 22mm. = 0']
        ['Black Square angle bracket 2.2mm => Blue Round Angle Bracket 23mm. = 0']

As you can see, the Correct product has been selected. This method isn't perfect, but it's a tad more accurate than standard techniques.