Motivation for the package:

  • a human intern can identify that these shows are the same:
Cirque du Soleil Zarkana New York
Cirque du Soleil-Zarkana
Cirque du Soleil: Zarkanna
Cirque Du Soleil - Zarkana Tickets 8/31/11 (New York)
Cirque Du Soleil - ZARKANA (Matinee) (New York)
Cirque du Soleil - New York
  • the human will also identify that these are different
Cirque du Soleil Kooza New York
Cirque du Soleil: KA
Cirque du Soleil Zarkana Las Vegas
  • but it’s hard for computers!

Usecases

fuzz.ratio("NEW YORK METS", "NEW YORK MEATS") β‡’ 96
  • smarter similarity based on similarities of substrings:
fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") β‡’ 100
fuzz.partial_ratio("NEW YORK METS", "NEW YORK YANKEES") β‡’ 69
  • out of order token similarity (via sort then compare)
fuzz.token_sort_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets") β‡’ 100
  • token set (the above fails when the strings are of very different lengths)
    • Solution: split the tokens into two groups: intersection and remainder.
    • Then use those sets to build up a comparison string.
fuzz.token_set_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners") β‡’ 90