
Note that this not not a replacement for the stock equals() method on the String class. We can use Levenshtein distance to determine the similarity between two strings. Increasingly, many applications in data cleaning, data integration. Those methods above, bigram and dice, provide a suggestion to implement Dices coefficient in Java to create a simple measurement of a fuzzy string similarity. The Levenshtein distance (or Edit distance) algorithm tells how different two strings are from one another by counting the minimum number of operations required to transform one string to another. The Edit distance algorithm tells us how different two strings are from each other by finding the least number of moves (add, remove, insert) required to convert one string to another.

We can use the Edit distance algorithm to determine the similarity between two strings in C. Harry supports common data formats and thus can interface with analysis environments, such as Matlab, Pylab and Weka. Edit distance based string similarity join is a fundamental operator in string databases. This article illustrates the different techniques to calculate the similarity between two strings in C. The tool has been designed with efficiency in mind and allows for multi-threaded as well as distributed computing, enabling the analysis of large data sets of strings. Harry implements over 20 similarity measures, including common string distances and string kernels, such as the Levenshtein distance and the Subsequence kernel. In this article, we present Harry, a small tool specifically designed for measuring the similarity of strings. The practitioner can choose from a large variety of available similarity measures for this task, each emphasizing different aspects of the string data. AbstractĬomparing strings and assessing their similarity is a basic operation in many application domains of machine learning, such as in information retrieval, natural language processing and bioinformatics. Konrad Rieck, Christian Wressnegger 17(9):1−5, 2016.


Harry: A Tool for Measuring String Similarity Textual based metrics resulting in a similarity or dissimilarity (distance) score between two pairs of text strings for approximate matching or comparison.
