public class StringDistance extends Object
Constructor and Description |
---|
StringDistance() |
Modifier and Type | Method and Description |
---|---|
static int |
editDistance(String s,
String t)
The edit distance counts the differences between two strings, where we would count a difference not only when
strings have different characters but also when one has a character whereas the other does not.
|
static int |
hammingDistance(String a,
String b)
The Hamming distance H is defined only for strings of the same length.
|
static double |
prefixWeightedHammingDistance(String s,
String t,
double weight)
Compute the Hamming distance between two strings of equal length (if they are of unequal length the longer one is
trimmed), giving higher weight to characters at the start of the strings, so strings that are similar at the
starts are given higher scores (shorter distances) than strings that are similar at the ends.
|
static double |
suffixWeightedHammingDistance(String s,
String t,
double weight) |
public static int editDistance(String s, String t)
For a string s, let s(i) stand for its ith character. For two characters a and b, define r(a, b) = 0 if a = b. Let r(a, b) = 1, otherwise.
Assume we are given two strings s and t of length n and m, respectively. We are going to fill an (n+1) x (m+1) array d with integers such that the low right corner element d(n+1, m+1) will furnish the required values of the Levenshtein distance L(s, t).
The definition of entries of d is recursive. First set d(i, 0) = i, i = 0, 1,..., n, and d(0, j) = j, j = 0, 1, ..., m. For other pairs i, j use (1) d(i, j) = min(d(i-1, j)+1, d(i, j-1)+1, d(i-1, j-1) + r(s(i), t(j)))
That last step is also described as:
(Description partly cribbed from http://www.cut-the-knot.org/do_you_know/Strings.shtml and http://www.merriampark.com/ld.htm)
a
- b
- public static int hammingDistance(String a, String b)
a
- b
- public static double prefixWeightedHammingDistance(String s, String t, double weight)
s
- t
- weight
- Controls how quickly (normalized by the length of the string) the bias towards the front drops off.
The penalty for mismatches drops off linearly for the fraction of the String represented by
weight
. A weight of 1.0 indicates the bias should be linear across the length of the string.
Intermediate values (e.g., 0.25) mean that differences beyond the first 1/4 of the string results in no
penalty.Copyright © 2003–2023 UBC Michael Smith Laboratories. All rights reserved.