After some text classification (more than 5-6 words per input) experiments with Edit Distance, I’ve noticed:
- Edit Distance works good (tolerable results, but not good enough), but with some text constellations it inclines to odd results.
- The procedure starts with 2 strings that are aligned with the most similarities (according to their position). This can be a good fit or even a bad one.
- In most cases the deleted, inserted or substituted token does not correspond with the “correct” counterpart.
To overcome these problems I’ve extended the original Edit Distance with a preprocessing step. To calculate the distance between two strings with more than 5-6 words, it is considerable but not the best choice to choose a tokenwise working procedure. Therefore I propose to delete all words, that these two comparing strings have in common. The length of these words should be the starting point of an Edit Distance calculation. Below you can see the pseudo-code for the described algorithm.
1. Set score = 0
2. Find similar sub phrases in both phrases
3. Determine the length of all similar sub phrases
4. Set score = length_of_similar_subphrases * (-1)
5. Delete similar sub phrases from both phrases
6. Apply Edit Distance to the rest of the phrases
7. Set ED_Score = value of Edit Distance
7. score = score + ED_Score
After this, the score contains the Substitution Distance between these two strings. The lower this value is the more are these strings in common (small distance).
In an experiment with a proprietary corpus containing ~1530 phrases the Substitution Distance outperforms the Edit Distance in accuracy. The input was preprocessed with the Lucene GermanStemmer and 10-fold crossvalidated for test. Substitution Distance reached 92.69% and Edit Distance 89.97% accuracy.
To extend the Substitution Distance method a trick can be used. Consider two words, both are linguistically related to each other (example below). These words are not exactly the same but they are related. The method introduced above wouldn’t match this.
Example: extension extensible
The example above shows two words that have 7 consecutive tokens in common. With the introduction of a threshold value that marks related words as ‘the same’ and deleted it (step 5), the accuracy of Substitution Distance could be improved. A possible threshold is 5 (or 6, 7 and 8). With the implementation of this enhancement a stemming-like functionality can be simulated.
Certainly, the Substitution Distance can’t fix all weaknesses of Edit Distance, but the results are significantly more accurate using SD. The following problems aren’t solved:
* After deleting (step 5) the sub phrases, Edit Distance also deleting, substituting and inserting tokens that have no relevance to the counterparts (in the comparison string).
* The residuals (also after step 5) of the phrase mostly have more semantically than syntactically meaning.
Indeed, the performance of Substitution Distance seems improved in contrast to Edit Distance but distance metrics have one limitation. The comparison of syntaxes of phrases is very restricted. A real text classification task needs semantic constraints, which a distance metric like ED or SD can’t provide.
 Navarro, Gonzales; „A guided tour to approximate string matching“, CSUR March 2001