File: README.md

package info (click to toggle)
rust-fuzzt 0.3.1-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 204 kB
  • sloc: makefile: 2
file content (225 lines) | stat: -rw-r--r-- 7,811 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
# Fuzzt

[Rust](https://www.rust-lang.org) implementations of
[string similarity metrics]:

- [Hamming](#hamming)
- [Levenshtein](#levenshtein) (distance & normalized)
- [Optimal string alignment](#optimal-string-alignment)
- [Damerau-Levenshtein](#damerau-levenshtein) (distance & normalized)
- [Jaro and Jaro-Winkler](#jaro-and-jaro-winkler)
- [Sørensen-Dice](#sørensen-dice)
- [Gestalt pattern matching](#gestalt-pattern-matching)

The normalized versions return values between `0.0` and `1.0`, where `1.0` means
an exact match.

There are also generic versions of the functions for non-string inputs.

## What is new?

This crate is heavily based on the
[strsim-rs](https://github.com/rapidfuzz/strsim-rs) crate, with some nice
additions:

- [Gestalt pattern matching](#gestalt-pattern-matching), the algorithm used by
  python difflib SequenceMatcher
- [Top-N matching](#top-n-matching), a method to retrieve the best N matches
  from a collection of choices.
- [Feature selection](#feature-selection), allows you to select only the
  features (metrics) you want to use, reducing the memory footprint of your
  application.

### Top-N Matching

The method `get_top_n` gets a list of the best matches from a collection of
choices. This feature is inspired by the `extractBests` method from the Python
[fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) package (now
[thefuzz](https://github.com/seatgeek/thefuzz)).

The `get_top_n` method takes a query string, an array of choice strings, a
cutoff similarity score, an optional number of top matches to return, an
optional string processor, and an optional similarity metric. It processes each
choice and the query using the provided or default string processor, computes
the similarity between the processed query and each processed choice using the
provided or default similarity metric, and returns the top-N matches that have a
similarity score greater than or equal to the cutoff.

Here's the signature of the `get_top_n` method:

```rust
extern crate fuzzt;
use fuzzt::{algorithms::NormalizedLevenshtein, get_top_n, processors::NullStringProcessor};

fn main() {
    let matches = get_top_n(
        "apple",
        &["apply", "apples", "ape", "applet", "applesauce"],
        Some(0.8),
        Some(3),
        Some(&NullStringProcessor),
        Some(&NormalizedLevenshtein),
    );
    assert_eq!(matches, ["apples", "applet", "apply"]);
}
```

### Feature selection

`fuzzt` is designed with flexibility in mind, allowing you to select only the
features you need for your specific use case. This can help to reduce the
footprint of your application and optimize performance.

The crate includes the following features:

- damerau_levenshtein
- gestalt
- hamming
- jaro
- levenshtein
- optimal_string_alignment
- sorensen_dice

By default, all features are included when you add `fuzzt` as a dependency.
However, you can choose to include only specific features by listing them under
the `features` key in your `Cargo.toml` file. For example:

```toml
[dependencies]
fuzzt = { version = "*", default-features = false, features = ["levenshtein", "jaro"] }
```

## Installation

`Fuzzt` is available on [crates.io](https://crates.io/crates/fuzzt). Add it to
your project:

```sh
cargo add fuzzt
```

## Usage

Go to [Docs.rs](https://docs.rs/fuzzt/) for the full documentation. You can also
clone the repo, and run `$ cargo doc --open`.

### Examples

```rust
extern crate fuzzt;

use fuzzt::{
    damerau_levenshtein, hamming, jaro, jaro_winkler, levenshtein, normalized_damerau_levenshtein,
    normalized_levenshtein, osa_distance, sequence_matcher, sorensen_dice,
};

fn main() {
    match hamming("hamming", "hammers") {
        Ok(distance) => assert_eq!(3, distance),
        Err(why) => panic!("{:?}", why),
    }

    assert_eq!(levenshtein("kitten", "sitting"), 3);
    assert!((normalized_levenshtein("kitten", "sitting") - 0.571).abs() < 0.001);
    assert_eq!(osa_distance("ac", "cba"), 3);
    assert_eq!(damerau_levenshtein("ac", "cba"), 2);
    assert!((normalized_damerau_levenshtein("levenshtein", "löwenbräu") - 0.272).abs() < 0.001);
    assert_eq!(jaro("Friedrich Nietzsche", "Jean-Paul Sartre"), 0.3918859649122807);
    assert_eq!(
        jaro_winkler("cheeseburger", "cheese fries"),
        0.8666666666666666
    );
    assert_eq!(
        sorensen_dice("web applications", "applications of the web"),
        0.7878787878787878
    );
    assert_eq!(
        sequence_matcher("this is a test", "this is a test!"),
        0.9655172413793104
    );
}
```

Using the generic versions of the functions:

```rust
extern crate fuzzt;

use fuzzt::generic_levenshtein;

fn main() {
    assert_eq!(2, generic_levenshtein(&[1, 2, 3], &[0, 2, 5]));
}
```

## Algorithms

### Hamming

The Hamming distance between two strings of equal length is the number of
positions at which the corresponding symbols are different. It measures the
minimum number of substitutions required to change one string into the other.

### Levenshtein

The Levenshtein distance is a string metric for measuring the difference between
two sequences. It quantifies how many edits (insertions, deletions, or
substitutions) you need to make to change one string into another. The
normalized version of this metric gives you a proportion between 0 and 1, where
1 means the strings are identical.

### Optimal String Alignment

The Optimal String Alignment (OSA), also known as the restricted
Damerau-Levenshtein distance, computes the shortest distance considering only
adjacent transpositions. This means it doesn't allow substrings to move as a
block, unlike the Damerau-Levenshtein distance.

### Damerau-Levenshtein

Damerau-Levenshtein distance is an extension of the Levenshtein distance,
allowing for transpositions of two adjacent characters along with insertions,
deletions, and substitutions. The normalized version gives a proportion between
0 and 1, where 1 means the strings are identical.

### Jaro and Jaro-Winkler

The Jaro distance allows for transpositions and takes into account the number
and order of common characters between two strings. The Jaro-Winkler distance is
a modification of the Jaro distance that gives more favorable ratings to strings
that match from the beginning.

### Sørensen-Dice

This coefficient is a statistic used to gauge the similarity of two samples.
It's calculated as twice the size of the intersection of the sets, divided by
the sum of the sizes of the two sets.

### Gestalt Pattern Matching

This is the algorithm used by Python's `difflib.SequenceMatcher`. It uses a
heuristic called "Ratcliff/Obershelp" that computes the doubled number of
matching characters divided by the total number of characters in the two
strings. It's particularly good at detecting close matches and some types of
typos.

## Contributing

If you don't want to install Rust itself, you can run `$ ./dev` for a
development CLI if you have [Docker] installed.

Benchmarks require a Nightly toolchain. Run `$ cargo +nightly bench`.

## License

[MIT](https://github.com/luizvbo/fuzzt/blob/main/LICENSE)

[string similarity metrics]: http://en.wikipedia.org/wiki/String_metric
[Damerau-Levenshtein]: http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
[Jaro and Jaro-Winkler]: http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
[Levenshtein]: http://en.wikipedia.org/wiki/Levenshtein_distance
[Hamming]: http://en.wikipedia.org/wiki/Hamming_distance
[Optimal string alignment]: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Optimal_string_alignment_distance
[Sørensen-Dice]: http://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient
[Gestalt pattern matching]: https://en.wikipedia.org/wiki/Gestalt_pattern_matching
[Docker]: https://docs.docker.com/engine/installation/