1 /*!
2 An implementation of the [Rabin-Karp substring search algorithm][rabinkarp].
3
4 Rabin-Karp works by creating a hash of the needle provided and then computing
5 a rolling hash for each needle sized window in the haystack. When the rolling
6 hash matches the hash of the needle, a byte-wise comparison is done to check
7 if a match exists. The worst case time complexity of Rabin-Karp is `O(m *
8 n)` where `m ~ len(needle)` and `n ~ len(haystack)`. Its worst case space
9 complexity is constant.
10
11 The main utility of Rabin-Karp is that the searcher can be constructed very
12 quickly with very little memory. This makes it especially useful when searching
13 for small needles in small haystacks, as it might finish its search before a
14 beefier algorithm (like Two-Way) even starts.
15
16 [rabinkarp]: https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
17 */
18
19 /*
20 (This was the comment I wrote for this module originally when it was not
21 exposed. The comment still looks useful, but it's a bit in the weeds, so it's
22 not public itself.)
23
24 This module implements the classical Rabin-Karp substring search algorithm,
25 with no extra frills. While its use would seem to break our time complexity
26 guarantee of O(m+n) (RK's time complexity is O(mn)), we are careful to only
27 ever use RK on a constant subset of haystacks. The main point here is that
28 RK has good latency properties for small needles/haystacks. It's very quick
29 to compute a needle hash and zip through the haystack when compared to
30 initializing Two-Way, for example. And this is especially useful for cases
31 where the haystack is just too short for vector instructions to do much good.
32
33 The hashing function used here is the same one recommended by ESMAJ.
34
35 Another choice instead of Rabin-Karp would be Shift-Or. But its latency
36 isn't quite as good since its preprocessing time is a bit more expensive
37 (both in practice and in theory). However, perhaps Shift-Or has a place
38 somewhere else for short patterns. I think the main problem is that it
39 requires space proportional to the alphabet and the needle. If we, for
40 example, supported needles up to length 16, then the total table size would be
41 len(alphabet)*size_of::<u16>()==512 bytes. Which isn't exactly small, and it's
42 probably bad to put that on the stack. So ideally, we'd throw it on the heap,
43 but we'd really like to write as much code without using alloc/std as possible.
44 But maybe it's worth the special casing. It's a TODO to benchmark.
45
46 Wikipedia has a decent explanation, if a bit heavy on the theory:
47 https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
48
49 But ESMAJ provides something a bit more concrete:
50 http://www-igm.univ-mlv.fr/~lecroq/string/node5.html
51
52 Finally, aho-corasick uses Rabin-Karp for multiple pattern match in some cases:
53 https://github.com/BurntSushi/aho-corasick/blob/3852632f10587db0ff72ef29e88d58bf305a0946/src/packed/rabinkarp.rs
54 */
55
56 use crate::ext::Pointer;
57
58 /// A forward substring searcher using the Rabin-Karp algorithm.
59 ///
60 /// Note that, as a lower level API, a `Finder` does not have access to the
61 /// needle it was constructed with. For this reason, executing a search
62 /// with a `Finder` requires passing both the needle and the haystack,
63 /// where the needle is exactly equivalent to the one given to the `Finder`
64 /// at construction time. This design was chosen so that callers can have
65 /// more precise control over where and how many times a needle is stored.
66 /// For example, in cases where Rabin-Karp is just one of several possible
67 /// substring search algorithms.
68 #[derive(Clone, Debug)]
69 pub struct Finder {
70 /// The actual hash.
71 hash: Hash,
72 /// The factor needed to multiply a byte by in order to subtract it from
73 /// the hash. It is defined to be 2^(n-1) (using wrapping exponentiation),
74 /// where n is the length of the needle. This is how we "remove" a byte
75 /// from the hash once the hash window rolls past it.
76 hash_2pow: u32,
77 }
78
79 impl Finder {
80 /// Create a new Rabin-Karp forward searcher for the given `needle`.
81 ///
82 /// The needle may be empty. The empty needle matches at every byte offset.
83 ///
84 /// Note that callers must pass the same needle to all search calls using
85 /// this `Finder`.
86 #[inline]
new(needle: &[u8]) -> Finder87 pub fn new(needle: &[u8]) -> Finder {
88 let mut s = Finder { hash: Hash::new(), hash_2pow: 1 };
89 let first_byte = match needle.get(0) {
90 None => return s,
91 Some(&first_byte) => first_byte,
92 };
93 s.hash.add(first_byte);
94 for b in needle.iter().copied().skip(1) {
95 s.hash.add(b);
96 s.hash_2pow = s.hash_2pow.wrapping_shl(1);
97 }
98 s
99 }
100
101 /// Return the first occurrence of the `needle` in the `haystack`
102 /// given. If no such occurrence exists, then `None` is returned.
103 ///
104 /// The `needle` provided must match the needle given to this finder at
105 /// construction time.
106 ///
107 /// The maximum value this can return is `haystack.len()`, which can only
108 /// occur when the needle and haystack both have length zero. Otherwise,
109 /// for non-empty haystacks, the maximum value is `haystack.len() - 1`.
110 #[inline]
find(&self, haystack: &[u8], needle: &[u8]) -> Option<usize>111 pub fn find(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> {
112 unsafe {
113 let hstart = haystack.as_ptr();
114 let hend = hstart.add(haystack.len());
115 let nstart = needle.as_ptr();
116 let nend = nstart.add(needle.len());
117 let found = self.find_raw(hstart, hend, nstart, nend)?;
118 Some(found.distance(hstart))
119 }
120 }
121
122 /// Like `find`, but accepts and returns raw pointers.
123 ///
124 /// When a match is found, the pointer returned is guaranteed to be
125 /// `>= start` and `<= end`. The pointer returned is only ever equivalent
126 /// to `end` when both the needle and haystack are empty. (That is, the
127 /// empty string matches the empty string.)
128 ///
129 /// This routine is useful if you're already using raw pointers and would
130 /// like to avoid converting back to a slice before executing a search.
131 ///
132 /// # Safety
133 ///
134 /// Note that `start` and `end` below refer to both pairs of pointers given
135 /// to this routine. That is, the conditions apply to both `hstart`/`hend`
136 /// and `nstart`/`nend`.
137 ///
138 /// * Both `start` and `end` must be valid for reads.
139 /// * Both `start` and `end` must point to an initialized value.
140 /// * Both `start` and `end` must point to the same allocated object and
141 /// must either be in bounds or at most one byte past the end of the
142 /// allocated object.
143 /// * Both `start` and `end` must be _derived from_ a pointer to the same
144 /// object.
145 /// * The distance between `start` and `end` must not overflow `isize`.
146 /// * The distance being in bounds must not rely on "wrapping around" the
147 /// address space.
148 /// * It must be the case that `start <= end`.
149 #[inline]
find_raw( &self, hstart: *const u8, hend: *const u8, nstart: *const u8, nend: *const u8, ) -> Option<*const u8>150 pub unsafe fn find_raw(
151 &self,
152 hstart: *const u8,
153 hend: *const u8,
154 nstart: *const u8,
155 nend: *const u8,
156 ) -> Option<*const u8> {
157 let hlen = hend.distance(hstart);
158 let nlen = nend.distance(nstart);
159 if nlen > hlen {
160 return None;
161 }
162 let mut cur = hstart;
163 let end = hend.sub(nlen);
164 let mut hash = Hash::forward(cur, cur.add(nlen));
165 loop {
166 if self.hash == hash && is_equal_raw(cur, nstart, nlen) {
167 return Some(cur);
168 }
169 if cur >= end {
170 return None;
171 }
172 hash.roll(self, cur.read(), cur.add(nlen).read());
173 cur = cur.add(1);
174 }
175 }
176 }
177
178 /// A reverse substring searcher using the Rabin-Karp algorithm.
179 #[derive(Clone, Debug)]
180 pub struct FinderRev(Finder);
181
182 impl FinderRev {
183 /// Create a new Rabin-Karp reverse searcher for the given `needle`.
184 #[inline]
new(needle: &[u8]) -> FinderRev185 pub fn new(needle: &[u8]) -> FinderRev {
186 let mut s = FinderRev(Finder { hash: Hash::new(), hash_2pow: 1 });
187 let last_byte = match needle.last() {
188 None => return s,
189 Some(&last_byte) => last_byte,
190 };
191 s.0.hash.add(last_byte);
192 for b in needle.iter().rev().copied().skip(1) {
193 s.0.hash.add(b);
194 s.0.hash_2pow = s.0.hash_2pow.wrapping_shl(1);
195 }
196 s
197 }
198
199 /// Return the last occurrence of the `needle` in the `haystack`
200 /// given. If no such occurrence exists, then `None` is returned.
201 ///
202 /// The `needle` provided must match the needle given to this finder at
203 /// construction time.
204 ///
205 /// The maximum value this can return is `haystack.len()`, which can only
206 /// occur when the needle and haystack both have length zero. Otherwise,
207 /// for non-empty haystacks, the maximum value is `haystack.len() - 1`.
208 #[inline]
rfind(&self, haystack: &[u8], needle: &[u8]) -> Option<usize>209 pub fn rfind(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> {
210 unsafe {
211 let hstart = haystack.as_ptr();
212 let hend = hstart.add(haystack.len());
213 let nstart = needle.as_ptr();
214 let nend = nstart.add(needle.len());
215 let found = self.rfind_raw(hstart, hend, nstart, nend)?;
216 Some(found.distance(hstart))
217 }
218 }
219
220 /// Like `rfind`, but accepts and returns raw pointers.
221 ///
222 /// When a match is found, the pointer returned is guaranteed to be
223 /// `>= start` and `<= end`. The pointer returned is only ever equivalent
224 /// to `end` when both the needle and haystack are empty. (That is, the
225 /// empty string matches the empty string.)
226 ///
227 /// This routine is useful if you're already using raw pointers and would
228 /// like to avoid converting back to a slice before executing a search.
229 ///
230 /// # Safety
231 ///
232 /// Note that `start` and `end` below refer to both pairs of pointers given
233 /// to this routine. That is, the conditions apply to both `hstart`/`hend`
234 /// and `nstart`/`nend`.
235 ///
236 /// * Both `start` and `end` must be valid for reads.
237 /// * Both `start` and `end` must point to an initialized value.
238 /// * Both `start` and `end` must point to the same allocated object and
239 /// must either be in bounds or at most one byte past the end of the
240 /// allocated object.
241 /// * Both `start` and `end` must be _derived from_ a pointer to the same
242 /// object.
243 /// * The distance between `start` and `end` must not overflow `isize`.
244 /// * The distance being in bounds must not rely on "wrapping around" the
245 /// address space.
246 /// * It must be the case that `start <= end`.
247 #[inline]
rfind_raw( &self, hstart: *const u8, hend: *const u8, nstart: *const u8, nend: *const u8, ) -> Option<*const u8>248 pub unsafe fn rfind_raw(
249 &self,
250 hstart: *const u8,
251 hend: *const u8,
252 nstart: *const u8,
253 nend: *const u8,
254 ) -> Option<*const u8> {
255 let hlen = hend.distance(hstart);
256 let nlen = nend.distance(nstart);
257 if nlen > hlen {
258 return None;
259 }
260 let mut cur = hend.sub(nlen);
261 let start = hstart;
262 let mut hash = Hash::reverse(cur, cur.add(nlen));
263 loop {
264 if self.0.hash == hash && is_equal_raw(cur, nstart, nlen) {
265 return Some(cur);
266 }
267 if cur <= start {
268 return None;
269 }
270 cur = cur.sub(1);
271 hash.roll(&self.0, cur.add(nlen).read(), cur.read());
272 }
273 }
274 }
275
276 /// Whether RK is believed to be very fast for the given needle/haystack.
277 #[inline]
is_fast(haystack: &[u8], _needle: &[u8]) -> bool278 pub(crate) fn is_fast(haystack: &[u8], _needle: &[u8]) -> bool {
279 haystack.len() < 16
280 }
281
282 /// A Rabin-Karp hash. This might represent the hash of a needle, or the hash
283 /// of a rolling window in the haystack.
284 #[derive(Clone, Copy, Debug, Default, Eq, PartialEq)]
285 struct Hash(u32);
286
287 impl Hash {
288 /// Create a new hash that represents the empty string.
289 #[inline(always)]
new() -> Hash290 fn new() -> Hash {
291 Hash(0)
292 }
293
294 /// Create a new hash from the bytes given for use in forward searches.
295 ///
296 /// # Safety
297 ///
298 /// The given pointers must be valid to read from within their range.
299 #[inline(always)]
forward(mut start: *const u8, end: *const u8) -> Hash300 unsafe fn forward(mut start: *const u8, end: *const u8) -> Hash {
301 let mut hash = Hash::new();
302 while start < end {
303 hash.add(start.read());
304 start = start.add(1);
305 }
306 hash
307 }
308
309 /// Create a new hash from the bytes given for use in reverse searches.
310 ///
311 /// # Safety
312 ///
313 /// The given pointers must be valid to read from within their range.
314 #[inline(always)]
reverse(start: *const u8, mut end: *const u8) -> Hash315 unsafe fn reverse(start: *const u8, mut end: *const u8) -> Hash {
316 let mut hash = Hash::new();
317 while start < end {
318 end = end.sub(1);
319 hash.add(end.read());
320 }
321 hash
322 }
323
324 /// Add 'new' and remove 'old' from this hash. The given needle hash should
325 /// correspond to the hash computed for the needle being searched for.
326 ///
327 /// This is meant to be used when the rolling window of the haystack is
328 /// advanced.
329 #[inline(always)]
roll(&mut self, finder: &Finder, old: u8, new: u8)330 fn roll(&mut self, finder: &Finder, old: u8, new: u8) {
331 self.del(finder, old);
332 self.add(new);
333 }
334
335 /// Add a byte to this hash.
336 #[inline(always)]
add(&mut self, byte: u8)337 fn add(&mut self, byte: u8) {
338 self.0 = self.0.wrapping_shl(1).wrapping_add(u32::from(byte));
339 }
340
341 /// Remove a byte from this hash. The given needle hash should correspond
342 /// to the hash computed for the needle being searched for.
343 #[inline(always)]
del(&mut self, finder: &Finder, byte: u8)344 fn del(&mut self, finder: &Finder, byte: u8) {
345 let factor = finder.hash_2pow;
346 self.0 = self.0.wrapping_sub(u32::from(byte).wrapping_mul(factor));
347 }
348 }
349
350 /// Returns true when `x[i] == y[i]` for all `0 <= i < n`.
351 ///
352 /// We forcefully don't inline this to hint at the compiler that it is unlikely
353 /// to be called. This causes the inner rabinkarp loop above to be a bit
354 /// tighter and leads to some performance improvement. See the
355 /// memmem/krate/prebuilt/sliceslice-words/words benchmark.
356 ///
357 /// # Safety
358 ///
359 /// Same as `crate::arch::all::is_equal_raw`.
360 #[cold]
361 #[inline(never)]
is_equal_raw(x: *const u8, y: *const u8, n: usize) -> bool362 unsafe fn is_equal_raw(x: *const u8, y: *const u8, n: usize) -> bool {
363 crate::arch::all::is_equal_raw(x, y, n)
364 }
365
366 #[cfg(test)]
367 mod tests {
368 use super::*;
369
370 define_substring_forward_quickcheck!(|h, n| Some(
371 Finder::new(n).find(h, n)
372 ));
373 define_substring_reverse_quickcheck!(|h, n| Some(
374 FinderRev::new(n).rfind(h, n)
375 ));
376
377 #[test]
forward()378 fn forward() {
379 crate::tests::substring::Runner::new()
380 .fwd(|h, n| Some(Finder::new(n).find(h, n)))
381 .run();
382 }
383
384 #[test]
reverse()385 fn reverse() {
386 crate::tests::substring::Runner::new()
387 .rev(|h, n| Some(FinderRev::new(n).rfind(h, n)))
388 .run();
389 }
390 }
391