1 /*!
2 An implementation of the [Rabin-Karp substring search algorithm][rabinkarp].
3 
4 Rabin-Karp works by creating a hash of the needle provided and then computing
5 a rolling hash for each needle sized window in the haystack. When the rolling
6 hash matches the hash of the needle, a byte-wise comparison is done to check
7 if a match exists. The worst case time complexity of Rabin-Karp is `O(m *
8 n)` where `m ~ len(needle)` and `n ~ len(haystack)`. Its worst case space
9 complexity is constant.
10 
11 The main utility of Rabin-Karp is that the searcher can be constructed very
12 quickly with very little memory. This makes it especially useful when searching
13 for small needles in small haystacks, as it might finish its search before a
14 beefier algorithm (like Two-Way) even starts.
15 
16 [rabinkarp]: https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
17 */
18 
19 /*
20 (This was the comment I wrote for this module originally when it was not
21 exposed. The comment still looks useful, but it's a bit in the weeds, so it's
22 not public itself.)
23 
24 This module implements the classical Rabin-Karp substring search algorithm,
25 with no extra frills. While its use would seem to break our time complexity
26 guarantee of O(m+n) (RK's time complexity is O(mn)), we are careful to only
27 ever use RK on a constant subset of haystacks. The main point here is that
28 RK has good latency properties for small needles/haystacks. It's very quick
29 to compute a needle hash and zip through the haystack when compared to
30 initializing Two-Way, for example. And this is especially useful for cases
31 where the haystack is just too short for vector instructions to do much good.
32 
33 The hashing function used here is the same one recommended by ESMAJ.
34 
35 Another choice instead of Rabin-Karp would be Shift-Or. But its latency
36 isn't quite as good since its preprocessing time is a bit more expensive
37 (both in practice and in theory). However, perhaps Shift-Or has a place
38 somewhere else for short patterns. I think the main problem is that it
39 requires space proportional to the alphabet and the needle. If we, for
40 example, supported needles up to length 16, then the total table size would be
41 len(alphabet)*size_of::<u16>()==512 bytes. Which isn't exactly small, and it's
42 probably bad to put that on the stack. So ideally, we'd throw it on the heap,
43 but we'd really like to write as much code without using alloc/std as possible.
44 But maybe it's worth the special casing. It's a TODO to benchmark.
45 
46 Wikipedia has a decent explanation, if a bit heavy on the theory:
47 https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
48 
49 But ESMAJ provides something a bit more concrete:
50 http://www-igm.univ-mlv.fr/~lecroq/string/node5.html
51 
52 Finally, aho-corasick uses Rabin-Karp for multiple pattern match in some cases:
53 https://github.com/BurntSushi/aho-corasick/blob/3852632f10587db0ff72ef29e88d58bf305a0946/src/packed/rabinkarp.rs
54 */
55 
56 use crate::ext::Pointer;
57 
58 /// A forward substring searcher using the Rabin-Karp algorithm.
59 ///
60 /// Note that, as a lower level API, a `Finder` does not have access to the
61 /// needle it was constructed with. For this reason, executing a search
62 /// with a `Finder` requires passing both the needle and the haystack,
63 /// where the needle is exactly equivalent to the one given to the `Finder`
64 /// at construction time. This design was chosen so that callers can have
65 /// more precise control over where and how many times a needle is stored.
66 /// For example, in cases where Rabin-Karp is just one of several possible
67 /// substring search algorithms.
68 #[derive(Clone, Debug)]
69 pub struct Finder {
70     /// The actual hash.
71     hash: Hash,
72     /// The factor needed to multiply a byte by in order to subtract it from
73     /// the hash. It is defined to be 2^(n-1) (using wrapping exponentiation),
74     /// where n is the length of the needle. This is how we "remove" a byte
75     /// from the hash once the hash window rolls past it.
76     hash_2pow: u32,
77 }
78 
79 impl Finder {
80     /// Create a new Rabin-Karp forward searcher for the given `needle`.
81     ///
82     /// The needle may be empty. The empty needle matches at every byte offset.
83     ///
84     /// Note that callers must pass the same needle to all search calls using
85     /// this `Finder`.
86     #[inline]
new(needle: &[u8]) -> Finder87     pub fn new(needle: &[u8]) -> Finder {
88         let mut s = Finder { hash: Hash::new(), hash_2pow: 1 };
89         let first_byte = match needle.get(0) {
90             None => return s,
91             Some(&first_byte) => first_byte,
92         };
93         s.hash.add(first_byte);
94         for b in needle.iter().copied().skip(1) {
95             s.hash.add(b);
96             s.hash_2pow = s.hash_2pow.wrapping_shl(1);
97         }
98         s
99     }
100 
101     /// Return the first occurrence of the `needle` in the `haystack`
102     /// given. If no such occurrence exists, then `None` is returned.
103     ///
104     /// The `needle` provided must match the needle given to this finder at
105     /// construction time.
106     ///
107     /// The maximum value this can return is `haystack.len()`, which can only
108     /// occur when the needle and haystack both have length zero. Otherwise,
109     /// for non-empty haystacks, the maximum value is `haystack.len() - 1`.
110     #[inline]
find(&self, haystack: &[u8], needle: &[u8]) -> Option<usize>111     pub fn find(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> {
112         unsafe {
113             let hstart = haystack.as_ptr();
114             let hend = hstart.add(haystack.len());
115             let nstart = needle.as_ptr();
116             let nend = nstart.add(needle.len());
117             let found = self.find_raw(hstart, hend, nstart, nend)?;
118             Some(found.distance(hstart))
119         }
120     }
121 
122     /// Like `find`, but accepts and returns raw pointers.
123     ///
124     /// When a match is found, the pointer returned is guaranteed to be
125     /// `>= start` and `<= end`. The pointer returned is only ever equivalent
126     /// to `end` when both the needle and haystack are empty. (That is, the
127     /// empty string matches the empty string.)
128     ///
129     /// This routine is useful if you're already using raw pointers and would
130     /// like to avoid converting back to a slice before executing a search.
131     ///
132     /// # Safety
133     ///
134     /// Note that `start` and `end` below refer to both pairs of pointers given
135     /// to this routine. That is, the conditions apply to both `hstart`/`hend`
136     /// and `nstart`/`nend`.
137     ///
138     /// * Both `start` and `end` must be valid for reads.
139     /// * Both `start` and `end` must point to an initialized value.
140     /// * Both `start` and `end` must point to the same allocated object and
141     /// must either be in bounds or at most one byte past the end of the
142     /// allocated object.
143     /// * Both `start` and `end` must be _derived from_ a pointer to the same
144     /// object.
145     /// * The distance between `start` and `end` must not overflow `isize`.
146     /// * The distance being in bounds must not rely on "wrapping around" the
147     /// address space.
148     /// * It must be the case that `start <= end`.
149     #[inline]
find_raw( &self, hstart: *const u8, hend: *const u8, nstart: *const u8, nend: *const u8, ) -> Option<*const u8>150     pub unsafe fn find_raw(
151         &self,
152         hstart: *const u8,
153         hend: *const u8,
154         nstart: *const u8,
155         nend: *const u8,
156     ) -> Option<*const u8> {
157         let hlen = hend.distance(hstart);
158         let nlen = nend.distance(nstart);
159         if nlen > hlen {
160             return None;
161         }
162         let mut cur = hstart;
163         let end = hend.sub(nlen);
164         let mut hash = Hash::forward(cur, cur.add(nlen));
165         loop {
166             if self.hash == hash && is_equal_raw(cur, nstart, nlen) {
167                 return Some(cur);
168             }
169             if cur >= end {
170                 return None;
171             }
172             hash.roll(self, cur.read(), cur.add(nlen).read());
173             cur = cur.add(1);
174         }
175     }
176 }
177 
178 /// A reverse substring searcher using the Rabin-Karp algorithm.
179 #[derive(Clone, Debug)]
180 pub struct FinderRev(Finder);
181 
182 impl FinderRev {
183     /// Create a new Rabin-Karp reverse searcher for the given `needle`.
184     #[inline]
new(needle: &[u8]) -> FinderRev185     pub fn new(needle: &[u8]) -> FinderRev {
186         let mut s = FinderRev(Finder { hash: Hash::new(), hash_2pow: 1 });
187         let last_byte = match needle.last() {
188             None => return s,
189             Some(&last_byte) => last_byte,
190         };
191         s.0.hash.add(last_byte);
192         for b in needle.iter().rev().copied().skip(1) {
193             s.0.hash.add(b);
194             s.0.hash_2pow = s.0.hash_2pow.wrapping_shl(1);
195         }
196         s
197     }
198 
199     /// Return the last occurrence of the `needle` in the `haystack`
200     /// given. If no such occurrence exists, then `None` is returned.
201     ///
202     /// The `needle` provided must match the needle given to this finder at
203     /// construction time.
204     ///
205     /// The maximum value this can return is `haystack.len()`, which can only
206     /// occur when the needle and haystack both have length zero. Otherwise,
207     /// for non-empty haystacks, the maximum value is `haystack.len() - 1`.
208     #[inline]
rfind(&self, haystack: &[u8], needle: &[u8]) -> Option<usize>209     pub fn rfind(&self, haystack: &[u8], needle: &[u8]) -> Option<usize> {
210         unsafe {
211             let hstart = haystack.as_ptr();
212             let hend = hstart.add(haystack.len());
213             let nstart = needle.as_ptr();
214             let nend = nstart.add(needle.len());
215             let found = self.rfind_raw(hstart, hend, nstart, nend)?;
216             Some(found.distance(hstart))
217         }
218     }
219 
220     /// Like `rfind`, but accepts and returns raw pointers.
221     ///
222     /// When a match is found, the pointer returned is guaranteed to be
223     /// `>= start` and `<= end`. The pointer returned is only ever equivalent
224     /// to `end` when both the needle and haystack are empty. (That is, the
225     /// empty string matches the empty string.)
226     ///
227     /// This routine is useful if you're already using raw pointers and would
228     /// like to avoid converting back to a slice before executing a search.
229     ///
230     /// # Safety
231     ///
232     /// Note that `start` and `end` below refer to both pairs of pointers given
233     /// to this routine. That is, the conditions apply to both `hstart`/`hend`
234     /// and `nstart`/`nend`.
235     ///
236     /// * Both `start` and `end` must be valid for reads.
237     /// * Both `start` and `end` must point to an initialized value.
238     /// * Both `start` and `end` must point to the same allocated object and
239     /// must either be in bounds or at most one byte past the end of the
240     /// allocated object.
241     /// * Both `start` and `end` must be _derived from_ a pointer to the same
242     /// object.
243     /// * The distance between `start` and `end` must not overflow `isize`.
244     /// * The distance being in bounds must not rely on "wrapping around" the
245     /// address space.
246     /// * It must be the case that `start <= end`.
247     #[inline]
rfind_raw( &self, hstart: *const u8, hend: *const u8, nstart: *const u8, nend: *const u8, ) -> Option<*const u8>248     pub unsafe fn rfind_raw(
249         &self,
250         hstart: *const u8,
251         hend: *const u8,
252         nstart: *const u8,
253         nend: *const u8,
254     ) -> Option<*const u8> {
255         let hlen = hend.distance(hstart);
256         let nlen = nend.distance(nstart);
257         if nlen > hlen {
258             return None;
259         }
260         let mut cur = hend.sub(nlen);
261         let start = hstart;
262         let mut hash = Hash::reverse(cur, cur.add(nlen));
263         loop {
264             if self.0.hash == hash && is_equal_raw(cur, nstart, nlen) {
265                 return Some(cur);
266             }
267             if cur <= start {
268                 return None;
269             }
270             cur = cur.sub(1);
271             hash.roll(&self.0, cur.add(nlen).read(), cur.read());
272         }
273     }
274 }
275 
276 /// Whether RK is believed to be very fast for the given needle/haystack.
277 #[inline]
is_fast(haystack: &[u8], _needle: &[u8]) -> bool278 pub(crate) fn is_fast(haystack: &[u8], _needle: &[u8]) -> bool {
279     haystack.len() < 16
280 }
281 
282 /// A Rabin-Karp hash. This might represent the hash of a needle, or the hash
283 /// of a rolling window in the haystack.
284 #[derive(Clone, Copy, Debug, Default, Eq, PartialEq)]
285 struct Hash(u32);
286 
287 impl Hash {
288     /// Create a new hash that represents the empty string.
289     #[inline(always)]
new() -> Hash290     fn new() -> Hash {
291         Hash(0)
292     }
293 
294     /// Create a new hash from the bytes given for use in forward searches.
295     ///
296     /// # Safety
297     ///
298     /// The given pointers must be valid to read from within their range.
299     #[inline(always)]
forward(mut start: *const u8, end: *const u8) -> Hash300     unsafe fn forward(mut start: *const u8, end: *const u8) -> Hash {
301         let mut hash = Hash::new();
302         while start < end {
303             hash.add(start.read());
304             start = start.add(1);
305         }
306         hash
307     }
308 
309     /// Create a new hash from the bytes given for use in reverse searches.
310     ///
311     /// # Safety
312     ///
313     /// The given pointers must be valid to read from within their range.
314     #[inline(always)]
reverse(start: *const u8, mut end: *const u8) -> Hash315     unsafe fn reverse(start: *const u8, mut end: *const u8) -> Hash {
316         let mut hash = Hash::new();
317         while start < end {
318             end = end.sub(1);
319             hash.add(end.read());
320         }
321         hash
322     }
323 
324     /// Add 'new' and remove 'old' from this hash. The given needle hash should
325     /// correspond to the hash computed for the needle being searched for.
326     ///
327     /// This is meant to be used when the rolling window of the haystack is
328     /// advanced.
329     #[inline(always)]
roll(&mut self, finder: &Finder, old: u8, new: u8)330     fn roll(&mut self, finder: &Finder, old: u8, new: u8) {
331         self.del(finder, old);
332         self.add(new);
333     }
334 
335     /// Add a byte to this hash.
336     #[inline(always)]
add(&mut self, byte: u8)337     fn add(&mut self, byte: u8) {
338         self.0 = self.0.wrapping_shl(1).wrapping_add(u32::from(byte));
339     }
340 
341     /// Remove a byte from this hash. The given needle hash should correspond
342     /// to the hash computed for the needle being searched for.
343     #[inline(always)]
del(&mut self, finder: &Finder, byte: u8)344     fn del(&mut self, finder: &Finder, byte: u8) {
345         let factor = finder.hash_2pow;
346         self.0 = self.0.wrapping_sub(u32::from(byte).wrapping_mul(factor));
347     }
348 }
349 
350 /// Returns true when `x[i] == y[i]` for all `0 <= i < n`.
351 ///
352 /// We forcefully don't inline this to hint at the compiler that it is unlikely
353 /// to be called. This causes the inner rabinkarp loop above to be a bit
354 /// tighter and leads to some performance improvement. See the
355 /// memmem/krate/prebuilt/sliceslice-words/words benchmark.
356 ///
357 /// # Safety
358 ///
359 /// Same as `crate::arch::all::is_equal_raw`.
360 #[cold]
361 #[inline(never)]
is_equal_raw(x: *const u8, y: *const u8, n: usize) -> bool362 unsafe fn is_equal_raw(x: *const u8, y: *const u8, n: usize) -> bool {
363     crate::arch::all::is_equal_raw(x, y, n)
364 }
365 
366 #[cfg(test)]
367 mod tests {
368     use super::*;
369 
370     define_substring_forward_quickcheck!(|h, n| Some(
371         Finder::new(n).find(h, n)
372     ));
373     define_substring_reverse_quickcheck!(|h, n| Some(
374         FinderRev::new(n).rfind(h, n)
375     ));
376 
377     #[test]
forward()378     fn forward() {
379         crate::tests::substring::Runner::new()
380             .fwd(|h, n| Some(Finder::new(n).find(h, n)))
381             .run();
382     }
383 
384     #[test]
reverse()385     fn reverse() {
386         crate::tests::substring::Runner::new()
387             .rev(|h, n| Some(FinderRev::new(n).rfind(h, n)))
388             .run();
389     }
390 }
391