1 /*!
2 A low level regular expression library that uses deterministic finite automata.
3 It supports a rich syntax with Unicode support, has extensive options for
4 configuring the best space vs time trade off for your use case and provides
5 support for cheap deserialization of automata for use in `no_std` environments.
6 
7 # Overview
8 
9 This section gives a brief overview of the primary types in this crate:
10 
11 * A [`Regex`](struct.Regex.html) provides a way to search for matches of a
12   regular expression. This includes iterating over matches with both the start
13   and end positions of each match.
14 * A [`RegexBuilder`](struct.RegexBuilder.html) provides a way configure many
15   compilation options for a regex.
16 * A [`DenseDFA`](enum.DenseDFA.html) provides low level access to a DFA that
17   uses a dense representation (uses lots of space, but fast searching).
18 * A [`SparseDFA`](enum.SparseDFA.html) provides the same API as a `DenseDFA`,
19   but uses a sparse representation (uses less space, but slower matching).
20 * A [`DFA`](trait.DFA.html) trait that defines an interface that all DFAs must
21   implement.
22 * Both dense DFAs and sparse DFAs support
23   [serialization to raw bytes](enum.DenseDFA.html#method.to_bytes_little_endian)
24   and
25   [cheap deserialization](enum.DenseDFA.html#method.from_bytes).
26 
27 # Example: basic regex searching
28 
29 This example shows how to compile a regex using the default configuration
30 and then use it to find matches in a byte string:
31 
32 ```
33 use regex_automata::Regex;
34 
35 let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap();
36 let text = b"2018-12-24 2016-10-08";
37 let matches: Vec<(usize, usize)> = re.find_iter(text).collect();
38 assert_eq!(matches, vec![(0, 10), (11, 21)]);
39 ```
40 
41 # Example: use sparse DFAs
42 
43 By default, compiling a regex will use dense DFAs internally. This uses more
44 memory, but executes searches more quickly. If you can abide slower searches
45 (somewhere around 3-5x), then sparse DFAs might make more sense since they can
46 use significantly less space.
47 
48 Using sparse DFAs is as easy as using `Regex::new_sparse` instead of
49 `Regex::new`:
50 
51 ```
52 use regex_automata::Regex;
53 
54 # fn example() -> Result<(), regex_automata::Error> {
55 let re = Regex::new_sparse(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap();
56 let text = b"2018-12-24 2016-10-08";
57 let matches: Vec<(usize, usize)> = re.find_iter(text).collect();
58 assert_eq!(matches, vec![(0, 10), (11, 21)]);
59 # Ok(()) }; example().unwrap()
60 ```
61 
62 If you already have dense DFAs for some reason, they can be converted to sparse
63 DFAs and used to build a new `Regex`. For example:
64 
65 ```
66 use regex_automata::Regex;
67 
68 # fn example() -> Result<(), regex_automata::Error> {
69 let dense_re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap();
70 let sparse_re = Regex::from_dfas(
71     dense_re.forward().to_sparse()?,
72     dense_re.reverse().to_sparse()?,
73 );
74 let text = b"2018-12-24 2016-10-08";
75 let matches: Vec<(usize, usize)> = sparse_re.find_iter(text).collect();
76 assert_eq!(matches, vec![(0, 10), (11, 21)]);
77 # Ok(()) }; example().unwrap()
78 ```
79 
80 # Example: deserialize a DFA
81 
82 This shows how to first serialize a DFA into raw bytes, and then deserialize
83 those raw bytes back into a DFA. While this particular example is a bit
84 contrived, this same technique can be used in your program to deserialize a
85 DFA at start up time or by memory mapping a file. In particular,
86 deserialization is guaranteed to be cheap because it will always be a constant
87 time operation.
88 
89 ```
90 use regex_automata::{DenseDFA, Regex};
91 
92 # fn example() -> Result<(), regex_automata::Error> {
93 let re1 = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap();
94 // serialize both the forward and reverse DFAs, see note below
95 let fwd_bytes = re1.forward().to_u16()?.to_bytes_native_endian()?;
96 let rev_bytes = re1.reverse().to_u16()?.to_bytes_native_endian()?;
97 // now deserialize both---we need to specify the correct type!
98 let fwd: DenseDFA<&[u16], u16> = unsafe { DenseDFA::from_bytes(&fwd_bytes) };
99 let rev: DenseDFA<&[u16], u16> = unsafe { DenseDFA::from_bytes(&rev_bytes) };
100 // finally, reconstruct our regex
101 let re2 = Regex::from_dfas(fwd, rev);
102 
103 // we can use it like normal
104 let text = b"2018-12-24 2016-10-08";
105 let matches: Vec<(usize, usize)> = re2.find_iter(text).collect();
106 assert_eq!(matches, vec![(0, 10), (11, 21)]);
107 # Ok(()) }; example().unwrap()
108 ```
109 
110 There are a few points worth noting here:
111 
112 * We need to extract the raw DFAs used by the regex and serialize those. You
113   can build the DFAs manually yourself using
114   [`dense::Builder`](dense/struct.Builder.html), but using the DFAs from a
115   `Regex` guarantees that the DFAs are built correctly.
116 * We specifically convert the dense DFA to a representation that uses `u16`
117   for its state identifiers using
118   [`DenseDFA::to_u16`](enum.DenseDFA.html#method.to_u16). While this isn't
119   strictly necessary, if we skipped this step, then the serialized bytes would
120   use `usize` for state identifiers, which does not have a fixed size. Using
121   `u16` ensures that we can deserialize this DFA even on platforms with a
122   smaller pointer size. If our DFA is too big for `u16` state identifiers, then
123   one can use `u32` or `u64`.
124 * To convert the DFA to raw bytes, we use the `to_bytes_native_endian`
125   method. In practice, you'll want to use either
126   [`DenseDFA::to_bytes_little_endian`](enum.DenseDFA.html#method.to_bytes_little_endian)
127   or
128   [`DenseDFA::to_bytes_big_endian`](enum.DenseDFA.html#method.to_bytes_big_endian),
129   depending on which platform you're deserializing your DFA from. If you intend
130   to deserialize on either platform, then you'll need to serialize both and
131   deserialize the right one depending on your target's endianness.
132 * Deserializing a DFA requires the use of `unsafe` because the raw bytes must
133   be *trusted*. In particular, while some degree of sanity checks are
134   performed, nothing guarantees the integrity of the DFA's transition table
135   since deserialization is a constant time operation. Since searching with a
136   DFA must be able to follow transitions blindly for performance reasons,
137   giving incorrect bytes to the deserialization API can result in memory
138   unsafety.
139 
140 The same process can be achieved with sparse DFAs as well:
141 
142 ```
143 use regex_automata::{SparseDFA, Regex};
144 
145 # fn example() -> Result<(), regex_automata::Error> {
146 let re1 = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap();
147 // serialize both
148 let fwd_bytes = re1.forward().to_u16()?.to_sparse()?.to_bytes_native_endian()?;
149 let rev_bytes = re1.reverse().to_u16()?.to_sparse()?.to_bytes_native_endian()?;
150 // now deserialize both---we need to specify the correct type!
151 let fwd: SparseDFA<&[u8], u16> = unsafe { SparseDFA::from_bytes(&fwd_bytes) };
152 let rev: SparseDFA<&[u8], u16> = unsafe { SparseDFA::from_bytes(&rev_bytes) };
153 // finally, reconstruct our regex
154 let re2 = Regex::from_dfas(fwd, rev);
155 
156 // we can use it like normal
157 let text = b"2018-12-24 2016-10-08";
158 let matches: Vec<(usize, usize)> = re2.find_iter(text).collect();
159 assert_eq!(matches, vec![(0, 10), (11, 21)]);
160 # Ok(()) }; example().unwrap()
161 ```
162 
163 Note that unlike dense DFAs, sparse DFAs have no alignment requirements.
164 Conversely, dense DFAs must be be aligned to the same alignment as their
165 state identifier representation.
166 
167 # Support for `no_std`
168 
169 This crate comes with a `std` feature that is enabled by default. When the
170 `std` feature is enabled, the API of this crate will include the facilities
171 necessary for compiling, serializing, deserializing and searching with regular
172 expressions. When the `std` feature is disabled, the API of this crate will
173 shrink such that it only includes the facilities necessary for deserializing
174 and searching with regular expressions.
175 
176 The intended workflow for `no_std` environments is thus as follows:
177 
178 * Write a program with the `std` feature that compiles and serializes a
179   regular expression. Serialization should only happen after first converting
180   the DFAs to use a fixed size state identifier instead of the default `usize`.
181   You may also need to serialize both little and big endian versions of each
182   DFA. (So that's 4 DFAs in total for each regex.)
183 * In your `no_std` environment, follow the examples above for deserializing
184   your previously serialized DFAs into regexes. You can then search with them
185   as you would any regex.
186 
187 Deserialization can happen anywhere. For example, with bytes embedded into a
188 binary or with a file memory mapped at runtime.
189 
190 Note that the
191 [`ucd-generate`](https://github.com/BurntSushi/ucd-generate)
192 tool will do the first step for you with its `dfa` or `regex` sub-commands.
193 
194 # Syntax
195 
196 This crate supports the same syntax as the `regex` crate, since they share the
197 same parser. You can find an exhaustive list of supported syntax in the
198 [documentation for the `regex` crate](https://docs.rs/regex/1.1/regex/#syntax).
199 
200 Currently, there are a couple limitations. In general, this crate does not
201 support zero-width assertions, although they may be added in the future. This
202 includes:
203 
204 * Anchors such as `^`, `$`, `\A` and `\z`.
205 * Word boundary assertions such as `\b` and `\B`.
206 
207 It is possible to run a search that is anchored at the beginning of the input.
208 To do that, set the
209 [`RegexBuilder::anchored`](struct.RegexBuilder.html#method.anchored)
210 option when building a regex. By default, all searches are unanchored.
211 
212 # Differences with the regex crate
213 
214 The main goal of the [`regex`](https://docs.rs/regex) crate is to serve as a
215 general purpose regular expression engine. It aims to automatically balance low
216 compile times, fast search times and low memory usage, while also providing
217 a convenient API for users. In contrast, this crate provides a lower level
218 regular expression interface that is a bit less convenient while providing more
219 explicit control over memory usage and search times.
220 
221 Here are some specific negative differences:
222 
223 * **Compilation can take an exponential amount of time and space** in the size
224   of the regex pattern. While most patterns do not exhibit worst case
225   exponential time, such patterns do exist. For example, `[01]*1[01]{N}` will
226   build a DFA with `2^(N+1)` states. For this reason, untrusted patterns should
227   not be compiled with this library. (In the future, the API may expose an
228   option to return an error if the DFA gets too big.)
229 * This crate does not support sub-match extraction, which can be achieved with
230   the regex crate's "captures" API. This may be added in the future, but is
231   unlikely.
232 * While the regex crate doesn't necessarily sport fast compilation times, the
233   regexes in this crate are almost universally slow to compile, especially when
234   they contain large Unicode character classes. For example, on my system,
235   compiling `\w{3}` with byte classes enabled takes just over 1 second and
236   almost 5MB of memory! (Compiling a sparse regex takes about the same time
237   but only uses about 500KB of memory.) Conversly, compiling the same regex
238   without Unicode support, e.g., `(?-u)\w{3}`, takes under 1 millisecond and
239   less than 5KB of memory. For this reason, you should only use Unicode
240   character classes if you absolutely need them!
241 * This crate does not support regex sets.
242 * This crate does not support zero-width assertions such as `^`, `$`, `\b` or
243   `\B`.
244 * As a lower level crate, this library does not do literal optimizations. In
245   exchange, you get predictable performance regardless of input. The
246   philosophy here is that literal optimizations should be applied at a higher
247   level, although there is no easy support for this in the ecosystem yet.
248 * There is no `&str` API like in the regex crate. In this crate, all APIs
249   operate on `&[u8]`. By default, match indices are guaranteed to fall on
250   UTF-8 boundaries, unless
251   [`RegexBuilder::allow_invalid_utf8`](struct.RegexBuilder.html#method.allow_invalid_utf8)
252   is enabled.
253 
254 With some of the downsides out of the way, here are some positive differences:
255 
256 * Both dense and sparse DFAs can be serialized to raw bytes, and then cheaply
257   deserialized. Deserialization always takes constant time since searching can
258   be performed directly on the raw serialized bytes of a DFA.
259 * This crate was specifically designed so that the searching phase of a DFA has
260   minimal runtime requirements, and can therefore be used in `no_std`
261   environments. While `no_std` environments cannot compile regexes, they can
262   deserialize pre-compiled regexes.
263 * Since this crate builds DFAs ahead of time, it will generally out-perform
264   the `regex` crate on equivalent tasks. The performance difference is likely
265   not large. However, because of a complex set of optimizations in the regex
266   crate (like literal optimizations), an accurate performance comparison may be
267   difficult to do.
268 * Sparse DFAs provide a way to build a DFA ahead of time that sacrifices search
269   performance a small amount, but uses much less storage space. Potentially
270   even less than what the regex crate uses.
271 * This crate exposes DFAs directly, such as
272   [`DenseDFA`](enum.DenseDFA.html) and [`SparseDFA`](enum.SparseDFA.html),
273   which enables one to do less work in some cases. For example, if you only
274   need the end of a match and not the start of a match, then you can use a DFA
275   directly without building a `Regex`, which always requires a second DFA to
276   find the start of a match.
277 * Aside from choosing between dense and sparse DFAs, there are several options
278   for configuring the space usage vs search time trade off. These include
279   things like choosing a smaller state identifier representation, to
280   premultiplying state identifiers and splitting a DFA's alphabet into
281   equivalence classes. Finally, DFA minimization is also provided, but can
282   increase compilation times dramatically.
283 */
284 
285 #![deny(missing_docs)]
286 #![cfg_attr(not(feature = "std"), no_std)]
287 
288 #[cfg(feature = "std")]
289 extern crate core;
290 
291 #[cfg(all(test, feature = "transducer"))]
292 extern crate bstr;
293 #[cfg(feature = "transducer")]
294 extern crate fst;
295 #[cfg(feature = "std")]
296 extern crate regex_syntax;
297 
298 pub use dense::DenseDFA;
299 pub use dfa::DFA;
300 #[cfg(feature = "std")]
301 pub use error::{Error, ErrorKind};
302 pub use regex::Regex;
303 #[cfg(feature = "std")]
304 pub use regex::RegexBuilder;
305 pub use sparse::SparseDFA;
306 pub use state_id::StateID;
307 
308 mod byteorder;
309 mod classes;
310 #[path = "dense.rs"]
311 mod dense_imp;
312 #[cfg(feature = "std")]
313 mod determinize;
314 mod dfa;
315 #[cfg(feature = "std")]
316 mod error;
317 #[cfg(feature = "std")]
318 mod minimize;
319 #[cfg(feature = "std")]
320 #[doc(hidden)]
321 pub mod nfa;
322 mod regex;
323 #[path = "sparse.rs"]
324 mod sparse_imp;
325 #[cfg(feature = "std")]
326 mod sparse_set;
327 mod state_id;
328 #[cfg(feature = "transducer")]
329 mod transducer;
330 
331 /// Types and routines specific to dense DFAs.
332 ///
333 /// This module is the home of [`DenseDFA`](enum.DenseDFA.html) and each of its
334 /// corresponding variant DFA types, such as [`Standard`](struct.Standard.html)
335 /// and [`ByteClass`](struct.ByteClass.html).
336 ///
337 /// This module also contains a [builder](struct.Builder.html) for
338 /// configuring the construction of a dense DFA.
339 pub mod dense {
340     pub use dense_imp::*;
341 }
342 
343 /// Types and routines specific to sparse DFAs.
344 ///
345 /// This module is the home of [`SparseDFA`](enum.SparseDFA.html) and each of
346 /// its corresponding variant DFA types, such as
347 /// [`Standard`](struct.Standard.html) and
348 /// [`ByteClass`](struct.ByteClass.html).
349 ///
350 /// Unlike the [`dense`](../dense/index.html) module, this module does not
351 /// contain a builder specific for sparse DFAs. Instead, the intended way to
352 /// build a sparse DFA is either by using a default configuration with its
353 /// [constructor](enum.SparseDFA.html#method.new),
354 /// or by first
355 /// [configuring the construction of a dense DFA](../dense/struct.Builder.html)
356 /// and then calling
357 /// [`DenseDFA::to_sparse`](../enum.DenseDFA.html#method.to_sparse).
358 pub mod sparse {
359     pub use sparse_imp::*;
360 }
361