1 /*! 2 A low level regular expression library that uses deterministic finite automata. 3 It supports a rich syntax with Unicode support, has extensive options for 4 configuring the best space vs time trade off for your use case and provides 5 support for cheap deserialization of automata for use in `no_std` environments. 6 7 # Overview 8 9 This section gives a brief overview of the primary types in this crate: 10 11 * A [`Regex`](struct.Regex.html) provides a way to search for matches of a 12 regular expression. This includes iterating over matches with both the start 13 and end positions of each match. 14 * A [`RegexBuilder`](struct.RegexBuilder.html) provides a way configure many 15 compilation options for a regex. 16 * A [`DenseDFA`](enum.DenseDFA.html) provides low level access to a DFA that 17 uses a dense representation (uses lots of space, but fast searching). 18 * A [`SparseDFA`](enum.SparseDFA.html) provides the same API as a `DenseDFA`, 19 but uses a sparse representation (uses less space, but slower matching). 20 * A [`DFA`](trait.DFA.html) trait that defines an interface that all DFAs must 21 implement. 22 * Both dense DFAs and sparse DFAs support 23 [serialization to raw bytes](enum.DenseDFA.html#method.to_bytes_little_endian) 24 and 25 [cheap deserialization](enum.DenseDFA.html#method.from_bytes). 26 27 # Example: basic regex searching 28 29 This example shows how to compile a regex using the default configuration 30 and then use it to find matches in a byte string: 31 32 ``` 33 use regex_automata::Regex; 34 35 let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); 36 let text = b"2018-12-24 2016-10-08"; 37 let matches: Vec<(usize, usize)> = re.find_iter(text).collect(); 38 assert_eq!(matches, vec![(0, 10), (11, 21)]); 39 ``` 40 41 # Example: use sparse DFAs 42 43 By default, compiling a regex will use dense DFAs internally. This uses more 44 memory, but executes searches more quickly. If you can abide slower searches 45 (somewhere around 3-5x), then sparse DFAs might make more sense since they can 46 use significantly less space. 47 48 Using sparse DFAs is as easy as using `Regex::new_sparse` instead of 49 `Regex::new`: 50 51 ``` 52 use regex_automata::Regex; 53 54 # fn example() -> Result<(), regex_automata::Error> { 55 let re = Regex::new_sparse(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); 56 let text = b"2018-12-24 2016-10-08"; 57 let matches: Vec<(usize, usize)> = re.find_iter(text).collect(); 58 assert_eq!(matches, vec![(0, 10), (11, 21)]); 59 # Ok(()) }; example().unwrap() 60 ``` 61 62 If you already have dense DFAs for some reason, they can be converted to sparse 63 DFAs and used to build a new `Regex`. For example: 64 65 ``` 66 use regex_automata::Regex; 67 68 # fn example() -> Result<(), regex_automata::Error> { 69 let dense_re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); 70 let sparse_re = Regex::from_dfas( 71 dense_re.forward().to_sparse()?, 72 dense_re.reverse().to_sparse()?, 73 ); 74 let text = b"2018-12-24 2016-10-08"; 75 let matches: Vec<(usize, usize)> = sparse_re.find_iter(text).collect(); 76 assert_eq!(matches, vec![(0, 10), (11, 21)]); 77 # Ok(()) }; example().unwrap() 78 ``` 79 80 # Example: deserialize a DFA 81 82 This shows how to first serialize a DFA into raw bytes, and then deserialize 83 those raw bytes back into a DFA. While this particular example is a bit 84 contrived, this same technique can be used in your program to deserialize a 85 DFA at start up time or by memory mapping a file. In particular, 86 deserialization is guaranteed to be cheap because it will always be a constant 87 time operation. 88 89 ``` 90 use regex_automata::{DenseDFA, Regex}; 91 92 # fn example() -> Result<(), regex_automata::Error> { 93 let re1 = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); 94 // serialize both the forward and reverse DFAs, see note below 95 let fwd_bytes = re1.forward().to_u16()?.to_bytes_native_endian()?; 96 let rev_bytes = re1.reverse().to_u16()?.to_bytes_native_endian()?; 97 // now deserialize both---we need to specify the correct type! 98 let fwd: DenseDFA<&[u16], u16> = unsafe { DenseDFA::from_bytes(&fwd_bytes) }; 99 let rev: DenseDFA<&[u16], u16> = unsafe { DenseDFA::from_bytes(&rev_bytes) }; 100 // finally, reconstruct our regex 101 let re2 = Regex::from_dfas(fwd, rev); 102 103 // we can use it like normal 104 let text = b"2018-12-24 2016-10-08"; 105 let matches: Vec<(usize, usize)> = re2.find_iter(text).collect(); 106 assert_eq!(matches, vec![(0, 10), (11, 21)]); 107 # Ok(()) }; example().unwrap() 108 ``` 109 110 There are a few points worth noting here: 111 112 * We need to extract the raw DFAs used by the regex and serialize those. You 113 can build the DFAs manually yourself using 114 [`dense::Builder`](dense/struct.Builder.html), but using the DFAs from a 115 `Regex` guarantees that the DFAs are built correctly. 116 * We specifically convert the dense DFA to a representation that uses `u16` 117 for its state identifiers using 118 [`DenseDFA::to_u16`](enum.DenseDFA.html#method.to_u16). While this isn't 119 strictly necessary, if we skipped this step, then the serialized bytes would 120 use `usize` for state identifiers, which does not have a fixed size. Using 121 `u16` ensures that we can deserialize this DFA even on platforms with a 122 smaller pointer size. If our DFA is too big for `u16` state identifiers, then 123 one can use `u32` or `u64`. 124 * To convert the DFA to raw bytes, we use the `to_bytes_native_endian` 125 method. In practice, you'll want to use either 126 [`DenseDFA::to_bytes_little_endian`](enum.DenseDFA.html#method.to_bytes_little_endian) 127 or 128 [`DenseDFA::to_bytes_big_endian`](enum.DenseDFA.html#method.to_bytes_big_endian), 129 depending on which platform you're deserializing your DFA from. If you intend 130 to deserialize on either platform, then you'll need to serialize both and 131 deserialize the right one depending on your target's endianness. 132 * Deserializing a DFA requires the use of `unsafe` because the raw bytes must 133 be *trusted*. In particular, while some degree of sanity checks are 134 performed, nothing guarantees the integrity of the DFA's transition table 135 since deserialization is a constant time operation. Since searching with a 136 DFA must be able to follow transitions blindly for performance reasons, 137 giving incorrect bytes to the deserialization API can result in memory 138 unsafety. 139 140 The same process can be achieved with sparse DFAs as well: 141 142 ``` 143 use regex_automata::{SparseDFA, Regex}; 144 145 # fn example() -> Result<(), regex_automata::Error> { 146 let re1 = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); 147 // serialize both 148 let fwd_bytes = re1.forward().to_u16()?.to_sparse()?.to_bytes_native_endian()?; 149 let rev_bytes = re1.reverse().to_u16()?.to_sparse()?.to_bytes_native_endian()?; 150 // now deserialize both---we need to specify the correct type! 151 let fwd: SparseDFA<&[u8], u16> = unsafe { SparseDFA::from_bytes(&fwd_bytes) }; 152 let rev: SparseDFA<&[u8], u16> = unsafe { SparseDFA::from_bytes(&rev_bytes) }; 153 // finally, reconstruct our regex 154 let re2 = Regex::from_dfas(fwd, rev); 155 156 // we can use it like normal 157 let text = b"2018-12-24 2016-10-08"; 158 let matches: Vec<(usize, usize)> = re2.find_iter(text).collect(); 159 assert_eq!(matches, vec![(0, 10), (11, 21)]); 160 # Ok(()) }; example().unwrap() 161 ``` 162 163 Note that unlike dense DFAs, sparse DFAs have no alignment requirements. 164 Conversely, dense DFAs must be be aligned to the same alignment as their 165 state identifier representation. 166 167 # Support for `no_std` 168 169 This crate comes with a `std` feature that is enabled by default. When the 170 `std` feature is enabled, the API of this crate will include the facilities 171 necessary for compiling, serializing, deserializing and searching with regular 172 expressions. When the `std` feature is disabled, the API of this crate will 173 shrink such that it only includes the facilities necessary for deserializing 174 and searching with regular expressions. 175 176 The intended workflow for `no_std` environments is thus as follows: 177 178 * Write a program with the `std` feature that compiles and serializes a 179 regular expression. Serialization should only happen after first converting 180 the DFAs to use a fixed size state identifier instead of the default `usize`. 181 You may also need to serialize both little and big endian versions of each 182 DFA. (So that's 4 DFAs in total for each regex.) 183 * In your `no_std` environment, follow the examples above for deserializing 184 your previously serialized DFAs into regexes. You can then search with them 185 as you would any regex. 186 187 Deserialization can happen anywhere. For example, with bytes embedded into a 188 binary or with a file memory mapped at runtime. 189 190 Note that the 191 [`ucd-generate`](https://github.com/BurntSushi/ucd-generate) 192 tool will do the first step for you with its `dfa` or `regex` sub-commands. 193 194 # Syntax 195 196 This crate supports the same syntax as the `regex` crate, since they share the 197 same parser. You can find an exhaustive list of supported syntax in the 198 [documentation for the `regex` crate](https://docs.rs/regex/1.1/regex/#syntax). 199 200 Currently, there are a couple limitations. In general, this crate does not 201 support zero-width assertions, although they may be added in the future. This 202 includes: 203 204 * Anchors such as `^`, `$`, `\A` and `\z`. 205 * Word boundary assertions such as `\b` and `\B`. 206 207 It is possible to run a search that is anchored at the beginning of the input. 208 To do that, set the 209 [`RegexBuilder::anchored`](struct.RegexBuilder.html#method.anchored) 210 option when building a regex. By default, all searches are unanchored. 211 212 # Differences with the regex crate 213 214 The main goal of the [`regex`](https://docs.rs/regex) crate is to serve as a 215 general purpose regular expression engine. It aims to automatically balance low 216 compile times, fast search times and low memory usage, while also providing 217 a convenient API for users. In contrast, this crate provides a lower level 218 regular expression interface that is a bit less convenient while providing more 219 explicit control over memory usage and search times. 220 221 Here are some specific negative differences: 222 223 * **Compilation can take an exponential amount of time and space** in the size 224 of the regex pattern. While most patterns do not exhibit worst case 225 exponential time, such patterns do exist. For example, `[01]*1[01]{N}` will 226 build a DFA with `2^(N+1)` states. For this reason, untrusted patterns should 227 not be compiled with this library. (In the future, the API may expose an 228 option to return an error if the DFA gets too big.) 229 * This crate does not support sub-match extraction, which can be achieved with 230 the regex crate's "captures" API. This may be added in the future, but is 231 unlikely. 232 * While the regex crate doesn't necessarily sport fast compilation times, the 233 regexes in this crate are almost universally slow to compile, especially when 234 they contain large Unicode character classes. For example, on my system, 235 compiling `\w{3}` with byte classes enabled takes just over 1 second and 236 almost 5MB of memory! (Compiling a sparse regex takes about the same time 237 but only uses about 500KB of memory.) Conversly, compiling the same regex 238 without Unicode support, e.g., `(?-u)\w{3}`, takes under 1 millisecond and 239 less than 5KB of memory. For this reason, you should only use Unicode 240 character classes if you absolutely need them! 241 * This crate does not support regex sets. 242 * This crate does not support zero-width assertions such as `^`, `$`, `\b` or 243 `\B`. 244 * As a lower level crate, this library does not do literal optimizations. In 245 exchange, you get predictable performance regardless of input. The 246 philosophy here is that literal optimizations should be applied at a higher 247 level, although there is no easy support for this in the ecosystem yet. 248 * There is no `&str` API like in the regex crate. In this crate, all APIs 249 operate on `&[u8]`. By default, match indices are guaranteed to fall on 250 UTF-8 boundaries, unless 251 [`RegexBuilder::allow_invalid_utf8`](struct.RegexBuilder.html#method.allow_invalid_utf8) 252 is enabled. 253 254 With some of the downsides out of the way, here are some positive differences: 255 256 * Both dense and sparse DFAs can be serialized to raw bytes, and then cheaply 257 deserialized. Deserialization always takes constant time since searching can 258 be performed directly on the raw serialized bytes of a DFA. 259 * This crate was specifically designed so that the searching phase of a DFA has 260 minimal runtime requirements, and can therefore be used in `no_std` 261 environments. While `no_std` environments cannot compile regexes, they can 262 deserialize pre-compiled regexes. 263 * Since this crate builds DFAs ahead of time, it will generally out-perform 264 the `regex` crate on equivalent tasks. The performance difference is likely 265 not large. However, because of a complex set of optimizations in the regex 266 crate (like literal optimizations), an accurate performance comparison may be 267 difficult to do. 268 * Sparse DFAs provide a way to build a DFA ahead of time that sacrifices search 269 performance a small amount, but uses much less storage space. Potentially 270 even less than what the regex crate uses. 271 * This crate exposes DFAs directly, such as 272 [`DenseDFA`](enum.DenseDFA.html) and [`SparseDFA`](enum.SparseDFA.html), 273 which enables one to do less work in some cases. For example, if you only 274 need the end of a match and not the start of a match, then you can use a DFA 275 directly without building a `Regex`, which always requires a second DFA to 276 find the start of a match. 277 * Aside from choosing between dense and sparse DFAs, there are several options 278 for configuring the space usage vs search time trade off. These include 279 things like choosing a smaller state identifier representation, to 280 premultiplying state identifiers and splitting a DFA's alphabet into 281 equivalence classes. Finally, DFA minimization is also provided, but can 282 increase compilation times dramatically. 283 */ 284 285 #![deny(missing_docs)] 286 #![cfg_attr(not(feature = "std"), no_std)] 287 288 #[cfg(feature = "std")] 289 extern crate core; 290 291 #[cfg(all(test, feature = "transducer"))] 292 extern crate bstr; 293 #[cfg(feature = "transducer")] 294 extern crate fst; 295 #[cfg(feature = "std")] 296 extern crate regex_syntax; 297 298 pub use dense::DenseDFA; 299 pub use dfa::DFA; 300 #[cfg(feature = "std")] 301 pub use error::{Error, ErrorKind}; 302 pub use regex::Regex; 303 #[cfg(feature = "std")] 304 pub use regex::RegexBuilder; 305 pub use sparse::SparseDFA; 306 pub use state_id::StateID; 307 308 mod byteorder; 309 mod classes; 310 #[path = "dense.rs"] 311 mod dense_imp; 312 #[cfg(feature = "std")] 313 mod determinize; 314 mod dfa; 315 #[cfg(feature = "std")] 316 mod error; 317 #[cfg(feature = "std")] 318 mod minimize; 319 #[cfg(feature = "std")] 320 #[doc(hidden)] 321 pub mod nfa; 322 mod regex; 323 #[path = "sparse.rs"] 324 mod sparse_imp; 325 #[cfg(feature = "std")] 326 mod sparse_set; 327 mod state_id; 328 #[cfg(feature = "transducer")] 329 mod transducer; 330 331 /// Types and routines specific to dense DFAs. 332 /// 333 /// This module is the home of [`DenseDFA`](enum.DenseDFA.html) and each of its 334 /// corresponding variant DFA types, such as [`Standard`](struct.Standard.html) 335 /// and [`ByteClass`](struct.ByteClass.html). 336 /// 337 /// This module also contains a [builder](struct.Builder.html) for 338 /// configuring the construction of a dense DFA. 339 pub mod dense { 340 pub use dense_imp::*; 341 } 342 343 /// Types and routines specific to sparse DFAs. 344 /// 345 /// This module is the home of [`SparseDFA`](enum.SparseDFA.html) and each of 346 /// its corresponding variant DFA types, such as 347 /// [`Standard`](struct.Standard.html) and 348 /// [`ByteClass`](struct.ByteClass.html). 349 /// 350 /// Unlike the [`dense`](../dense/index.html) module, this module does not 351 /// contain a builder specific for sparse DFAs. Instead, the intended way to 352 /// build a sparse DFA is either by using a default configuration with its 353 /// [constructor](enum.SparseDFA.html#method.new), 354 /// or by first 355 /// [configuring the construction of a dense DFA](../dense/struct.Builder.html) 356 /// and then calling 357 /// [`DenseDFA::to_sparse`](../enum.DenseDFA.html#method.to_sparse). 358 pub mod sparse { 359 pub use sparse_imp::*; 360 } 361