1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396
|
# re2 - safer regular expressions in Ruby
Ruby bindings to [RE2][], a "fast, safe, thread-friendly alternative to
backtracking regular expression engines like those used in PCRE, Perl, and
Python".
[](https://github.com/mudge/re2/actions)
**Current version:** 2.7.0
**Bundled RE2 version:** libre2.11 (2023-11-01)
```ruby
RE2('h.*o').full_match?("hello") #=> true
RE2('e').full_match?("hello") #=> false
RE2('h.*o').partial_match?("hello") #=> true
RE2('e').partial_match?("hello") #=> true
RE2('(\w+):(\d+)').full_match("ruby:1234")
#=> #<RE2::MatchData "ruby:1234" 1:"ruby" 2:"1234">
```
## Table of Contents
* [Why RE2?](#why-re2)
* [Usage](#usage)
* [Compiling regular expressions](#compiling-regular-expressions)
* [Matching interface](#matching-interface)
* [Submatch extraction](#submatch-extraction)
* [Scanning text incrementally](#scanning-text-incrementally)
* [Searching simultaneously](#searching-simultaneously)
* [Encoding](#encoding)
* [Requirements](#requirements)
* [Native gems](#native-gems)
* [Verifying the gems](#verifying-the-gems)
* [Installing the `ruby` platform gem](#installing-the-ruby-platform-gem)
* [Using system libraries](#using-system-libraries)
* [Thanks](#thanks)
* [Contact](#contact)
* [License](#license)
* [Dependencies](#dependencies)
## Why RE2?
While [recent
versions](https://www.ruby-lang.org/en/news/2022/12/25/ruby-3-2-0-released/) of
Ruby have improved defences against [regular expression denial of service
(ReDoS) attacks](https://en.wikipedia.org/wiki/ReDoS), it is still possible for
users to craft malicious patterns that take a long time to process by using
syntactic features such as [back-references, lookaheads and possessive
quantifiers](https://bugs.ruby-lang.org/issues/19104#note-3). RE2 aims to
eliminate ReDoS by design:
> **_Safety is RE2's raison d'être._**
>
> RE2 was designed and implemented with an explicit goal of being able to
> handle regular expressions from untrusted users without risk. One of its
> primary guarantees is that the match time is linear in the length of the
> input string. It was also written with production concerns in mind: the
> parser, the compiler and the execution engines limit their memory usage by
> working within a configurable budget – failing gracefully when exhausted –
> and they avoid stack overflow by eschewing recursion.
— [Why RE2?](https://github.com/google/re2/wiki/WhyRE2)
## Usage
Install re2 as a dependency:
```ruby
# In your Gemfile
gem "re2"
# Or without Bundler
gem install re2
```
Include in your code:
```ruby
require "re2"
```
Full API documentation automatically generated from the latest version is
available at https://mudge.name/re2/.
While re2 uses the same naming scheme as Ruby's built-in regular expression
library (with [`Regexp`](https://mudge.name/re2/RE2/Regexp.html) and
[`MatchData`](https://mudge.name/re2/RE2/MatchData.html)), its API is slightly
different:
### Compiling regular expressions
> [!WARNING]
> RE2's regular expression syntax differs from PCRE and Ruby's built-in
> [`Regexp`](https://docs.ruby-lang.org/en/3.2/Regexp.html) library, see the
> [official syntax page](https://github.com/google/re2/wiki/Syntax) for more
> details.
The core class is [`RE2::Regexp`](https://mudge.name/re2/RE2/Regexp.html) which
takes a regular expression as a string and compiles it internally into an `RE2`
object. A global function `RE2` is available to concisely compile a new
`RE2::Regexp`:
```ruby
re = RE2('(\w+):(\d+)')
#=> #<RE2::Regexp /(\w+):(\d+)/>
re.ok? #=> true
re = RE2('abc)def')
re.ok? #=> false
re.error #=> "missing ): abc(def"
```
> [!TIP]
> Note the use of *single quotes* when passing the regular expression as
> a string to `RE2` so that the backslashes aren't interpreted as escapes.
When compiling a regular expression, an optional second argument can be used to change RE2's default options, e.g. stop logging syntax and execution errors to stderr with `log_errors`:
```ruby
RE2('abc)def', log_errors: false)
```
See the API documentation for [`RE2::Regexp#initialize`](https://mudge.name/re2/RE2/Regexp.html#initialize-instance_method) for all the available options.
### Matching interface
There are two main methods for matching: [`RE2::Regexp#full_match?`](https://mudge.name/re2/RE2/Regexp.html#full_match%3F-instance_method) requires the regular expression to match the entire input text, and [`RE2::Regexp#partial_match?`](https://mudge.name/re2/RE2/Regexp.html#partial_match%3F-instance_method) looks for a match for a substring of the input text, returning a boolean to indicate whether a match was successful or not.
```ruby
RE2('h.*o').full_match?("hello") #=> true
RE2('e').full_match?("hello") #=> false
RE2('h.*o').partial_match?("hello") #=> true
RE2('e').partial_match?("hello") #=> true
```
### Submatch extraction
> [!TIP]
> Only extract the number of submatches you need as performance is improved
> with fewer submatches (with the best performance when avoiding submatch
> extraction altogether).
Both matching methods have a second form that can extract submatches as [`RE2::MatchData`](https://mudge.name/re2/RE2/MatchData.html) objects: [`RE2::Regexp#full_match`](https://mudge.name/re2/RE2/Regexp.html#full_match-instance_method) and [`RE2::Regexp#partial_match`](https://mudge.name/re2/RE2/Regexp.html#partial_match-instance_method).
```ruby
m = RE2('(\w+):(\d+)').full_match("ruby:1234")
#=> #<RE2::MatchData "ruby:1234" 1:"ruby" 2:"1234">
m[0] #=> "ruby:1234"
m[1] #=> "ruby"
m[2] #=> "1234"
m = RE2('(\w+):(\d+)').full_match("r")
#=> nil
```
`RE2::MatchData` supports retrieving submatches by numeric index or by name if present in the regular expression:
```ruby
m = RE2('(?P<word>\w+):(?P<number>\d+)').full_match("ruby:1234")
#=> #<RE2::MatchData "ruby:1234" 1:"ruby" 2:"1234">
m["word"] #=> "ruby"
m["number"] #=> "1234"
```
They can also be used with Ruby's [pattern matching](https://docs.ruby-lang.org/en/3.2/syntax/pattern_matching_rdoc.html):
```ruby
case RE2('(\w+):(\d+)').full_match("ruby:1234")
in [word, number]
puts "Word: #{word}, Number: #{number}"
else
puts "No match"
end
# Word: ruby, Number: 1234
case RE2('(?P<word>\w+):(?P<number>\d+)').full_match("ruby:1234")
in word:, number:
puts "Word: #{word}, Number: #{number}"
else
puts "No match"
end
# Word: ruby, Number: 1234
```
By default, both `full_match` and `partial_match` will extract all submatches into the `RE2::MatchData` based on the number of capturing groups in the regular expression. This can be changed by passing an optional second argument when matching:
```ruby
m = RE2('(\w+):(\d+)').full_match("ruby:1234", submatches: 1)
=> #<RE2::MatchData "ruby:1234" 1:"ruby">
```
> [!WARNING]
> If the regular expression has no capturing groups or you pass `submatches:
> 0`, the matching method will behave like its `full_match?` or
> `partial_match?` form and only return `true` or `false` rather than
> `RE2::MatchData`.
### Scanning text incrementally
If you want to repeatedly match regular expressions from the start of some input text, you can use [`RE2::Regexp#scan`](https://mudge.name/re2/RE2/Regexp.html#scan-instance_method) to return an `Enumerable` [`RE2::Scanner`](https://mudge.name/re2/RE2/Scanner.html) object which will lazily consume matches as you iterate over it:
```ruby
scanner = RE2('(\w+)').scan(" one two three 4")
scanner.each do |match|
puts match.inspect
end
# ["one"]
# ["two"]
# ["three"]
# ["4"]
```
### Searching simultaneously
[`RE2::Set`](https://mudge.name/re2/RE2/Set.html) represents a collection of
regular expressions that can be searched for simultaneously. Calling
[`RE2::Set#add`](https://mudge.name/re2/RE2/Set.html#add-instance_method) with
a regular expression will return the integer index at which it is stored within
the set. After all patterns have been added, the set can be compiled using
[`RE2::Set#compile`](https://mudge.name/re2/RE2/Set.html#compile-instance_method),
and then
[`RE2::Set#match`](https://mudge.name/re2/RE2/Set.html#match-instance_method)
will return an array containing the indices of all the patterns that matched.
```ruby
set = RE2::Set.new
set.add("abc") #=> 0
set.add("def") #=> 1
set.add("ghi") #=> 2
set.compile #=> true
set.match("abcdefghi") #=> [0, 1, 2]
set.match("ghidefabc") #=> [2, 1, 0]
```
### Encoding
> [!WARNING]
> Note RE2 only supports UTF-8 and ISO-8859-1 encoding so strings will be
> returned in UTF-8 by default or ISO-8859-1 if the `:utf8` option for the
> `RE2::Regexp` is set to `false` (any other encoding's behaviour is undefined).
For backward compatibility: re2 won't automatically convert string inputs to
the right encoding so this is the responsibility of the caller, e.g.
```ruby
# By default, RE2 will process patterns and text as UTF-8
RE2(non_utf8_pattern.encode("UTF-8")).match(non_utf8_text.encode("UTF-8"))
# If the :utf8 option is false, RE2 will process patterns and text as ISO-8859-1
RE2(non_latin1_pattern.encode("ISO-8859-1"), utf8: false).match(non_latin1_text.encode("ISO-8859-1"))
```
## Requirements
This gem requires the following to run:
* [Ruby](https://www.ruby-lang.org/en/) 2.6 to 3.3
It supports the following RE2 ABI versions:
* libre2.0 (prior to release 2020-03-02) to libre2.11 (2023-07-01 to 2023-11-01)
### Native gems
Where possible, a pre-compiled native gem will be provided for the following platforms:
* Linux
* `aarch64-linux` and `arm-linux` (requires [glibc](https://www.gnu.org/software/libc/) 2.29+)
* `x86-linux` and `x86_64-linux` (requires [glibc](https://www.gnu.org/software/libc/) 2.17+)
* [musl](https://musl.libc.org/)-based systems such as [Alpine](https://alpinelinux.org) are supported as long as a [glibc-compatible library is installed](https://wiki.alpinelinux.org/wiki/Running_glibc_programs)
* macOS `x86_64-darwin` and `arm64-darwin`
* Windows `x64-mingw32` and `x64-mingw-ucrt`
### Verifying the gems
SHA256 checksums are included in the [release notes](https://github.com/mudge/re2/releases) for each version and can be checked with `sha256sum`, e.g.
```console
$ gem fetch re2 -v 2.6.0
Fetching re2-2.6.0-arm64-darwin.gem
Downloaded re2-2.6.0-arm64-darwin
$ sha256sum re2-2.6.0-arm64-darwin.gem
ba6fda7a29cd16179d5401c1b4917ba204c92e5ca9d25df80d840ed76fca439f re2-2.6.0-arm64-darwin.gem
```
[GPG](https://www.gnupg.org/) signatures are attached to each release (the assets ending in `.sig`) and can be verified if you import [our signing key `0x39AC3530070E0F75`](https://mudge.name/39AC3530070E0F75.asc) (or fetch it from a public keyserver, e.g. `gpg --keyserver keyserver.ubuntu.com --recv-key 0x39AC3530070E0F75`):
```console
$ gpg --verify re2-2.6.0-arm64-darwin.gem.sig re2-2.6.0-arm64-darwin.gem
gpg: Signature made Wed 27 Dec 19:26:53 2023 GMT
gpg: using RSA key 702609D9C790F45B577D7BEC39AC3530070E0F75
gpg: Good signature from "Paul Mucur <mudge@mudge.name>" [unknown]
gpg: aka "Paul Mucur <paul@ghostcassette.com>" [unknown]
gpg: WARNING: This key is not certified with a trusted signature!
gpg: There is no indication that the signature belongs to the owner.
Primary key fingerprint: 7026 09D9 C790 F45B 577D 7BEC 39AC 3530 070E 0F75
```
The fingerprint should be as shown above or you can independently verify it with the ones shown in the footer of https://mudge.name.
### Installing the `ruby` platform gem
> [!WARNING]
> We strongly recommend using the native gems where possible to avoid the need
> for compiling the C++ extension and its dependencies which will take longer
> and be less reliable.
If you wish to compile the gem, you will need to explicitly install the `ruby` platform gem:
```ruby
# In your Gemfile with Bundler 2.3.18+
gem "re2", force_ruby_platform: true
# With Bundler 2.1+
bundle config set force_ruby_platform true
# With older versions of Bundler
bundle config force_ruby_platform true
# Without Bundler
gem install re2 --platform=ruby
```
You will need a full compiler toolchain for compiling Ruby C extensions (see
[Nokogiri's "The Compiler
Toolchain"](https://nokogiri.org/tutorials/installing_nokogiri.html#appendix-a-the-compiler-toolchain))
plus the toolchain required for compiling the vendored version of RE2 and its
dependency [Abseil][] which includes
[CMake](https://cmake.org) and a compiler with C++14 support such as
[clang](http://clang.llvm.org/) 3.4 or [gcc](https://gcc.gnu.org/) 5. On
Windows, you'll also need pkgconf 2.1.0+ to avoid [`undefined reference`
errors](https://github.com/pkgconf/pkgconf/issues/322) when attempting to
compile Abseil.
### Using system libraries
If you already have RE2 installed, you can instruct the gem not to use its own vendored version:
```ruby
gem install re2 --platform=ruby -- --enable-system-libraries
# If RE2 is not installed in /usr/local, /usr, or /opt/homebrew:
gem install re2 --platform=ruby -- --enable-system-libraries --with-re2-dir=/path/to/re2/prefix
```
Alternatively, you can set the `RE2_USE_SYSTEM_LIBRARIES` environment variable instead of passing `--enable-system-libraries` to the `gem` command.
## Thanks
* Thanks to [Jason Woods](https://github.com/driskell) who contributed the
original implementations of `RE2::MatchData#begin` and `RE2::MatchData#end`.
* Thanks to [Stefano Rivera](https://github.com/stefanor) who first contributed
C++11 support.
* Thanks to [Stan Hu](https://github.com/stanhu) for reporting a bug with empty
patterns and `RE2::Regexp#scan`, contributing support for libre2.11
(2023-07-01) and for vendoring RE2 and abseil and compiling native gems in
2.0.
* Thanks to [Sebastian Reitenbach](https://github.com/buzzdeee) for reporting
the deprecation and removal of the `utf8` encoding option in RE2.
* Thanks to [Sergio Medina](https://github.com/serch) for reporting a bug when
using `RE2::Scanner#scan` with an invalid regular expression.
* Thanks to [Pritam Baral](https://github.com/pritambaral) for contributing the
initial support for `RE2::Set`.
* Thanks to [Mike Dalessio](https://github.com/flavorjones) for reviewing the
precompilation of native gems in 2.0.
* Thanks to [Peter Zhu](https://github.com/peterzhu2118) for
[ruby_memcheck](https://github.com/Shopify/ruby_memcheck) and helping find
the memory leaks fixed in 2.1.3.
* Thanks to [Jean Boussier](https://github.com/byroot) for contributing the
switch to Ruby's `TypedData` API and the resulting garbage collection
improvements in 2.4.0.
* Thanks to [Manuel Jacob](https://github.com/manueljacob) for reporting a bug
when passing strings with null bytes.
## Contact
All issues and suggestions should go to [GitHub Issues](https://github.com/mudge/re2/issues).
## License
This library is licensed under the BSD 3-Clause License, see `LICENSE.txt`.
Copyright © 2010, Paul Mucur.
### Dependencies
The source code of [RE2][] is distributed in the `ruby` platform gem. This code is licensed under the BSD 3-Clause License, see `LICENSE-DEPENDENCIES.txt`.
The source code of [Abseil][] is distributed in the `ruby` platform gem. This code is licensed under the Apache License 2.0, see `LICENSE-DEPENDENCIES.txt`.
[RE2]: https://github.com/google/re2
[Abseil]: https://abseil.io
|