1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219
|
# CharacterSet
[](http://badge.fury.io/rb/character_set)
[](https://github.com/jaynetics/character_set/actions)
[](https://github.com/jaynetics/character_set/actions)
[](https://codecov.io/gh/jaynetics/character_set)
This is a C-extended Ruby gem to work with sets of Unicode codepoints.
It can [read](#parseinitialize) and [write](#write) sets of codepoints in various formats and it implements the stdlib `Set` interface for them.
It also offers a [way of scrubbing and scanning characters in Strings](#interact-with-strings) that is more semantic and consistently offers better performance than `Regexp` and `String` methods from the stdlib for this (see [benchmarks](./BENCHMARK.md)).
Many parts can be used independently, e.g.:
- `CharacterSet::Character`
- `CharacterSet::ExpressionConverter`
- `CharacterSet::Parser`
- `CharacterSet::Writer`
## Usage
### Usage examples
```ruby
CharacterSet.url_query.cover?('?a=(b$c;)') # => true
CharacterSet.non_ascii.delete_in!(string)
CharacterSet.emoji.sample(5) # => ["⛷", "👈", "🌞", "♑", "⛈"]
```
### Parse/Initialize
These all produce a `CharacterSet` containing `a`, `b` and `c`:
```ruby
CharacterSet['a', 'b', 'c']
CharacterSet[97, 98, 99]
CharacterSet.new('a'..'c')
CharacterSet.new(0x61..0x63)
CharacterSet.of('abacababa')
CharacterSet.parse('[a-c]')
CharacterSet.parse('\U00000061-\U00000063')
```
If the gems [`regexp_parser`](https://github.com/ammar/regexp_parser) and [`regexp_property_values`](https://github.com/jaynetics/regexp_property_values) are installed, `Regexp` instances and unicode property names can also be read.
```ruby
CharacterSet.of(/./) # => #<CharacterSet (size: 1112064)>
CharacterSet.of_property('Thai') # => #<CharacterSet (size: 86)>
require 'character_set/core_ext/regexp_ext'
/[\D&&[:ascii:]&&\p{emoji}]/.character_set.size # => 2
```
### Predefined utility sets
`ascii`, `ascii_alnum`, `ascii_letter`, `assigned`, `bmp`, `crypt`, `emoji`, `newline`, `surrogate`, `unicode`, `url_fragment`, `url_host`, `url_path`, `url_query`, `whitespace`
```ruby
CharacterSet.ascii # => #<CharacterSet (size: 128)>
# all can be prefixed with `non_`, e.g.
CharacterSet.non_ascii
```
### Interact with Strings
`CharacterSet` can replace some types of `String` handling with better performance than the stdlib.
`#used_by?` and `#cover?` can replace some `Regexp#match?` calls:
```ruby
CharacterSet.ascii.used_by?('Tüür') # => true
CharacterSet.ascii.cover?('Tüür') # => false
CharacterSet.ascii.cover?('Tr') # => true
```
`#delete_in(!)` and `#keep_in(!)` can replace `String#gsub(!)` and the like:
```ruby
string = 'Tüür'
CharacterSet.ascii.delete_in(string) # => 'üü'
CharacterSet.ascii.keep_in(string) # => 'Tr'
string # => 'Tüür'
CharacterSet.ascii.delete_in!(string) # => 'üü'
string # => 'üü'
CharacterSet.ascii.keep_in!(string) # => ''
string # => ''
```
`#count_in` and `#scan` can replace `String#count` and `String#scan`:
```ruby
CharacterSet.non_ascii.count_in('Tüür') # => 2
CharacterSet.non_ascii.scan('Tüür') # => ['ü', 'ü']
```
There is also a core extension for String interaction.
```ruby
require 'character_set/core_ext/string_ext'
"a\rb".character_set & CharacterSet.newline # => CharacterSet["\r"]
"a\rb".uses_character_set?(CharacterSet['ä', 'ö', 'ü']) # => false
"a\rb".covered_by_character_set?(CharacterSet.newline) # => false
# predefined sets can also be referenced via Symbols
"a\rb".covered_by_character_set?(:ascii) # => true
"a\rb".delete_character_set(:newline) # => 'ab'
# etc.
```
### Manipulate
Use [any Ruby Set method](https://ruby-doc.org/stdlib-2.5.1/libdoc/set/rdoc/Set.html), e.g. `#+`, `#-`, `#&`, `#^`, `#intersect?`, `#<`, `#>` etc. to interact with other sets. Use `#add`, `#delete`, `#include?` etc. to change or check for members.
Where appropriate, methods take both chars and codepoints, e.g.:
```ruby
CharacterSet['a'].add('b') # => CharacterSet['a', 'b']
CharacterSet['a'].add(98) # => CharacterSet['a', 'b']
CharacterSet['a'].include?('a') # => true
CharacterSet['a'].include?(0x61) # => true
```
`#inversion` can be used to create a `CharacterSet` with all valid Unicode codepoints that are not in the current set:
```ruby
non_a = CharacterSet['a'].inversion
# => #<CharacterSet (size: 1112063)>
non_a.include?('a') # => false
non_a.include?('ü') # => true
# surrogate pair halves are not included by default
CharacterSet['a'].inversion(include_surrogates: true)
# => #<CharacterSet (size: 1114112)>
```
`#case_insensitive` can be used to create a `CharacterSet` where upper/lower case codepoints are supplemented:
```ruby
CharacterSet['1', 'A'].case_insensitive # => CharacterSet['1', 'A', 'a']
```
### Write
```ruby
set = CharacterSet['a', 'b', 'c', 'j', '-']
# safely printable ASCII chars are not escaped by default
set.to_s # => 'a-cj\x2D'
set.to_s(escape_all: true) # => '\x61-\x63\x6A\x2D'
# brackets may be added
set.to_s(in_brackets: true) # => '[a-cj\x2D]'
# the default escape format is Ruby/ES6 compatible, others are available
set = CharacterSet['a', 'b', 'c', 'ɘ', '🤩']
set.to_s # => 'a-c\u0258\u{1F929}'
set.to_s(format: 'U+') # => 'a-cU+0258U+1F929'
set.to_s(format: 'Python') # => "a-c\u0258\U0001F929"
set.to_s(format: 'raw') # => 'a-cɘ🤩'
# or pass a block
set.to_s { |char| "[#{char.codepoint}]" } # => "a-c[600][129321]"
set.to_s(escape_all: true) { |c| "<#{c.hex}>" } # => "<61>-<63><258><1F929>"
# disable abbreviation (grouping of codepoints in ranges)
set.to_s(abbreviate: false) # => "abc\u0258\u{1F929}"
# astral members require some trickery if we want to target environments
# that are based on UTF-16 or "UCS-2 with surrogates", such as JavaScript.
set = CharacterSet['a', 'b', '🤩', '🤪', '🤫']
# Use #to_s_with_surrogate_ranges e.g. for JavaScript:
set.to_s_with_surrogate_ranges
# => '(?:[ab]|\uD83E[\uDD29-\uDD2B])'
# Or use #to_s_with_surrogate_alternation if such surrogate set pairs
# don't work in your target environment:
set.to_s_with_surrogate_alternation
# => '(?:[ab]|\uD83E\uDD29|\uD83E\uDD2A|\uD83E\uDD2B)'
```
### Other features
#### Secure tokens
Generate secure random strings of characters from a set:
```ruby
CharacterSet.new('a'..'z').secure_token(8) # => "ugwpujmt"
CharacterSet.crypt.secure_token # => "8.1w7aBT737/pMfcMoO4y2y8/=0xtmo:"
```
#### Unicode planes
There are some methods to check for planes and to handle ASCII, [BMP](https://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane) and astral parts:
```Ruby
CharacterSet['a', 'ü', '🤩'].ascii_part # => CharacterSet['a']
CharacterSet['a', 'ü', '🤩'].ascii_part? # => true
CharacterSet['a', 'ü', '🤩'].ascii_only? # => false
CharacterSet['a', 'ü', '🤩'].ascii_ratio # => 0.3333333
CharacterSet['a', 'ü', '🤩'].bmp_part # => CharacterSet['a', 'ü']
CharacterSet['a', 'ü', '🤩'].astral_part # => CharacterSet['🤩']
CharacterSet['a', 'ü', '🤩'].bmp_ratio # => 0.6666666
CharacterSet['a', 'ü', '🤩'].planes # => [0, 1]
CharacterSet['a', 'ü', '🤩'].plane(1) # => CharacterSet['🤩']
CharacterSet['a', 'ü', '🤩'].member_in_plane?(7) # => false
CharacterSet::Character.new('a').plane # => 0
```
## Contributions
Feel free to send suggestions, point out issues, or submit pull requests.
|