1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343
|
* pcre2el: convert between PCRE, Emacs and rx regexp syntax
** Overview
=pcre2el= or =rxt= (RegeXp Translator or RegeXp Tools) is a utility
for working with regular expressions in Emacs, based on a
recursive-descent parser for regexp syntax. In addition to
converting (a subset of) PCRE syntax into its Emacs equivalent, it
can do the following:
- convert Emacs syntax to PCRE
- convert either syntax to =rx=, an S-expression based regexp
syntax
- untangle complex regexps by showing the parse tree in =rx= form
and highlighting the corresponding chunks of code
- show the complete list of strings (productions) matching a
regexp, provided the list is finite
- provide live font-locking of regexp syntax (so far only for
Elisp buffers -- other modes on the TODO list)
** Usage
Enable =rxt-mode= or its global equivalent =rxt-global-mode=
to get the default key-bindings. There are three sets of commands:
commands that take a PCRE regexp, commands which take an Emacs
regexp, and commands that try to do the right thing based on the
current mode. Currently, this means Emacs syntax in
=emacs-lisp-mode= and =lisp-interaction-mode=, and PCRE syntax
everywhere else.
The default key bindings all begin with =C-c /= and have a mnemonic
structure: =C-c / <source> <target>=, or just =C-c / <target>= for
the "do what I mean" commands. The complete list of key bindings is
given here and explained in more detail below:
- "Do-what-I-mean" commands:
- =C-c / /= :: =rxt-explain=
- =C-c / c= :: =rxt-convert-syntax=
- =C-c / x= :: =rxt-convert-to-rx=
- =C-c / ′= :: =rxt-convert-to-strings=
- Commands that work on a PCRE regexp:
- =C-c / p e= :: =rxt-pcre-to-elisp=
- =C-c / %= :: =pcre-query-replace-regexp=
- =C-c / p x= :: =rxt-pcre-to-rx=
- =C-c / p s= :: =rxt-pcre-to-sre=
- =C-c / p ′= :: =rxt-pcre-to-strings=
- =C-c / p /= :: =rxt-explain-pcre=
- Commands that work on an Emacs regexp:
- =C-c / e /= :: =rxt-explain-elisp=
- =C-c / e p= :: =rxt-elisp-to-pcre=
- =C-c / e x= :: =rxt-elisp-to-rx=
- =C-c / e s= :: =rxt-elisp-to-sre=
- =C-c / e ′= :: =rxt-elisp-to-strings=
- =C-c / e t= :: =rxt-toggle-elisp-rx=
- =C-c / t= :: =rxt-toggle-elisp-rx=
*** Interactive input and output
When used interactively, the conversion commands can read a regexp
either from the current buffer or from the minibuffer. The output
is displayed in the minibuffer and copied to the kill-ring.
- When called with a prefix argument (=C-u=), they read a regular
expression from the minibuffer literally, without further
processing -- meaning there's no need to double the backslashes if
it's an Emacs regexp. This is the same way commands like
=query-replace-regexp= read input.
- When the region is active, they use they the region contents,
again literally (without any translation of string syntax).
- With neither a prefix arg nor an active region, the behavior
depends on whether the command expects an Emacs regexp or
a PCRE one.
Commands that take an Emacs regexp behave like =C-x C-e=: they
evaluate the sexp before point (which could be simply a string
literal) and use its value. This is designed for use in Elisp
buffers. As a special case, if point is *inside* a string, it's
first moved to the string end, so in practice they should work
as long as point is somewhere within the regexp literal.
Commands that take a PCRE regexp try to read a Perl-style
delimited regex literal *after* point in the current buffer,
including its flags. For example, putting point before the =m=
in the following example and doing =C-c / p e=
(=rxt-pcre-to-elisp=) displays =\(?:bar\|foo\)=, correctly
stripping out the whitespace and comment:
: $x =~ m/ foo | (?# comment) bar /x
The PCRE reader currently only works with =/ ... /= delimiters. It
will ignore any preceding =m=, =s=, or =qr= operator, as well as
the replacement part of an =s= construction.
Readers for other PCRE-using languages are on the TODO list.
The translation functions display their result in the minibuffer
and copy it to the kill ring. When translating something into
Elisp syntax, you might need to use the result either literally
(e.g. for interactive input to a command like
=query-replace-regexp=), or as a string to paste into Lisp code.
To allow both uses, =rxt-pcre-to-elisp= copies both versions
successively to the kill-ring. The literal regexp without string
quoting is the top element of the kill-ring, while the Lisp string
is the second-from-top. You can paste the literal regexp somewhere
by doing =C-y=, or the Lisp string by =C-y M-y=.
*** Syntax conversion commands
=rxt-convert-syntax= (=C-c / c=) converts between Emacs and PCRE
syntax, depending on the major mode in effect when called.
Alternatively, you can specify the conversion direction explicitly
by using either =rxt-pcre-to-elisp= (=C-c / p e=) or
=rxt-elisp-to-pcre= (=C-c / e p=).
Similarly, =rxt-convert-to-rx= (=C-c / x=) converts either kind of
syntax to =rx= form, while =rxt-convert-pcre-to-rx= (=C-c / p x=)
and =rxt-convert-elisp-to-rx= (=C-c / e x=) convert to =rx= from a
specified source type.
In Elisp buffers, you can use =rxt-toggle-elisp-rx= (=C-c / t= or
=C-c / e t=) to switch the regexp at point back and forth between
string and =rx= syntax. Point should either be within an =rx= or
=rx-to-string= form or a string literal for this to work.
*** PCRE mode (experimental)
If you want to use emulated PCRE regexp syntax in all Emacs
commands, try =pcre-mode=, which uses Emacs's advice system to
make all commands that read regexps using the minibuffer use
emulated PCRE syntax. It should also work with Isearch.
This feature is still fairly experimental. It may fail to work or
do the wrong thing with certain commands. Please report bugs.
=pcre-query-replace-regexp= was originally defined to do
query-replace using emulated PCRE regexps, and is now made
somewhat obsolete by =pcre-mode=. It is bound to =C-c / %= by
default, by analogy with =M-%=. Put the following in your
=.emacs= if you want to use PCRE-style query replacement
everywhere:
: (global-set-key [(meta %)] 'pcre-query-replace-regexp)
*** Explain regexps
When syntax-highlighting isn't enough to untangle some gnarly
regexp you find in the wild, try the 'explain' commands:
=rxt-explain= (=C-c / /=), =rxt-explain-pcre= (=C-c / p=) and
=rxt-explain-elisp= (=C-c / e=). These display the original regexp
along with its pretty-printed =rx= equivalent in a new buffer.
Moving point around either in the original regexp or the =rx=
translation highlights corresponding pieces of syntax, which can
aid in seeing things like the scope of quantifiers.
I call them "explain" commands because the =rx= form is close to a
plain syntax tree, and this plus the wordiness of the operators
usually helps to clarify what is going on. People who dislike
Lisp syntax might disagree with this assessment.
*** Generate all matching strings (productions)
Occasionally you come across a regexp which is designed to match a
finite set of strings, e.g. a set of keywords, and it would be
useful to recover the original set. (In Emacs you can generate
such regexps using =regexp-opt=). The commands
=rxt-convert-to-strings= (=C-c / ′=), =rxt-pcre-to-strings= (=C-c
/ p ′=) or =rxt-elisp-to-strings= (=C-c / e ′=) accomplish this by
generating all the matching strings ("productions") of a regexp.
(The productions are copied to the kill ring as a Lisp list).
An example in Lisp code:
: (regexp-opt '("cat" "caterpillar" "catatonic"))
: ;; => "\\(?:cat\\(?:atonic\\|erpillar\\)?\\)"
: (rxt-elisp-to-strings "\\(?:cat\\(?:atonic\\|erpillar\\)?\\)")
: ;; => '("cat" "caterpillar" "catatonic")
For obvious reasons, these commands only work with regexps that
don't include any unbounded quantifiers like =+= or =*=. They also
can't enumerate all the characters that match a named character
class like =[[:alnum:]]=. In either case they will give a (hopefully
meaningful) error message. Due to the nature of permutations, it's
still possible for a finite regexp to generate a huge number of
productions, which will eat memory and slow down your Emacs. Be
ready with =C-g= if necessary.
*** RE-Builder support
The Emacs RE-Builder is a useful visual tool which allows using
several different built-in syntaxes via =reb-change-syntax= (=C-c
TAB=). It supports Elisp read and literal syntax and =rx=, but it
can only convert from the symbolic forms to Elisp, not the other
way. This package hacks the RE-Builder to also work with emulated
PCRE syntax, and to convert transparently between Elisp, PCRE and
rx syntaxes. PCRE mode reads a delimited Perl-like literal of the
form =/ ... /=, and it should correctly support using the =x= and
=s= flags.
*** Use from Lisp
Example of using the conversion functions:
: (rxt-pcre-to-elisp "(abc|def)\\w+\\d+")
: ;; => "\\(\\(?:abc\\|def\\)\\)[_[:alnum:]]+[[:digit:]]+"
All the conversion functions take a single string argument, the
regexp to translate:
- =rxt-pcre-to-elisp=
- =rxt-pcre-to-rx=
- =rxt-pcre-to-sre=
- =rxt-pcre-to-strings=
- =rxt-elisp-to-pcre=
- =rxt-elisp-to-rx=
- =rxt-elisp-to-sre=
- =rxt-elisp-to-strings=
** Bugs and Limitations
*** Limitations on PCRE syntax
PCRE has a complicated syntax and semantics, only some of which
can be translated into Elisp. The following subset of PCRE should
be correctly parsed and converted:
- parenthesis grouping =( .. )=, including shy matches =(?: ... )=
- backreferences (various syntaxes), but only up to 9 per expression
- alternation =|=
- greedy and non-greedy quantifiers =*=, =*?=, =+=, =+?=, =?= and =??=
(all of which are the same in Elisp as in PCRE)
- numerical quantifiers ={M,N}=
- beginning/end of string =\A=, =\Z=
- string quoting =\Q .. \E=
- word boundaries =\b=, =\B= (these are the same in Elisp)
- single character escapes =\a=, =\c=, =\e=, =\f=, =\n=, =\r=,
=\t=, =\x=, and =\octal digits= (but see below about non-ASCII
characters)
- character classes =[...]= including Posix escapes
- character classes =\d=, =\D=, =\h=, =\H=, =\s=, =\S=, =\v=, =\V=
both within character class brackets and outside
- word and non-word characters =\w= and =\W=
(Emacs has the same syntax, but its meaning is different)
- =s= (single line) and =x= (extended syntax) flags, in regexp
literals, or set within the expression via =(?xs-xs)= or =(?xs-xs:
.... )= syntax
- comments =(?# ... )=
Most of the more esoteric PCRE features can't really be supported
by simple translation to Elisp regexps. These include the
different lookaround assertions, conditionals, and the
"backtracking control verbs" =(* ...)= . OTOH, there are a few
other syntaxes which are currently unsupported and possibly could be:
- =\L=, =\U=, =\l=, =\u= case modifiers
- =\g{...}= backreferences
*** Other limitations
- The order of alternatives and characters in char classes
sometimes gets shifted around, which is annoying.
- Although the string parser tries to interpret PCRE's octal and
hexadecimal escapes correctly, there are problems with matching
8-bit characters that I don't use enough to properly understand,
e.g.:
: (string-match-p (rxt-pcre-to-elisp "\\377") "\377") => nil
A fix for this would be welcome.
- Most of PCRE's rules for how =^=, =\A=, =$= and =\Z= interact
with newlines are not implemented, since they seem less relevant
to Emacs's buffer-oriented rather than line-oriented model.
However, the different meanings of the =.= metacharacter *are*
implemented (it matches newlines with the =/s= flag, but not
otherwise).
- Not currently namespace clean (both =rxt-= and a couple of
=pcre-= functions).
*** TODO:
- Python-specific extensions to PCRE?
- Language-specific stuff to enable regexp font-locking and
explaining in different modes. Each language would need two
functions, which could be kept in an alist:
1. A function to read PCRE regexps, taking the string syntax into
account. E.g., Python has single-quoted, double-quoted and raw
strings, each with different quoting rules. PHP has the kind
of belt-and-suspenders solution you would expect: regexps are
in strings, /and/ you have to include the =/ ... /=
delimiters! Duh.
2. A function to copy faces back from the parsed string to the
original buffer text. This has to recognize any escape
sequences so they can be treated as a single character.
** Internal details
Internally, =rxt= defines an abstract syntax tree data type for
regular expressions, parsers for Elisp and PCRE syntax, and
"unparsers" from to PCRE, rx, and SRE syntax. Converting from a
parsed syntax tree to Elisp syntax is a two-step process: first
convert to =rx= form, then let =rx-to-string= do the heavy lifting.
See =rxt-parse-re=, =rxt-adt->pcre=, =rxt-adt->rx=, and
=rxt-adt->sre=, and the section beginning "Regexp ADT" in
pcre2el.el for details.
This code is partially based on Olin Shivers' reference SRE
implementation in scsh, although it is simplified in some respects
and extended in others. See =scsh/re.scm=, =scsh/spencer.scm= and
=scsh/posixstr.scm= in the =scsh= source tree for details. In
particular, =pcre2el= steals the idea of an abstract data type for
regular expressions and the general structure of the string regexp
parser and unparser. The data types for character sets are extended
in order to support symbolic translation between character set
expressions without assuming a small (Latin1) character set. The
string parser is also extended to parse a bigger variety of
constructions, including POSIX character classes and various Emacs
and Perl regexp assertions. Otherwise, only the bare minimum of
scsh's abstract data type is implemented.
** Soapbox
Emacs regexps have their annoyances, but it is worth getting used
to them. The Emacs assertions for word boundaries, symbol
boundaries, and syntax classes depending on the syntax of the mode
in effect are especially useful. (PCRE has =\b= for word-boundary,
but AFAIK it doesn't have separate assertions for beginning-of-word
and end-of-word). Other things that might be done with huge regexps
in other languages can be expressed more understandably in Elisp
using combinations of `save-excursion' with the various searches
(regexp, literal, skip-syntax-forward, sexp-movement functions,
etc.).
There's not much point in using =rxt-pcre-to-elisp= to use PCRE
notation in a Lisp program you're going to maintain, since you
still have to double all the backslashes. Better to just use the
converted result (or better yet, the =rx= form).
** History and acknowledgments
This was originally created out of an answer to a stackoverflow
question:
http://stackoverflow.com/questions/9118183/elisp-mechanism-for-converting-pcre-regexps-to-emacs-regexps
Thanks to:
- Wes Hardaker (hardaker) for the initial inspiration and
subsequent hacking
- priyadarshan for requesting RX/SRE support
- Daniel Colascione (dcolascione) for a patch to support Emacs's
explicitly-numbered match groups
- Aaron Meurer (asmeurer) for requesting Isearch support
- Philippe Vaucher (silex) for a patch to support
=ibuffer-do-replace-regexp= in PCRE mode
|