File: regex.3

package info (click to toggle)
newsgate 1.6-10
  • links: PTS
  • area: non-free
  • in suites: slink
  • size: 332 kB
  • ctags: 310
  • sloc: ansic: 2,682; yacc: 499; sh: 278; lex: 183; perl: 151; makefile: 113
file content (266 lines) | stat: -rw-r--r-- 7,417 bytes parent folder | download | duplicates (6)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
.\" $Header: /nfs/papaya/u2/rsalz/src/newsgate/RCS/regex.3,v 1.4 91/02/12 14:50:56 rsalz Exp $
.TH REGEX 3 LOCAL
.SH NAME
re_comp, re_exec, re_subs, re_modw, re_fail \- regular expression handling
.SH SYNOPSIS
.nf
.B "char *re_comp(pat)"
.B "char *pat;"
.sp
.B "re_exec(str)"
.B "char *str;"
.sp
.B "re_subs(src, dst)"
.B "char *src;"
.B "char *dst;"
.sp
.B "void re_fail(msg, op)"
.B "char *msg;"
.B "char op;"
.sp
.B "void re_modw(str)"
.B "char *str;"
.fi
.SH DESCRIPTION
These functions implement
.IR ed (1)\-style
partial regular expressions and supporting facilities.
.PP
.I Re_comp
compiles a pattern string into an internal form (a deterministic finite\-state
automaton) to be executed by
.I re_exec
for pattern matching.
.I Re_comp
returns zero if the pattern is compiled successfully, otherwise it returns an
error message string.
If
.I re_comp
is called with a null pointer or a pointer to an empty string, it returns
without changing the currently compiled regular expression.
.PP
.I Re_comp
supports the same limited set of
.I "regular expressions"
found in
.I ed
and Berkeley
.IR regex (3)
routines:
.in +1i
.de Ti
.sp
.ti -1i
.ta 0.5i +0.5i +0.5i
..
.Ti
[1]	\fIchar\fP	Matches itself, unless it is a special
character (meta\-character): \fB. \e [ ] * + ^ $\fP
.Ti
[2]	\fB.\fP	Matches \fIany\fP character.
.Ti
[3]	\fB\e\fP	Matches the character following it, except
when followed by a digit, \fB(\fP, \fB)\fP, \fB<\fP or \fB>\fP
(see [7], [8] and [9]).
It is used as an escape character for all other meta\-characters, and itself.
When used in a set ([4]), it is treated as an ordinary character.
.Ti
[4]	\fB[\fP\fIset\fP\fB]\fP	Matches one of the characters in the set.
If the first character in the set is \fB^\fP, it matches a character \fInot\fP
in the set.
the shorthand
.IR S \- E
specifies the set of characters
.I S
through
.IR E ,
inclusive.
The special characters \fB]\fP and \fB\-\fP have no special meaning if they
appear as the first characters in the set.
.nf
.ta \w'[a\-zA\-Z0\-9]    'u
Examples	Match
[a\-z]		any lowercase alpha
[^]\-]		any char except ] and \-
[^A\-Z]		any char except uppercase alpha
[a\-zA\-Z0\-9]	any alphanumeric
.fi
.Ti
[5]	\fB*\fP	Any regular expression form [1] to [4], followed by the
closure character (*) matches zero or more matches of that form.
.Ti
[6]	\fB+\fP	Same as [5], except it matches one or more.
.Ti
[7]	\e\|( \e)	A regular expression in the form [1] to [10], enclosed
as \e\|(\fIform\fP\e) matches what form matches.
The enclosure creates a set of tags, used for [8] and for pattern
substitution in
.IR re_subs .
The tagged forms are numbered starting from one.
.Ti
[8]	\ed	A \e followed by a digit matches whatever a previously tagged
regular expression ([7]) matched.
.Ti
[9]	\fB\e<\fP	Matches the beginning of a \fIword\fP; that is,
an empty string followed by a letter, digit, or _ and not preceded by a
letter, digit, or _ .
.Ti
	\fB\e>\fP	Matches the end of a \fIword\fP; that is, an empty
string preceded by a letter, digit, or _ , and not followed by a letter,
digit, or _ .
.Ti
[10]		A composite regular expression \fIxy\fP where \fIx\fP and
\fIy\fP are in the form of [1] to [10] matches the longest match of \fIx\fP
followed by a match for \fIy\fP.
.Ti
[11]	\fB^ $\fP	A regular expression starting with a \fB^\fP character
and/or ending with a \fB$\fP character, restricts the pattern matching to the
beginning of the line, and/or the end of line (anchors).
Elsewhere in the pattern, \fB^\fP and \fB$\fP are treated as ordinary
characters.
.in -1i
.PP
.I Re_exec
executes the internal form produced by
.I re_comp
and searches the argument string for the regular expression described
by the internal form.
.I Re_exec
returns 1 if the last regular expression pattern is matched within the string,
0 if no match is found.
In case of an internal error (corrupted internal form),
.I re_exec
calls the user\-supplied
.I re_fail
and returns 0.
.PP
The strings passed to both
.I re_comp
and
.I re_exec
may have trailing or embedded newline characters, but must be properly
terminated with a NUL.
.PP
.I Re_subs
does
.IR ed \-style
pattern substitution, after a successful match is found by
.I re_exec.
The source string parameter to
.I re_subs
is copied to the destination string with the following interpretation:
.sp
.in +1i
.Ti
[1]	&	Substitute the entire matched string in the destination.
.Ti
[2]	\e\fId\fP	Substitute the substring matched by a tagged subpattern
numbered \fId\fP, where \fId\fP is between 1 and 9, inclusive.
.Ti
[3]	\e\fIc\fP	Treat the next character literally, unless the
character is a digit ([2]).
.in -1i
.PP
If the copy operation with the substitutions is successful,
.I re_subs
returns 1.
If the source string is corrupted, or the last call to
.I re_exec
fails, it returns 0.
.PP
.I Re_modw
is used to
add new characters into an internal table to
change the re_exec's understanding of what
a \fIword\fP should look like, when matching with \fB\e<\fP and \fB\e>\fP
constructs. If the string parameter is 0 or null string,
the table is reset back to the default, which contains \fBA\-Z a\-z 0\-9 _\fP .
.PP
.I Re_fail
is a user\-supplied routine to handle internal errors.
.I Re_exec
calls
.I re_fail
with an error message string, and the opcode character that caused the error.
The default
.I re_fail
routine simply prints the message and the opcode character to the standard
error and calls
.IR exit (2).
.SH EXAMPLES
For additional details, refer to the sources.
.PP
.RS
.nf
.ta \w'\e\|(foo\e)[1\-3]\e1    'u
foo*.*	fo foo fooo foobar fobar foxx ...

fo[ob]a[rz]	fobar fooar fobaz fooaz

foo\e\e+	foo\e foo\e\e foo\e\e\e  ...

\e\|(foo\e)[1\-3]\e1	foo1foo foo2foo foo3foo
(This is the same as \fIfoo[1\-3]foo\fP, but it takes less internal space.)

\e\|(fo.*\e)\-\e1	foo\-foo fo\-fo fob\-fob foobar\-foobar ...
.fi
.RE
.SH DIAGNOSTICS
.I Re_comp
returns one of the following strings if an error occurs:
.RS
.nf
.I "No previous regular expression"
.I "Empty closure"
.I "Illegal closure"
.I "Cyclical reference"
.I "Undetermined reference"
.I "Unmatched \e\|("
.I "Missing ]"
.I "Null pattern inside \e\|(\e)"
.I "Null pattern inside \e<\e>"
.I "Too many \e\|(\e) pairs"
.I "Unmatched \e)"
.fi
.RE
.SH REFERENCES
.nf
.IR "Software tools" ", Kernighan & Plauger."
.IR "Software tools in Pascal" ", Kernighan & Plauger."
.IR "Grep sources [rsx\-11 C dist]" ", David Conroy."
.IR "Ed \- text editor" ", Unix Programmer's Manual."
.IR "Advanced editing on Unix" ", B. W. Kernighan."
.IR "RegExp sources" ", Henry Spencer."
.fi
.SH "HISTORY AND NOTES"
These routines are
.IR Public Domain ,
you can get them in source.
They are derived from various implementations found in the
.I "Software Tools"
books, and David Conroy's
.IR grep .
They are NOT derived from licensed/restricted software.
For more interesting/academic/complicated implementations, see Henry Spencer's
.I regexp
routines, or the
.I "GNU Emacs"
pattern
matching module.
.PP
.I Re_comp
and
.I re_exec
generally perform at least as well as their licensed counterparts.
In a very few instances, they are about 10% to 15% slower.
.SH AUTHOR
Ozan S. Yigit <yunexus!oz>.
.br
This manual page was edited from the original by Rich $alz <rsalz@bbn.com>.
.SH BUGS
The internal buffer for the compiled pattern is not checked for overflow;
the size is currently 1024 bytes.
.br
There are no doubt other bugs, too.
.SH "SEE ALSO"
ed(1), egrep(1), fgrep(1), grep(1)