File: regexp.3

package info (click to toggle)
erlang-manpages 1%3A12.b.3-1
links: PTS
area: main
in suites: lenny
size: 4,188 kB
ctags: 2
sloc: makefile: 68; perl: 30; sh: 15
file content (381 lines) | stat: -rw-r--r-- 9,272 bytes
.TH regexp 3 "stdlib  1.15.3" "Ericsson AB" "ERLANG MODULE DEFINITION"
.SH MODULE
regexp \- Regular Expression Functions for Strings
.SH DESCRIPTION
.LP
This module contains functions for regular expression matching and substitution\&.

.SH EXPORTS
.LP
.B
match(String, RegExp) -> MatchRes
.br
.RS
.TP
Types
String = RegExp = string()
.br
MatchRes = {match, Start, Length} | nomatch | {error, errordesc()}
.br
Start = Length = integer()
.br
.RE
.RS
.LP
Finds the first, longest match of the regular expression \fIRegExp\fR in \fIString\fR\&. This function searches for the longest possible match and returns the first one found if there are several expressions of the same length\&. It returns as follows:
.RS 2
.TP 4
.B
\fI{match, Start, Length}\fR:
if the match succeeded\&. \fIStart\fR is the starting position of the match, and \fILength\fR is the length of the matching string\&.
.TP 4
.B
\fInomatch\fR:
if there were no matching characters\&.
.TP 4
.B
\fI{error, Error}\fR:
if there was an error in \fIRegExp\fR\&.
.RE
.RE
.LP
.B
first_match(String, RegExp) -> MatchRes
.br
.RS
.TP
Types
String = RegExp = string()
.br
MatchRes = {match, Start, Length} | nomatch | {error, errordesc()}
.br
Start = Length = integer()
.br
.RE
.RS
.LP
Finds the first match of the regular expression \fIRegExp\fR in \fIString\fR\&. This call is usually faster than \fImatch\fR and it is also a useful way to ascertain that a match exists\&. It returns as follows:
.RS 2
.TP 4
.B
\fI{match, Start, Length}\fR:
if the match succeeded\&. \fIStart\fR is the starting position of the match and \fILength\fR is the length of the matching string\&.
.TP 4
.B
\fInomatch\fR:
if there were no matching characters\&.
.TP 4
.B
\fI{error, Error}\fR:
if there was an error in \fIRegExp\fR\&.
.RE
.RE
.LP
.B
matches(String, RegExp) -> MatchRes
.br
.RS
.TP
Types
String = RegExp = string()
.br
MatchRes = {match, Matches} | {error, errordesc()}
.br
Matches = list()
.br
.RE
.RS
.LP
Finds all non-overlapping matches of the expression \fIRegExp\fR in \fIString\fR\&. It returns as follows:
.RS 2
.TP 4
.B
\fI{match, Matches}\fR:
if the regular expression was correct\&. The list will be empty if there was no match\&. Each element in the list looks like \fI{Start, Length}\fR, where \fIStart\fR is the starting position of the match, and \fILength\fR is the length of the matching string\&.
.TP 4
.B
\fI{error, Error}\fR:
if there was an error in \fIRegExp\fR\&.
.RE
.RE
.LP
.B
sub(String, RegExp, New) -> SubRes
.br
.RS
.TP
Types
String = RegExp = New = string()
.br
SubRes = {ok, NewString, RepCount} | {error, errordesc()}
.br
RepCount = integer()
.br
.RE
.RS
.LP
Substitutes the first occurrence of a substring matching \fIRegExp\fR in \fIString\fR with the string \fINew\fR\&. A \fI&\fR in the string \fINew\fR is replaced by the matched substring of \fIString\fR\&. \fI\e&\fR puts a literal \fI&\fR into the replacement string\&. It returns as follows:
.RS 2
.TP 4
.B
\fI{ok, NewString, RepCount}\fR:
if \fIRegExp\fR is correct\&. \fIRepCount\fR is the number of replacements which have been made (this will be either 0 or 1)\&.
.TP 4
.B
\fI{error, Error}\fR:
if there is an error in \fIRegExp\fR\&.
.RE
.RE
.LP
.B
gsub(String, RegExp, New) -> SubRes
.br
.RS
.TP
Types
String = RegExp = New = string()
.br
SubRes = {ok, NewString, RepCount} | {error, errordesc()}
.br
RepCount = integer()
.br
.RE
.RS
.LP
The same as \fIsub\fR, except that all non-overlapping occurrences of a substring matching \fIRegExp\fR in \fIString\fR are replaced by the string \fINew\fR\&. It returns:
.RS 2
.TP 4
.B
\fI{ok, NewString, RepCount}\fR:
if \fIRegExp\fR is correct\&. \fIRepCount\fR is the number of replacements which have been made\&.
.TP 4
.B
\fI{error, Error}\fR:
if there is an error in \fIRegExp\fR\&.
.RE
.RE
.LP
.B
split(String, RegExp) -> SplitRes
.br
.RS
.TP
Types
String = RegExp = string()
.br
SubRes = {ok, FieldList} | {error, errordesc()}
.br
Fieldlist = [string()]
.br
.RE
.RS
.LP
\fIString\fR is split into fields (sub-strings) by the regular expression \fIRegExp\fR\&.
.LP
If the separator expression is \fI" "\fR (a single space), then the fields are separated by blanks and/or tabs and leading and trailing blanks and tabs are discarded\&. For all other values of the separator, leading and trailing blanks and tabs are not discarded\&. It returns:
.RS 2
.TP 4
.B
\fI{ok, FieldList}\fR:
to indicate that the string has been split up into the fields of \fIFieldList\fR\&.
.TP 4
.B
\fI{error, Error}\fR:
if there is an error in \fIRegExp\fR\&.
.RE
.RE
.LP
.B
sh_to_awk(ShRegExp) -> AwkRegExp
.br
.RS
.TP
Types
ShRegExp AwkRegExp = string()
.br
SubRes = {ok, NewString, RepCount} | {error, errordesc()}
.br
RepCount = integer()
.br
.RE
.RS
.LP
Converts the \fIsh\fR type regular expression \fIShRegExp\fR into a full \fIAWK\fR regular expression\&. Returns the converted regular expression string\&. \fIsh\fR expressions are used in the shell for matching file names and have the following special characters:
.RS 2
.TP 4
.B
\fI*\fR:
matches any string including the null string\&.
.TP 4
.B
\fI?\fR:
matches any single character\&.
.TP 4
.B
\fI[\&.\&.\&.]\fR:
matches any of the enclosed characters\&. Character ranges are specified by a pair of characters separated by a \fI-\fR\&. If the first character after \fI[\fR is a \fI!\fR, then any character not enclosed is matched\&.
.RE
.LP
It may sometimes be more practical to use \fIsh\fR type expansions as they are simpler and easier to use, even though they are not as powerful\&.
.RE
.LP
.B
parse(RegExp) -> ParseRes
.br
.RS
.TP
Types
RegExp = string()
.br
ParseRes = {ok, RE} | {error, errordesc()}
.br
.RE
.RS
.LP
Parses the regular expression \fIRegExp\fR and builds the internal representation used in the other regular expression functions\&. Such representations can be used in all of the other functions instead of a regular expression string\&. This is more efficient when the same regular expression is used in many strings\&. It returns:
.RS 2
.TP 4
.B
\fI{ok, RE}\fRif \fIRegExp\fRis correct and \fIRE\fRis the internal representation\&.:

.TP 4
.B
\fI{error, Error}\fRif there is an error in \fIRegExpString\fR\&.:

.RE
.RE
.LP
.B
format_error(ErrorDescriptor) -> Chars
.br
.RS
.TP
Types
ErrorDescriptor = errordesc()
.br
Chars = [char() | Chars]
.br
.RE
.RS
.LP
Returns a string which describes the error \fIErrorDescriptor\fR returned when there is an error in a regular expression\&.
.RE
.SH REGULAR EXPRESSIONS
.LP
The regular expressions allowed here is a subset of the set found in \fIegrep\fR and in the \fIAWK\fR programming language, as defined in the book, \fIThe AWK Programming Language, by A\&. V\&. Aho, B\&. W\&. Kernighan, P\&. J\&. Weinberger\fR\&. They are composed of the following characters:
.RS 2
.TP 4
.B
c:
matches the non-metacharacter \fIc\fR\&.
.TP 4
.B
\ec:
matches the escape sequence or literal character \fIc\fR\&.
.TP 4
.B
\&.:
matches any character\&.
.TP 4
.B
^:
matches the beginning of a string\&.
.TP 4
.B
$:
matches the end of a string\&.
.TP 4
.B
[abc\&.\&.\&.]:
character class, which matches any of the characters \fIabc\&.\&.\&.\fR Character ranges are specified by a pair of characters separated by a \fI-\fR\&.
.TP 4
.B
[^abc\&.\&.\&.]:
negated character class, which matches any character except \fIabc\&.\&.\&.\fR\&.
.TP 4
.B
r1 | r2:
alternation\&. It matches either \fIr1\fR or \fIr2\fR\&.
.TP 4
.B
r1r2:
concatenation\&. It matches \fIr1\fR and then \fIr2\fR\&.
.TP 4
.B
r+:
matches one or more \fIr\fRs\&.
.TP 4
.B
r*:
matches zero or more \fIr\fRs\&.
.TP 4
.B
r?:
matches zero or one \fIr\fRs\&.
.TP 4
.B
(r):
grouping\&. It matches \fIr\fR\&.
.RE
.LP
The escape sequences allowed are the same as for Erlang strings:
.RS 2
.TP 4
.B
\fI\eb\fR:
backspace
.TP 4
.B
\fI\ef\fR:
form feed 
.TP 4
.B
\fI\e\fR:
newline (line feed) 
.TP 4
.B
\fI\er\fR:
carriage return 
.TP 4
.B
\fI\et\fR:
tab 
.TP 4
.B
\fI\ee\fR:
escape 
.TP 4
.B
\fI\ev\fR:
vertical tab 
.TP 4
.B
\fI\es\fR:
space 
.TP 4
.B
\fI\ed\fR:
delete 
.TP 4
.B
\fI\eddd\fR:
the octal value ddd 
.TP 4
.B
\fI\ec\fR:
any other character literally, for example \fI\e\e\fR for backslash, \fI\e"\fR for ")
.RE
.LP
To make these functions easier to use, in combination with the function \fIio:get_line\fR which terminates the input line with a new line, the \fI$\fR characters also matches a string ending with \fI"\&.\&.\&.\e "\fR\&. The following examples define Erlang data types:

.nf
Atoms     [a-z][0-9a-zA-Z_]*

Variables [A-Z_][0-9a-zA-Z_]*

Floats    (\e+|-)?[0-9]+\e\&.[0-9]+((E|e)(\e+|-)?[0-9]+)?
.fi
.LP
Regular expressions are written as Erlang strings when used with the functions in this module\&. This means that any \fI\e\fR or \fI"\fR characters in a regular expression string must be written with \fI\e\fR as they are also escape characters for the string\&. For example, the regular expression string for Erlang floats is: \fI"(\e\e+|-)?[0-9]+\e\e\&.[0-9]+((E|e)(\e\e+|-)?[0-9]+)?"\fR\&.
.LP
It is not really necessary to have the escape sequences as part of the regular expression syntax as they can always be generated directly in the string\&. They are included for completeness and can they can also be useful when generating regular expressions, or when they are entered other than with Erlang strings\&.