File: tindale.html

package info (click to toggle)
lg-issue55 2-1
  • links: PTS
  • area: main
  • in suites: sarge
  • size: 1,600 kB
  • ctags: 94
  • sloc: sh: 37; makefile: 34
file content (188 lines) | stat: -rw-r--r-- 9,515 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
<!--startcut  ==============================================-->
<!-- *** BEGIN HTML header *** -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML><HEAD>
<title>Regular Expressions in C LG #55</title>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#0000AF"
ALINK="#FF0000">
<!-- *** END HTML header *** -->

<CENTER>
<A HREF="http://www.linuxgazette.com/">
<H1><IMG ALT="LINUX GAZETTE" SRC="../gx/lglogo.jpg" 
	WIDTH="600" HEIGHT="124" border="0"></H1></A> 

<!-- *** BEGIN navbar *** -->
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="stoddard.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue55/tindale.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom"  ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="williams.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom"  ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
<!-- *** END navbar *** -->
<P>
</CENTER>

<!--endcut ============================================================-->

<H4 ALIGN="center">
"Linux Gazette...<I>making Linux just a little more fun!</I>"
</H4>

<P> <HR> <P> 
<!--===================================================================-->

<center>
<H1><font color="maroon">Regular Expressions in C</font></H1>
<H4>By <a href="mailto:Ben.Tindale@aals27.alcatel.com.au">Ben Tindale</a></H4>
</center>
<P> <HR> <P>  

<!-- END header -->




<h1>Scope</h1>
<p>
In this series of articles I intend to explore the varying
implementations of strings in languages that are common on the Linux
platform. The first article will explore the regular expression
library provided with GNU libc. In future articles I hope to look at
other common libraries and languages - hashing functions in Java and
strings in KDE versus string in Gnome.
</p>
<p>
Each language has its strengths and its weaknesses. I hope that by
doing a little grunt work on your behalf, I'll be able to give you a
brief overview of the abilities and weaknesses of the common languages
and their libraries with respect to string handling.
</p>
<p>
I won't be talking about internationalization and localisation in this
series of articles, since those subjects are worthy of volumes of
study - not a short summary.
</p>
<p>

<h2>The Gnu C Library and Regular Expressions</h2>
</p>
<p>
The GNU C library is the most basic system element on any Linux
installation from a programmer's perspective. Most higher level
libraries are based on libc, and most of what we think of as the &quot;C
language&quot; are really functions in libc.
</p>
<p>
Strings in C are just null terminated arrays of chars or wide
chars. This is the simplest and most efficient implementation of
strings in terms of computer resources, but probably the trickiest and
least efficient implementation in terms of programmer resources. Since
strings are either constants (ie literals) or pointers, the programmer
has the power to manipulate the strings down to the bit level and has
all kinds of opportunities to optimise their code (for example
<a href="http://sourceforge.net/snippet/detail.php?type=snippet&id=100055">
this</a> snippet). On the other hand, null termination of strings and
the absence of in-built length checking mean that problems such as
infinite loops and buffer-overflows are inevitably going to appear in
code.
</p>
<p>
The GNU C library is rich in string manipulation functions. There are
standard calls to copy, move, concatenate, compare and find the length
of a string (or a section of memory). In addition to these, libc also
supports tokenization and regular expression searches.
</p>
<p>
Regular expressions are a powerful method for searching for text that
matches a particular pattern. Most users will have first encountered
the idea of regular expressions while using the command line, where
characters such as '*' have a special meaning (in this case, matching
zero or more characters). To illustrate the power of regular
expressions and how they are used, we will implement a simple form of
grep.
</p>
<p>
<h2><a href="misc/tindale/mygrep.c.txt">Mygrep.c</a></h2>
</p>
<p>
Mygrep.c uses the powerful regex.h library for the task of searching
through a text file for a line that matches the given pattern.
</p>
<pre>
	bash&gt; ./mygrep -f mygrep.c -p int Line 17: int
	match_patterns(regex_t *r, FILE *FH) Line 36: printf(&quot;Line %d:
	%s&quot;, line_no, line); Line 52: printf(&quot;In error\n&quot;);
        bash&gt;
</pre>
<p>
Libc makes the use of regular expressions comparitively easy. Of
course, it would be much easier to use a language with regular
expression matching as part of its core definition (such as perl) for
this example, but the C library does have the advantage of easy
integration with existing code and maybe speed (although in languages
such as perl the regular expression matching is highly optimised).
</p>
<p>
If you examine the program listing, you will see that mygrep.c
consists of a main function that handles the user options and two
functions that perform the actual regular expression matching. The
first of these functions, logically, is the function do_regex(). This
function takes in as its parameters a pointer to a regular expression
structure, a string holding the pattern to search for and a string
holding the filename. The first task that do_regex() performs is to
&quot;compile&quot; the regular expression pattern into the format native to the
GNU library by calling regcomp(). This format is a data structure
optimised for pattern matching, the details of which are hidden from
the user. Next, the file to be scanned is opened, then the file handle
and the compiled regular expression are passed to match_patterns() to
execute the search and output the results.
</p>
<p>
Match_patterns() scans through each line of the file, looking for
patterns that match the regular expression. We begin scanning the
lines one by one - note that we have assumed that the lines are less
than 1023 bytes long (the array called &quot;line&quot; is 1024 bytes long and
we need one byte for the null termination). If the input is more than
1023 bytes long, then the line is wrapped over and interpreted as a
new line until the '\n' character is met. The function regexec() scans
the line for a set of characters that match the user specified
pattern. Every set of characters that matches the regular expression
forces regexec() to return 0, at which point we print out the line and
the line number that match. If a regular expression matches more than
once, then the line is printed out more than once. The offset from the
beginning of the line is updated so that we do not match on the same
pattern again.
</p>
<p>
This example, while fairly trivial, illustrates how powerful the GNU C
library can be. Some of the more salient features of the library that
we have used include: 
<ul>
<li>The ability to automagically handle extremely
   long lines.</li>
<li>Optimised data structures for particular functions.</li>
<li> Standard, portable error handling.  </li>
<li> Standard, portable handling of command line options.</li>
    </ul>
In particular, we explored the capable GNU
regular expression library, regex.h, which simplifies the inclusion of
regular expression matching into your program, and
provides a safe and simple interface to these capabilities.</p>




<!-- *** BEGIN copyright *** -->
<P> <hr> <!-- P --> 
<H5 ALIGN=center>

Copyright &copy; 2000, Ben Tindale<BR> 
Published in Issue 55 of <i>Linux Gazette</i>, July 2000</H5>
<!-- *** END copyright *** -->

<!--startcut ==========================================================-->
<HR><P>
<CENTER>
<!-- *** BEGIN navbar *** -->
<IMG ALT="" SRC="../gx/navbar/left.jpg" WIDTH="14" HEIGHT="45" BORDER="0" ALIGN="bottom"><A HREF="stoddard.html"><IMG ALT="[ Prev ]" SRC="../gx/navbar/prev.jpg" WIDTH="16" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="index.html"><IMG ALT="[ Table of Contents ]" SRC="../gx/navbar/toc.jpg" WIDTH="220" HEIGHT="45" BORDER="0" ALIGN="bottom" ></A><A HREF="../index.html"><IMG ALT="[ Front Page ]" SRC="../gx/navbar/frontpage.jpg" WIDTH="137" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="http://www.linuxgazette.com/cgi-bin/talkback/all.py?site=LG&article=http://www.linuxgazette.com/issue55/tindale.html"><IMG ALT="[ Talkback ]" SRC="../gx/navbar/talkback.jpg" WIDTH="121" HEIGHT="45" BORDER="0" ALIGN="bottom"  ></A><A HREF="../faq/index.html"><IMG ALT="[ FAQ ]" SRC="./../gx/navbar/faq.jpg"WIDTH="62" HEIGHT="45" BORDER="0" ALIGN="bottom"></A><A HREF="williams.html"><IMG ALT="[ Next ]" SRC="../gx/navbar/next.jpg" WIDTH="15" HEIGHT="45" BORDER="0" ALIGN="bottom"  ></A><IMG ALT="" SRC="../gx/navbar/right.jpg" WIDTH="15" HEIGHT="45" ALIGN="bottom">
<!-- *** END navbar *** -->
</CENTER>
</BODY></HTML>
<!--endcut ============================================================-->