File: bash.html

package info (click to toggle)
lg-issue18 5-4
  • links: PTS
  • area: main
  • in suites: woody
  • size: 2,928 kB
  • ctags: 148
  • sloc: makefile: 36; sh: 4
file content (396 lines) | stat: -rw-r--r-- 15,795 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
<!--startcut ==========================================================-->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<title>bash String Manipulations Issue 18</title>
</HEAD>
<BODY BGCOLOR="#EEE1CC" TEXT="#000000" LINK="#0000FF" VLINK="#0020F0"
ALINK="#FF0000">
<!--endcut ============================================================-->

<H4>
&quot;Linux Gazette...<I>making Linux just a little more fun!</I>&quot;
</H4>

<P> <HR> <P> 
<!--===================================================================-->

<center>
<H2>bash String Manipulations</H2>
<H4>By Jim Dennis,
<a href="mailto:jimd@starshine.org">jimd@starshine.org</a></H4>
</center>
<P><HR><P>
<P>
The <i>bash</i> shell has many features that are 
sufficiently obscure you almost never see them used.  One of
the problems is that the man page offers no examples. <P>

Here I'm going to show how to use some of these features to do
the sorts of simple string manipulations that are commonly 
needed on file and path names. <H2>

Background </H2><P>

In traditional Bourne shell programming you might see references
to the <i>basename</i> and <i>dirname</i> commands.
These perform simple string manipulations on their arguments. You'll
also see many uses of <i>sed</i> and <i>awk</i> or
<i>perl -e</i> to perform simple string manipulations.<P>

Often these machinations are necessary perform on lists of filenames
and paths. There are many specialized programs that are conventionally
included with Unix to perform these sorts of utility functions: 
<i>tr</i>, <i>cut</i>, <i>paste</i>, and <i>join</i>.

Given a filename like <i>/home/myplace/a.data.directory/a.filename.txt
</i> which we'll call <b>$f</b> you could use commands like:

	<blockquote><pre>
	<i>dirname</i> <b>$f</b> 
	<i>basename</i> <b>$f</b> 
	<i>basename</i> <b>$f</b> .txt
	</pre></blockquote>

... to see output like:
	<blockquote><pre><i>
	/home/myplace/a.data.directory
	a.filename.txt
	a.filename </i></pre></blockquote>

Notice that the GNU version of <i>basename</i> takes an 
optional parameter.  This handy for specifying a filename "extension"
like <b>.tar.gz</b> which will be stripped off of the output.  Note that
<i>basename</i> and <i>dirname</i> don't verify that these parameters 
are valid filenames or paths.  They simple perform simple
string operations on a single argument. You shouldn't use wild cards
with them -- since <i>dirname</i> takes exactly one argument
(and complains if given more) and <i>basename</i> takes one argument
and an optional one which is not a filename. <P>

Despite their simplicity these two commands are used frequently in 
shell programming because most shells don't have any built-in string
handling functions -- and we frequently need to refer to just the
directory or just the file name parts of a given full file specification. <P>

Usually these commands are used within the "back tick" shell operators
like  <i>TARGETDIR=`dirname $1`</i>.  The "back tick" operators
are equivalent to the <i>$(...)</i> construct.  This latter construct
is valid in Korn shell and <i>bash</i> -- and I find it easier to read
(since I don't have to squint at me screen wondering which direction the
"tick" is slanted). <P><H2>

A Better Way </H2>

Although the <i>basename</i> and <i>dirname</i> commands
embody the "small is beautiful" spirit of Unix -- they may push the
envelope towards the "too simple to be worth a separate program" end
of simplicity. <P>

Naturally you can call on <i>sed</i>, <i>awk</i>, TCL or
<i>perl</i> for more flexible and complete string handling.
However this can be overkill -- and a little ungainly. <P>

So, <i>bash</i> (which long ago abandoned the "small is beautiful"
principal and went the way of <i>emacs</i>) has some built in
syntactical candy for doing these operations.  Since <i>bash</i>
is the default shell on Linux systems then there is no reason not to 
use these features when writing scripts for Linux.<P>

	<ul><lh>
	If your concerned about portability to other shells and
	systems -- you may want to stick with <i>dirname</i>,
	<i>basename</i>, and <i>sed</i></lh></ul><P>


<H2>
The <i>bash</i> Man Page </H2><P>

The <i>bash</i> man page is huge.  In contains a complete
reference to the "readline" libraries and how to write a <b>.inputrc</b>
file (which I think should all go in a separate man page) -- and a
run down of all the <i>csh</i> "history" or <b>bang!</b> operators
(which I think should be replaced with a simple statement like:
"Most of the <b>bang!</b> tricks that work in <i>csh</i> work the
same way in <i>bash</i>"). <P>

However, buried in there is a section on <b>Parameter Substitution</b>
which tells us that $foo is really a shorthand for ${foo} which is 
really the simplest case of several ${foo<i>:operators</i>} and similar
constructs. <P>

Are you confused, yet? <P>

Here's where a few examples would have helped.  To understand the 
man page I simply experimented with the echo command and several 
shell variables.  This is what it all means: 

	<ul><lh>
	Given:<ul><lh>
		foo=/tmp/my.dir/filename.tar.gz </lh></ul>
		</lh></ul>

	<ul>
	We can use these expressions:<dl><dt>
		path = ${foo%/*} <dd>
			To get: /tmp/my.dir (like <i>dirname</i>)<dt>
		file = ${foo##*/} <dd>
			To get: filename.tar.gz (like <i>basename</i>)<dt>
		base = ${file%%.*} <dd>
			To get: filename <dt>
		ext  = ${file#*.} <dd>
			To get: tar.gz
			</dl></ul>

	<ul><B>Note that the last two depend on the 
	assignment made in the second one</b></ul>

Here we notice two different "operators" being used inside the 
parameters (curly braces).  Those are the <b>#</b> and the <b>%</b>
operators.  We also see them used as single characters and in pairs.
This gives us four combinations for trimming patterns off the 
beginning or end of a string:<DL><DT>
	${variable%pattern} <DD>
		Trim the shortest match from the end <DT>
	${variable##pattern} <DD>
		Trim the longest match from the beginning <DT>
	${variable%%pattern} <DD>
		Trim the shortest match from the end <DT>
	${variable#pattern} <DD>
		Trim the shortest match from the beginning 
		
		</DL><P>

It's important to understand that these use shell "globbing"
rather than "regular expressions" to <b>match</b> these patterns. 
Naturally a simple string like "txt" will match sequences of exactly
those three characters in that sequence -- so the difference between
"shortest" and "longest" only applies if you are using a shell 
wild card in your pattern.<P>

A simple example of using these operators comes in the common
question of copying or renaming all the *.txt to change the 
.txt to .bak (in MS-DOS' COMMAND.COM that would be REN *.TXT *.BAK).<P>

This is complicated in Unix/Linux because of a fundamental difference
in the programming API's.  In most Unix shells the expansion of a
wild card pattern into a list of filenames (called "globbing") is done
by the shell -- before the command is executed.  Thus the command normally
sees a list of filenames (like "foo.txt bar.txt etc.txt") where DOS
(COMMAND.COM) hands external programs a pattern like *.TXT.  <P>

Under Unix shells, if a pattern doesn't match any filenames the parameter 
is usually left on the command like literally.  Under <i>bash</i>
this is a user-settable option.  In fact, under <i>bash</i> you can
disable shell "globbing" if you like -- there's a simple option to do this.
It's almost never used -- because commands like <i>mv</i>, and
<i>cp</i> won't work properly if their arguments are passed to them
in this manner.<P>

However here's a way to accomplish a similar result:

	<blockquote>
	for i in *.txt; do cp $i ${i%.txt}.bak; done
		</blockquote>

... obviously this is more typing. If you tried to create a
shell function or alias for it -- you have to figure out how to 
pass this parameters.  Certainly the following seems simple enough:

	<blockquote>
	function cp-pattern {
	for i in $1; do cp $i ${i%$1}$2; done
		</blockquote>

... but that doesn't work like most Unix users would expect.  You'd
have to pass this command a pair of specially <em>chosen</em>, and
<em>quoted</em> arguments like:

	<blockquote>
	cp-pattern '*.txt' .bak
		</blockquote>

... note how the second pattern has no wild cards and how the first is
quoted to prevent any shell globbing.  That's fine for something you
might just use yourself -- if you remember to quote it right.  It's 
easy enough to add check for the number of arguments and to ensure that
there is at least one file that exists in the $1 pattern.  However it 
becomes much harder to make this command reasonably safe and robust.  
Inevitably it becomes less "unix-like" and thus more difficult to use
with other Unix tools.<P>

I generally just take a whole different approach.  Rather than trying
to use <i>cp</i> to make a backup of each file under a slightly
changed name I might just make a directory (usually using the date
and my login ID as a template) and use a simple <i>cp</i> command
to copy all my target files into the new directory.<P>

Another interesting thing we can do with these "parameter expansion"
features is to iterate over a list of components in a single variable.<P>

For example, you might want to do something to traverse over every 
directory listed in your path -- perhaps to verify that everything
listed therein is really a directory and is accessible to you.<P>

Here's a command that will echo each directory named on your path
on it's own line:

	<blockquote>
	p=$PATH
	until [ $p = $d ]; do d=${p%%:*}; p=${p#*:}; echo $d; done
		</blockquote>

... obviously you can replace the <i>echo $d</i> part of this
command with anything you like. <P>

Another case might be where you'd want to traverse a list of directories
that were all part of a path.  Here's a command pair that echos each
directory from the root down to the "current working directory":

	<blockquote>
	p=$(pwd)
	until [ $p = $d ]; do p=${p#*/}; d=${p%%/*}; echo $d; done
		</blockquote>

... here we've reversed the assignments to <i>p</i> and <i>d</i>
so that we skip the root directory itself -- which must be "special cased"
since it appears to be a "null" entry if we do it the other way.  The 
same problem would have occurred in the previous example -- if the value
assigned to <i>$PATH</i> had started with a ":" character. <P>


Of course, its important to realize that this is not the only, or 
necessarily the best method to parse a line or value into separate 
fields.  Here's an example that uses the old <i>IFS</i> variable
(the "inter-field separator in the Bourne, and Korn  shells as well as 
<i>bash</i>) to parse each line of <i>/etc/passwd</i> and extract 
just two fields:

		
		<blockquote><pre>
		cat /etc/passwd | ( \
			IFS=: ; while read lognam pw id gp fname home sh; \
				do echo $home \"$fname\"; done \
				)
			</pre></blockquote>

Here we see the parentheses used to isolate the contents in a subshell
-- such that the assignment to IFS doesn't affect our current shell.
Setting the IFS to a "colon" tells the shell to treat that character as 
the separater between "words" -- instead of the usual "whitespace" that's
assigned to it.  For this particular function it's very important that
IFS consist solely of that character -- usually it is set to "space,"
"tab," and "newline.<P>

After that we see a typical <i>while read</i> loop -- where we
read values from each line of input (from <i>/etc/passwd</i> into
seven variables per line.  This allows us to use any of these fields
that we need from within the loop.  Here we are just using the <i>
echo</i> command -- as we have in the other examples.<P>

My point here has been to show how we can do quite a bit of 
string parsing and manipulation directly within <i>bash</i>
-- which will allow our shell scripts to run faster with less overhead
and may be easier than some of the more complex sorts of pipes and 
command substitutions one might have to employ to pass data to the
various external commands and return the results. <P>

Many people might ask: <i>Why not simply do it all in <b>perl</b>?</i>
I won't dignify that with a response.  Part of the beauty of Unix is
that each user has many options about how they choose to program something.
Well written scripts and programs interoperate regardless of what particular
scripting or programming facility was used to create them.  Issue the
command <i>file /usr/bin/*</i> on your system and and you may be 
surprised at how many Bourne and C shell scripts there are in there<P>

In conclusion I'll just provide a sampler of some other 
<i>bash</i> parameter expansions:

	<DL><DT>
	<i>${parameter:-word}</i><DD>
	Provide a default if <i>parameter</i> is unset or null.<br>
	Example:<ul><lh>
	      <i> echo ${1:-"default"}</i></lh></ul><DT>

Note:  this would have to be used from within a
functions or shell script -- the point is to show 
that some of the parameter substitutions can be use
with shell numbered arguments.   In this case the
string "default" would be returned if the function
or script was called with no $1 (or if all of the 
arguments had been <i>shift</i>ed out of existence.
       
	<i>${parameter:=word}</i><DD>
	Assign a value to <i>parameter</i> if it was previously
	unset or null.<dt>
	Example:<ul><lh>
	       <i>echo ${HOME:="/home/.nohome"}</i></lh></ul><DD>

	${parameter:?word}<DD>
	Generate an error if <i>parameter</i> is unset or null by
	printing <i>word</i> to <i>stdout</i>.<dt>

	Example:<ul><lh>
	      <i>: ${HOME:="/home/.nohome"} </i></lh></ul><DD>
       ${TMP:?"Error: Must have a valid Temp Variable Set"}
       </DL>
       
This one just uses the shell "null command" (the : command) to
evaluate the expression.  If the variable doesn't exist or has a 
null value -- this will print the string to the standard error
file handle and exit the script with a return code of one.<P>

Oddly enough -- while it is easy to redirect the standard error
of processes under <i>bash</i> -- there doesn't seem to be an
easy portable way to explicitly generate message or redirect output
<b>to</b> stderr.  The best method I've come up with is to use 
the /proc/ filesystem (process table)  like so:

	<ul><ul>
	function error { echo "$*" > /proc/self/fd/2 }
	</ul></ul>

... <i>self</i> is always a set of entries that refers to the current
process -- and <i>self/fd/</i> is a directory full of the currently
open file descriptors.  Under Unix and DOS every process is given
the following pre-opened file descriptors:  stdin, stdout, and stderr.
<p><dl><dt>

       ${parameter:+word}<dd>
       Alternative value.
       ${TMP:+"/mnt/tmp"} <br>
       
       use /mnt/tmp instead of $TMP but do nothing if TMP was
       unset.  This is a weird one that I can't ever see myself
       using.  But it is a logical complement to the ${var:-value} 
       we saw above.<dt>


	${#variable}<dd>
	Return the length of the variable in characters.<br>

	Example:<ul><lh>
       echo The length of your PATH is ${#PATH} </lh></ul><DD>
	</DL>


<!--===================================================================-->
<P> <hr> <P> 
<center><H5>Copyright &copy; 1997, Jim Dennis<BR> 
Published in Issue 18 of the Linux Gazette, June 1997</H5></center>

<!--===================================================================-->
<P> <hr> <P> 
<A HREF="./lg_toc18.html"><IMG ALIGN=BOTTOM SRC="../gx/indexnew.gif" 
ALT="[ TABLE OF CONTENTS ]"></A>
<A HREF="../lg_frontpage.html"><IMG ALIGN=BOTTOM SRC="../gx/homenew.gif"
ALT="[ FRONT PAGE ]"></A>
<A HREF="./lg_answer18.html"><IMG SRC="../gx/back2.gif"
ALT=" Back "></A>
<A HREF="./gnu.html"><IMG SRC="../gx/fwd.gif" ALT=" Next "></A>
<P> <hr> <P> 
<!--startcut ==========================================================-->
</BODY>
</HTML>
<!--endcut ============================================================-->