1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396
|
<!--startcut ==========================================================-->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<title>bash String Manipulations Issue 18</title>
</HEAD>
<BODY BGCOLOR="#EEE1CC" TEXT="#000000" LINK="#0000FF" VLINK="#0020F0"
ALINK="#FF0000">
<!--endcut ============================================================-->
<H4>
"Linux Gazette...<I>making Linux just a little more fun!</I>"
</H4>
<P> <HR> <P>
<!--===================================================================-->
<center>
<H2>bash String Manipulations</H2>
<H4>By Jim Dennis,
<a href="mailto:jimd@starshine.org">jimd@starshine.org</a></H4>
</center>
<P><HR><P>
<P>
The <i>bash</i> shell has many features that are
sufficiently obscure you almost never see them used. One of
the problems is that the man page offers no examples. <P>
Here I'm going to show how to use some of these features to do
the sorts of simple string manipulations that are commonly
needed on file and path names. <H2>
Background </H2><P>
In traditional Bourne shell programming you might see references
to the <i>basename</i> and <i>dirname</i> commands.
These perform simple string manipulations on their arguments. You'll
also see many uses of <i>sed</i> and <i>awk</i> or
<i>perl -e</i> to perform simple string manipulations.<P>
Often these machinations are necessary perform on lists of filenames
and paths. There are many specialized programs that are conventionally
included with Unix to perform these sorts of utility functions:
<i>tr</i>, <i>cut</i>, <i>paste</i>, and <i>join</i>.
Given a filename like <i>/home/myplace/a.data.directory/a.filename.txt
</i> which we'll call <b>$f</b> you could use commands like:
<blockquote><pre>
<i>dirname</i> <b>$f</b>
<i>basename</i> <b>$f</b>
<i>basename</i> <b>$f</b> .txt
</pre></blockquote>
... to see output like:
<blockquote><pre><i>
/home/myplace/a.data.directory
a.filename.txt
a.filename </i></pre></blockquote>
Notice that the GNU version of <i>basename</i> takes an
optional parameter. This handy for specifying a filename "extension"
like <b>.tar.gz</b> which will be stripped off of the output. Note that
<i>basename</i> and <i>dirname</i> don't verify that these parameters
are valid filenames or paths. They simple perform simple
string operations on a single argument. You shouldn't use wild cards
with them -- since <i>dirname</i> takes exactly one argument
(and complains if given more) and <i>basename</i> takes one argument
and an optional one which is not a filename. <P>
Despite their simplicity these two commands are used frequently in
shell programming because most shells don't have any built-in string
handling functions -- and we frequently need to refer to just the
directory or just the file name parts of a given full file specification. <P>
Usually these commands are used within the "back tick" shell operators
like <i>TARGETDIR=`dirname $1`</i>. The "back tick" operators
are equivalent to the <i>$(...)</i> construct. This latter construct
is valid in Korn shell and <i>bash</i> -- and I find it easier to read
(since I don't have to squint at me screen wondering which direction the
"tick" is slanted). <P><H2>
A Better Way </H2>
Although the <i>basename</i> and <i>dirname</i> commands
embody the "small is beautiful" spirit of Unix -- they may push the
envelope towards the "too simple to be worth a separate program" end
of simplicity. <P>
Naturally you can call on <i>sed</i>, <i>awk</i>, TCL or
<i>perl</i> for more flexible and complete string handling.
However this can be overkill -- and a little ungainly. <P>
So, <i>bash</i> (which long ago abandoned the "small is beautiful"
principal and went the way of <i>emacs</i>) has some built in
syntactical candy for doing these operations. Since <i>bash</i>
is the default shell on Linux systems then there is no reason not to
use these features when writing scripts for Linux.<P>
<ul><lh>
If your concerned about portability to other shells and
systems -- you may want to stick with <i>dirname</i>,
<i>basename</i>, and <i>sed</i></lh></ul><P>
<H2>
The <i>bash</i> Man Page </H2><P>
The <i>bash</i> man page is huge. In contains a complete
reference to the "readline" libraries and how to write a <b>.inputrc</b>
file (which I think should all go in a separate man page) -- and a
run down of all the <i>csh</i> "history" or <b>bang!</b> operators
(which I think should be replaced with a simple statement like:
"Most of the <b>bang!</b> tricks that work in <i>csh</i> work the
same way in <i>bash</i>"). <P>
However, buried in there is a section on <b>Parameter Substitution</b>
which tells us that $foo is really a shorthand for ${foo} which is
really the simplest case of several ${foo<i>:operators</i>} and similar
constructs. <P>
Are you confused, yet? <P>
Here's where a few examples would have helped. To understand the
man page I simply experimented with the echo command and several
shell variables. This is what it all means:
<ul><lh>
Given:<ul><lh>
foo=/tmp/my.dir/filename.tar.gz </lh></ul>
</lh></ul>
<ul>
We can use these expressions:<dl><dt>
path = ${foo%/*} <dd>
To get: /tmp/my.dir (like <i>dirname</i>)<dt>
file = ${foo##*/} <dd>
To get: filename.tar.gz (like <i>basename</i>)<dt>
base = ${file%%.*} <dd>
To get: filename <dt>
ext = ${file#*.} <dd>
To get: tar.gz
</dl></ul>
<ul><B>Note that the last two depend on the
assignment made in the second one</b></ul>
Here we notice two different "operators" being used inside the
parameters (curly braces). Those are the <b>#</b> and the <b>%</b>
operators. We also see them used as single characters and in pairs.
This gives us four combinations for trimming patterns off the
beginning or end of a string:<DL><DT>
${variable%pattern} <DD>
Trim the shortest match from the end <DT>
${variable##pattern} <DD>
Trim the longest match from the beginning <DT>
${variable%%pattern} <DD>
Trim the shortest match from the end <DT>
${variable#pattern} <DD>
Trim the shortest match from the beginning
</DL><P>
It's important to understand that these use shell "globbing"
rather than "regular expressions" to <b>match</b> these patterns.
Naturally a simple string like "txt" will match sequences of exactly
those three characters in that sequence -- so the difference between
"shortest" and "longest" only applies if you are using a shell
wild card in your pattern.<P>
A simple example of using these operators comes in the common
question of copying or renaming all the *.txt to change the
.txt to .bak (in MS-DOS' COMMAND.COM that would be REN *.TXT *.BAK).<P>
This is complicated in Unix/Linux because of a fundamental difference
in the programming API's. In most Unix shells the expansion of a
wild card pattern into a list of filenames (called "globbing") is done
by the shell -- before the command is executed. Thus the command normally
sees a list of filenames (like "foo.txt bar.txt etc.txt") where DOS
(COMMAND.COM) hands external programs a pattern like *.TXT. <P>
Under Unix shells, if a pattern doesn't match any filenames the parameter
is usually left on the command like literally. Under <i>bash</i>
this is a user-settable option. In fact, under <i>bash</i> you can
disable shell "globbing" if you like -- there's a simple option to do this.
It's almost never used -- because commands like <i>mv</i>, and
<i>cp</i> won't work properly if their arguments are passed to them
in this manner.<P>
However here's a way to accomplish a similar result:
<blockquote>
for i in *.txt; do cp $i ${i%.txt}.bak; done
</blockquote>
... obviously this is more typing. If you tried to create a
shell function or alias for it -- you have to figure out how to
pass this parameters. Certainly the following seems simple enough:
<blockquote>
function cp-pattern {
for i in $1; do cp $i ${i%$1}$2; done
</blockquote>
... but that doesn't work like most Unix users would expect. You'd
have to pass this command a pair of specially <em>chosen</em>, and
<em>quoted</em> arguments like:
<blockquote>
cp-pattern '*.txt' .bak
</blockquote>
... note how the second pattern has no wild cards and how the first is
quoted to prevent any shell globbing. That's fine for something you
might just use yourself -- if you remember to quote it right. It's
easy enough to add check for the number of arguments and to ensure that
there is at least one file that exists in the $1 pattern. However it
becomes much harder to make this command reasonably safe and robust.
Inevitably it becomes less "unix-like" and thus more difficult to use
with other Unix tools.<P>
I generally just take a whole different approach. Rather than trying
to use <i>cp</i> to make a backup of each file under a slightly
changed name I might just make a directory (usually using the date
and my login ID as a template) and use a simple <i>cp</i> command
to copy all my target files into the new directory.<P>
Another interesting thing we can do with these "parameter expansion"
features is to iterate over a list of components in a single variable.<P>
For example, you might want to do something to traverse over every
directory listed in your path -- perhaps to verify that everything
listed therein is really a directory and is accessible to you.<P>
Here's a command that will echo each directory named on your path
on it's own line:
<blockquote>
p=$PATH
until [ $p = $d ]; do d=${p%%:*}; p=${p#*:}; echo $d; done
</blockquote>
... obviously you can replace the <i>echo $d</i> part of this
command with anything you like. <P>
Another case might be where you'd want to traverse a list of directories
that were all part of a path. Here's a command pair that echos each
directory from the root down to the "current working directory":
<blockquote>
p=$(pwd)
until [ $p = $d ]; do p=${p#*/}; d=${p%%/*}; echo $d; done
</blockquote>
... here we've reversed the assignments to <i>p</i> and <i>d</i>
so that we skip the root directory itself -- which must be "special cased"
since it appears to be a "null" entry if we do it the other way. The
same problem would have occurred in the previous example -- if the value
assigned to <i>$PATH</i> had started with a ":" character. <P>
Of course, its important to realize that this is not the only, or
necessarily the best method to parse a line or value into separate
fields. Here's an example that uses the old <i>IFS</i> variable
(the "inter-field separator in the Bourne, and Korn shells as well as
<i>bash</i>) to parse each line of <i>/etc/passwd</i> and extract
just two fields:
<blockquote><pre>
cat /etc/passwd | ( \
IFS=: ; while read lognam pw id gp fname home sh; \
do echo $home \"$fname\"; done \
)
</pre></blockquote>
Here we see the parentheses used to isolate the contents in a subshell
-- such that the assignment to IFS doesn't affect our current shell.
Setting the IFS to a "colon" tells the shell to treat that character as
the separater between "words" -- instead of the usual "whitespace" that's
assigned to it. For this particular function it's very important that
IFS consist solely of that character -- usually it is set to "space,"
"tab," and "newline.<P>
After that we see a typical <i>while read</i> loop -- where we
read values from each line of input (from <i>/etc/passwd</i> into
seven variables per line. This allows us to use any of these fields
that we need from within the loop. Here we are just using the <i>
echo</i> command -- as we have in the other examples.<P>
My point here has been to show how we can do quite a bit of
string parsing and manipulation directly within <i>bash</i>
-- which will allow our shell scripts to run faster with less overhead
and may be easier than some of the more complex sorts of pipes and
command substitutions one might have to employ to pass data to the
various external commands and return the results. <P>
Many people might ask: <i>Why not simply do it all in <b>perl</b>?</i>
I won't dignify that with a response. Part of the beauty of Unix is
that each user has many options about how they choose to program something.
Well written scripts and programs interoperate regardless of what particular
scripting or programming facility was used to create them. Issue the
command <i>file /usr/bin/*</i> on your system and and you may be
surprised at how many Bourne and C shell scripts there are in there<P>
In conclusion I'll just provide a sampler of some other
<i>bash</i> parameter expansions:
<DL><DT>
<i>${parameter:-word}</i><DD>
Provide a default if <i>parameter</i> is unset or null.<br>
Example:<ul><lh>
<i> echo ${1:-"default"}</i></lh></ul><DT>
Note: this would have to be used from within a
functions or shell script -- the point is to show
that some of the parameter substitutions can be use
with shell numbered arguments. In this case the
string "default" would be returned if the function
or script was called with no $1 (or if all of the
arguments had been <i>shift</i>ed out of existence.
<i>${parameter:=word}</i><DD>
Assign a value to <i>parameter</i> if it was previously
unset or null.<dt>
Example:<ul><lh>
<i>echo ${HOME:="/home/.nohome"}</i></lh></ul><DD>
${parameter:?word}<DD>
Generate an error if <i>parameter</i> is unset or null by
printing <i>word</i> to <i>stdout</i>.<dt>
Example:<ul><lh>
<i>: ${HOME:="/home/.nohome"} </i></lh></ul><DD>
${TMP:?"Error: Must have a valid Temp Variable Set"}
</DL>
This one just uses the shell "null command" (the : command) to
evaluate the expression. If the variable doesn't exist or has a
null value -- this will print the string to the standard error
file handle and exit the script with a return code of one.<P>
Oddly enough -- while it is easy to redirect the standard error
of processes under <i>bash</i> -- there doesn't seem to be an
easy portable way to explicitly generate message or redirect output
<b>to</b> stderr. The best method I've come up with is to use
the /proc/ filesystem (process table) like so:
<ul><ul>
function error { echo "$*" > /proc/self/fd/2 }
</ul></ul>
... <i>self</i> is always a set of entries that refers to the current
process -- and <i>self/fd/</i> is a directory full of the currently
open file descriptors. Under Unix and DOS every process is given
the following pre-opened file descriptors: stdin, stdout, and stderr.
<p><dl><dt>
${parameter:+word}<dd>
Alternative value.
${TMP:+"/mnt/tmp"} <br>
use /mnt/tmp instead of $TMP but do nothing if TMP was
unset. This is a weird one that I can't ever see myself
using. But it is a logical complement to the ${var:-value}
we saw above.<dt>
${#variable}<dd>
Return the length of the variable in characters.<br>
Example:<ul><lh>
echo The length of your PATH is ${#PATH} </lh></ul><DD>
</DL>
<!--===================================================================-->
<P> <hr> <P>
<center><H5>Copyright © 1997, Jim Dennis<BR>
Published in Issue 18 of the Linux Gazette, June 1997</H5></center>
<!--===================================================================-->
<P> <hr> <P>
<A HREF="./lg_toc18.html"><IMG ALIGN=BOTTOM SRC="../gx/indexnew.gif"
ALT="[ TABLE OF CONTENTS ]"></A>
<A HREF="../lg_frontpage.html"><IMG ALIGN=BOTTOM SRC="../gx/homenew.gif"
ALT="[ FRONT PAGE ]"></A>
<A HREF="./lg_answer18.html"><IMG SRC="../gx/back2.gif"
ALT=" Back "></A>
<A HREF="./gnu.html"><IMG SRC="../gx/fwd.gif" ALT=" Next "></A>
<P> <hr> <P>
<!--startcut ==========================================================-->
</BODY>
</HTML>
<!--endcut ============================================================-->
|