File: bash.html

package info (click to toggle)
lg-issue18 5-4
links: PTS
area: main
in suites: woody
size: 2,928 kB
ctags: 148
sloc: makefile: 36; sh: 4
file content (396 lines) | stat: -rw-r--r-- 15,795 bytes
parent folder | download | duplicates (4)
<!--startcut ==========================================================-->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<title>bash String Manipulations Issue 18</title>
</HEAD>
<BODY BGCOLOR="#EEE1CC" TEXT="#000000" LINK="#0000FF" VLINK="#0020F0"
ALINK="#FF0000">
<!--endcut ============================================================-->

<H4>
&quot;Linux Gazette...<I>making Linux just a little more fun!</I>&quot;
</H4>

<P> <HR> <P> 
<!--===================================================================-->

<center>
<H2>bash String Manipulations</H2>
<H4>By Jim Dennis,
<a href="mailto:jimd@starshine.org">jimd@starshine.org</a></H4>
</center>
<P><HR><P>
<P>
The <i>bash</i> shell has many features that are 
sufficiently obscure you almost never see them used.  One of
the problems is that the man page offers no examples. <P>

Here I'm going to show how to use some of these features to do
the sorts of simple string manipulations that are commonly 
needed on file and path names. <H2>

Background </H2><P>

In traditional Bourne shell programming you might see references
to the <i>basename</i> and <i>dirname</i> commands.
These perform simple string manipulations on their arguments. You'll
also see many uses of <i>sed</i> and <i>awk</i> or
<i>perl -e</i> to perform simple string manipulations.<P>

Often these machinations are necessary perform on lists of filenames
and paths. There are many specialized programs that are conventionally
included with Unix to perform these sorts of utility functions: 
<i>tr</i>, <i>cut</i>, <i>paste</i>, and <i>join</i>.

Given a filename like <i>/home/myplace/a.data.directory/a.filename.txt
</i> which we'll call <b>$f</b> you could use commands like:

	<blockquote><pre>
	<i>dirname</i> <b>$f</b> 
	<i>basename</i> <b>$f</b> 
	<i>basename</i> <b>$f</b> .txt
	</pre></blockquote>

... to see output like:
	<blockquote><pre><i>
	/home/myplace/a.data.directory
	a.filename.txt
	a.filename </i></pre></blockquote>

Notice that the GNU version of <i>basename</i> takes an 
optional parameter.  This handy for specifying a filename "extension"
like <b>.tar.gz</b> which will be stripped off of the output.  Note that
<i>basename</i> and <i>dirname</i> don't verify that these parameters 
are valid filenames or paths.  They simple perform simple
string operations on a single argument. You shouldn't use wild cards
with them -- since <i>dirname</i> takes exactly one argument
(and complains if given more) and <i>basename</i> takes one argument
and an optional one which is not a filename. <P>

Despite their simplicity these two commands are used frequently in 
shell programming because most shells don't have any built-in string
handling functions -- and we frequently need to refer to just the
directory or just the file name parts of a given full file specification. <P>

Usually these commands are used within the "back tick" shell operators
like  <i>TARGETDIR=`dirname $1`</i>.  The "back tick" operators
are equivalent to the <i>$(...)</i> construct.  This latter construct
is valid in Korn shell and <i>bash</i> -- and I find it easier to read
(since I don't have to squint at me screen wondering which direction the
"tick" is slanted). <P><H2>

A Better Way </H2>

Although the <i>basename</i> and <i>dirname</i> commands
embody the "small is beautiful" spirit of Unix -- they may push the
envelope towards the "too simple to be worth a separate program" end
of simplicity. <P>

Naturally you can call on <i>sed</i>, <i>awk</i>, TCL or
<i>perl</i> for more flexible and complete string handling.
However this can be overkill -- and a little ungainly. <P>

So, <i>bash</i> (which long ago abandoned the "small is beautiful"
principal and went the way of <i>emacs</i>) has some built in
syntactical candy for doing these operations.  Since <i>bash</i>
is the default shell on Linux systems then there is no reason not to 
use these features when writing scripts for Linux.<P>

	<ul><lh>
	If your concerned about portability to other shells and
	systems -- you may want to stick with <i>dirname</i>,
	<i>basename</i>, and <i>sed</i></lh></ul><P>


<H2>
The <i>bash</i> Man Page </H2><P>

The <i>bash</i> man page is huge.  In contains a complete
reference to the "readline" libraries and how to write a <b>.inputrc</b>
file (which I think should all go in a separate man page) -- and a
run down of all the <i>csh</i> "history" or <b>bang!</b> operators
(which I think should be replaced with a simple statement like:
"Most of the <b>bang!</b> tricks that work in <i>csh</i> work the
same way in <i>bash</i>"). <P>

However, buried in there is a section on <b>Parameter Substitution</b>
which tells us that $foo is really a shorthand for ${foo} which is 
really the simplest case of several ${foo<i>:operators</i>} and similar
constructs. <P>

Are you confused, yet? <P>

Here's where a few examples would have helped.  To understand the 
man page I simply experimented with the echo command and several 
shell variables.  This is what it all means: 

	<ul><lh>
	Given:<ul><lh>
		foo=/tmp/my.dir/filename.tar.gz </lh></ul>
		</lh></ul>

	<ul>
	We can use these expressions:<dl><dt>
		path = ${foo%/*} <dd>
			To get: /tmp/my.dir (like <i>dirname</i>)<dt>
		file = ${foo##*/} <dd>
			To get: filename.tar.gz (like <i>basename</i>)<dt>
		base = ${file%%.*} <dd>
			To get: filename <dt>
		ext  = ${file#*.} <dd>
			To get: tar.gz
			</dl></ul>

	<ul><B>Note that the last two depend on the 
	assignment made in the second one</b></ul>

Here we notice two different "operators" being used inside the 
parameters (curly braces).  Those are the <b>#</b> and the <b>%</b>
operators.  We also see them used as single characters and in pairs.
This gives us four combinations for trimming patterns off the 
beginning or end of a string:<DL><DT>
	${variable%pattern} <DD>
		Trim the shortest match from the end <DT>
	${variable##pattern} <DD>
		Trim the longest match from the beginning <DT>
	${variable%%pattern} <DD>
		Trim the shortest match from the end <DT>
	${variable#pattern} <DD>
		Trim the shortest match from the beginning 
		
		</DL><P>

It's important to understand that these use shell "globbing"
rather than "regular expressions" to <b>match</b> these patterns. 
Naturally a simple string like "txt" will match sequences of exactly
those three characters in that sequence -- so the difference between
"shortest" and "longest" only applies if you are using a shell 
wild card in your pattern.<P>

A simple example of using these operators comes in the common
question of copying or renaming all the *.txt to change the 
.txt to .bak (in MS-DOS' COMMAND.COM that would be REN *.TXT *.BAK).<P>

This is complicated in Unix/Linux because of a fundamental difference
in the programming API's.  In most Unix shells the expansion of a
wild card pattern into a list of filenames (called "globbing") is done
by the shell -- before the command is executed.  Thus the command normally
sees a list of filenames (like "foo.txt bar.txt etc.txt") where DOS
(COMMAND.COM) hands external programs a pattern like *.TXT.  <P>

Under Unix shells, if a pattern doesn't match any filenames the parameter 
is usually left on the command like literally.  Under <i>bash</i>
this is a user-settable option.  In fact, under <i>bash</i> you can
disable shell "globbing" if you like -- there's a simple option to do this.
It's almost never used -- because commands like <i>mv</i>, and
<i>cp</i> won't work properly if their arguments are passed to them
in this manner.<P>

However here's a way to accomplish a similar result:

	<blockquote>
	for i in *.txt; do cp $i ${i%.txt}.bak; done
		</blockquote>

... obviously this is more typing. If you tried to create a
shell function or alias for it -- you have to figure out how to 
pass this parameters.  Certainly the following seems simple enough:

	<blockquote>
	function cp-pattern {
	for i in $1; do cp $i ${i%$1}$2; done
		</blockquote>

... but that doesn't work like most Unix users would expect.  You'd
have to pass this command a pair of specially <em>chosen</em>, and
<em>quoted</em> arguments like:

	<blockquote>
	cp-pattern '*.txt' .bak
		</blockquote>

... note how the second pattern has no wild cards and how the first is
quoted to prevent any shell globbing.  That's fine for something you
might just use yourself -- if you remember to quote it right.  It's 
easy enough to add check for the number of arguments and to ensure that
there is at least one file that exists in the $1 pattern.  However it 
becomes much harder to make this command reasonably safe and robust.  
Inevitably it becomes less "unix-like" and thus more difficult to use
with other Unix tools.<P>

I generally just take a whole different approach.  Rather than trying
to use <i>cp</i> to make a backup of each file under a slightly
changed name I might just make a directory (usually using the date
and my login ID as a template) and use a simple <i>cp</i> command
to copy all my target files into the new directory.<P>

Another interesting thing we can do with these "parameter expansion"
features is to iterate over a list of components in a single variable.<P>

For example, you might want to do something to traverse over every 
directory listed in your path -- perhaps to verify that everything
listed therein is really a directory and is accessible to you.<P>

Here's a command that will echo each directory named on your path
on it's own line:

	<blockquote>
	p=$PATH
	until [ $p = $d ]; do d=${p%%:*}; p=${p#*:}; echo $d; done
		</blockquote>

... obviously you can replace the <i>echo $d</i> part of this
command with anything you like. <P>

Another case might be where you'd want to traverse a list of directories
that were all part of a path.  Here's a command pair that echos each
directory from the root down to the "current working directory":

	<blockquote>
	p=$(pwd)
	until [ $p = $d ]; do p=${p#*/}; d=${p%%/*}; echo $d; done
		</blockquote>

... here we've reversed the assignments to <i>p</i> and <i>d</i>
so that we skip the root directory itself -- which must be "special cased"
since it appears to be a "null" entry if we do it the other way.  The 
same problem would have occurred in the previous example -- if the value
assigned to <i>$PATH</i> had started with a ":" character. <P>


Of course, its important to realize that this is not the only, or 
necessarily the best method to parse a line or value into separate 
fields.  Here's an example that uses the old <i>IFS</i> variable
(the "inter-field separator in the Bourne, and Korn  shells as well as 
<i>bash</i>) to parse each line of <i>/etc/passwd</i> and extract 
just two fields:

		
		<blockquote><pre>
		cat /etc/passwd | ( \
			IFS=: ; while read lognam pw id gp fname home sh; \
				do echo $home \"$fname\"; done \
				)
			</pre></blockquote>

Here we see the parentheses used to isolate the contents in a subshell
-- such that the assignment to IFS doesn't affect our current shell.
Setting the IFS to a "colon" tells the shell to treat that character as 
the separater between "words" -- instead of the usual "whitespace" that's
assigned to it.  For this particular function it's very important that
IFS consist solely of that character -- usually it is set to "space,"
"tab," and "newline.<P>

After that we see a typical <i>while read</i> loop -- where we
read values from each line of input (from <i>/etc/passwd</i> into
seven variables per line.  This allows us to use any of these fields
that we need from within the loop.  Here we are just using the <i>
echo</i> command -- as we have in the other examples.<P>

My point here has been to show how we can do quite a bit of 
string parsing and manipulation directly within <i>bash</i>
-- which will allow our shell scripts to run faster with less overhead
and may be easier than some of the more complex sorts of pipes and 
command substitutions one might have to employ to pass data to the
various external commands and return the results. <P>

Many people might ask: <i>Why not simply do it all in <b>perl</b>?</i>
I won't dignify that with a response.  Part of the beauty of Unix is
that each user has many options about how they choose to program something.
Well written scripts and programs interoperate regardless of what particular
scripting or programming facility was used to create them.  Issue the
command <i>file /usr/bin/*</i> on your system and and you may be 
surprised at how many Bourne and C shell scripts there are in there<P>

In conclusion I'll just provide a sampler of some other 
<i>bash</i> parameter expansions:

	<DL><DT>
	<i>${parameter:-word}</i><DD>
	Provide a default if <i>parameter</i> is unset or null.<br>
	Example:<ul><lh>
	      <i> echo ${1:-"default"}</i></lh></ul><DT>

Note:  this would have to be used from within a
functions or shell script -- the point is to show 
that some of the parameter substitutions can be use
with shell numbered arguments.   In this case the
string "default" would be returned if the function
or script was called with no $1 (or if all of the 
arguments had been <i>shift</i>ed out of existence.
       
	<i>${parameter:=word}</i><DD>
	Assign a value to <i>parameter</i> if it was previously
	unset or null.<dt>
	Example:<ul><lh>
	       <i>echo ${HOME:="/home/.nohome"}</i></lh></ul><DD>

	${parameter:?word}<DD>
	Generate an error if <i>parameter</i> is unset or null by
	printing <i>word</i> to <i>stdout</i>.<dt>

	Example:<ul><lh>
	      <i>: ${HOME:="/home/.nohome"} </i></lh></ul><DD>
       ${TMP:?"Error: Must have a valid Temp Variable Set"}
       </DL>
       
This one just uses the shell "null command" (the : command) to
evaluate the expression.  If the variable doesn't exist or has a 
null value -- this will print the string to the standard error
file handle and exit the script with a return code of one.<P>

Oddly enough -- while it is easy to redirect the standard error
of processes under <i>bash</i> -- there doesn't seem to be an
easy portable way to explicitly generate message or redirect output
<b>to</b> stderr.  The best method I've come up with is to use 
the /proc/ filesystem (process table)  like so:

	<ul><ul>
	function error { echo "$*" > /proc/self/fd/2 }
	</ul></ul>

... <i>self</i> is always a set of entries that refers to the current
process -- and <i>self/fd/</i> is a directory full of the currently
open file descriptors.  Under Unix and DOS every process is given
the following pre-opened file descriptors:  stdin, stdout, and stderr.
<p><dl><dt>

       ${parameter:+word}<dd>
       Alternative value.
       ${TMP:+"/mnt/tmp"} <br>
       
       use /mnt/tmp instead of $TMP but do nothing if TMP was
       unset.  This is a weird one that I can't ever see myself
       using.  But it is a logical complement to the ${var:-value} 
       we saw above.<dt>


	${#variable}<dd>
	Return the length of the variable in characters.<br>

	Example:<ul><lh>
       echo The length of your PATH is ${#PATH} </lh></ul><DD>
	</DL>


<!--===================================================================-->
<P> <hr> <P> 
<center><H5>Copyright &copy; 1997, Jim Dennis<BR> 
Published in Issue 18 of the Linux Gazette, June 1997</H5></center>

<!--===================================================================-->
<P> <hr> <P> 
<A HREF="./lg_toc18.html"><IMG ALIGN=BOTTOM SRC="../gx/indexnew.gif" 
ALT="[ TABLE OF CONTENTS ]"></A>
<A HREF="../lg_frontpage.html"><IMG ALIGN=BOTTOM SRC="../gx/homenew.gif"
ALT="[ FRONT PAGE ]"></A>
<A HREF="./lg_answer18.html"><IMG SRC="../gx/back2.gif"
ALT=" Back "></A>
<A HREF="./gnu.html"><IMG SRC="../gx/fwd.gif" ALT=" Next "></A>
<P> <hr> <P> 
<!--startcut ==========================================================-->
</BODY>
</HTML>
<!--endcut ============================================================-->