A word about the intended audience is in order before we get too far along (no need to waste your time if the NCL will not be helpful to you). The intended audience for both this documentation and the accompanying class library comprises computer programmers who wish to endow their C++ programs with the ability to read NEXUS data files. If you are not a programmer and simply use NEXUS files as a means of inputting data to the programs you use for analyzing your data, the NCL is not something that will be useful to you. The NCL is also not for you if you are a programmer but do not use the C++ language, since the NCL depends heavily on the object oriented programming features built into C++.
The NEXUS data file format was specified in the publication cited below. Please read this paper for further information about the format specification itself; the documentation for the NCL does not attempt to explain this.
Maddison, D. R., D. L. Swofford, and Wayne P. Maddison. 1997. NEXUS: an extensible file format for systematic information. Systematic Biology 46(4): 590-621.
The basic goal of the NCL is to provide a relatively easy way to endow a C++ program with the ability to read NEXUS data files. The steps necessary to use the NCL to create a bare-bones program that can read a NEXUS data file are simple and few (see below), and it is hoped that the availability of this class library will encourage the use of the NEXUS format. This will in turn encourage consistency in how programs read NEXUS files and how programs respond to errors in data files.
A further benefit can be seen by looking at the large number of special
data file formats that are out there. This places an extra burden on the
end user, who must deal with an increasing number of file formats all differing
in a number of ways. To port one's data to another file format often involves
manual manipulation of the data, an activity that is inherently dangerous and
probably has resulted in the corruption of many data files. At the very least,
the large number of formats in existance has led to a proliferation of data
file variants. With many copies of a given data file on a hard disk, each
formatted differently for various analysis programs, it becomes very easy to
change one (say, correct a datum found to be in error) and fail to correct the
other versions. The NEXUS file format provides a means for keeping one master
copy of the data and using it with several programs without modification. The
NCL provides a means for encouraging programmers to use the NEXUS file format
in future programs they write.
Obtaining the NCL?
The current version of the NCL is available by anonymous ftp from
alleyn.eeb.uconn.edu
(137.99.27.148) in the directory pub/ncl2
.
The NCL is available in two archive formats: compressed tar archive or
zip format. Download the compressed tar archive if you plan to develop
only for the Unix platform; the zip file contains everything in the tar
file as well as project files for both the Borland and Metrowerks IDEs.
The links below are provided to make the anonymous ftp access easier:
Compressed tar file
Zip file
README file
Characteristics of the NCL
Portability
The NCL has been designed to be as portable as possible for a C++ class library.
The NCL does make use of the ANSI Standard C++ Library, which may cause problems
if the compiler you are using is a couple of years old or older. The standard
library has been adopted by all modern compilers, including the most recent versions
of Metrowerks CodeWarrier Pro and Borland C++, as well as the EGCS compiler (a
free compiler available for most unix platforms). Thus, if your compiler chokes
on the template-based container classes used in the NCL (such as vector, list, map, string, etc.)
then you will need to upgrade your compiler in order to use the NCL. Assuming
you have a modern compiler, however, the NCL is fully portable across Mac, Windows,
and most Unix platforms.
Cross-platform features
I have attempted to create the NCL in such a way that one is not limited in the
type of platform targeted. For example, NEXUS files can contain "output comments"
that are supposed to be displayed in the output of the program reading the NEXUS
file. Such comments are handled automatically by the NCL, and are sent
to a virtual function that can be overridden by you in a derived class.
This provides a means for you to tailor the output of such comments to the platform
of your choice. For example, if you are writing a standard Unix console application
(i.e., not a graphical X-Windows application), you might want such output comments
to simply be sent to standard output or to an ofstream object. For a graphical
Windows, MacIntosh or X-Windows application, you might deem it more user-friendly to pop up
a message box with the output comment as the message. This would ensure that the
user noticed the output comment. You also have the option of having your program
completely ignore such comments in the data file.
The NCL provides similar hooks for noting the progress in reading the data file.
For example, a virtual function called EnteringBlock is called and provided with
the name of the block about to be read. You can override EnteringBlock in your
derived class to allow, for example, a message to be displayed in a status bar
at the bottom of your program's main window (in a graphical application) indicating
which block is currently being read. Other such virtual functions include
SkippingBlock (to allow users to be warned that your program is ignoring a block
in the data file), SkippingCommand (to allow users to be warned about particular
commands being skipped within a block), and NexusError, which is the function
called whenever anything unexpected happens when reading the file.
Extensibility
The basic tools provided in the NCL allow you to create your own NEXUS blocks
and use them in your program. This makes it easy to define a private block
to contain commands that only your program recognizes, allowing your users to
run your program in batch mode (see the section below entitled
General Advice for more information on this topic).
Current limitations
The main current limitation is that the NCL is incomplete. Some standard NEXUS
blocks have been provided with this distribution, but because the NEXUS format
is so extensive, I have not had the time to write code implementing even all of
the standard blocks described in the paper cited above (or even all of the standard
blocks I have included!). Here is a summary table showing what has been implemented
thus far:
Block | Current Limitations |
---|---|
ASSUMPTIONS | Only TAXSETS, CHARSETS, and EXSETS have been implemented thus far. |
ALLELES | Cannot yet handle transposed MATRIX, and only DATAPOINT=STANDARD is implemented (however, as far as I know, no other program in existance handles anything but standard datapoints so this is not a great limitation at the moment). |
CHARACTERS | Only ITEMS=STATES and STATESFORMAT=STATESPRESENT has been implemented thus far, and DATATYPE=CONTINUOUS has not been implemented. |
DISTANCES | No limitations, completely implemented |
DATA | Since the DATA block is essentially the same as a CHARACTERS block, the same limitations apply. |
TAXA | No limitations, completely implemented |
TREES | No limitations, completely implemented |
The NCL is written in C++ and designed for use in a C++ program. It is not available in a C version (or any other language for that matter), which will represent a limitation if you are not a C++ programmer. I do intend to produce a Java version eventually (once the bugs are worked out of the C++ version). I have written the NCL in a style that mimics Java (i.e., no operators are defined, objects are passed by reference whenever possible, etc.) so that this conversion, when it happens, will be as easy as possible.
The NCL has been designed to be portable, easy-to-use, and informative in the
error messages produced. It will be apparent to anyone who looks very closely
at the code that efficiency has been sacrificed to meet these goals.
You can expect the minimum size of an executable handling only the reading of
TAXA and TREES blocks to be at least 92 KB (94208 bytes). Adding the ability
to read the CHARACTERS block pushes this up to 135 KB (138752 bytes). These
figures are based on compiling a Win32 console program with Borland C++ 5.02
compiler with no optimizations. Speed has also been sacrificed, however I
think it can be argued that speed in reading a data file is not all that
important compared to the NCL's other benefits.
Building a NEXUS File Reader
This section illustrates how you could build a simple NEXUS file reader application
capable of reading in a TAXA and a TREES block. Note that the application NCLTEST,
for which you have the source code, is essentially an expanded version of this
sample program that can read all of the NEXUS blocks incorporated to date into the
NCL. To keep things simple, we will just write output to an ofstream object
(nothing graphical here).
As you work through this example, feel free to look into the NCL classes in more
detail. Each class has its own documentation in the form of a web page having a
name of the form NCLClassName.html, where NCLClassName is replaced by
the name of the class. For example, the NexusBlock class is described in the file
NexusBlock.html. An index to all classes in the NCL is
provided at the end of this document for quick access to the full range of
class-specific web pages.
The Main Function
int main() { taxa = new TaxaBlock(); trees = new TreesBlock(*taxa); MyNexus nexus( "testfile.nex", "output.txt" ); nexus.Add( taxa ); nexus.Add( trees ); MyToken token( nexus.inf, nexus.outf ); nexus.Execute( token ); taxa->Report( nexus.outf ); trees->Report( nexus.outf ); return 0; }
Creating block objects. The first two lines of main involve the creation of objects corresponding to the two types of NEXUS blocks we want our program to recognize. TaxaBlock is declared in the header file "taxablock.h" and defined in the source code file "taxablock.cpp", whereas the TreesBlock class is declared in "treesblock.h" and defined in "treesblock.cpp". Note that the TreesBlock constructor requires a reference to an object of type TaxaBlock. This is because a TREES block in a NEXUS data file requires the number of taxa and the taxon labels to have been previously definined earlier in the data file. In the NCL, any block that defines taxon labels stores this information in the TaxaBlock object, and any block that needs such information requires a reference to the TaxaBlock object in its constructor.
Adding the block objects to the NEXUS object. The next three lines involve creating a NEXUS object and adding our two block objects to a linked list maintained by the NEXUS object. The MyNexus class is derived from the Nexus class, which is declared in "nexus.h" and defined in "nexus.cpp". Objects cannot be created from the Nexus class alone, as it contains pure virtual functions that must be overridden in a derived class. Examples of pure virtual functions in Nexus that must be overridden are: EnteringBlock, SkippingBlock, and NexusError. The reason the Nexus object must maintain a list of block ojects is so that it can figure out who is responsible for reading each block found in the data file. The block objects taxa and trees have each inherited an id variable of type char* that stores their block name (i.e., "TAXA" for the TaxaBlock and "TREES" for the TreesBlock). When the Nexus object's Execute method encounters a block name, it searches its linked list of block objects until it finds one whose id variable is identical to the name of the block encountered. It then calls the Read function of that block object to do the work of reading the block from the data file and storing its contents. It is possible of course that a block name will appear in a data file for which there is no corresponding block object. In this case, the Nexus Execute method calls the SkippingBlock method to report the fact that it is skipping over the contents of the unknown block.
Reading the data file. The next two lines create a token object (MyToken is derived from the NexusToken class), and initiate the reading of the NEXUS data file using the Nexus Execute function. The input and output files are created within the MyNexus class. While this is not required, it facilitates handling messages generated while the data file is being read. The NexusToken class has one virtual member function - OutputComment - which enables you to control how output comments are displayed. The NexusToken version of OutputComment does nothing, so you must derive your own token class from NexusToken and override the OutputComment method in order for the output comments in the data file to be displayed. The main function of the NexusToken class is to provide a means for grabbing NEXUS tokens one by one from the data file. Calling the GetNextToken function reads and stores the next token found in the data file, correctly handling any comments found along the way. This greatly simplifies reading a NEXUS data file.
Reporting on block objects' contents.
The last two lines call the Report functions of each of the blocks. This just
spits out a summary of any data contained in these objects that has been
read from the data file.
Deriving From the Nexus Class
Note that the ifstream is opened in binary mode. You should always open your
input file in binary mode so that the file can be read properly regardless of
the platform on which it was created. For example, suppose someone created a NEXUS
data file on a MacIntosh and wanted to read it with your program, which is
running on a Windows 95 machine. Opening the file in binary mode allows the
NexusToken object you are using to recognize the newline character in the Mac
file as such, even though MacIntosh computers use a different symbol (ASCII 13)
to represent the newline character than computers running Windows (which use
the ASCII 13, ASCII 10 combination for newlines).
Also, note the special version of one line in the NexusError method that is necessary when using the Metrowerks CodeWarrior Professional (Release 4) compiler. Metrowerks uses a class called streampos to keep track of the file position rather than typedefing streampos to a simple long value. You must call the streampos object's offset() member function to obtain a value that you can use (no automatic conversions from streampos to long). If you are using CoeWarrior Pro Release 5, the special handling is not necessary (there was apparently a bug in Release 4 that has since been fixed).
class MyNexus : public Nexus { public: ifstream inf; ofstream outf; MyNexus( char* infname, char* outfname ) : Nexus() { inf.open( infname, ios::binary ); outf.open( outfname ); } ~MyNexus() { inf.close(); outf.close(); } void EnteringBlock( char* blockName ) { cout << "Reading \"" << blockName << "\" block..." << endl; outf << "Reading \"" << blockName << "\" block..." << endl; } void SkippingBlock( char* blockName ) { cout << "Skipping unknown block (" << blockName << ")..." << endl; outf << "Skipping unknown block (" << blockName << ")..." << endl; } void OutputComment( char* msg ) { outf << msg; } void NexusError( char* msg, streampos pos, long line, long col ) { cerr << endl; cerr << "Error found at line " << line; cerr << ", column " << col; #if defined( __MWERKS__ ) // if using MetroWerks CodeWarrior Pro 5, use other (normal) version cerr << " (file position " << pos.offset() << "):" << endl; #else cerr << " (file position " << pos << "):" << endl; #endif cerr << msg << endl; outf << endl; outf << "Error found at line " << line; outf << ", column " << col; #if defined( __MWERKS__ ) // if using MetroWerks CodeWarrior Pro 5, use other (normal) version outf << " (file position " << pos.offset() << "):" << endl; #else outf << " (file position " << pos << "):" << endl; #endif outf << msg << endl; exit(0); } };
class MyToken : public NexusToken { ostream& out; public: MyToken( istream& is, ostream& os ) : out(os), NexusToken(is) {} void OutputComment( char* msg ) { cout << msg << endl; out << msg << endl; } };
#include <stdlib.h> #include <fstream.h> #include "nexustoken.h" #include "labellist.h" #include "nexus.h" #include "taxablock.h" #include "treesblock.h" TaxaBlock* taxa; TreesBlock* trees; class MyToken : public NexusToken { ostream& out; public: MyToken( istream& is, ostream& os ) : out(os), NexusToken(is) {} void OutputComment( char* msg ) { cout << msg << endl; out << msg << endl; } }; class MyNexus : public Nexus { public: ifstream inf; ofstream outf; MyNexus( char* infname, char* outfname ) : Nexus() { inf.open( infname, ios::binary ); outf.open( outfname ); } ~MyNexus() { inf.close(); outf.close(); } void EnteringBlock( char* blockName ) { cout << "Reading \"" << blockName << "\" block..." << endl; outf << "Reading \"" << blockName << "\" block..." << endl; } void SkippingBlock( char* blockName ) { cout << "Skipping unknown block (" << blockName << ")..." << endl; outf << "Skipping unknown block (" << blockName << ")..." << endl; } void OutputComment( char* msg ) { outf << msg; } void NexusError( char* msg, streampos pos, long line, long col ) { cerr << endl; cerr << "Error found at line " << line; cerr << ", column " << col; #if defined( __MWERKS__ ) cerr << " (file position " << pos.offset() << "):" << endl; #else cerr << " (file position " << pos << "):" << endl; #endif cerr << msg << endl; outf << endl; outf << "Error found at line " << line; outf << ", column " << col; #if defined( __MWERKS__ ) outf << " (file position " << pos.offset() << "):" << endl; #else outf << " (file position " << pos << "):" << endl; #endif outf << msg << endl; exit(0); } }; int main() { taxa = new TaxaBlock(); trees = new TreesBlock(*taxa); MyNexus nexus( "testfile.nex", "output.txt" ); nexus.Add( taxa ); nexus.Add( trees ); MyToken token( nexus.inf, nexus.outf ); nexus.Execute( token ); taxa->Report( nexus.outf ); trees->Report( nexus.outf ); return 0; }
Feel free to introduce various sorts of errors (e.g., delete semicolons, misspell keywords, etc.) into the the sample data file to get a feel for what types of error messages the NEXUS file reader generates.
#nexus [!Output comment before first block] begin gdadata; [this is an unknown block] dimensions npops=2 nloci=3; end; [!Let's see if we can deal with [nested] comments] [! What happens if we do this! ] begin [comment at beginning of token]taxa; dimensions[comment at end of token] ntax=11; taxlabels [comment between tokens] P._fimbriata 'P. robusta' 'P. americana' 'P. myriophylla' 'P. articulata' 'P. parksii' 'P. gracilis' 'P. macrophylla' 'P. polygama' 'P. basiramia' 'P. ciliata' [!output comment in TAXLABELS command] ; end; begin trees; translate 1 P._fimbriata, 2 P._robusta, 3 P._americana, 4 P._myriophylla, 5 P._articulata, 6 P._parksii, 7 P._polygama, 8 P._macrophylla, 9 P._gracilis, 10 P._basiramia, 11 P._ciliata ; tree alpha = (1,2,((((3,4),5),6),((7,8),(9,(10,11))))); tree beta = (1,2,((((3,4),5),6),(7,(8,(9,(10,11)))))); end;
if( !token.Equals(";") ) { sprintf( errormsg, "Expecting ';' but found %s instead", token.GetToken() ); throw XNexus(errormsg); }Such checks will give your users some hope of finding where they have made a mistake in constructing their data file. We all know how frustrating it can be to have a program exit with an uninformative error message.
makedoc myblock.cppat the system prompt to create a help file for the class defined in myblock. Note that there can be only one class defined per source code file to use this system, and all of the special comments must be in the same file for a particular class. Here are some examples of source code comments for use with the makedoc program. Feel free to look through the NCL source code files for other examples.
/** * @class TaxaBlock * @file taxablock.h * @file taxablock.cpp * @author Paul O. Lewis * @copyright Copyright © 1999. All Rights Reserved. * @variable ntax [int:private] number of taxa (set from NTAX specification) * @variable taxonLabels [LabelList:private] storage for list of taxon labels * @see LabelList * @see Nexus * @see NexusBlock * @see NexusToken * @see XNexus * * This class handles reading and storage for the NEXUS block TAXA. * It overrides the member functions Read and Reset, which are abstract * virtual functions in the base class NexusBlock. The taxon names are * stored in an array of strings (taxonLabels) that is accessible through * the member functions GetTaxonLabel, AddTaxonLabel, ChangeTaxonLabel, * and GetNumTaxonLabels. */
/** * @enumeration * @enumitem saveCommandComment [0x0001] if set, command comments expected and will be saved * @enumitem parentheticalToken [0x0002] if set, parenthetical token expected * * For use with the variable labileFlags. */
/** * @constructor * * Default constructor. Initializes id to "TAXA" and ntax to 0. */
/** * @destructor * * Deletes the memory used by id and flushes taxonLabels. */
/** * @method Reset [void:protected] * * Flushes taxonLabels and sets ntax to 0 in preparation for reading a * new TAXA block. */
/** * @method ChangeTaxonLabel [void:public] * @param i [int] the taxon label number to change * @param s [char*] the string used to replace label i * * Changes the label for taxon i to s. Deletes the old * label and reallocates enough memory to store the new * label. */
/** * @method Read [void:protected] * @param token [NexusToken&] the token used to read from in * @param in [istream&] the input stream from which to read * @throws XNexus * * This function provides the ability to read everything following the block name * (which is read by the Nexus object) to the end or endblock statement. * Characters are read from the input stream in. Overrides the * abstract virtual function in the base class. */
/** * @operator = [NxsGDADataBlock&:public] * @param gdaData [const NxsGDADataBlock&] the NxsGDADataBlock to be copied * * Copies the information from gdaData to this object. */
/** * @castoperator () [double:public] * * Casts value of cell to a double. */
/** * @manipulator setleft * * Specifies the left boundary of the Table body. * All columns added before this point are considered * row headers and will be repeated for each output * page. */
After developing several programs like this, I have come up with the following strategy that make efficient use of the object-oriented nature of the NCL. I will assume your non-graphical program will be called simply "foo" and will read a private NEXUS block named "FOO". I will assume that the GUI version will be targeted for the Windows platform, and will be colled "winfoo".
Although you are not obligated in any way to me as a result of using this package to improve your programs, there are a few things that you can do to help encourage me to continue improving this library. Please make use of any of the following means of support that you feel comfortable with:
My current mail and email addresses as well as my phone and fax numbers are given below:
Paul O. Lewis, Assistant Professor Department of Ecology and Evolutionary Biology The University of Connecticut U-43, 75 North Eagleville Road Storrs, CT 06269-3043 Ph: +1-860-486-2069 FAX: +1-860-486-6364 (the departmental fax machine) Email: plewis@uconnvm.uconn.edu URL: http://www.eeb.uconn.edu/faculty/plewis.htm