1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655
|
\documentclass[10pt]{article}
% \usepackage{utopia}
\pagestyle{plain}
\addtolength{\hoffset}{-2cm}
\addtolength{\textwidth}{4cm}
\addtolength{\voffset}{-1.5cm}
\addtolength{\textheight}{3cm}
\setlength{\parindent}{0pt}
\setlength{\parskip}{11pt}
\title{A Parallel API for Creating and Reading NetCDF Files}
\begin{document}
\maketitle
\begin{abstract}
Scientists recognize the importance of portable and efficient mechanisms for
storing datasets created and used by their applications. NetCDF is one such
mechanism and is popular in a number of applicaiton domains because of its
availability on a wide variety of platforms and its easy to use API. However,
this API was originally designed for use in serial codes, and so the semantics
of the interface are not designed to allow for high performance parallel
access.
In this work we present a new API for creating and reading NetCDF datasets,
the \emph{PnetCDF API}. This interface builds on the original NetCDF
interface and defines semantics for parallel access to NetCDF datasets. The
interface is built on top of MPI-IO, allowing for further performance gains
through the use of collective I/O optimizations that already exist in MPI-IO
implementations.
\end{abstract}
\section{Introduction}
NetCDF is a popular package for storing data files in scientific applications.
NetCDF consists of both an API and a portable file format. The API provides a
consistent interface for access NetCDF files across multiple platforms, while
the NetCDF file format guarantees data portability.
The NetCDF API provides a convenient mechanism for a single process to define
and access variables and attributes in a NetCDF file. However, it does not
define a parallel access mechanism. In particular there is no mechanism for
concurrently writing to a NetCDF data file. Because of this, parallel
applications operating on NetCDF files must serialize access. This is
typically accomplished by shipping all data to and from a single process that
performs NetCDF operations. This mode of access is both cumbersome to the
application programmer and considerably slower than parallel access to the
NetCDF file. This mode can be particularly inconvenient when data set sizes
exceed the size of available memory on a single node; in this case the data
must be split into pieces to be shipped and written.
In this document we propose an alternative API for accessing NetCDF format
files in a parallel application. This API allows all processes in a parallel
application to access the NetCDF file simultaneously, which is considerably
more convenient than the serial API and allows for significantly higher
performance.
\emph{Note subset of interface that we are implementing, including both what
types we support and any functions that we might be leaving out.}
\section{Preliminaries}
In MPI, communicators are typically used to describe collections of processes
to MPI calls (e.g. the collective MPI-1 calls). In our PnetCDF API we
will similarly use a communicator to denote a collection of MPI processes that
will access a NetCDF file. By describing this collection of processes, we
provide the underlying implementation (of our PnetCDF API) with
information that it can use to ensure that the file is kept in a consistent
state.
Further, by using the collective operations provided in our PnetCDF
API (ones in which all processes in this collection participate), application
programmers provide the underlying implementation with an opportunity to
further optimize access to the NetCDF file. These optimizations are performed
without further intervention by the application programmer and have been
proven to provide huge performance wins in multidimensional dataset access \cite{thakur:romio},
exactly the kinds of accesses used in NetCDF.
All this said, the original NetCDF interface is made available with a minimum
of changes so that users migrating from the original NetCDF interface will
have little trouble moving to this new, PnetCDF interface.
The decision to slightly modify the API was not made lightly. It is
relatively trivial to port NetCDF to use MPI-IO through the use of the MPI-IO
independent access calls. However, it was only though adding this concept of
a collection of processes simultaneously accessing the file and adding
collective access semantics that we could hope to eliminate the serialization
step necessary in the original API or gain the performance advantages
available from the use of collective optimizations. Thus our performance
requirements mandated these small adjustments.
% Finally, some of the calls in our API utilize MPI datatypes. These datatypes
% allow one to describe noncontiguous regions of data (in this case memory
% regions) as an endpoint of a data transfer.
\section{PnetCDF API}
The NetCDF interface breaks access into two \emph{modes}, ``define'' mode and
``data'' mode. The define mode is used to describe the data set to be stored,
while the data mode is used for storing and retrieving data values.
We maintain these modes and (for the most part) maintain the operations when
in define mode. We will discuss the API for opening and closing a dataset and
for moving between modes first, next cover inquiry functions, then cover the
define mode, attribute functions, and finally discuss the API for data mode.
%
% PREFIX
%
We will prefix our C interface calls with ``ncmpi'' and our Fortran interface
calls with ``nfmpi''. This ensures no naming clashes with existing NetCDF
libraries and does not conflict with the desire to reserve the ``MPI'' prefix
for functions that are part of the MPI standard.
All of our functions return integer NetCDF status values, just as the original
NetCDF interface does.
We will only discuss points where our interface deviates from the original
interface in the following sections. A complete function listing is included
in Appendix A.
\subsection{Variable and Parameter Types}
%
% MPI_Offset
%
Rather than using \texttt{size\_t} types for size parameters passed in to our
functions, we choose to use \texttt{MPI\_Offset} type instead. For many
systems \texttt{size\_t} is a 32-bit unsigned value, which limits the maximum
range of values to 4~GBytes. The \texttt{MPI\_Offset} is typically a 64-bit
value, so it does not have this limitation. This gives us room to extend the
file size capabilities of NetCDF at a later date.
\emph{Add mapping of MPI types to NetCDF types.}
\emph{Is NetCDF already exactly in external32 format?}
\subsection{Dataset Functions}
As mentioned before, we will define a collection of processes that are
operating on the file by passing in a MPI communicator. This communicator is
passed in the call to \texttt{ncmpi\_create} or \texttt{ncmpi\_open}. These
calls are collective across all processes in the communicator. The second
additional parameter is an \texttt{MPI\_Info}. This is used to pass hints in
to the implementation (e.g. expected access pattern, aggregation
information). The value \texttt{MPI\_INFO\_NULL} may be passed in if the user
does not want to take advantage of this feature.
\begin{verbatim}
int ncmpi_create(MPI_Comm comm,
const char *path,
int cmode,
MPI_Info info,
int *ncidp)
int ncmpi_open(MPI_Comm comm,
const char *path,
int omode,
MPI_Info info,
int *ncidp)
\end{verbatim}
\subsection{Define Mode Functions}
\emph{All define mode functions are collective} (see Appendix B for
rationale).
All processes in the communicator must call them with the same values. At the
end of the define mode the values passed in by all processes are checked to
verify that they match, and if they do not then an error is returned from the
\texttt{ncmpi\_enddef}.
\subsection{Inquiry Functions}
\emph{These calls are all collective operations} (see Appendix B for
rationale).
As in the original NetCDF interface, they may be called from either define or
data mode. \emph{ They return information stored prior to the last open,
enddef, or sync.}
% In the original NetCDF interface the inquiry functions could be called from
% either data or define mode. To aid in the implementation of these functions
% in a parallel library, \emph{inquiry functions may only be called in data mode
% in the PnetCDF API}. They return information stored prior to the last
% open, enddef, or sync. This ensures that the NetCDF metadata need only be
% kept up-to-date on all nodes when in data mode.
%
\subsection{Attribute Functions}
\emph{These calls are all collective operations} (see Appendix B for
rationale).
Attributes in NetCDF are intended for storing scalar or vector values that
describe a variable in some way. As such the expectation is that these
attributes will be small enough to fit into memory.
In the original interface, attribute operations can be performed in either
define or data mode; however, it is possible for attribute operations that
modify attributes (e.g. copy or create attributes) to fail if in data mode.
This is possible because such operations can cause the amount of space needed
to grow. In this case the cost of the operation can be on the order of a copy
of the entire dataset. We will maintain these semantics.
\subsection{Data Mode Functions}
The most important change from the original NetCDF interface with respect to
data mode functions is the split of data mode into two distinct modes:
\emph{collective data mode} and \emph{independent data mode}. \emph{By default when
a user calls \texttt{ncmpi\_enddef} or \texttt{ncmpi\_open}, the user will be
in collective data mode.} The expectation is that most users will be using the
collective operations; these users will never need to switch to independent
data mode. In collective data mode, all processes must call the same function
on the same ncid at the same point in the code. Different parameters for
values such as start, count, and stride, are acceptable. Knowledge that all
processes will be calling the function allows for additional optimization
under the API. In independent mode processes do not need to coordinate calls
to the API; however, this limits the optimizations that can be applied to I/O operations.
A pair of new dataset operations \texttt{ncmpi\_begin\_indep\_data} and
\texttt{ncmpi\_end\_indep\_data} switch into and out of independent data mode.
These calls are collective. Calling \texttt{ncmpi\_close} or
\texttt{ncmpi\_redef} also leaves independent data mode.
Note that it is illegal to enter independent data mode while in define mode.
Users are reminded to call {\tt ncmpi\_enddef} to leave define mode and enter data mode.
\begin{verbatim}
int ncmpi_begin_indep_data(int ncid)
int ncmpi_end_indep_data(int ncid)
\end{verbatim}
The separation of the data mode into two distinct data modes is necessary to
provide consistent views of file data when moving between MPI-IO collective
and independent operations.
We have chosen to implement two collections of data mode functions. The first
collection closely mimics the original NetCDF access functions and is intended
to serve as an easy path of migration from the original NetCDF interface to
the PnetCDF interface. We call this subset of our PnetCDF
interface the \emph{high level data mode} interface.
The second collection uses more MPI functionality in order to provide better
handling of internal data representations and to more fully expose the
capabilities of MPI-IO to the application programmer. All of the first
collection will be implemented in terms of these calls. We will denote this
the \emph{flexible data mode} interface.
In both collections, both independent and collective operations are provided.
Collective function names end with \texttt{\_all}. They are collective across
the communicator associated with the ncid, so all those processes must call
the function at the same time.
Remember that in all cases the input data type is converted into the
appropriate type for the variable stored in the NetCDF file.
\subsubsection{High Level Data Mode Interface}
The independent calls in this interface closely resemble the NetCDF data mode
interface. The only major change is the use of \texttt{MPI\_Offset} types in
place of \texttt{size\_t} types, as described previously.
The collective calls have the same arguments as their independent
counterparts, but they must be called by all processes in the communicator
associated with the ncid.
Here are the example prototypes for accessing a strided subarray of a variable
in a NetCDF file; the remainder of the functions are listed in Appendix A.
In our initial implementation the following data function types will be
implemented for independent access: single data value read and write (var1),
entire variable read and write (var), array of values read and write (vara),
and subsampled array of values read and write (vars). Collective versions of
these types will also be provided, with the exception of a collective entire
variable write; semantically this doesn't make sense.
We could use the same function names for both independent and collective
operations (relying instead on the mode associated with the ncid); however, we
feel that the interface is cleaner, and it will be easier to detect bugs, with
separate calls for independent and collective access.
% \textbf{Strided Subarray Access}
%
Independent calls for writing or reading a strided subarray of values to/from
a NetCDF variable (values are contiguous in memory):
\begin{verbatim}
int ncmpi_put_vars_uchar(int ncid,
int varid,
const MPI_Offset start[],
const MPI_Offset count[],
const MPI_Offset stride[],
const unsigned char *up)
int ncmpi_get_vars_uchar(int ncid,
int varid,
const MPI_Offset start[],
const MPI_Offset count[],
const MPI_Offset stride[],
unsigned char *up)
\end{verbatim}
Collective calls for writing or reading a strided subarray of values to/from a
NetCDF variable (values are contiguous in memory).
\begin{verbatim}
int ncmpi_put_vars_uchar_all(int ncid,
int varid,
const MPI_Offset start[],
const MPI_Offset count[],
const MPI_Offset stride[],
unsigned char *up)
int ncmpi_get_vars_uchar_all(int ncid,
int varid,
const MPI_Offset start[],
const MPI_Offset count[],
const MPI_Offset stride[],
unsigned char *up)
\end{verbatim}
\emph{Note what calls are and aren't implemented at this time.}
\subsubsection{Flexible Data Mode Interface}
This smaller set of functions is all that is needed to implement the data mode
functions. These are also made available to the application programmer.
The advantage of these functions is that they allow the programmer to use MPI
datatypes to describe the in-memory organization of the values. The only
mechanism provides in the original NetCDF interface for such a description is
the mapped array calls. Mapped arrays are a suboptimal method of describing
any regular pattern in memory.
In all these functions the varid, start, count, and stride values refer to the
data in the file (just as in a NetCDF vars-type call). The buf, count, and
datatype fields refer to data in memory.
Here are examples for subarray access:
\begin{verbatim}
int ncmpi_put_vars(int ncid,
int varid,
MPI_Offset start[],
MPI_Offset count[],
MPI_Offset stride[],
const void *buf,
int count,
MPI_Datatype datatype)
int ncmpi_get_vars(int ncid,
int varid,
MPI_Offset start[],
MPI_Offset count[],
MPI_Offset stride[],
void *buf,
int count,
MPI_Datatype datatype)
int ncmpi_put_vars_all(int ncid,
int varid,
MPI_Offset start[],
MPI_Offset count[],
MPI_Offset stride[],
void *buf,
int count,
MPI_Datatype datatype)
int ncmpi_get_vars_all(int ncid,
int varid,
MPI_Offset start[],
MPI_Offset count[],
MPI_Offset stride[],
void *buf,
int count,
MPI_Datatype datatype)
\end{verbatim}
\subsubsection{Mapping Between NetCDF and MPI Types}
It is assumed here that the datatypes passed to the flexible NetCDF interface
use only one basic datatype. For example, the datatype can be arbitrarily
complex, but it cannot consist of both \texttt{MPI\_FLOAT} and
\texttt{MPI\_INT} values, but only one of these basic types.
\emph{Describe status of type support.}
\subsection{Missing}
Calls that were in John M.'s list but that we haven't mentioned here yet.
Attribute functions, strerror, text functions, get\_vara\_text (?).
\section{Examples}
This section will hopefully soon hold some examples, perhaps based on writing
out the 1D and 2D Jacobi examples in the Using MPI book using our interface?
\section{Implementation Notes}
Here we will keep any particular implementation details. As the
implementation matures, this section should discuss implementation decisions.
One trend that will be seen throughout here is the use of collective I/O
routines when it would be possible to use independent operations. There are
two reasons for this. First, for some operations (such as reading the
header), there are important optimizations that can be made to more
efficiently read data from the I/O system, especially as the number of NetCDF
application processes increases. Second, the use of collective routines
allows for the use of aggregation routines in the MPI-IO implementation. This
allows us to redirect I/O to nodes that have access to I/O resources in
systems where not all processes have access. This isn't currently possible
using the independent I/O calls.
See the ROMIO User's Guide for more information on the aggregation hints, in
particular the \texttt{cb\_config\_list} hint.
\subsection{Questions for Users on Implementation}
\begin{itemize}
\item Is this emphasis on collective operations appropriate or problematic?
\item Is C or Fortran the primary language for NetCDF programs?
\end{itemize}
\subsection{I/O routines}
All I/O within our implementation will be performed through MPI-IO.
No temporary files will be created at any time.
%
% HEADER I/O
%
\subsection{Header I/O}
\emph{It is assumed that headers are too small to benefit from parallel I/O.}
All header updates will be performed with collective I/O, but only rank 0 will
provide any input data. This is important because only through the collective
calls can our \texttt{cb\_config\_list} hint be used to control what hosts
actually do writing. Otherwise we could pick some arbitrary process to do
I/O, but we have no way of knowing if that was a process that the user
intended to do I/O in the first place (thus that process might not even be
able to write to the file!)
Headers are written all at once at \texttt{ncmpi\_enddef}.
Likewise collective I/O will be used when reading the header, which should
simply be used to read the entire header to everyone on open.
\emph{First cut might not do this.}
\subsection{Code Reuse}
We will not reuse any NetCDF code. This will give us an opportunity to leave
out any code pertaining to optimizations on specific machines (e.g. Cray) that
we will not need and, more importantly, cannot test.
\subsection{Providing Independent and Collective I/O}
In order to make use of \texttt{MPI\_File\_set\_view} for both independent and
collective NetCDF operations, we will need to open the NetCDF file separately
for both, with the input communicator for collective I/O and with
MPI\_COMM\_SELF for independent I/O. However, we can defer opening the file
for independent access until an independent access call is made if we like.
This can be an important optimization as we scale.
Synchronization when switching between collective and independent access is
mandatory to ensure correctness under the MPI I/O model.
\subsection{Outline of I/O}
Outline of steps to I/O:
\begin{itemize}
\item MPI\_File is extracted from ncid
\item variable type is extracted from varid
\item file view is created from:
\begin{itemize}
\item metadata associated with ncid
\item variable index info, array size, limited/unlimited, type from varid
\item start/count/stride info
\end{itemize}
\item datatype/count must match number of elements specified by
start/count/stride
\item status returns normal mpi status information, which is mapped to a
NetCDF error code.
\end{itemize}
\section{Future Work}
More to add.
%
% BIBLIOGRAPHY
%
\bibliography{pnetcdf-api}
\bibliographystyle{plain}
%
% APPENDIX A: FUNCTION LISTING
%
\section*{Appendix A: C API Listing}
\input c_api
%
% APPENDIX B: RATIONALE
%
\section*{Appendix B: Rationale}
This section covers the rationale behind our decisions on API specifics.
%
% define mode function semantics
%
\subsection*{B.1 Define Mode Function Semantics}
There are two options to choose from here, either forcing a single process to
make these calls (funnelled) or forcing all these calls to be collective with
the same data. Note that making them collective does \emph{not} imply any
communication at the time the call is made.
Both of these options allow for better error detection than what we
previously described (functions independent, anyone could call, only node
0's data was used). Error detection could be performed at the end of the
define mode to minimize costs.
In fact, it is only fractionally more costly to implement collectives than
funneled (including the error detection). To do this one simply bcast's
process 0's values out (which one would have to do anyway) and then
allgathers a single char or int from everyone indicating if there was a
problem.
There has to be some synchronization at the end of the define mode in any
case, so this extra error detection comes at an especially low cost.
\subsection*{B.2 Attribute and Inquiry Function Semantics}
There are similar options here to the ones for define mode functions.
One option would be to make these functions independent. This would be easy
for read operations, but would be more difficult for write operations. In
particular we would need to gather up modifications at some point in order to
distribute them out to all processes. Ordering of modifications might also be
an issue. Finally, we would want to constrain use of these independent
operations to the define mode so that we would have an obvious point at which
to perform this collect and distribute operation (e.g. \texttt{ncmpi\_enddef}).
Another option would be to make all of these functions collective. This is an
unnecessary constraint for the read operations, but it helps in implementing
the write operations (we can distribute modifications right away) and allows
us to maintain the use of these functions outside define mode if we wish.
This is also more consistent with the SPMD model that (we think) our users are
using.
The final option would be to allow independent use of the read operations but
force collective use of the write operations. This would result in confusing
semantics.
For now we will implement the second option, all collective operation. Based
on feedback from users we will consider relaxing this constraint.
\subsubsection*{Questions for Users Regarding Attribute and Inquiry Functions}
\begin{itemize}
\item Are attribute calls used in data mode?
\item Is it inconvenient that the inquiry functions are collective?
\end{itemize}
\subsection*{B.3 Splitting Data Mode into Collective and Independent Data Modes}
In both independent and collective MPI-IO operations, it is important to be
able to set the file view to allow for noncontiguous file regions. However,
since the \texttt{MPI\_File\_set\_view} is a collective operation, it is
impossible to use a single file handle to perform collective I/O and still be
able to arbitrarily reset the file view before an independent operation
(because all the processes would need to participate in the file set view).
For this reason it is necessary to have two file handles in the case where
both independent and collective I/O will be performed. One file handle is
opened with \texttt{MPI\_COMM\_SELF} and is used for independent operations,
while the other is opened with the communicator containing all the processes
for the collective operations.
It is difficult if not impossible in the general case to ensure consistency of
access when a collection of processes are using multiple MPI\_File handles to
access the same file with mixed independent and collective operations.
However, if explicit and collective synchronization points are introduced
between phases where collective and independent I/O operations will be
performed, then the correct set of operations to ensure a consistent view can
be inserted at this point.
\emph{Does this explanation make any sense? I think we need to document this.}
\subsection*{B.4 Even More MPI-Like Data Mode Functions}
Recall that our flexible NetCDF interface has functions such as:
\begin{verbatim}
int ncmpi_put_vars(int ncid,
int varid,
MPI_Offset start[],
MPI_Offset count[],
MPI_Offset stride[],
void *buf,
int count,
MPI_Datatype datatype)
\end{verbatim}
It is possible to move to an even more MPI-like interface by using an MPI
datatype to describe the file region in addition to using datatypes for the
memory region:
\begin{verbatim}
int ncmpi_put(int ncid,
int varid,
MPI_Datatype file_dtype,
void *buf,
int count,
MPI_Datatype mem_dtype)
\end{verbatim}
At first glance this looks rather elegant. However, this isn't as clean as it
seems. The \texttt{file\_dtype} in this case has to describe the variable
layout \emph{in terms of the variable array}, not in terms of the file,
because the user doesn't know about the internal file layout. So the
underlying implementation would need to tear apart \texttt{file\_dtype} and
rebuild a new datatype, on the fly, that corresponded to the data layout in
the file as a whole. This is complicated by the fact that
\texttt{file\_dtype} could be arbitrarily complex.
The flexible NetCDF interface parameters, \texttt{start}, \texttt{count}, and
\texttt{stride} must also be used to build a file type, but the process is
considerably simpler.
\subsection*{B.5 MPI and NetCDF Types}
\subsubsection*{Questions for Users Regarding MPI and NetCDF Types}
\begin{itemize}
\item How do users use text strings?
\item What types are our users using?
\end{itemize}
\end{document}
|