- Q: Is PnetCDF a file format?
A: No. NetCDF
is a file format. PnetCDF is an I/O library that lets MPI programs to
read and write NetCDF files in parallel. Note NetCDF is also referred as
an I/O library that defines a set of application programming interfaces
(APIs) for reading and writing files in the netCDF format.
- Q: How do I improve I/O performance on Lustre?
A: Lustre is a parallel file system that allows users to customize a file's
striping setting. If the amount of your I/O requests
is sufficiently large, then the best strategy is to set the striping count
to the maximal allowable by the file system. For Lustre, the user
configurable parameters are striping count, striping size, and striping
offset.
- Striping count is the number of object storage targets (OST), i.e. the
number of file servers storing the files in round robin.
- Striping size is the size of the block. 1 MB is a good size.
- Striping offset is the starting OST index (default -1). In most of the case, OSTs are selected by the system, not configurable by users.
To find the (default) striping setting of your files (or folders), use the following command:
% lfs getstripe filename
stripe_count: 12 stripe_size: 1048576 stripe_offset: -1
The command to change a directory's/file's striping setting is "lfs".
Its syntax is:
% lfs setstripe -s stripe_size -c stripe_count -o start_ost_index directory|filename
Note that users can change a directory's striping settings. New files created in a Lustre directory inherit the same settings.
All in all, we recommend the followings.
- Use command "lfs" to set the striping count and size for the output directory and create your output files there.
- Use collective APIs. Collective I/O coordinates application processes and reorganizes their requests into an access pattern that fits better for the underlying file system.
- Use nonblocking APIs for multiple small requests. Nonblocking APIs aggregate small requests into large ones and hence have a better chance to achieve higher performance.
- Q: Should I compile my program (or PnetCDF library) to use shared or static libraries?
A: If performance is your top priority, then we recommend static libraries. Although using shared (aka dynamic) libraries can produce the executable files in smaller size, it may introduce delays. See further notes from NERSC.
- Q: What is file striping?
A: On parallel file systems, a file can be divided into blocks of same size,
called striping size, which are stored in a set of file servers in a
round-robin fashion. File striping allows multiple file servers to service
I/O requests simultaneously, achieving a higher I/O bandwidth result.
- Q: Must I use a parallel file system to run PnetCDF?
A: No. However, using a parallel file system with a proper file
striping setting can significantly improve your parallel I/O performance.
- Q: Can a netCDF-4 program make use of PnetCDF internally (instead of HDF5) to perform parallel I/O?
A: Yes. Starting from the release of 4.1, NetCDF has incorporated PnetCDF library to enable parallel I/O operations on files in classic formats (i.e. CDF-1 and 2).
Note that when using HDF5 to carry out parallel I/O, the files will be created in the HDF5 format, instead of the classic netCDF format.
To create a HDF5 format file, user must set the file create mode to add either
NC_CLASSIC_MODEL or NC_NETCDF4 flag (together with NC_MPIIO), when calling API nc_create_par.
To use PnetCDF for parallel I/O, users must add the NC_MPIIO (or NC_PNETCDF) flag to the create mode argument when calling nc_create_par.
Without this flag, netCDF-4 programs can only perform sequential I/O on the classic CDF-1 and CDF-2 files.
Starting from version 4.4.0, netCDF 4 supports the CDF-5 format which allows larger sized variables and 8-byte integers.
Example netCDF-4 programs that use PnetCDF for parallel I/O are available here.
- Q: How do I avoid the "data shift penalty" due to growth of the file header?
A:
"Data shift" occurs when the size of file header grows.
Files in CDF formats comprise two sections: metadata section and data section.
The metadata section, also referred as the file header, is stored at the beginning of the file.
The data section is stored after the metadata section and its starting file offset is
determined at the first call to the end-define API (eg. ncmpi_enddef or nc_enddef), when
the file is created.
Afterward, the starting offset is recalculated every time the program calls end-define
(from entering the re-define mode, finishing changes to metadata, and exiting).
The "data shift penalty" happens when the new file header grows bigger than
the space reserved for the original header and forces PnetCDF to "shift"
the entire data section to a location with higher file offset. The header
of a netCDF file can grow if a program opens an existing file and enters
the redefine mode to add more metadata (e.g. new attributes, dimensions, or
variables). PnetCDF provides an I/O hint, nc_header_align_size, to allow
user to preserve a larger space for file header if it is expected to grow.
We refer to the space allocated for the file header as "file header
extent".
The default file header extent is of size 512 bytes, if the file system's
striping size cannot be obtained from the underneath MPI-IO library. If the
file striping size can be obtained (such as on Lustre) and the total size of
all defined fix-sized variables is larger than 4 times the file striping
size, then PnetCDF aligns the file header extent to the file striping size.
Note PnetCDF always sets a header space of size equal to a multiple of
nc_header_align_size.
Below is an example code fragment that sets the file header hint to 1MB and
pass it to PnetCDF when creating a file.
MPI_Info_create(&info);
MPI_Info_set(info, "nc_header_align_size", "1048576");
ncmpi_create(MPI_COMM_WORLD, "filename.nc", NC_CLOBBER|NC_64BIT_DATA, info, &ncid);
You can also use the run-time environment variable PNETCDF_HINTS to set a desired value.
More information regarding to the netCDF file layout, readers are referred to Parts of a NetCDF Classic File.
Note all I/O hints in PnetCDF and MPI-IO are advisory. The actual values
used by PnetCDF and MPI-IO may be different from the ones set by the user
programs. Users are encouraged to print the actual values used by both
libraries.
See I/O hints for how to print the hint values.
- Q: How do I enable file access layout alignment for fixed-size variables?
A: On most of the file systems, file locking is performed in the
units of file blocks. If a write straddles two blocks, then locks must be
acquired for both blocks. Aligning the start of a variable to a block
boundary can often eliminate all unaligned file system accesses. For IBM's
GPFS and Lustre, the locking unit size is also the file striping size.
The PnetCDF hint for setting the file alignment size is nc_var_align_size.
Below is an example of setting the alignment size to 1 MB.
MPI_Info_create(&info);
MPI_Info_set(info, "nc_var_align_size", "1048576");
ncmpi_create(MPI_COMM_WORLD, "filename.nc", NC_CLOBBER|NC_64BIT_DATA, info, &ncid);
If you are using independent APIs, then setting this hint is more important
than if using collective. This is because most of the latest MPI-IO
implementations have incorporated the file access alignment in their
collective I/O functions, if MPI-IO can successfully retrieve the file
striping information from the underlying parallel file system. This is one
of the reasons we encourage PnetCDF users to use collective APIs whenever
possible.
To disable the alignment, set the hint value of nc_var_align_size to 1.
If you are using nonblocking APIs to write data,
we recommend to disable the alignment.
Note all I/O hints in PnetCDF and MPI-IO are advisory. The actual values
used by PnetCDF and MPI-IO may be different from the ones set by the user
programs. Users are encouraged to print the actual values used by both
libraries.
See I/O hints for how to print the hint values.
- Q: What run-time environment variables are available in PnetCDF?
A: The list of PnetCDF run-time environment variables can be found
here.
- PNETCDF_HINTS allows users to pass I/O hints to PnetCDF library. Hints
include both PnetCDF and MPI-IO hints. The value is a string of hints
separated by ";" and each hint is in the form of "keyword=value". E.g.
under csh/tcsh environment, use command:
setenv PNETCDF_HINTS "romio_ds_write=disable;nc_header_align_size=1048576"
- PNETCDF_VERBOSE_DEBUG_MODE is used to print the location in the source
code where the error code is originated, no matter the error is intended or
not. This run-time environment variable only takes effect when PnetCDF is
configure with debug mode, i.e. --enable-debug is used at the configure
command line. Users are warned that enabling this mode may result in a lot
of debugging messages printed in stderr. Set this variable to 1 to enable.
Set it to 0 or keep it unset disables this mode. Default is 0, i.e.
disabled.
- PNETCDF_SAFE_MODE is used to enable/disable the internal checking for
attribute/argument consistency across all processes. Set it to 1 to enable
the checking. Default is 0, i.e. disabled.
Note the environment variables precede the (hint) values set in the
application program.
- Q: How do I find out the PnetCDF and MPI-IO hint values used in my program?
A: Hint values can be retrieved from calls to ncmpi_get_file_info
and MPI_Info_get. Users are encouraged to check the hint values for the
ones used in their programs. Since all hints are advisory, the actual
values used by PnetCDF and MPI-IO may be different from the values set by
the user programs. The real hint values are automatically adjusted based
on many factors, including file size, variable sizes, and file system
settings.
Example programs that print PnetCDF hints only can be found in the
"examples" directory of the PnetCDF release:
hints.c,
hints.f, and
hints.f90.
Below is a code fragment in C that prints all I/O hints, including PnetCDF and MPI-IO.
err = ncmpi_get_file_info(ncid, &info_used);
MPI_Info_get_nkeys(info_used, &nkeys);
for (i=0; i<nkeys; i++) {
char key[MPI_MAX_INFO_KEY], value[MPI_MAX_INFO_VAL];
int valuelen, flag;
MPI_Info_get_nthkey(info_used, i, key);
MPI_Info_get_valuelen(info_used, key, &valuelen, &flag);
MPI_Info_get(info_used, key, valuelen+1, value, &flag);
printf("I/O hint: key = %21s, value = %s\n", key,value);
}
MPI_Info_free(&info_used);
- Q: Should I consider using nonblocking APIs?
A: Using nonblocking APIs can aggregate a sequence of small requests
into a large one and hence achieve better I/O performance. We encourage
all PnetCDF users to use nonblocking APIs, if your program exhibits the
following I/O patterns:
- There are many variables defined in the netCDF file. MPI processes
read/write a list of one variables in a sequence. (PnetCDF aggregation can
handle requests across variables.)
- MPI processes read/write a sequence of subarrays of the same variable.
(PnetCDF aggregation can also handle multiple requests to a single variable.)
- The numbers of read/write requests are different among processes.
Note the user buffers should not be touched after posting the nonblocking
API calls until the return of wait APIs, except when using the buffered nonblocking write APIs. If the contents of
the buffers are changed before the wait call, then the outcome (contents in
user read buffer or in file) may not be expected.
If the user buffers are freed before the wait call, then the program may
crash.
- Q: How do I use the buffered nonblocking write APIs?
A: Buffered nonblocking write APIs copy the contents of user buffers into an internally allocated buffer, so the user buffers can be reused immediately after the calls return.
A typical way to use these APIs is described below.
- First, tell PnetCDF how much space can be allocated to be used by the APIs.
- Make calls to the buffered put APIs.
- Make calls to the (collective) wait APIs.
- Free the space allocated by the internal buffer.
For further information about the buffered nonblocking APIs, readers are referred to this page.
- Q: What is the difference between collective and independent APIs?
A: Collective APIs requires all MPI processes to participate the
call. This requirement allows MPI-IO and PnetCDF to coordinate the I/O
requesting processes to rearrange all requests into a form that can achieve
the best performance from the underlying file system.
On the contrary, independent APIs (also referred as non-collective) impose
no such requirement.
All PnetCDF collective APIs (except create, open, and
close) have a suffix of "_all", corresponding to their independent
counterparts. To switch from collective data mode to independent mode,
users must call ncmpi_begin_indep_data. API ncmpi_begin_indep_data is to
exit the independent mode.
- Q: Should I use collective APIs or independent APIs?
A: Users are encouraged to use collective APIs whenever possible.
Collective API calls require the participation of all MPI processes that
open the shared file. This requirement allows MPI-IO and PnetCDF to
coordinate the I/O requesting processes to rearrange requests into a form
that can achieve the best performance from the underlying file system. If
the nature of user's I/O does not permit to call collective APIs (such as
the number of requests are not equal among processes, or is determined at
the run time), then we recommend the followings.
- Use nonblocking APIs. Individual processes
can make any number of calls to nonblocking APIs independently from
other processes. At the end, a collective wait API, ncmpi_wait_all, is
recommended to used to allow all nonblocking requests to commit to the
file system.
- Force all the processes participate the collective calls. When a process
has nothing to request, users can still call a collective API with
zero-length request. This is achieved by setting the contents of
argument count to zero.
- Q: Is there an API to read/write multiple subarrays of a single variable?
A: The family of varn APIs can read/write a list of subarrays of a variable in a single call.
These APIs have similar functionality to H5Sselect_elements API in HDF5.
See their C Interface Guide for detail information.
Example programs of using these APIs can be found under the directory examples of PnetCDF release (C/put_varn_int.c, C/put_varn_float.c, F77/put_varn_int.f, F77/put_varn_real.f, F90/put_varn_int.f90, and F90/put_varn_real.f90).
ncmpi_get_varn_<type>_all
ncmpi_get_varn_<type>
ncmpi_put_varn_<type>_all
ncmpi_put_varn_<type>
- Q: What file formats does PnetCDF support and what are their differences?
A: PnetCDF supports CDF-1, CDF-2, and CDF-5 formats.
CDF-1 has been used by netCDF through version 3.5.1.
In CDF-1, both file size and individual variable size is limited by what a 4-byte integer can represented (2(32-1) = 2147483648 bytes).
Starting from 3.6.0, netCDF added support for CDF-2 format.
CDF-2 allows the file size larger than 2 GB.
In addition, CDF-2 also allows more special characters in the name strings of defined dimension, variables, and attributes.
CDF-2 backward supports CDF-1 format.
CDF-5 further relaxes the variable size limitation to allow the size of individual variables larger than 2 GB.
CD-5 also adds new data types to include all unsigned and 64-bit integers.
Check CDF-5 format specification for detailed differences (highlighted in colors).
- Q: How do I obtain the error message corresponding to a returned error code?
A: All PnetCDF APIs return an integer value, an error code indicating the error status.
NC_NOERR, NF_NOERR, and NF90_NOERR mean the API ran successfully.
All error codes are non-positive integral values, constants defined in header file pnetcdf.h.
APIs ncmpi_strerror/nfmpi_strerror/nf90mpi_strerror turn an error code into a human readable string.
For example, NC_EBADID becomes "NetCDF: Not a valid ID".
The code fragment below shows a way to check for error and prints the error message.
err = ncmpi_create(comm, path, cmode, info, &ncid);
if (err != NC_NOERR) {
int rank;
MPI_Comm_rank(comm, &rank);
printf("Error at rank %d: %s\n", rank, ncmpi_strerror(err));
}
Starting from release 1.8.0, a new API ncmpi_strerrno has been added to
print the name of the error code, such as "NC_EBADID".
- Q: When should I call API ncmpi_sync?
A: In most of the cases, applications need not call ncmpi_sync
before closing a file.
API ncmpi_sync does nothing but simply calls MPI_File_sync. It is expected
to have a very high cost when files are stored on a parallel file system,
as MPI_File_sync (internally most likely calls the POSIX sync function) is
designed to ensure the data is safely stored on the persistent storage
hardware, before the function returns. POSIX sync usually incurs a huge
performance penalty. This API is to be used for extremely cautious
behavior.
The term "sync" should not be confused with "flush". The PnetCDF API
ncmpi_flush is also available, which is used to flush data cached in
memory to the file system. This API does not call MPI_File_sync internally.
POSIX sync guarantees the write data is safely stored on the persistent
storage devices, such as hard disks, before the function call returns to
users. Safe means even if file servers crashed right after the function
returns, data is safely stored on the disks. On parallel file systems,
calling MPI_File_sync will result in all MPI processes calling sync, which
causes all file servers to carry out the sync operations at the same time.
When using a POSIX compliant file system, such as Lustre and GPFS,
applications usually need not call ncmpi_sync before closing the file.
- Q: Does PnetCDF support fill mode?
A: Prior to version 1.6.1, PnetCDF does not support fill mode. This
is because I/O under fill mode can be very expensive (i.e. variables are
prefilled first and later overwritten with user's data.) See netCDF
interface guide on
nc_set_fill
for more explanation.
Starting from version 1.6.1, PnetCDF supports fill mode in a slightly
different way from netCDF. The API ncmpi_set_fill()
sets the fill mode for all the non-record variables defined in the file and
can only be called in the define mode. For record variables, users are
required to call ncmpi_fill_var() explicitly to fill one record of a
variable at a time.
Similar to nc_def_var_fill() in netCDF4, API ncmpi_def_var_fill()
can be used to set the fill mode for individual variables.
- Q: Is there an API that reports the amount of data read/written that was carried out by PnetCDF?
A: The following two APIs reports the amount of data that has been
read/written since the file is opened/created. The amount includes the I/O
to the file header as well as the variables. The reported amount is per
process rank basis. The APIs can be called in between file open/create and
close.
int ncmpi_inq_get_size(int ncid, MPI_Offset *size);
int ncmpi_inq_put_size(int ncid, MPI_Offset *size);
- Q: What are the numerical/non-numerical data types referred in NetCDF/PnetCDF?
A:The NetCDF CDF format specifications define a set of external NC data types and describe their intents of use.
External type No. Bits Intent of use
------------- -------- ---------------------------------
NC_CHAR 8 text data (the only non-numerical type in NetCDF)
NC_BYTE 8 1-byte integer
NC_SHORT 16 2-byte signed integer
NC_INT 32 4-byte signed integer
NC_FLOAT 32 4-byte floating point number
NC_DOUBLE 64 8-byte real number in double precision
NC_UBYTE 8 unsigned 1-byte integer
NC_USHORT 16 unsigned 2-byte integer
NC_UINT 32 unsigned 4-byte integer
NC_INT64 64 signed 8-byte integer
NC_UINT64 64 unsigned 8-byte integer
Note NC_CHAR is the only non-numerical data type available in NetCDF
realm. All other external types are considered numerical, which are
illegal to be converted (type-casted) to and from a NetCDF variable
defined in NC_CHAR type (error code NC_ECHAR will be thrown). The only
legal APIs to read/write a variable of type NC_CHAR are the "_text" APIs.
- Q: Why do I get NC_EBADTYPE error when I used MPI_BYTE in flexible APIs?
A:
Starting from 1.7.0, PnetCDF translates internal data types (i.e. data
types of the I/O buffer and also used in the API name such as text,
schar, uchar, short, int, etc.) to MPI data types based on the table
below. Note MPI_BYTE does no correspond to any internal data type used
in NetCDF/PnetCDF APIs. Thus, when MPI_BYTE is used to construct an MPI
derived data type which is later used as argument buftype in a flexible
API, the error code NC_EBADTYPE will be thrown (Not a valid data type).
Data type of Corresponding
internal I/O buffer Example API MPI datatype
------------- ---------------------- -----------------
text ncmpi_put_var_text MPI_CHAR
schar ncmpi_put_var_schar MPI_SIGNED_CHAR
uchar ncmpi_put_var_uchar MPI_UNSIGNED_CHAR
short ncmpi_put_var_short MPI_SHORT
ushort ncmpi_put_var_ushort MPI_UNSIGNED_SHORT
int ncmpi_put_var_int MPI_INT
uint ncmpi_put_var_uint MPI_UNSIGNED
long ncmpi_put_var_long MPI_LONG
float ncmpi_put_var_float MPI_FLOAT
double ncmpi_put_var_double MPI_DOUBLE
longlong ncmpi_put_var_longlong MPI_LONG_LONG_INT
ulonglong ncmpi_put_var_ulonglong MPI_UNSIGNED_LONG_LONG
- Q: Does PnetCDF support compound data types?
A: No. This is due to the limitation of CDF file format specifications.
- Q: Can I run my PnetCDF program sequentially?
A: Yes. Because a PnetCDF program is also an MPI program, it can run on one process under the MPI running environment.
- Q: What level of parallel I/O data consistency is supported by PnetCDF?
A: PnetCDF follows the same parallel I/O data consistency as
MPI-IO standard.
Readers are also referred to the following paper.
Rajeev Thakur, William Gropp, and Ewing Lusk, On Implementing MPI-IO Portably and with High Performance,
in the Proceedings of the 6th Workshop on I/O in Parallel and Distributed Systems, pp. 23-32, May 1999.
- Q: Where can I find PnetCDF example programs?
A: PnetCDF releases come with a set of example programs in C, Fortran, and Fortran 90.
They are available under the directory named "examples".
Refer to file README for descriptions of each example programs.
- Q: How to find out the PnetCDF version I am using?
A: A utility program named pnetcdf_version comes with all PnetCDF releases reports the version information. Check its man page for command-line options. Below is an example run of this command:
% pnetcdf_version
PnetCDF Version: 1.7.0
PnetCDF Release date: 03 Mar 2016
PnetCDF configure: --prefix=/usr/local/PnetCDF --with-mpi=/usr/local
MPICC: /usr/local/bin/mpicc -g -O2
MPICXX: /usr/local/bin/mpicxx -g -O2
MPIF77: /usr/local/bin/mpif77 -g -O2
MPIF90: /usr/local/bin/mpif90 -g -O2
Alternatively, two commands below also show the version information.
% ident libpnetcdf.a
src/lib/libpnetcdf.a:
$Id: @(#) PnetCDF library version 1.7.0 of 03 Mar 2016 $
% strings libpnetcdf.a |grep "PnetCDF library version"
PnetCDF library version 1.7.0 of 03 Mar 2016
$Id: @(#) PnetCDF library version 1.7.0 of 03 Mar 2016 $
- Q: Is there a mailing list for PnetCDF discussions and questions?
A: We discuss the design and use of the PnetCDF library on the parallel-netcdf@mcs.anl.gov mailing list.
Anyone interested in developing or using PnetCDF is encouraged to join.
Visit the list information page for details.