Parallel netCDF Q&A

This page provides frequently asked questions and tips for obtaining better I/O performance. Readers are also referred to NetCDF FAQ for netCDF specific questions.

Q: Is PnetCDF a file format?
Q: How do I improve I/O performance on Lustre?
Q: Should I compile my program (or PnetCDF library) to use shared or static libraries?
Q: What is file striping?
Q: Must I use a parallel file system to run PnetCDF?
Q: Can a netCDF-4 program make use of PnetCDF internally (instead of HDF5) to perform parallel I/O?
Q: How do I avoid the "data shift penalty" due to growth of the file header?
Q: How do I enable file access layout alignment for fixed-size variables?
Q: What run-time environment variables are available in PnetCDF?
Q: How do I find out the PnetCDF and MPI-IO hint values used in my program?
Q: How do I inquire the file header size and the amount of space allocated for it?
Q: Should I consider using nonblocking APIs?
Q: How do I use the buffered nonblocking write APIs?
Q: What is the difference between collective and independent APIs?
Q: Should I use collective APIs or independent APIs?
Q: Is there an API to read/write multiple subarrays of a single variable?
Q: What file formats does PnetCDF support and what are their differences?
Q: How do I obtain the error message corresponding to a returned error code?
Q: What should I call API ncmpi_sync?
Q: Does PnetCDF support fill mode?
Q: Is there an API that reports the amount of data read/written that was carried out by PnetCDF?
Q: What are the numerical/non-numerical data types referred in NetCDF/PnetCDF?
Q: Why do I get NC_EBADTYPE error when I used MPI_BYTE in flexible APIs?
Q: Does PnetCDF support compound data types?
Q: Can I run my PnetCDF program sequentially?
Q: What level of parallel I/O data consistency is supported by PnetCDF?
Q: Where can I find PnetCDF example programs?
Q: How to find out the PnetCDF version I am using?
Q: Is there a mailing list for PnetCDF discussions and questions?

Q: Is PnetCDF a file format?
A: No. NetCDF is a file format. PnetCDF is an I/O library that lets MPI programs to read and write NetCDF files in parallel. Note NetCDF is also referred as an I/O library that defines a set of application programming interfaces (APIs) for reading and writing files in the netCDF format.

Q: How do I improve I/O performance on Lustre?
A: Lustre is a parallel file system that allows users to customize a file's striping setting. If the amount of your I/O requests is sufficiently large, then the best strategy is to set the striping count to the maximal allowable by the file system. For Lustre, the user configurable parameters are striping count, striping size, and striping offset.
- Striping count is the number of object storage targets (OST), i.e. the number of file servers storing the files in round robin.
- Striping size is the size of the block. 1 MB is a good size.
- Striping offset is the starting OST index (default -1). In most of the case, OSTs are selected by the system, not configurable by users.
To find the (default) striping setting of your files (or folders), use the following command:

% lfs getstripe filename stripe_count: 12 stripe_size: 1048576 stripe_offset: -1
The command to change a directory's/file's striping setting is "lfs". Its syntax is:

% lfs setstripe -s stripe_size -c stripe_count -o start_ost_index directory|filename
Note that users can change a directory's striping settings. New files created in a Lustre directory inherit the same settings. All in all, we recommend the followings.
1. Use command "lfs" to set the striping count and size for the output directory and create your output files there.
2. Use collective APIs. Collective I/O coordinates application processes and reorganizes their requests into an access pattern that fits better for the underlying file system.
3. Use nonblocking APIs for multiple small requests. Nonblocking APIs aggregate small requests into large ones and hence have a better chance to achieve higher performance.

Q: Should I compile my program (or PnetCDF library) to use shared or static libraries?
A: If performance is your top priority, then we recommend static libraries. Although using shared (aka dynamic) libraries can produce the executable files in smaller size, it may introduce delays. See further notes from NERSC.

Q: What is file striping?
A: On parallel file systems, a file can be divided into blocks of same size, called striping size, which are stored in a set of file servers in a round-robin fashion. File striping allows multiple file servers to service I/O requests simultaneously, achieving a higher I/O bandwidth result.

Q: Must I use a parallel file system to run PnetCDF?
A: No. However, using a parallel file system with a proper file striping setting can significantly improve your parallel I/O performance.

Q: Can a netCDF-4 program make use of PnetCDF internally (instead of HDF5) to perform parallel I/O?
A: Yes. Starting from the release of 4.1, NetCDF has incorporated PnetCDF library to enable parallel I/O operations on files in classic formats (i.e. CDF-1 and 2). Note that when using HDF5 to carry out parallel I/O, the files will be created in the HDF5 format, instead of the classic netCDF format. To create a HDF5 format file, user must set the file create mode to add either NC_CLASSIC_MODEL or NC_NETCDF4 flag (together with NC_MPIIO), when calling API nc_create_par.
To use PnetCDF for parallel I/O, users must add the NC_MPIIO (or NC_PNETCDF) flag to the create mode argument when calling nc_create_par. Without this flag, netCDF-4 programs can only perform sequential I/O on the classic CDF-1 and CDF-2 files. Starting from version 4.4.0, netCDF 4 supports the CDF-5 format which allows larger sized variables and 8-byte integers.
Example netCDF-4 programs that use PnetCDF for parallel I/O are available here.

Q: How do I avoid the "data shift penalty" due to growth of the file header?
A: "Data shift" occurs when the size of file header grows. Files in CDF formats comprise two sections: metadata section and data section. The metadata section, also referred as the file header, is stored at the beginning of the file. The data section is stored after the metadata section and its starting file offset is determined at the first call to the end-define API (eg. ncmpi_enddef or nc_enddef), when the file is created. Afterward, the starting offset is recalculated every time the program calls end-define (from entering the re-define mode, finishing changes to metadata, and exiting).
The "data shift penalty" happens when the new file header grows bigger than the space reserved for the original header and forces PnetCDF to "shift" the entire data section to a location with higher file offset. The header of a netCDF file can grow if a program opens an existing file and enters the redefine mode to add more metadata (e.g. new attributes, dimensions, or variables). PnetCDF provides an I/O hint, nc_header_align_size, to allow user to preserve a larger space for file header if it is expected to grow. We refer to the space allocated for the file header as "file header extent".
The default file header extent is of size 512 bytes, if the file system's striping size cannot be obtained from the underneath MPI-IO library. If the file striping size can be obtained (such as on Lustre) and the total size of all defined fix-sized variables is larger than 4 times the file striping size, then PnetCDF aligns the file header extent to the file striping size. Note PnetCDF always sets a header space of size equal to a multiple of nc_header_align_size.
Below is an example code fragment that sets the file header hint to 1MB and pass it to PnetCDF when creating a file.
MPI_Info_create(&info); MPI_Info_set(info, "nc_header_align_size", "1048576"); ncmpi_create(MPI_COMM_WORLD, "filename.nc", NC_CLOBBER|NC_64BIT_DATA, info, &ncid);
You can also use the run-time environment variable PNETCDF_HINTS to set a desired value. More information regarding to the netCDF file layout, readers are referred to Parts of a NetCDF Classic File.
Note all I/O hints in PnetCDF and MPI-IO are advisory. The actual values used by PnetCDF and MPI-IO may be different from the ones set by the user programs. Users are encouraged to print the actual values used by both libraries. See I/O hints for how to print the hint values.

Q: How do I enable file access layout alignment for fixed-size variables?
A: On most of the file systems, file locking is performed in the units of file blocks. If a write straddles two blocks, then locks must be acquired for both blocks. Aligning the start of a variable to a block boundary can often eliminate all unaligned file system accesses. For IBM's GPFS and Lustre, the locking unit size is also the file striping size. The PnetCDF hint for setting the file alignment size is nc_var_align_size. Below is an example of setting the alignment size to 1 MB.
MPI_Info_create(&info); MPI_Info_set(info, "nc_var_align_size", "1048576"); ncmpi_create(MPI_COMM_WORLD, "filename.nc", NC_CLOBBER|NC_64BIT_DATA, info, &ncid);
If you are using independent APIs, then setting this hint is more important than if using collective. This is because most of the latest MPI-IO implementations have incorporated the file access alignment in their collective I/O functions, if MPI-IO can successfully retrieve the file striping information from the underlying parallel file system. This is one of the reasons we encourage PnetCDF users to use collective APIs whenever possible.
To disable the alignment, set the hint value of nc_var_align_size to 1. If you are using nonblocking APIs to write data, we recommend to disable the alignment.
Note all I/O hints in PnetCDF and MPI-IO are advisory. The actual values used by PnetCDF and MPI-IO may be different from the ones set by the user programs. Users are encouraged to print the actual values used by both libraries. See I/O hints for how to print the hint values.

Q: What run-time environment variables are available in PnetCDF?
A: The list of PnetCDF run-time environment variables can be found here.
- PNETCDF_HINTS allows users to pass I/O hints to PnetCDF library. Hints include both PnetCDF and MPI-IO hints. The value is a string of hints separated by ";" and each hint is in the form of "keyword=value". E.g. under csh/tcsh environment, use command:
setenv PNETCDF_HINTS "romio_ds_write=disable;nc_header_align_size=1048576"
- PNETCDF_VERBOSE_DEBUG_MODE is used to print the location in the source code where the error code is originated, no matter the error is intended or not. This run-time environment variable only takes effect when PnetCDF is configure with debug mode, i.e. --enable-debug is used at the configure command line. Users are warned that enabling this mode may result in a lot of debugging messages printed in stderr. Set this variable to 1 to enable. Set it to 0 or keep it unset disables this mode. Default is 0, i.e. disabled.
- PNETCDF_SAFE_MODE is used to enable/disable the internal checking for attribute/argument consistency across all processes. Set it to 1 to enable the checking. Default is 0, i.e. disabled.
Note the environment variables precede the (hint) values set in the application program.

Q: How do I find out the PnetCDF and MPI-IO hint values used in my program?
A: Hint values can be retrieved from calls to ncmpi_get_file_info and MPI_Info_get. Users are encouraged to check the hint values for the ones used in their programs. Since all hints are advisory, the actual values used by PnetCDF and MPI-IO may be different from the values set by the user programs. The real hint values are automatically adjusted based on many factors, including file size, variable sizes, and file system settings.
Example programs that print PnetCDF hints only can be found in the "examples" directory of the PnetCDF release: hints.c, hints.f, and hints.f90. Below is a code fragment in C that prints all I/O hints, including PnetCDF and MPI-IO.
```
    err = ncmpi_get_file_info(ncid, &info_used);
    MPI_Info_get_nkeys(info_used, &nkeys);
    for (i=0; i<nkeys; i++) {
        char key[MPI_MAX_INFO_KEY], value[MPI_MAX_INFO_VAL];
        int  valuelen, flag;
        MPI_Info_get_nthkey(info_used, i, key);
        MPI_Info_get_valuelen(info_used, key, &valuelen, &flag);
        MPI_Info_get(info_used, key, valuelen+1, value, &flag);
        printf("I/O hint: key = %21s, value = %s\n", key,value);
    }
    MPI_Info_free(&info_used);
    
```

Q: How do I inquire the file header size and the amount of space allocated for it?
A: Starting from version 1.4.0, the following two APIs were added for such inquiries.

int ncmpi_inq_header_size(int ncid, MPI_Offset *hdr_size); int ncmpi_inq_header_extent(int ncid, MPI_Offset *hdr_extent);
API ncmpi_inq_header_size returns in argument hdr_size the amount in bytes already used by the header (true size). API ncmpi_inq_header_extent returns in argument hdr_extent the amount in bytes allocated for the header. Note that hdr_extent >= hdr_size.

Q: Should I consider using nonblocking APIs?
A: Using nonblocking APIs can aggregate a sequence of small requests into a large one and hence achieve better I/O performance. We encourage all PnetCDF users to use nonblocking APIs, if your program exhibits the following I/O patterns:
- There are many variables defined in the netCDF file. MPI processes read/write a list of one variables in a sequence. (PnetCDF aggregation can handle requests across variables.)
- MPI processes read/write a sequence of subarrays of the same variable. (PnetCDF aggregation can also handle multiple requests to a single variable.)
- The numbers of read/write requests are different among processes.
Note the user buffers should not be touched after posting the nonblocking API calls until the return of wait APIs, except when using the buffered nonblocking write APIs. If the contents of the buffers are changed before the wait call, then the outcome (contents in user read buffer or in file) may not be expected. If the user buffers are freed before the wait call, then the program may crash.

Q: How do I use the buffered nonblocking write APIs?
A: Buffered nonblocking write APIs copy the contents of user buffers into an internally allocated buffer, so the user buffers can be reused immediately after the calls return. A typical way to use these APIs is described below.
- First, tell PnetCDF how much space can be allocated to be used by the APIs.
- Make calls to the buffered put APIs.
- Make calls to the (collective) wait APIs.
- Free the space allocated by the internal buffer.
For further information about the buffered nonblocking APIs, readers are referred to this page.

Q: What is the difference between collective and independent APIs?
A: Collective APIs requires all MPI processes to participate the call. This requirement allows MPI-IO and PnetCDF to coordinate the I/O requesting processes to rearrange all requests into a form that can achieve the best performance from the underlying file system.
On the contrary, independent APIs (also referred as non-collective) impose no such requirement.
All PnetCDF collective APIs (except create, open, and close) have a suffix of "_all", corresponding to their independent counterparts. To switch from collective data mode to independent mode, users must call ncmpi_begin_indep_data. API ncmpi_begin_indep_data is to exit the independent mode.

Q: Should I use collective APIs or independent APIs?
A: Users are encouraged to use collective APIs whenever possible. Collective API calls require the participation of all MPI processes that open the shared file. This requirement allows MPI-IO and PnetCDF to coordinate the I/O requesting processes to rearrange requests into a form that can achieve the best performance from the underlying file system. If the nature of user's I/O does not permit to call collective APIs (such as the number of requests are not equal among processes, or is determined at the run time), then we recommend the followings.
- Use nonblocking APIs. Individual processes can make any number of calls to nonblocking APIs independently from other processes. At the end, a collective wait API, ncmpi_wait_all, is recommended to used to allow all nonblocking requests to commit to the file system.
- Force all the processes participate the collective calls. When a process has nothing to request, users can still call a collective API with zero-length request. This is achieved by setting the contents of argument count to zero.

Q: Is there an API to read/write multiple subarrays of a single variable?
A: The family of varn APIs can read/write a list of subarrays of a variable in a single call. These APIs have similar functionality to H5Sselect_elements API in HDF5. See their C Interface Guide for detail information. Example programs of using these APIs can be found under the directory examples of PnetCDF release (C/put_varn_int.c, C/put_varn_float.c, F77/put_varn_int.f, F77/put_varn_real.f, F90/put_varn_int.f90, and F90/put_varn_real.f90).
```
    ncmpi_get_varn_<type>_all
    ncmpi_get_varn_<type>
    ncmpi_put_varn_<type>_all
    ncmpi_put_varn_<type>
    
```

Q: What file formats does PnetCDF support and what are their differences?
A: PnetCDF supports CDF-1, CDF-2, and CDF-5 formats. CDF-1 has been used by netCDF through version 3.5.1. In CDF-1, both file size and individual variable size is limited by what a 4-byte integer can represented (2^(32-1) = 2147483648 bytes). Starting from 3.6.0, netCDF added support for CDF-2 format. CDF-2 allows the file size larger than 2 GB. In addition, CDF-2 also allows more special characters in the name strings of defined dimension, variables, and attributes. CDF-2 backward supports CDF-1 format. CDF-5 further relaxes the variable size limitation to allow the size of individual variables larger than 2 GB. CD-5 also adds new data types to include all unsigned and 64-bit integers. Check CDF-5 format specification for detailed differences (highlighted in colors).

Q: How do I obtain the error message corresponding to a returned error code?
A: All PnetCDF APIs return an integer value, an error code indicating the error status. NC_NOERR, NF_NOERR, and NF90_NOERR mean the API ran successfully. All error codes are non-positive integral values, constants defined in header file pnetcdf.h. APIs ncmpi_strerror/nfmpi_strerror/nf90mpi_strerror turn an error code into a human readable string. For example, NC_EBADID becomes "NetCDF: Not a valid ID". The code fragment below shows a way to check for error and prints the error message.
```
    err = ncmpi_create(comm, path, cmode, info, &ncid);
    if (err != NC_NOERR) {
        int rank;
        MPI_Comm_rank(comm, &rank);
        printf("Error at rank %d: %s\n", rank, ncmpi_strerror(err));
    }
    
```
Starting from release 1.8.0, a new API ncmpi_strerrno has been added to print the name of the error code, such as "NC_EBADID".

Q: When should I call API ncmpi_sync?
A: In most of the cases, applications need not call ncmpi_sync before closing a file.
API ncmpi_sync does nothing but simply calls MPI_File_sync. It is expected to have a very high cost when files are stored on a parallel file system, as MPI_File_sync (internally most likely calls the POSIX sync function) is designed to ensure the data is safely stored on the persistent storage hardware, before the function returns. POSIX sync usually incurs a huge performance penalty. This API is to be used for extremely cautious behavior.
The term "sync" should not be confused with "flush". The PnetCDF API ncmpi_flush is also available, which is used to flush data cached in memory to the file system. This API does not call MPI_File_sync internally. POSIX sync guarantees the write data is safely stored on the persistent storage devices, such as hard disks, before the function call returns to users. Safe means even if file servers crashed right after the function returns, data is safely stored on the disks. On parallel file systems, calling MPI_File_sync will result in all MPI processes calling sync, which causes all file servers to carry out the sync operations at the same time.
When using a POSIX compliant file system, such as Lustre and GPFS, applications usually need not call ncmpi_sync before closing the file.

Q: Does PnetCDF support fill mode?
A: Prior to version 1.6.1, PnetCDF does not support fill mode. This is because I/O under fill mode can be very expensive (i.e. variables are prefilled first and later overwritten with user's data.) See netCDF interface guide on nc_set_fill for more explanation.
Starting from version 1.6.1, PnetCDF supports fill mode in a slightly different way from netCDF. The API ncmpi_set_fill() sets the fill mode for all the non-record variables defined in the file and can only be called in the define mode. For record variables, users are required to call ncmpi_fill_var() explicitly to fill one record of a variable at a time.
Similar to nc_def_var_fill() in netCDF4, API ncmpi_def_var_fill() can be used to set the fill mode for individual variables.

Q: Is there an API that reports the amount of data read/written that was carried out by PnetCDF?
A: The following two APIs reports the amount of data that has been read/written since the file is opened/created. The amount includes the I/O to the file header as well as the variables. The reported amount is per process rank basis. The APIs can be called in between file open/create and close.
```
     int ncmpi_inq_get_size(int ncid, MPI_Offset *size);
     int ncmpi_inq_put_size(int ncid, MPI_Offset *size);
    
```

Q: What are the numerical/non-numerical data types referred in NetCDF/PnetCDF?
A:The NetCDF CDF format specifications define a set of external NC data types and describe their intents of use.

      External type  No. Bits  Intent of use
      -------------  --------  ---------------------------------
      NC_CHAR         8        text data (the only non-numerical type in NetCDF)
      NC_BYTE         8        1-byte integer
      NC_SHORT       16        2-byte signed integer
      NC_INT         32        4-byte signed integer
      NC_FLOAT       32        4-byte floating point number
      NC_DOUBLE      64        8-byte real number in double precision
      NC_UBYTE        8        unsigned 1-byte integer
      NC_USHORT      16        unsigned 2-byte integer
      NC_UINT        32        unsigned 4-byte integer
      NC_INT64       64        signed 8-byte integer
      NC_UINT64      64        unsigned 8-byte integer

Note NC_CHAR is the only non-numerical data type available in NetCDF realm. All other external types are considered numerical, which are illegal to be converted (type-casted) to and from a NetCDF variable defined in NC_CHAR type (error code NC_ECHAR will be thrown). The only legal APIs to read/write a variable of type NC_CHAR are the "_text" APIs.

Q: Why do I get NC_EBADTYPE error when I used MPI_BYTE in flexible APIs?
A: Starting from 1.7.0, PnetCDF translates internal data types (i.e. data types of the I/O buffer and also used in the API name such as text, schar, uchar, short, int, etc.) to MPI data types based on the table below. Note MPI_BYTE does no correspond to any internal data type used in NetCDF/PnetCDF APIs. Thus, when MPI_BYTE is used to construct an MPI derived data type which is later used as argument buftype in a flexible API, the error code NC_EBADTYPE will be thrown (Not a valid data type).

      Data type of                                  Corresponding
      internal I/O buffer   Example API             MPI datatype
      -------------         ----------------------  -----------------
      text                  ncmpi_put_var_text      MPI_CHAR
      schar                 ncmpi_put_var_schar     MPI_SIGNED_CHAR
      uchar                 ncmpi_put_var_uchar     MPI_UNSIGNED_CHAR
      short                 ncmpi_put_var_short     MPI_SHORT
      ushort                ncmpi_put_var_ushort    MPI_UNSIGNED_SHORT
      int                   ncmpi_put_var_int       MPI_INT
      uint                  ncmpi_put_var_uint      MPI_UNSIGNED
      long                  ncmpi_put_var_long      MPI_LONG
      float                 ncmpi_put_var_float     MPI_FLOAT
      double                ncmpi_put_var_double    MPI_DOUBLE
      longlong              ncmpi_put_var_longlong  MPI_LONG_LONG_INT
      ulonglong             ncmpi_put_var_ulonglong MPI_UNSIGNED_LONG_LONG

Q: Does PnetCDF support compound data types?
A: No. This is due to the limitation of CDF file format specifications.

Q: Can I run my PnetCDF program sequentially?
A: Yes. Because a PnetCDF program is also an MPI program, it can run on one process under the MPI running environment.

Q: What level of parallel I/O data consistency is supported by PnetCDF?
A: PnetCDF follows the same parallel I/O data consistency as MPI-IO standard. Readers are also referred to the following paper.
Rajeev Thakur, William Gropp, and Ewing Lusk, On Implementing MPI-IO Portably and with High Performance, in the Proceedings of the 6th Workshop on I/O in Parallel and Distributed Systems, pp. 23-32, May 1999.

Q: Where can I find PnetCDF example programs?
A: PnetCDF releases come with a set of example programs in C, Fortran, and Fortran 90. They are available under the directory named "examples". Refer to file README for descriptions of each example programs.

Q: How to find out the PnetCDF version I am using?
A: A utility program named pnetcdf_version comes with all PnetCDF releases reports the version information. Check its man page for command-line options. Below is an example run of this command:

% pnetcdf_version
PnetCDF Version:    	1.7.0
PnetCDF Release date:	03 Mar 2016
PnetCDF configure: 	--prefix=/usr/local/PnetCDF --with-mpi=/usr/local
MPICC:  /usr/local/bin/mpicc -g -O2
MPICXX: /usr/local/bin/mpicxx -g -O2
MPIF77: /usr/local/bin/mpif77 -g -O2
MPIF90: /usr/local/bin/mpif90 -g -O2

Alternatively, two commands below also show the version information.

% ident libpnetcdf.a
src/lib/libpnetcdf.a:
     $Id: @(#) PnetCDF library version 1.7.0 of 03 Mar 2016 $

% strings libpnetcdf.a |grep "PnetCDF library version"
PnetCDF library version 1.7.0 of 03 Mar 2016
$Id: @(#) PnetCDF library version 1.7.0 of 03 Mar 2016 $

Q: Is there a mailing list for PnetCDF discussions and questions?
A: We discuss the design and use of the PnetCDF library on the parallel-netcdf@mcs.anl.gov mailing list. Anyone interested in developing or using PnetCDF is encouraged to join. Visit the list information page for details.