Using DataWarp as Burst Buffers in PnetCDF

Using Burst Buffers with PnetCDF

PnetCDF has a built in I/O driver that aggregate variable write I/O requests on the burst buffers. Through this driver, write requests are first stored on the burst buffer and later flushed to the parallel file system when the file is closed or when user explicitly calls flush. This new driver is under beta version been tested on the Cori parallel computer at NERSC. Please report any problem on our git hub repository.

Figure 1. Design of the Burst Buffer Driver. The green lines show the data flow for write operations when the Burst Buffer Driver is disabled. The blue lines indicate the write requests are first stored on the burst buffer. The red line shows the data flow when flushing the data from the burst buffer to the parallel file system.

Enable the Burst Buffer Driver

To build PnetCDF with Burst Buffer Driver support, add "--enable-burst-buffering" option at the configure command line.

    ./configure --prefix=/path/to/install --enable-burst-buffering

The remaining steps are the same as usual, to build PnetCDF library, run:

    make

To install PnetCDF, run:

    make install

Use Burst Buffer Driver

The use of Burst Buffer Driver is controlled by PnetCDF I/O hints. There are two ways of setting PnetCDF I/O hints.

One is by setting MPI file info object and passing it through the call to file creating API ncmpi_create or file open API ncmpi_open.
The other is by setting the environment variable PNETCDF_HINTS.

For instance, to enable the Burst Buffer Driver, add to the MPI-IO info object the hint "nc_burst_buf" and set it to "enable".

    MPI_Info_set(info, "nc_burst_buf", "enable");

The hint can also be set using environment variable PNETCDF_HINTS.

    export PNETCDF_HINTS="nc_burst_buf=enable"

See more information on how to set PnetCDF hints through the environment variable in the PnetCDF C Interface Guide.

PnetCDF Hints for Controlling Burst Buffer Driver

Below is a list of hints for controlling the behavior of the Burst Buffer Driver.

Hint Key	Value	Default Value	Description
nc_burst_buf	enable or disable	disable	Whether to enable Burst Buffer Driver or not. The rest of the hints will be ignored if Burst Buffer Driver is disabled.
nc_burst_buf_dirname	a directory name	(see description)	The location on burst buffer system to store I/O data. Take Cori at NERSC as example. It is usually set to the value of variable DW_JOB_PRIVATE or DW_JOB_STRIPED, depending on the access_mode set in the batch script. See the user guide of How to use the Burst Buffer on Cori for picking the access mode. If nc_burst_buf_dirname is not set by the user, PnetCDF will display a warning and set it to the same directory of the NetCDF file.
nc_burst_buf_del_on_close	enable or disable	enable	Whether the intermediate files created by the Burst Buffer Driver on the burst buffer should be deleted after closing the NetCDF file. If the job schedular will recycle the burst buffer space automatically, users can disable this option to improve performance.
nc_burst_buf_flush_buffer_size	an integer in bytes	0	Amount of memory in each MPI process that allows PnetCDF to use for flushing the data stored in the burst buffer to the parallel file system. 0 means unlimited. Note that this value must any single write request that is larger than the buffer size will not be buffered. Instead, it will be written to the parallel file system directly, bypassing the burst buffer.

Example Program

Below is an example that creates a NetCDF file using Burst Buffer Driver. Other than adding a new MPI-IO hint, the PnetCDF program appears the same as before.

More example programs can be found under example/burst_buffer folder.

    #include<stdio.h>
    #include<stdlib.h>
    #include<pnetcdf.h>

    int main(int argc, char *argv[]){
        int ncid, err;
        MPI_Info info;

        MPI_Info_create(&info);

        /* Enable the Burst Buffer Driver.
        * The hint is not required if it is set in the environment variable PNETCDF_HINTS
        */
        MPI_Info_set(info, "nc_burst_buf", "enable");

        /* Create a NetCDF file with hint to enable Burst Buffer Driver*/
        ncmpi_create(MPI_COMM_WORLD, filename, NC_CLOBBER, info, &ncid);

        MPI_Info_free(&info);

        /* For doing other IO operations, the code is the same as usual
         * No actions required after file creation
         */

        /* Data stored in the burst buffer will be flushed automatically to the PFS when the file is closed */
        mpi_close(ncid);

        return 0
    }

Running Jobs Using the Burst Buffer Driver

Here we give a brief instruction of submitting jobs that uses the Burst Buffer Driver on Cori at NERSC.

Requesting Burst Buffer Access

To request burst buffer access for a job, users add "#DW jobdw " command to the batch script. Users can specify the capacity, access mode, and type of the burst buffer. We highly recommand the use of private access mode. The only type supported for now is scratch in which the burst buffer is mounted as a file system to the user application.
Users can also choose form 2 granularity levels, wlm_pool and sm_pool. A burst buffer server hosts 20.14 GiB of data. As a result, the stripe count is determined by the capacity of the burst buffer space. For example, if we want the stripe count to be at least 64, we have to set the capacity to at least 20.14 * 64 = 1288.96 GiB even if we don't need such large space. The stripe size is fixed at 8 MiB. Users have no control on stripe size.

Here is an example command of requesting a burst buffer space in sm_pool with stripe count at least 64.

#DW jobdw capacity=1289GiB access_mode=private type=scratch

For more information, please refer to "How to use the Burst Buffer" and "Performance Tuning" page on NERSC website.

Configuring the Parallel File System

The Burst Buffer Driver does not require any specific support from the parallel file system that stores the NetCDF file. However, for best performance, we suggest using stripe count and stripe size no smaller than that used on the burst buffer. To set stripe count and stripe size for a directory under Lustre on Cori, run "lfs setstripe" command.

Here's an example of setting the stripe count to 64 and stripe size to 8 MiB on current working directory that is using Lustre on Cori:

lfs setstripe -c 64 -s 8m .

For more information, please refer to "Optimizing I/O performance on the Lustre file system" page on NERSC website and "Configuring Lustre File Striping" on Lustre wiki site.

Example Job Script

Below is an example script for running the FLASH I/O benchmark using burst buffer on Cori at NERSC. For detailed information about burst buffer settings, users can refer to How to use the Burst Buffer on Cori. This exmple script can also be found in example/burst buffer folder.

    #!/bin/bash
    #SBATCH -p debug
    #SBATCH -N 1
    #SBATCH -C haswell
    #SBATCH -t 00:01:00
    #SBATCH -o burst buffer_FLASH_example.txt
    #SBATCH -L scratch
    #DW jobdw capacity=1289GiB access_mode=private
    NNodes=${SLURM_NNODES}
    NProc=NNodes*32

    export PNETCDF_HINTS="nc_burst_buf=enable;nc_burst_buf_del_on_close=disable;nc_burst_buf_dirname=${DW_JOB_PRIVATE}"

    srun -n ${NProc} ./flash_benchmark_io ${SCRATCH}/flash_

Known Issues

While we design the Burst Buffer Driver to be as transparent as possible, there are some behaviors that can change when the Burst Buffer Driver is used.

Error reporting: The error codes returned from write related APIs are the errors encountered when writing to the burst buffer system. When flushing data from burst buffer to the parallel file system, the error code returned will be the first error encountered.
Non-blocking IO: Non-blocking write requests are written to burst buffer as well. However, for perofrmance consideration, non-blocking requests may be flushed to the parallel file system before the wait API, ncmpi_wait or ncmpi_wait_all, is called. In such case, users are not able to cancel those non-blocking requests and calling ncmpi_cancel will receive error code NC_EFLUSHED.
Unsupported API type: Currently, the Burst Buffer Driver does not support buffering on ncmpi_put_vard and ncmpi_put_vard_all. When these APIs are called, the data will be written to the parallel file system directly, bypassing the burst buffer.

Questions and Answers

Q: Can I run the Burst Buffer Driver on a different machine with a different burst buffer configuration?
A: We currently have only tested the Burst Buffer Driver on Cori at NERSC and the Cray burst buffer system installed there. It may also work on other burst buffer implementations where the burst buffer is mounted as a file system.

Q: Can I run the Burst Buffer Driver without the burst buffer?
A: Yes, as long as the hint "nc_burst_buf_dirname" is a valid directory, PnetCDF will use it as a burst buffer. However, the performance will be affected depending on the performance of the devices.

Q: Do I need to enable DataWarp module on Cori to build PnetCDF with Burst Buffer Driver support?
A: No, we currently do not use DataWarp APIs in the implementation.

Q: Can I leave the data on burst buffer without flushing it to the parallel file system?
A: No, in this version, the data will always be flushed when the file is closed. We do not recommand terminating the program without closing the file.

Q: What is the file name of intermediate files Burst Buffer Driver used to buffer I/O data? Will it overwrite my existing file if there is a name conflict?
A: The intermediate files are named as <NetCDF file name>_<ncid>_<rank>.meta and <NetCDF file name>_<ncid>_<rank>.data. To prevent overwriting other files by accident, Burst Buffer Driver will not proceed when there is a file name conflict. In such case, ncmpi_create and ncmpi_open will fail with error code NC_EEXISTS.

Q: Does the directory used by burst buffer to store I/O data need to be an empty folder?
A: No, there's no such restriction. However, we strongly recommend the use of an empty folder to prevent errors caused by file name conflict.

Q: Is there a way to trigger a flush without closing the NetCDF file?
A: Yes, calling ncmpi_sync triggers a flsuh. A flush can also be triggered if the user waits on a non-blocking write request that is still buffered on burst buffer.

Source Code Checkout

Use the following command to download the source codes.

    git clone https://github.com/Parallel-NetCDF/PnetCDF.git

Initialize autoconf utility settings using command:

    autoreconf -i

Next, run configure command to build the PnetCDF library. For more information about running configure commands, readers are referred to the README files under folder doc.

Please email your questions to <parallel-netcdf@lists.mcs.anl.gov>