eXtensible Data Model and Format
The need for a standardized method to exchange scientific data between High Performance Computing codes and tools lead to the development of the eXtensible Data Model and Format (XDMF) . Uses for XDMF range from a standard format used by HPC codes to take advantage of widely used visualization programs like ParaView, to a mechanism for performing coupled calculations using multiple, previously stand alone codes.
Data format refers to the raw data to be manipulated. Information like number type ( float, integer, etc.), precision, location, rank, and dimensions completely describe the any dataset regardless of its size. The description of the data is also separate from the values themselves. We refer to the description of the data as Light data and the values themselves as Heavy data. Light data is small and can be passed between modules easily. Heavy data may be potentially enormous; movement needs to be kept to a minimum. Due to the different nature of heavy and light data, they are stored using separate mechanisms. Light data is stored using XML, Heavy data is typically stored using HDF5. While we could have chosen to store the light data using HDF5 attributes using XML does not require every tool to have access to the compiled HDF5 libraries in order to perform simple operations.
Data model refers to the intended use of the data. For example, a three dimensional array of floating point vales may be the X,Y,Z geometry for a grid or calculated vector values. Without a data model, it is impossible to tell the difference. Since the data model only describes the data, it is purely light data and thus stored using XML. It is targeted at scientific simulation data concentrating on scalars, vectors, and tensors defined on some type of computational grid. Structured and Unstructured grids are described via their topology and geometry. Calculated, time varying data values are described as attributes of the grid. The actual values for the grid geometry, connectivity, and attribute values are contained in the data format. This separation of data format and model allows HPC codes to efficiently produce and store vales in a convenient manner without being encumbered by our data model which may be different from their internal arrangement.
XDMF uses XML to store Light data and to describe the data Model. HDF5 is used to store Heavy data. The data Format is stored redundantly in both XML and HDF5. This allows tools to parse XML to determine the resources that will be required to access the Heavy data.
The data model in XDMF stored in XML provides the knowledge of what is represented by the Heavy data. In this model, HPC data is viewed as a hierarchy of Domains. A Domain must contain at least one Grid. A Grid is the basic representation of both the geometric and computed/measured values. A Grid is considered to be a group of elements with Structured or Unstructured Topology and their associated values. In addition to the topology of the Grid, Geometry, specifying the X, Y, and Z positions of the Grid is required. Finally, a Grid may have one or more Attributes. Attributes are used to store any other value associated with the grid and may be referenced to the Grid or to individual cells that comprise the Grid.
The concept of separating the light data from the heavy data is critical to the performance of this data model and format. HPC codes can read and write data in large, contiguous chunks that are natural to their internal data storage, to achieve optimal I/O performance. If codes were required to significantly re-arrange data prior to I/O operations, data locality, and thus performance, could be adversely affected, particularly on codes that attempt to make maximum use of memory cache. The complexity of the dataset is described in the light data portion, which is small and transportable. For example, the light data might specify a topology of one million hexaherda while the heavy data would contain the geometric XYZ values of the mesh and pressure values at the cell centers stored in large, contiguous arrays. This key feature will allow reusable tools to be built that do not put onerous requirements on HPC codes. Despite the complexity of the organization described in the XML below, the HPC code only needs to produce the three HDF5 datasets for geometry, connectivity, and pressure values.
While not required, a C++ API is provided to read and write XDMF data. This API has also been wrapped so it is available from popular languages like Python, Tcl, and Java. The API is not necessary in order to produce or consume XDMF data. Currently several HPC codes that already produced HDF5 data, use native text output to produce the XML necessary for valid XDMF.
How do I ...
- Join the Xdmf mailing list here
- To edit these pages, send an email to Dave DeMarle with XDMF in the subject that includes your preferred user name and email address in the message body.