PreVent Tools is a set of tools designed to facilitate conversion of physiological data monitoring data to differnet formats. Physiological data monitors and their constellation of tools generally create data in specific formats, and interoperability is a problem. PreVent Tools seeks to provide conversion capabilities to/from a variety of formats, and also provide "native" format for archiving and sharing.
| Monitor/Format | Input | Output |
|---|---|---|
| HDF5 (PreVent Native) | X | X |
| WFDB | X | X |
| STP XML (versions 6-8) | X3 | |
| STP (GE) | X1 | |
| STP (Philips MX800) | X1,2 | |
| Matlab 7.X | X | |
| CPC | X | |
| Data Warehouse Connect | X | |
| TDMS | X | |
| MEDI | X | |
| Auton Lab | X | |
| CSV | X2 | X2 |
1 Experimental support
2 Waveforms not implemented
3 Supports gzipped/compressed/zipped files as well
PreVent Tools comprises two tools at this time: formatconverter is a command-line tool for converting between formats, while preventtools provides several useful tools for working with the data. preventtools primarily works with the native HDF5 format.
The easiest way to run PreVent Tools is to use the docker images for formatconverter and preventtools. However, we have tried to make building the tools as easy as possible. If you are interested, we develop on Ubuntu, using NetBeans, though neither is required.
Dependencies to build these programs (ubuntu)
- g++
- pkg-config
- libhdf5-dev
- libmatio-dev
- libexpat-dev
- libzip-dev
- wfdb-10.6.0 (https://archive.physionet.org/physiotools/wfdb-linux-quick-start.shtml)
- TDMSpp (https://github.com/Ostrich-Emulators/TDMSpp)
- googletest 1.8.0 (for unit testing [optional]) (https://github.com/google/googletest)
handy:
- hdf5-tools
This package uses CMake. To install the software:
> cmake -DCMAKE_BUILD_TYPE=Release CMakeLists.txt
> make
> sudo make install
Every file format organizes data differently. Sometimes data is stored sequentially; sometimes in parallel; sometimes in segments or chunks. With all the options available to designers, physiological data developers decided: "yes!" So, here we are.
At a high-level, formatconverter simply selects the "reader" to read the input, a "writer" to write the output, and then coordinates their operation. Each reader supports one or more input formats, and is responsible for organizing the data into standardized temporary data structures until enough data has been read to necessitate writing. The writer then flushes the data to one or more output files. Each input and output format may consist of one or more files. WFDB is an example of a format that spans multiple files.
Because input files can be very large, care has been taken to ensure that only as much data is read as is necessary. For example, the STP XML DOM is not read into memory at once; it is read piecemeal to populate internal data structures, and flushed as needed. If a large amount of data is needed before writing, extra data is cached to disk. This progressive reading and caching algorithm is part of all the readers, and keeps memory usage very low. The formatconverter API for creating readers and writers is documented in code only at this time.
Also, the native output files are usually 15-60% smaller than the input file.
formatconverter has a number of features worth mentioning, though not all formats support all features:
- millisecond time resolution
- per-signal frequency
- per-signal auxillary data
- pre-signal arbitrary metadata
- arbitrary global metadata
- waveforms and vitals data are distinct
- "event" support
- cross-platform
- automatic input format resolution
Additionally, there are a number of optional features available:
- compression
- local time/GMT time handling
- anonymization
- set arbitrary start date
- create SQLite database of metadata during conversion
- create new file per day/patient
- store timing information as offset from start date
- arbitrary output file naming conventions
formatconverter accepts a variety of command line options to enable features or change default behaviors. The table below describes these options
| Long Option | Short Option | Valid Arguments {Default} | Description |
|---|---|---|---|
| --from | -f | wfdb, hdf5, stpxml, stpge, stpp, cpcxml, tdms, medi, dwc {auto} | Specify the input format |
| --to | -t | wfdb, hdf5, mat4, mat7, au, nop | Specify the output format |
| --compression | -z | 0-9 {6} | Compression level |
| --sqlite | -s | db file | Create/Append SQLite metadata database |
| --quiet | -q | Print less stuff to console (repeat to further lessen output) | |
| --verbose | -v | Print more stuff to console (repeat to further increase output) | |
| --stop-after-one | -1 | Stop conversion after first file is generated. Useful for troubleshooting | |
| --localtime | -l | Convert times to local time | |
| --offset | -Z | time string (MM/DD/YYYY) or seconds since 01/01/1970 | Shift dates by the desired amount |
| --opening-date | -S | time string (MM/DD/YYYY) or seconds since 01/01/1970 | Shift dates so that the first time in the output is the given date |
| --no-break or --one-file | -n | Do not split output files by day (convenience for --split 0) | |
| --no-cache | -C | Do not cache anything to disk | |
| --time-step | -T | Store timing information as offset from start of file | |
| --anonymize | -a | Attempt to anonymize the output files | |
| --release | -R | Show release information and exit | |
| --pattern | -p | format string | Set the output file naming pattern |
| --skip-waves | -w | Skip waves during reading and writing files | |
| --tmpdir | -m | Place all temporary files in the specified directory | |
| --split | -x | 'm[idnight]' or <0-9>[h] {midnight} | roll over the output files at midnight or every X hours. 'h' ensures rollover at the top of the hour. '0' disables rollover. Rollover times are affected by --localtime option. |
Because each input file can generate multiple output files, it is necessary to specify how those files should be named. This is accomplished using format specifiers within a string. The specifiers are:
%p - patient ordinal
%i - input filename (without directory or extension)
%d - input directory (with trailing separator)
%C - current directory (with trailing separator)
%x - input extension
%m - modified date of input file
%c - creation date of input file
%D - date of conversion
%s - date of first data point
%e - date of last data point
%T - time of first data point (24hr clock)
%E - time of last data point (24hr clock)
%o - output file ordinal
%M - value of the MRN metadata attribute
%t - the --to option's extension (e.g., hdf5, wfdb)
%S - same as %d%i-p%p-%s.%t
The default output filename pattern is %d%i.%t, that is, the output filename is the input file name with a different extension. Note that some specifiers are similar to command-line options; these are separate concepts.
The "native" output format for formatconverter is HDF5. We consider this "native" because it is the most feature-rich output format, and creates the smallest on-disk files. HDF5 is a flexible data format that is essentially a filesystem within a file. Data is organized in Groups and Datasets, an approach that is perfect for storing physiological data.
The formatconverter's native format consists of three main Groups: Events, VitalSigns, and Waveforms, though VitalSigns and Waveforms use the exact same structure, and are separated merely for ease of use. Two other groups, Calculated_Data and Auxillary_Data may be present at the root level as well.
The file itself contains metadata useful for understanding/troubleshooting the data:
- Build Number The formatconverter that generated the file
- Source Reader The reader that generated the data
- Filename The original input file name
- Layout Version A number that denotes how the data is organized in the file. Should the layout change between versions, this number will orient the user/other tools to the actual format.
- HDF5 Version The HDF5 version for this file.
The file metadata, and all signal Datasets contain timing information. At the file level, this metadata are the "global" values for all Datasets (e.g., Start Time is the earliest start time of all the signals). Signals are not required to start and/or stop at the same time. Times need not be contiguous, though it is expected they will be sorted chronologically from earliest to latest.
- Duration The total duration for this Dataset
- Start Time The earliest time value contained in the Dataset
- Start Date/Time an ISO8601 version of Start Time
- End Time The last time value contained in the Dataset
- End Date/Time an ISO8601 version of End Time
- Timezone The timezone of Start Time and End Time
If the source input format supports metadata or attributes, these are duplicated in the file's metadata.
Lastly, every Dataset has a Columns attribute that describes the data each column of the Dataset. An example of this attribute might be "timestamp (ms), segment offset", telling the user that the first column is a timestamp, and the second is something called "segment offset." All Dataset columns have the same data type.
The Events Group contains one main Dataset: Global_Times. This is a Dataset containing a list of all times in the other Datasets. This Dataset is generally useful in conjunction with the --time-step option. Times are always in milliseconds since the Unix Epoch.
Segment_Offsets may also exist in the Events group. This is auxillary data provided by the STP XML reader.
VitalSigns and Waveforms use the same structure, so they are described together here. A signal Group comprises two Datasets: data and time, plus metadata specific to the signal:
- Data Label The name of the signal. Signal names are cleaned to make the Dataset name easily useable by other tools. Things like spaces or punctuation are removed. This attribute provides the "raw" name of the signal.
- Unit of Measure The unit of the data points
- Sample Period (ms) What is the duration of a single unit of time?
- Readings Per Sample How many data points are present in a unit of time? In general, the difference between a waveform and a vital sign is this number: vital signs have 1 reading per sample, while waveforms have >1.
The data Dataset contains the data points for this signal. It is usually a single-column Dataset, but it can contain multiple columns if needed. For example, some monitors provide "quality" metrics about each reading. data Datasets always have an integer data type (short of regular). data-specific metadata include:
- Missing Value Marker A specific value that represents a missing value. This is useful primarily with waveform data.
- Scale A scaling factor used to convert floating-point numbers to integers. To calculate the "raw" value, divide the data point by 10scale
- Min Value The smallest raw data point value in the Dataset.
- Max Value The largest raw data point value in the Dataset.
As with the file metadata, if an input format supports per-signal metadata, it is duplicated in the data Dataset.
The time Dataset contains the timing information the data points in data. It is always a single-column of long numbers. time contains a single attribute to help users/tools interpret the data: Time Source is either raw or indexed. If it is raw, the times in the column are actual times. If indexed, the times are index numbers to Events/Global_Times. Actual time values are always in milliseconds since the Unix Epoch.
Both Calculated_Data and Auxillary_Data Groups have the same basic structure as Signal Groups--data and time Datasets-- but with different semantics. The Calculated_Data Group is used for separating data that has been added after the initial conversion. Very often, it is useful to convert a file, and then calculate some other values (e.g. RR intervals) based on it. This data goes in the Calculated_Data Group. Note that Calculated_Data's times are not required to be present in the Events/Global_Times Dataset.
The Auxillary_Data Dataset is to support input formats that provide time series that are neither vitals or waveforms. For example, DWC has a "Wall Times" Dataset. Auxillary_Data's data Dataset differs from other data Datasets because its data type is string instead on integer. In addition to the root level, each signal can have an arbitrary number of auxillary Datasets.
The CSV format is very basic: The first column is the time, and the subsequent columns are vitals data. Times can be either millisecond or second resolution.
Metadata for the CSV format is added from a separate metadata file. This file has the same filename as the CSV file, but the CSV extension (if any) is replaced with
".meta." For example, The test.meta file contains metadata for the test.csv file. The metadata file must be in the same directory as the CSV file, and is optional.
The metadata file format is basically a CSV file with three fields, using | as the separator. The first field is the location to set the metadata. This must be either /
for the file metadata, or /VitalSigns/HR HR signal. The second field is the attribute to set, and the third field is the string value to set. Only string values
are supported at this time.
A sample file:
/ | Bed | 5YE-4
/VitalSigns/HR|Unit of Measure|Bpm
/VitalSigns/SPO2|Unit of Measure|%