Table Of Contents

Previous topic

gromacs.formats – Accessing various files

Next topic

Gromacs XPM file format

This Page

Simple xmgrace XVG file format

Gromacs produces graphs in the xmgrace (“xvg”) format. These are simple multi-column data files. The class XVG encapsulates access to such files and adds a number of methods to access the data (as NumPy arrays), compute aggregates, or quickly plot it.

The XVG class is useful beyond reading xvg files. With the array keyword or the XVG.set() method one can load data from an array instead of a file. The array should be simple “NXY” data (typically: first column time or position, further columns scalar observables). The data should be a NumPy numpy.ndarray array a with shape (M, N) where M-1 is the number of observables and N the number of observations, e.g.the number of time points in a time series. a[0] is the time or position and a[1:] the M-1 data columns.

Errors

The XVG.error attribute contains the statistical error for each timeseries. It is computed from the standard deviation of the fluctuations from the mean and their correlation time. The parameters for the calculations of the correlation time are set with XVG.set_correlparameters().

Plotting

The XVG.plot() and XVG.errorbar() methods are set up to produce graphs of multiple columns simultaneously. It is typically assumed that the first column in the selected (sub)array contains the abscissa (“x-axis”) of the graph and all further columns are plotted against the first one.

Data selection

Plotting from XVG is fairly flexible as one can always pass the columns keyword to select which columns are to be plotted. Assuming that the data contains [t, X1, X2, X3], then one can

  1. plot all observable columns (X1 to X3) against t:

    xvg.plot()
    
  2. plot only X2 against t:

    xvg.plot(columns=[0,2])
    
  3. plot X2 and X3 against t:

    xvg.plot(columns=[0,2,3])
    
  4. plot X1 against X3:

    xvg.plot(columns=[2,3])
    

Coarse grainining of data

It is also possible to coarse grain the data for plotting (which typically results in visually smoothing the graph because noise is averaged out).

Currently, two alternative algorithms to produce “coarse grained” (decimated) graphs are implemented and can be selected with the method keyword for the plotting functions in conjuction with maxpoints (the number of points to be plotted):

  1. mean histogram (default) — bin the data (using numkit.timeseries.regularized_function() and compute the mean for each bin. Gives the exact number of desired points but the time data are whatever the middle of the bin is.
  2. smooth subsampled — smooth the data with a running average (other windows like Hamming are also possible) and then pick data points at a stepsize compatible with the number of data points required. Will give exact times but not the exact number of data points.

For simple test data, both approaches give very similar output.

For the special case of periodic data such as angles, one can use the circular mean (“circmean”) to coarse grain. In this case, jumps across the -180º/+180º boundary are added as masked datapoints and no line is drawn across the jump in the plot. (Only works with the simple XVG.plot() method at the moment, errorbars or range plots are not implemented yet.)

See also

XVG.decimate()

Examples

In this example we generate a noisy time series of a sine wave. We store the time, the value, and an error. (In a real example, the value might be the mean over multiple observations and the error might be the estimated error of the mean.)

>>> import numpy as np
>>> import gromacs.formats
>>> X = np.linspace(-10,10,50000)
>>> yerr = np.random.randn(len(X))*0.05
>>> data = np.vstack((X, np.sin(X) + yerr, np.random.randn(len(X))*0.05))
>>> xvg = gromacs.formats.XVG(array=data)

Plot value for all time points:

>>> xvg.plot(columns=[0,1], maxpoints=None, color="black", alpha=0.2)

Plot bin-averaged (decimated) data with the errors, over 1000 points:

>>> xvg.errorbar(maxpoints=1000, color="red")

(see output in Figure Plot of Raw vs Decimated data)

plot of a raw noisy sin(x) graph versus its decimated version

Plot of Raw vs Decimated data. Example of plotting raw data (sine on 50,000 points, gray) versus the decimated graph (reduced to 1000 points, red line). The errors were also decimated and reduced to the errors within the 5% and the 95% percentile. The decimation is carried out by histogramming the data in the desired number of bins and then the data in each bin is reduced by either numpy.mean() (for the value) or scipy.stats.scoreatpercentile() (for errors).

In principle it is possible to use other functions to decimate the data. For XVG.plot(), the method keyword can be changed (see XVG.decimate() for allowed method values). For XVG.errorbar(), the method to reduce the data values (typically column 1) is fixed to “mean” but the errors (typically columns 2 and 3) can also be reduced with error_method = “rms”.

If one wants to show the variation of the raw data together with the decimated and smoothed data then one can plot the percentiles of the deviation from the mean in each bin:

>>> xvg.errorbar(columns=[0,1,1], maxpoints=1000, color="blue", demean=True)

The demean keyword indicates if fluctuations from the mean are regularised [1]. The method XVG.plot_coarsened() automates this approach and can plot coarsened data selected by the columns keyword.

Footnotes

[1]When error_method = “percentile” is selected for XVG.errorbar() then demean does not actually force a regularisation of the fluctuations from the mean. Instead, the (symmetric) percentiles are computed on the full data and the error ranges for plotting are directly set to the percentiles. In this way one can easily plot the e.g. 10th-percentile to 90th-percentile band (using keyword percentile = 10).

Classes and functions

class gromacs.fileformats.xvg.XVG(filename=None, names=None, array=None, permissive=False, **kwargs)

Class that represents the numerical data in a grace xvg file.

All data must be numerical. NAN and INF values are supported via python’s float() builtin function.

The array attribute can be used to access the the array once it has been read and parsed. The ma attribute is a numpy masked array (good for plotting).

Conceptually, the file on disk and the XVG instance are considered the same data. Whenever the filename for I/O (XVG.read() and XVG.write()) is changed then the filename associated with the instance is also changed to reflect the association between file and instance.

With the permissive = True flag one can instruct the file reader to skip unparseable lines. In this case the line numbers of the skipped lines are stored in XVG.corrupted_lineno.

A number of attributes are defined to give quick access to simple statistics such as

  • mean: mean of all data columns
  • std: standard deviation
  • min: minimum of data
  • max: maximum of data
  • error: error on the mean, taking correlation times into account (see also XVG.set_correlparameters())
  • tc: correlation time of the data (assuming a simple exponential decay of the fluctuations around the mean)

These attributes are numpy arrays that correspond to the data columns, i.e. :attr:`XVG.array`[1:].

Note

  • Only simple XY or NXY files are currently supported, not Grace files that contain multiple data sets separated by ‘&’.
  • Any kind of formatting (i.e. xmgrace commands) is discarded.

Initialize the class from a xvg file.

Arguments :
filename

is the xvg file; it can only be of type XY or NXY. If it is supplied then it is read and parsed when XVG.array is accessed.

names

optional labels for the columns (currently only written as comments to file); string with columns separated by commas or a list of strings

array

read data from array (see XVG.set())

permissive

False raises a ValueError and logs and errior when encountering data lines that it cannot parse. True ignores those lines and logs a warning—this is a risk because it might read a corrupted input file [False]

stride

Only read every stride line of data [1].

savedata

True includes the data (XVG.array` and associated caches) when the instance is pickled (see pickle); this is oftens not desirable because the data are already on disk (the xvg file filename) and the resulting pickle file can become very big. False omits those data from a pickle. [False]

metadata

dictionary of metadata, which is not touched by the class

array

Represent xvg data as a (cached) numpy array.

The array is returned with column-first indexing, i.e. for a data file with columns X Y1 Y2 Y3 ... the array a will be a[0] = X, a[1] = Y1, ... .

decimate(method, a, **kwargs)

Decimate data a to maxpoints using method.

If a is a 1D array then it is promoted to a (2, N) array where the first column simply contains the index.

If the array contains fewer than maxpoints points or if maxpoints is None then it is returned as it is. The default for maxpoints is 10000.

Valid values for the reduction method:

  • “mean”, uses XVG.decimate_mean() to coarse grain by averaging the data in bins along the time axis
  • “circmean”, uses XVG.decimate_circmean() to coarse grain by calculating the circular mean of the data in bins along the time axis. Use additional keywords low and high to set the limits. Assumes that the data are in degrees.
  • “min” and “max* select the extremum in each bin
  • “rms”, uses XVG.decimate_rms() to coarse grain by computing the root mean square sum of the data in bins along the time axis (for averaging standard deviations and errors)
  • “percentile” with keyword per: XVG.decimate_percentile() reduces data in each bin to the percentile per
  • “smooth”, uses XVG.decimate_smooth() to subsample from a smoothed function (generated with a running average of the coarse graining step size derived from the original number of data points and maxpoints)
Returns :numpy array (M', N') from a (M', N) array with M' == M (except when M == 1, see above) and N' <= N (N' is maxpoints).
decimate_circmean(a, maxpoints, **kwargs)

Return data a circmean-decimated on maxpoints.

Histograms each column into maxpoints bins and calculates the weighted circular mean in each bin as the decimated data, using numkit.timeseries.circmean_histogrammed_function(). The coarse grained time in the first column contains the centers of the histogram time.

If a contains <= maxpoints then a is simply returned; otherwise a new array of the same dimensions but with a reduced number of maxpoints points is returned.

Keywords low and high can be used to set the boundaries. By default they are [-pi, +pi].

This method returns a masked array where jumps are flagged by an insertion of a masked point.

Note

Assumes that the first column is time and that the data are in degrees.

Warning

Breaking of arrays only works properly with a two-column array because breaks are only inserted in the x-column (a[0]) where y1 = a[1] has a break.

decimate_error(a, maxpoints, **kwargs)

Return data a error-decimated on maxpoints.

Histograms each column into maxpoints bins and calculates an error estimate in each bin as the decimated data, using numkit.timeseries.error_histogrammed_function(). The coarse grained time in the first column contains the centers of the histogram time.

If a contains <= maxpoints then a is simply returned; otherwise a new array of the same dimensions but with a reduced number of maxpoints points is returned.

Note

Assumes that the first column is time.

Does not work very well because often there are too few datapoints to compute a good autocorrelation function.

decimate_max(a, maxpoints, **kwargs)

Return data a max-decimated on maxpoints.

Histograms each column into maxpoints bins and calculates the maximum in each bin as the decimated data, using numkit.timeseries.max_histogrammed_function(). The coarse grained time in the first column contains the centers of the histogram time.

If a contains <= maxpoints then a is simply returned; otherwise a new array of the same dimensions but with a reduced number of maxpoints points is returned.

Note

Assumes that the first column is time.

decimate_mean(a, maxpoints, **kwargs)

Return data a mean-decimated on maxpoints.

Histograms each column into maxpoints bins and calculates the weighted average in each bin as the decimated data, using numkit.timeseries.mean_histogrammed_function(). The coarse grained time in the first column contains the centers of the histogram time.

If a contains <= maxpoints then a is simply returned; otherwise a new array of the same dimensions but with a reduced number of maxpoints points is returned.

Note

Assumes that the first column is time.

decimate_min(a, maxpoints, **kwargs)

Return data a min-decimated on maxpoints.

Histograms each column into maxpoints bins and calculates the minimum in each bin as the decimated data, using numkit.timeseries.min_histogrammed_function(). The coarse grained time in the first column contains the centers of the histogram time.

If a contains <= maxpoints then a is simply returned; otherwise a new array of the same dimensions but with a reduced number of maxpoints points is returned.

Note

Assumes that the first column is time.

decimate_percentile(a, maxpoints, **kwargs)

Return data a percentile-decimated on maxpoints.

Histograms each column into maxpoints bins and calculates the percentile per in each bin as the decimated data, using numkit.timeseries.percentile_histogrammed_function(). The coarse grained time in the first column contains the centers of the histogram time.

If a contains <= maxpoints then a is simply returned; otherwise a new array of the same dimensions but with a reduced number of maxpoints points is returned.

Note

Assumes that the first column is time.

Keywords :
per
percentile as a percentage, e.g. 75 is the value that splits the data into the lower 75% and upper 25%; 50 is the median [50.0]
decimate_rms(a, maxpoints, **kwargs)

Return data a rms-decimated on maxpoints.

Histograms each column into maxpoints bins and calculates the root mean square sum in each bin as the decimated data, using numkit.timeseries.rms_histogrammed_function(). The coarse grained time in the first column contains the centers of the histogram time.

If a contains <= maxpoints then a is simply returned; otherwise a new array of the same dimensions but with a reduced number of maxpoints points is returned.

Note

Assumes that the first column is time.

decimate_smooth(a, maxpoints, window='flat')

Return smoothed data a decimated on approximately maxpoints points.

  1. Produces a smoothed graph using the smoothing window window; “flat” is a running average.
  2. select points at a step size approximatelt producing maxpoints

If a contains <= maxpoints then a is simply returned; otherwise a new array of the same dimensions but with a reduced number of points (close to maxpoints) is returned.

Note

Assumes that the first column is time (which will never be smoothed/averaged), except when the input array a is 1D and therefore to be assumed to be data at equidistance timepoints.

TODO: - Allow treating the 1st column as data

default_color_cycle = ['black', 'red', 'blue', 'orange', 'magenta', 'cyan', 'yellow', 'brown', 'green']

Default color cycle for XVG.plot_coarsened(): ['black', 'red', 'blue', 'orange', 'magenta', 'cyan', 'yellow', 'brown', 'green']

default_extension = 'xvg'

Default extension of XVG files.

error

Error on the mean of the data, taking the correlation time into account.

See [FrenkelSmit2002] p526:

error = sqrt(2*tc*acf[0]/T)

where acf() is the autocorrelation function of the fluctuations around the mean, y-<y>, tc is the correlation time, and T the total length of the simulation.

[FrenkelSmit2002]D. Frenkel and B. Smit, Understanding Molecular Simulation. Academic Press, San Diego 2002
errorbar(**kwargs)

errorbar plot for a single time series with errors.

Set columns keyword to select [x, y, dy] or [x, y, dx, dy], e.g. columns=[0,1,2]. See XVG.plot() for details. Only a single timeseries can be plotted and the user needs to select the appropriate columns with the columns keyword.

By default, the data are decimated (see XVG.plot()) for the default of maxpoints = 10000 by averaging data in maxpoints bins.

x,y,dx,dy data can plotted with error bars in the x- and y-dimension (use filled = False).

For x,y,dy use filled = True to fill the region between y±dy. fill_alpha determines the transparency of the fill color. filled = False will draw lines for the error bars. Additional keywords are passed to pylab.errorbar().

By default, the errors are decimated by plotting the 5% and 95% percentile of the data in each bin. The percentile can be changed with the percentile keyword; e.g. percentile = 1 will plot the 1% and 99% perentile (as will percentile = 99).

The error_method keyword can be used to compute errors as the root mean square sum (error_method = “rms”) across each bin instead of percentiles (“percentile”). The value of the keyword demean is applied to the decimation of error data alone.

See also

XVG.plot() lists keywords common to both methods.

ma

Represent data as a masked array.

The array is returned with column-first indexing, i.e. for a data file with columns X Y1 Y2 Y3 ... the array a will be a[0] = X, a[1] = Y1, ... .

inf and nan are filtered via numpy.isfinite().

max

Maximum of the data columns.

maxpoints_default = 10000

Aim for plotting around that many points

mean

Mean value of all data columns.

min

Minimum of the data columns.

parse(stride=None)

Read and cache the file as a numpy array.

Store every stride line of data; if None then the class default is used.

The array is returned with column-first indexing, i.e. for a data file with columns X Y1 Y2 Y3 ... the array a will be a[0] = X, a[1] = Y1, ... .

plot(**kwargs)

Plot xvg file data.

The first column of the data is always taken as the abscissa X. Additional columns are plotted as ordinates Y1, Y2, ...

In the special case that there is only a single column then this column is plotted against the index, i.e. (N, Y).

Keywords :
columns : list

Select the columns of the data to be plotted; the list is used as a numpy.array extended slice. The default is to use all columns. Columns are selected after a transform.

transform : function

function transform(array) -> array which transforms the original array; must return a 2D numpy array of shape [X, Y1, Y2, ...] where X, Y1, ... are column vectors. By default the transformation is the identity [lambda x: x].

maxpoints : int

limit the total number of data points; matplotlib has issues processing png files with >100,000 points and pdfs take forever to display. Set to None if really all data should be displayed. At the moment we simply decimate the data at regular intervals. [10000]

method

method to decimate the data to maxpoints, see XVG.decimate() for details

color

single color (used for all plots); sequence of colors (will be repeated as necessary); or a matplotlib colormap (e.g. “jet”, see matplotlib.cm). The default is to use the XVG.default_color_cycle.

kwargs

All other keyword arguments are passed on to pylab.plot().

plot_coarsened(**kwargs)

Plot data like XVG.plot() with the range of all data shown.

Data are reduced to maxpoints (good results are obtained with low values such as 100) and the actual range of observed data is plotted as a translucent error band around the mean.

Each column in columns (except the abscissa, i.e. the first column) is decimated (with XVG.decimate()) and the range of data is plotted alongside the mean using XVG.errorbar() (see for arguments). Additional arguments:

Kewords :
maxpoints

number of points (bins) to coarsen over

color

single color (used for all plots); sequence of colors (will be repeated as necessary); or a matplotlib colormap (e.g. “jet”, see matplotlib.cm). The default is to use the XVG.default_color_cycle.

method

Method to coarsen the data. See XVG.decimate()

The demean keyword has no effect as it is required to be True.

read(filename=None)

Read and parse xvg file filename.

set(a)

Set the array data from a (i.e. completely replace).

No sanity checks at the moment...

set_correlparameters(**kwargs)

Set and change the parameters for calculations with correlation functions.

The parameters persist until explicitly changed.

Keywords :
nstep

only process every nstep data point to speed up the FFT; if left empty a default is chosen that produces roughly 25,000 data points (or whatever is set in ncorrel)

ncorrel

If no nstep is supplied, aim at using ncorrel data points for the FFT; sets XVG.ncorrel [25000]

force

force recalculating correlation data even if cached values are available

kwargs

see numkit.timeseries.tcorrel() for other options

std

Standard deviation from the mean of all data columns.

tc

Correlation time of the data.

See XVG.error() for details.

write(filename=None)

Write array to xvg file filename in NXY format.

Note

Only plain files working at the moment, not compressed.

gromacs.fileformats.xvg.break_array(a, threshold=3.141592653589793, other=None)

Create a array which masks jumps >= threshold.

Extra points are inserted between two subsequent values whose absolute difference differs by more than threshold (default is pi).

Other can be a secondary array which is also masked according to a.

Returns (a_masked, other_masked) (where other_masked can be None)