# Statistics of the dynamics¶

Once qualitative intuition has been gained by plotting time series from a few
samples (see Plotting samples) one can inspect quantitatively the
dynamics of a dataset by using `tunacell`

’s pre-defined tools.

Univariate and bivariate analysis tools are coded in `tunacell`

in order to
describe the statistics of single, resp. a couple of, observable(s).

We start with a background introduction to those concepts. Then we set the
session, and show `tunacell`

’s procedures.

Note

This manual page is rather tedious. To be more practical, open, read, and run
the following scripts `univariate-analysis.py`

,
`univariate-analysis-2.py`

, and `bivariate-analysis.py`

. Background
information on this page, and figures might be useful as a side help.

## Background¶

We consider a stochastic process \(x(t)\).

### One-point functions¶

One-point functions are statistical estimates of functions of a single time-point. Typical one-point functions are the average at a given time

where the notation \(\langle \cdot \rangle\) means taking the ensemble average of the quantity; or the variance

Inspecting these functions provides a first quantitative approach of the studied process.

In our case, time series are indexed to acquisition times \(t_i\), where \(i=0,\ 1,\ 2,\ \dots\). Usually

where \(\Delta t\) is the time interval between two frame acquisitions, and \(t_{\mathrm{offset}}\) is an offset that sets the origin of times.

Then if we denote \(s^{(1)}_i\) as the number of cells acquired at time index
\(i\), the average value at the same time of observable \(x\) is
evaluated in `tunacell`

as:

where \(x^{(k)}(t)\) is the value of observable \(x\) in cell \(k\) at time \(t\).

### Two-point functions¶

Two-point functions are statistical estimates of functions of two time-points. The typical two-point function is the auto-correlation function, defined as:

In `tuna`

it is estimated using:

where the sum over \(k\) means over lineages connecting times \(t_i\) to \(t_j\) (there are \(s^{(2)}_{ij}\) such lineages).

With our method, for a cell living at time \(t_i\), there will be at most one associated descendant cell at time \(t_j\). There may be more descendants living at \(t_j\), but only one is picked at random according to our lineage decomposition procedure.

For identical times, the auto-correlation coefficient reduces to the variance:

Under stationarity hypothesis, the auto-correlation function depends only on time differences such that:

(and the function \(\tilde{a}\) is symmetric: \(\tilde{a}(-u)=\tilde{a}(u)\)).

In `tuna`

, there is a special procedure to estimate \(\tilde{a}\) which is
to use the lineage decomposition to generate time series, and then for a given
time interval \(u\), we collect all couple of points
\((t_i, t_j)\) such that \(u = t_i - t_j\), and perform the average
over all these samples.

We can extend such a computation to two observables \(x(t)\) and \(y(t)\). The relevant quantity is the cross-correlation function

which we estimate through cross-correlation coefficients

Again under stationarity hypothesis, the cross-correlation function depends only on time differences:

though now the function \(\tilde{c}\) may not be symmetric.

Note

At this stage of development, extra care has not been taken to ensure ideal properties for our statistical estimates such as unbiasedness. Hence caution should be taken for the interpretation of such estimates.

## Warming-up [1]¶

We start with:

```
from tunacell import Experiment, Observable, FilterSet
from tunacell.filters.cells import FilterCellIDparity
exp = Experiment('~/tmptunacell/simutest')
# define a condition
even = FilterCellIDparity('even')
condition = FilterSet(label='evenID', filtercell=even)
```

Note

The condition we are using in this example serves only as a test; we do not expect that the subgroup of cells with even identifiers differ from all cells, though we expect to halve the samples and thus we can appreciate the finite-size effects.

In this example, we look at the following dynamic observables:

```
ou = Observable(name='exact-growth-rate', raw='ou')
```

The `ou`

—Ornstein-Uhlenbeck—observable process models instantaneous
growth rate. As it is a numerical simulation, we have some knowledge of the
statistics of such process. We import some of them from the metadata:

```
md = exp.metadata
params = md['ornstein_uhlenbeck_params']
ref_mean = params['target']
ref_var = params['noise']/(2 * params['spring'])
ref_decayrate = params['spring']
```

## Starting with the univariate analysis¶

To investigate the statistics of a single observable over time, `tuna`

uses
the lineage decomposition to parse samples and computes incrementally one- and
two-point functions.

Estimated one-point functions are the number of samples and the average value at each time-point. Estimated two-point functions are the correlation matrix between any couple of time-points, which reduces to the variance for identical times.

The module `tuna.stats.api`

stores most of the functions to be used

To perform the computations, we import
`tunacell.stats.api.compute_univariate()`

and call it:

```
from tunacell.stats.api import compute_univariate
univ = compute_univariate_dynamics(exp, ou, cset=[condition, ])
```

This function computes one-point and two-point functions as described above
and stores the results in `univ`

, a
`tuna.stats.single.Univariate`

instance. Results are reported for
the unconditioned data, under the `master`

label, and for each of the
conditions provided in the `cset`

list. Each individual group is an
instance of `tuna.stats.single.UnivariateConditioned`

, which
attributes points directly to the estimated one- and two-point functions.
These items can be accessed as values of a dictionary:

```
result = univ['master']
result_conditioned = univ[repr(condition)]
```

As the master is always defined, one can alternatively use the attribute:

```
result = univ.master
```

### Inspecting univariate results¶

The objects `result`

and `result_conditioned`

are instances of the
`UnivariateConditioned`

class, where one can find the following
attributes: `time`

, `count_one`

, `average`

,
`count_two`

, and `autocorr`

; these are Numpy arrays.

To be explicit, the `time`

array is the array of each \(t_i\) where
observables have been evaluated.
The `count_one`

array stores the corresponding number of samples
\(s^{(1)}_i\) (see Background), and the `average`

array
stores the \(\langle x(t_i) \rangle\) average values.

One can see an excerpt of the table of one-point functions by typing:

```
result.display_onepoint(10) # 10 lines excerpt
```

which should be like:

```
time counts average std_dev
0 0.0 200 0.011725 0.001101
1 5.0 207 0.011770 0.001175
2 10.0 225 0.011780 0.001201
3 15.0 253 0.011766 0.001115
4 20.0 265 0.011694 0.001119
5 25.0 286 0.011635 0.001149
6 30.0 301 0.011627 0.001147
7 35.0 318 0.011592 0.001173
8 40.0 337 0.011564 0.001189
9 45.0 354 0.011578 0.001150
```

The `count_two`

2d array stores matrix elements \(s^{(2)}_{ij}\)
corresponding to the number of independent lineages connecting time
\(t_i\) to \(t_j\), and the attribute `autocorr`

stores the
matrix elements \(a_{ij}\) (auto-covariance coefficients).
The `std_dev`

column of the latter table is in fact computed as the
square root of the diagonal of such auto-covariance matrix (such diagonal
is the variance at each time-point).

An excerpt of the auto-covariance function can be printed:

```
result.display_twopoint(10)
```

which should produce something like:

```
time-row time-col counts autocovariance
0 0.0 0.0 200 1.211721e-06
1 0.0 5.0 200 1.093628e-06
2 0.0 10.0 200 7.116838e-07
3 0.0 15.0 200 3.415255e-07
4 0.0 20.0 200 6.881773e-07
5 0.0 25.0 200 1.027559e-06
6 0.0 30.0 200 1.053278e-06
7 0.0 35.0 200 5.925049e-07
8 0.0 40.0 200 -7.884958e-08
9 0.0 45.0 200 -8.413113e-08
```

### Examples¶

To fix the idea, if we want to plot the sample average as a function of time for the whole statistical ensemble, here’s how one can do:

```
import matplotlib.pyplot as plt
plt.plot(univ.master.time, univ.master.average)
plt.show()
```

If one wants to plot the variance as a function of time for the `condition`

results:

```
import numpy as np
res = univ[repr(condition)]
plt.plot(res.time, np.diag(res.autocorr))
```

To obtain a representation of the auto-correlation function, we set a time of reference and find the closest index in the time array:

```
tref = 80.
iref = np.argmin(np.abs(res.time - tref) # index in time array
plt.plot(res.time, res.autocorr[iref, :])
```

Such a plot represents the autocorrelation \(a(t_{\mathrm{ref}}, t)\) as a function of \(t\).

We will see below some pre-defined plotting capabilities.

### Computations can be exported as text files¶

To save the computations, just type:

```
univ.export_text()
```

This convenient function exports computations as text files, under a folder structure that stores the context of the computation such as the filter set, the various conditions that have been applied, and the different observables over which computation has been performed:

```
simutest/analysis/filterset/observable/condition
```

The advantage of such export is that it is possible to re-load parameters from an analysis in a different session.

### Plotting results¶

`tunacell`

comes with the following plotting functions:

```
from tunacell.plotting.dynamics import plot_onepoint, plot_two_points
```

that works with `tuna.stats.single.Univariate`

instances such
as our results stored in `univ`

:

```
fig = plot_onepoint(univ, mean_ref=ref_mean, var_ref=ref_var, show_ci=True, save=True)
```

One point plots are saved in the `simutest/analysis/filterset/observable`

folder since all conditions are represented.

The first figure, stored in `fig1`

, looks like:

We can represent two point functions:

```
fig2 = plot_twopoints(univariate, condition_label='master', trefs=[40., 80., 150.],
show_exp_decay=ref_decayrate)
```

The second figure, stored in `fig2`

, looks like so:

The view proposed on auto-correlation functions for specific times of reference is not enough to quantify the decay and associate a correlation time. A clever trick to gain statistics is to pool all data where the process is stationary and numerically evaluate \(\tilde{a}\).

## Computing the auto-correlation function under stationarity¶

By inspecting the average and variance in the one-point function figure above, the user can estimate whether the process is stationary and where (over the whole time course, or just over a subset of it). The user is prompted to define regions where the studied process is (or might be) stationary. These regions are saved automatically:

```
# %% define region(s) for steady state analysis
# call the Regions object initialized on parser
regs = Regions(exp)
# this call reads previously defined regions, show them with
print(regs)
# then use one of the defined regions
region = regs.get('ALL') # we take the entire time course
```

Computation options need to be provided. They dictate how the mean value must be substracted: either global mean over all time-points defined within a region, either locally where the time-dependent average value is used; and how segments should be sampled: disjointly or not. Default settings are set to use global mean value and disjoint segments:

```
# define computation options
options = CompuParams()
```

To compute the stationary auto-correlation function \(\tilde{a}\) use:

```
from tunacell.stats.api import compute_stationary
stat = compute_stationary(univ, region, options)
```

The first argument is the `Univariate`

instance `univ`

, the second
argument is the time region over which to accept samples, and the third are the
computation options.

Here our process is stationary by construct over the whole time period of the simulation so we choose the ‘ALL’ region. Our options is to substract the global average value for the process, and to accept only disjoint segments for a given time interval: this will ensure that samples used for a given time interval are independent (as long as the process is Markovian) and we can estimate the confidence interval by computing the standard deviation of all samples for a given time interval.

`stat`

is an instance of `tuna.stats.single.StationaryUnivariate`

which is structured in the same way with respect to `master`

and conditions.
Each of its items (*e.g.* `stat.master`

, or `stat[repr(condition)]`

) is
an instance of `tuna.stats.single.StationaryUnivariateConditioned`

and stores information in the following attributes:

`time`

: the 1d array storing time interval values,`counts`

: the 1d array storing the corresponding sample counts,`autocorr`

: the 1d array storing the value of the auto-correlation function \(\tilde{a}\) for corresponding time intervals.`dataframe`

: a`Pandas.dataframe`

instance that collects data points used in the computation ; each row corresponds to a single data point (in a single cell), with information on the acquisition time, the cell identifier, the value of the observable, and as many boolean columns as there are conditions, plus the master (no condition), that indicate whether a sample has been taken or not. This is a convenient dataframe to draw*e.g.*marginal distributions.

### Plotting results¶

`tunacell`

provides a plotting function that returns a `Figure`

instance:

```
from tunacell.plotting.dynamics import plot_stationary
fig = plot_stationary(stat, show_exp_decay=ref_decayrate, save=True)
```

The first argument must be a `tuna.stats.single.StationaryObservable`

instance. The second parameter displays an exponential decay (to compare with
data).

### Exporting results as text files¶

Again it is possible to export results as text files under the same folder structure by typing:

```
stat.export_text()
```

This will create a tab-separated text file called
`stationary_<region.name>.tsv`

that can be read with any spreadsheet reader.

In addition, the dataframe of single time point values is exported as a csv file under
the filterset folder as `data_<region.name>_<observable.label()>.csv`

.

## A note on loading results¶

As described above, results can be saved in a specific folder structure that not only store the numerical results but also the context (filterset, conditions, observables, regions).

Then it is possible to load results by parsing the folder structure and reading the text files. To do so, initialize an analysis object with some settings, and try to read results from files:

```
from tunacell.stats.api import load_univariate
# load univariate analysis of experiment defined in parser
univ = load_univariate(exp, ou, cset=[condition, ])
```

The last call will work only if the analysis has been performed and exported to text files before. Hence a convenient way to work is:

```
try:
univ = load_univariate(exp, ou, cset=[condition, ])
except UnivariateIOError as uerr:
print('Impossible to load univariate {}'.format(uerr))
print('Launching computation')
univ = compute_univariate(exp, ou, cset=[condition, ])
univ.export_text()
```

## Bivariate analysis: cross-correlations¶

Key questions are to check which observables correlate, and how they correlate in time. The appropriate quantity to look at is the cross-correlation function, \(c(s, t)\), and the stationary cross-correlation funtion \(\tilde{c}(\Delta t)\) defined above (see Background).

To estimate these functions, one first need to have run the univariate analyses
on the corresponding observables. We take the univariate objects corresponding
to the `ou`

and `gr`

observables:

```
# local estimate of growth rate by using the differentiation of size measurement
# (the raw column 'exp_ou_int' plays the role of cell size in our simulations)
gr = Observable(name='approx-growth-rate', raw='exp_ou_int',
differentiate=True, scale='log',
local_fit=True, time_window=15.)
univ_gr = compute_univariate(exp, gr, [condition, ])
# import api functions
from tunacell.stats.api import (compute_bivariate,
compute_stationary_bivariate)
# compute cross-correlation matrix
biv = compute_cross(univ, univ_gr)
biv.export_text()
# compute cross-correlation function under stationarity hypothesis
sbiv = compute_cross_stationary(univ, univ_gr, region, options)
sbiv.export_text()
```

These objects again point to items corresponding to the unconditioned data and each of the conditions.

Again, cross-correlation functions as a function of two time-points (results
stored in `biv`

), the low sample size is a limit to get a smooth numerical
estimate and we turn to the estimate under stationary hypothesis in order to
pool all samples.

### Inspecting cross-correlation results¶

We can inspect the `master`

result:

```
master = biv.master
```

or any of the conditioned dataset:

```
cdt = biv[repr(condition)]
```

where `condition`

is an item of each of the `cset`

lists (one for each
`single`

object). Important attributes are:

`times`

: a couple of lists of sequences of times, corresponding respectively to the times evaluated for each item in`singles`

, or \(\{ \{s_i\}_i, \{t_j\}_j \}\) where \(\{s_i\}_i\) is the sequence of times where the first`single`

item has been evaluated, and \(\{t_j\}_j\) is the sequence of times where the second`single`

observable has been evaluated. Note that the length \((p, q)\) of these vectors may not coincide.`counts`

: the \((p, q)\) matrix giving for entry \((i, j)\) the number of samples in data where an independent lineage has been drawn between times \(s_i\) and \(t_j\).`corr`

: the \((p, q)\) matrix giving for entry \((i, j)\) the value of estimated correlation \(c_(s_i, t_j)\).

It is possible to export data in text format using:

```
biv.export_text()
```

It will create a new folder `<obs1>_<obs2>`

under each condition folder and
store the items listed above in text files.

### Inspecting cross-correlation function at stationarity¶

In the same spirit:

```
master = sbiv.master
```

gets among its attributes `array`

that stores time intervals, counts,
and values for correlation as a Numpy structured array. The `dataframe`

attribute points to a `Pandas.dataframe`

that recapitulates single
time point data in a table, with boolean columns for each condition.

It is possible to use the same plotting function used for stationary autocorrelation functions:

```
plot_stationary(sbiv, ref_decay=ref_decayrate)
```

which should plot something like: