# Setting up your analysis¶

Once raw data files are organized following requirements in Input file format, analysis can get started. A first step is to follow the guidelines in 10 minute tutorial. Here we go into more detail about:

- how to parse your data,
- how to define the observable to look at, and
- how to define conditions.

Contents

## Experiment and filters¶

To start the analysis, you need to tell tunacell which experiment to analyse, and whether to apply filters.

### Loading an experiment¶

To set up the experiment, you have to give the path to the experiment folder on
your computer. We will denote this path as `<path-to-exp>`

, then use:

```
from tunacell import Experiment
exp = Experiment(<path-to-exp>)
```

By default, no filter is applied. But it is possible to associate a set of filters to an experiment, giving instructions to how data structures will be parsed.

### Defining the statistical ensemble¶

The statistical ensemble is the set of cells, lineages, colonies, and containers that are parsed to compute statistics. In some cases, you might be likely to remove outliers, such as cells carrying anomalous values.

To do so a `FilterSet`

instance must be defined
and associated it with the `Experiment`

object.
Detailed description about how to define filters and filter sets is in
Filters. Here we give a simple, concrete example. Suppose you’d like to
filter out cells that didn’t divide symmetrically. To do so, you first instantiate
the `FilterSymmetricDivision`

class:

```
from tunacell.filters.cells import FilterSymmetricDivision
myfilter = FilterSymmetricDivision(raw='length', lower_bound=0.4, upper_bound=0.6)
```

`length`

is used as the raw quantifier (assuming you have a column
`length`

in your data files). Such filter requires that the
daughter cell `length`

at birth must be bound within 40 and 60 percent
of the mother cell’s `length`

at division. Then:

```
from tunacell import FilterSet
myfset = FilterSet(filtercells=myfilter)
```

In the last line, the keyword argument specifies `filtercells`

since our
filter `myfilter`

acts on `Cell`

instances. You can define one filter
for each type of structures: `Cell`

, `Colony`

, `Lineage`

,
and `Container`

.

Once that a `FilterSet`

instance is defined, load it with:

```
exp.set_filter(fset)
```

Note

Filtering cell outliers may affect the tree structure, decomposing original tree in multiple subtrees where outlier node has been removed. Hence the number of trees generated from one container file depends on the filter applied to cells.

### Defining particular samples¶

All samples from an experiment are used for statistics, under the filtering assumption discussed above. However, visualization of trajectories is performed over a subset of reasonable size: this is what we’ll be calling small samples.

Small samples can be chosen specifically by user (*“I am intrigued by this
cell, let’s have a look on its trajectory”*), or randomly. To do so:

```
from tunacell import Parser
parser = Parser(exp)
```

Note that a sample is identified by a couple of labels: the container label, and the cell identifier. For example:

```
parser.add_sample({'container_label': 'FOV_001', 'cellID': 12})
```

or synonymously:

```
parser.add_sample(('FOV_001', 12))
```

This information is stored under the `samples`

, and you can get a
print of the registered samples with:

```
print(parser.info_samples())
```

You can also add randomly chosen samples:

```
parser.add_sample(10)
```

adds 10 such samples.

Please refer to Parser for more information about how to use it.

### Iterating through samples¶

The `exp`

object provides a set of iterators to parse data at each level,
with the appropriate applied filters:

`Container`

level with the method`iter_containers()`

, filtered at the container level,`Colony`

level with the method`iter_colonies()`

, filtered at the container, cell, and colony levels,`Lineage`

level with the method`iter_lineages()`

, filtered at the container, cell, colony, and lineages levels`Cell`

level with the method`iter_cells()`

, filtered at the container, cell, colony, and lineages levels.

The idea behind `tunacell`

is to decompose colonies into sets of lineages, *i.e.* into
sets of sequences of parentally linked cells. This way, it is possible
to extract time-series that span time ranges larger than single cell cycles.

Note

Decomposition in lineages is performed randomly: at cell division, one daughter cell is chosen randomly to be the next step in the lineage. This way, lineages are independent: a given cell belongs to one, and only one independent lineage.

#### Iterating over listed samples¶

Use above-mentioned methods on the `Parser`

instance.

See Parser for more details.

#### Iterating over all samples¶

Use above-mentioned methods on the `Experiment`

instance.

See Experiment for more details.

## Defining the observable¶

To define an observable, *i.e.* a measurable quantity that evolves through time,
use the `Observable`

class:

```
from tunacell import Observable
```

and instantiate it with parameters to define a particular observable.

First parameter is the name to give to the observable (to find it back in the analysis process).

Second, mandatory parameter is the column to use as raw data
(*e.g.* ‘length’, ‘size’, ‘fluo’, …).

Then, it is possible to use time-lapse data (as stored in data files, or processed using a time-derivative estimate) or to determine the value of said raw observable at a particular cell cycle stage, for example length at birth.

### Indicating raw data¶

First, one needs to indicate which column to be used in the raw data
file, by specifying `raw='<column-name>'`

.

When raw data is expected to be steady, or to be a linear function of time
within cell cycle, then use `scale='linear'`

(default setting). When it is
expected to be an exponential function of time within cell cycle, use
`scale='log'`

. We will mention below how this parameter affects some
procedures.

Raw data can be used as is, or further processed to provide user-defined observable. Two main modes are used to process raw data:

- The
*dynamics*mode is used when one wants to analyze observables for all time points; examples are: length, growth rate, … - The
*cell-cycle*mode indicates observables that are defined as a single value per cell cycle; examples are: length at birth, average growth rate, …

### Dynamic mode¶

It corresponds to the parameter `mode='dynamics'`

.
It sets automatically the *timing* parameter as `timing='t'`

where `t`

stands for time-lapse timing. It is meant to study observables for all time
points (time-lapse, dynamic analysis).

### Cell-cycle modes¶

Cell-cycle modes are used when one observable need to be quantified at the cell-cycle
level, *i.e.* quantified once per cell cycle.There are few cell cycle modes:

`mode='birth'`

: extrapolates values to estimate observable at cell birth;`mode='division'`

: extrapolates values to estimate observable at cell division;`mode='net-increase-additive'`

: returns the difference between division and birth values of observable;`mode='net-increase-multiplicative'`

: returns the ratio between division and birth values of observable;`mode='average'`

: returns the average value of observable along cell cycle;`mode='rate'`

: proceeds to a linear/exponential fit of observable depending on the chosen`scale`

parameter. In fact, the procedure always performs linear fits, when`scale='log'`

the log of raw data is used, thereby performing an exponential fit on raw data.

### Choosing the timing¶

For dynamic mode, the only associated timing is `t`

(stands for “time-lapse”).
The parameter `tref`

may be used to align time points. When provided as a
number, it will be substracted to acquisition time. A string code can be given,
`'root'`

that aligns data with the colony’s root cell division time (caution:
when filtering happens, some cells that were acquired at the middle of your
experiment can become root cells if their parent cell is an outlier; this may
affect dangerously the alignement of your time-series).

For cell-cycle modes it associates to the estimated observable a time-point to be chosen between:

`b`

: time at birth, when known;`d`

: time at division, when known;`m`

: time at mid-point trhough cell-cycle;`g`

: generation index, which can be used in conjunction with the parameter`tref`

. When the later is set to a floating number, generation index will be offset to the generation index of the cell’s ancestor that lived at this time of reference if it exists, otherwise, data from this lineage is discarded in analysis. When`tref=None`

, then the generation index is relative to the colony to which belongs current cell.

End-point values are estimated by extrapolation. This is because cell divisions
are recorded halfway between parent cell last frame and daughter cell first
frame. The extrapolation uses local fits over `join_points`

points.

Warning

generation index may be used with care in statistical estimates over the dynamics of dividing cells, since generation 0 for a given colony does not necessarily correspond to generation 0 of another colony.

### Differentiation¶

In *dynamics* mode, differentiation is obtained either by default using finite
differences with two consecutive points, either by a sliding window fit.
For an observable \(x(t)\), depending on the chosen scale, linear or log,
it returns the estimate of \(\frac{dx}{dt}\) or
\(\frac{d}{dt} \log x(t)\) respectively.

### Local fit estimates¶

As finite difference estimates of derivatives are very sensitive to measurement precision, the user can opt for a local fitting procedure.

This procedure can be applied to estimate derivatives, or values of the
observables by performing local linear fit of the scaled observable over
a given time window. To use said option, user needs to provide the time window
extent, *e.g.* `time_window=15`

, will proceed to a local fit over
a time window of 15 units of time (usually minutes).

Such a local fit procedure restricted to scanning cell-cycle time segments
would lead to a loss of exploitable times, as large as the time window,
for each cell. To deal with that, the procedure provide a way to use daughter
cell information to “fill data estimates” towards the end of cell-cycle.
The parameter `join_points=3`

indicates that end-point values are
estimated using 3 first frames, or 3 last frames.

Warning

Using local fitting procedure is likely to artificially correlate time points
over the time window time range. Such option can help with data visualization
since it smoothens measurement errors, but **extreme caution** is adviced when
this feature is used in statistical analysis.

### Examples¶

Let’s assume that raw data column names include `'length'`

and `'fluo'`

.

#### Example 1: length vs. time¶

This is the trivial example. We stay in dynamic mode, and we do not associate any further processing to collected data:

```
>>> length = Observable(name='length', raw='length')
```

#### Example 2: length at birth¶

We go to the corresponding cell-cycle mode with the appropriate timing:

```
>>> length_at_birth = Observable(name='birth-length', raw='length', mode='birth', timing='b')
```

Note

one could associate the observable length at birth with another timing,
*e.g.* time at mid cell cycle.

#### Example 3: Fluorescence production rate (finite differences)¶

```
>>> k = Observable(name='prod-rate', raw='fluo', differentiate=True)
```

#### Example 4: Fluorescence production rate (local fit)¶

We found that the later led to really noisy timeseries, so we choose to produce local estimates over 3 points, in an experiment where acquisition period is 4 minutes, it means to have a 12 minutes time-window:

```
>>> kw = Observable(name='window-prod-rate', raw='fluo', differentiate=True, scale='linear',
local_fit=True, time_window=12.)
```

It computes

using 12 minutes time windows.

#### Example 5: Fluorescence production rate (using local fits) at birth¶

And we want to have it as a function of generation index, setting 0 for cells that live at time 300 minutes:

```
>>> kw = Observable(name='window-prod-rate-at-birth'raw='fluo', differentiate=True, scale='linear',
local_fit=True, time_window=12.,
mode='birth', timing='g', tref=300.)
```

## Conditional analysis¶

We saw in Defining the statistical ensemble that one can define filters that act on
cells, or colonies, and to group them in a `FilterSet`

instance that
essentially sets the statistical ensemble over which analysis is performed.

There is another utility of these `FilterSet`

objects: they may define
sub-ensembles over which analysis is performed in order to compare results
over chosen sub-populations. One example is to “gate” cell-cycle quantifiers
and observe the statistics of the different sub-populations. Here we extend
the gating procedure to analyse any dynamic observable.

To do os, a list of `FilterSet`

instances, one per condition, can be
provided to our analysis functions. We refer to the following users pages for
further reading on how to use filters, see Filters, and how to run
statistical analysis Statistics of the dynamics.