Skip to content

Commit bb1420e

Browse files
Little update to README.md
1 parent a0c3592 commit bb1420e

File tree

1 file changed

+35
-30
lines changed

1 file changed

+35
-30
lines changed

README.md

Lines changed: 35 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22

33
[![Build Status](https://travis-ci.org/sandyjmacdonald/dots_for_microarrays.svg?branch=master)](https://travis-ci.org/sandyjmacdonald/dots_for_microarrays) [![Coverage Status](https://coveralls.io/repos/sandyjmacdonald/dots_for_microarrays/badge.svg?branch=master&service=github)](https://coveralls.io/github/sandyjmacdonald/dots_for_microarrays?branch=master)
44

5-
Dots is a Python package for working with microarray data.
6-
Its back-end is a standalone package for reading in, normalisation, statistical
5+
Dots is a Python package for working with microarray data.
6+
Its back-end is a standalone package for reading in, normalisation, statistical
77
analysis and plotting of Agilent single-colour microarray data. Its front-end
88
isn't finished yet (more on that below).
99

@@ -18,7 +18,7 @@ sudo pip install dots_for_microarrays
1818
**OR, ALTERNATIVELY:**
1919

2020
Dots has a number of dependencies including NumPy and SciPy and the least painful
21-
way of getting these is to use the
21+
way of getting these is to use the
2222
[Anaconda Python distribution](https://store.continuum.io/cshop/anaconda/) which includes
2323
NumPy and SciPy and a couple of the other required dependencies like Pandas, Scikit-learn
2424
and Bokeh.
@@ -35,7 +35,7 @@ Setuptools should take care of the dependencies but, in testing, I've found the
3535
and Scikit-learn installations to be problematic, hence my recommendation of using Anaconda
3636
to relieve those headaches.
3737

38-
Once you have Anaconda, if you'd like to install and use Dots in a fenced-off virtual
38+
Once you have Anaconda, if you'd like to install and use Dots in a fenced-off virtual
3939
environment that won't interfere with anything else, then you can do so as follows:
4040

4141
```
@@ -56,17 +56,17 @@ sudo python setup.py nosetests
5656
## What Dots does
5757

5858
1. Reads in a series of Agilent single-color array files.
59-
**It's important that your array files are named correctly, in order for Dots to work out
60-
to which group and replicate they belong e.g. for treated and untreated groups each with
59+
**It's important that your array files are named correctly, in order for Dots to work out
60+
to which group and replicate they belong e.g. for treated and untreated groups each with
6161
three replicates name the files `treated_1.txt, treated_2.txt, treated_3.txt, untreated_1.txt,
6262
untreated_2.txt, untreated_3.txt`.
6363
2. Normalises the data by log2-transforming, 75th percentile-shifting and setting the baseline
6464
to the median for each gene across all samples.
6565
3. Calculates fold changes and log fold changes for all of the pairs of groups.
66-
4. Runs either a T-test or ANOVA (determined automagically by the number of groups) with
66+
4. Runs either a T-test or ANOVA (determined automagically by the number of groups) with
6767
Benjamini-Hochberg p-value adjustment and a Tukey HSD post hoc test to determine signifcant
6868
pairs from the ANOVA.
69-
5. Provides a number of different visualisations of the data: box and whisker plots of the
69+
5. Provides a number of different visualisations of the data: box and whisker plots of the
7070
normalised data for each sample, a PCA plot of all of the samples, a hierarchically-clustered
7171
(by gene) heatmap for the significantly differentially expressed genes (> +/- 2-fold, p < 0.05),
7272
a plot of k-means clustered groups of genes with similar expression patterns across the samples,
@@ -78,7 +78,7 @@ and volcano plots for each pair of samples.
7878

7979
## What Dots will do in the future
8080

81-
1. Read the array data into an SQLite3 database, signifcantly speeding the whole workflow if
81+
1. Read the array data into an SQLite3 database, signifcantly speeding the whole workflow if
8282
you re-analyse your array data at a later date.
8383
2. Assess the quality of the arrays.
8484
3. Provide a web front-end to guide you through the workflow.
@@ -99,17 +99,17 @@ You can run it, for example, on the sample data included here (in `dots_sample_d
9999
python dots_workflow.py dots_sample_data -o sample_data_output
100100
```
101101

102-
The `-o` is an optional argument and, if you don't include it, then they'll be put in a
102+
The `-o` is an optional argument and, if you don't include it, then they'll be put in a
103103
folder named `output`.
104104

105105
## Getting your hands dirty
106106

107107
I've tried to comment the code as thoroughly as possible, so the best way to find out everything
108-
that it can do is to dig into the code. Currently, it's organised in three modules that handle
108+
that it can do is to dig into the code. Currently, it's organised in three modules that handle
109109
reading in the arrays, analysing them and the plotting. The docstrings allow you to get information
110110
about a function or class by typing, e.g. `help(run_stats)`.
111111

112-
The three modules - `dots_arrays`, `dots_analysis` and `dots_plotting` - are all part of the
112+
The three modules - `dots_arrays`, `dots_analysis` and `dots_plotting` - are all part of the
113113
`dots_backend` package. As an example, you can import `dots_arrays` by typing
114114

115115
```python
@@ -128,7 +128,7 @@ This module handles reading in individual arrays or a series of arrays as an exp
128128
has classes for each of these - the `Array` class and the `Experiment` class - that have a
129129
bunch of methods that you can run on them.
130130

131-
There are also two functions - `read_array` and `read_experiment` - that pretty much do what
131+
There are also two functions - `read_array` and `read_experiment` - that pretty much do what
132132
they say on the tin. Both of these return `Array` and `Experiment` instances that both have
133133
Pandas data frame attributes that contain the data. Where possible, the dots modules use
134134
Pandas data frames because... Pandas.
@@ -178,7 +178,7 @@ You can read in a whole experiment as follows:
178178
experiment = read_experiment(array_filenames, baseline=True)
179179
```
180180

181-
The `arrays_filenames` should be a list of filenames and the `baseline` option determines whether
181+
The `arrays_filenames` should be a list of filenames and the `baseline` option determines whether
182182
the baseline is set to the median.
183183

184184
The `Experiment` class is essentially a collection of `Array` instances with some neat methods to,
@@ -203,7 +203,7 @@ if you haven't already set the baseline to median.
203203
experiment = experiment.remove_sample('treated_1')
204204
```
205205

206-
This method will be of more use once the quality control features are added, allowing you to remove
206+
This method will be of more use once the quality control features are added, allowing you to remove
207207
samples that are of low quality before proceeding with the analysis and plotting.
208208

209209
### The read_annotations function
@@ -222,33 +222,33 @@ You can read them in as part of your `read_experiment` call as follows:
222222
experiment = read_experiment(array_filenames, baseline=True, annotations_file='annotations.txt')
223223
```
224224

225-
Note that the `annotations_file` is an
225+
Note that the `annotations_file` is an
226226

227227
## The dots_analysis module
228228

229-
This is the meat of the dots_backend.
229+
This is the meat of the dots_backend.
230230

231-
The `get_fold_changes` function is straightforward and just takes an experiment instance and
232-
returns a data frame with e.g. `FC_treated_untreated` and `logFC_treated_untreated` columns for
231+
The `get_fold_changes` function is straightforward and just takes an experiment instance and
232+
returns a data frame with e.g. `FC_treated_untreated` and `logFC_treated_untreated` columns for
233233
each pair of groups in the experiment. Use it as follows:
234234

235235
```python
236236
fold_changes = get_fold_changes(experiment)
237237
```
238238

239-
The `run_stats` function is similarly simple. It automagically decides whether to run just a
239+
The `run_stats` function is similarly simple. It automagically decides whether to run just a
240240
T-test (if there are two groups) or to run an ANOVA and Tukey HSD post hoc (if there are three
241241
or more groups), and also adjusts the p values with a Benjamini-Hochberg correction. It
242242
returns a data frame with `p_val` and `p_val_adj` columns. The significances from the post
243-
hoc test are in columns in the data frame named e.g. `significant_treated_untreated`. Use it
243+
hoc test are in columns in the data frame named e.g. `significant_treated_untreated`. Use it
244244
as follows:
245245

246246
```python
247247
stats = run_stats(experiment)
248248
```
249249

250250
There's a simple `run_pca` function that is used by the `do_pcaplot` function in the `dots_plotting`
251-
module. It returns a data frame with the x/y coordinates from the first two principal components.
251+
module. It returns a data frame with the x/y coordinates from the first two principal components.
252252
Use it as follows:
253253

254254
```python
@@ -260,11 +260,11 @@ do slightly different things. Both functions have an option to select either k-m
260260
clustering.
261261

262262
The `find_clusters` function returns a list of cluster numbers in the same order as the rows in the
263-
experiment data frame. If the method is hierarchical - `how=`hierarchical` - then the number of
263+
experiment data frame. If the method is hierarchical - `how=`hierarchical` - then the number of
264264
clusters is set at the square root of (number of rows divided by two), a good approximation. If the
265265
method is k-means - `how='kmeans'` - then values of k (the number of clusters) from 3 to 10 are tested
266266
using silhouette analysis and the best value picked. An additional argument passed to the function -
267-
`k_range=(3,51)` allows you to increase the number of values tested to, in this example, 50. Here's
267+
`k_range=(3,51)` allows you to increase the number of values tested to, in this example, 50. Here's
268268
how to get a list of clusters with either hierarchical or k-means clustering:
269269

270270
```python
@@ -277,7 +277,7 @@ filters the data frame down to only significantly differentially expressed genes
277277
things up considerably, especially for the k-means clustering, and makes the heat maps more compact).
278278
It returns the filtered data frame along with an extra column `cluster` that contains the cluster
279279
numbers. As with the `find_clusters` function, it allows hierarchical or k-means clustering to be
280-
selected. Here's how to get a filtered data frame with clusters with either hierarchical or k-means
280+
selected. Here's how to get a filtered data frame with clusters with either hierarchical or k-means
281281
clustering:
282282

283283
```python
@@ -320,16 +320,21 @@ do_heatmap(experiment, show=False, image=False, html_file='heatmap.html')
320320
do_clusters_plot(experiment, show=True, image=False, html_file='clustersplot.html')
321321
```
322322

323-
As you'll see, they all take an experiment instance and have a number of other optional arguments.
324-
The `show=False/True` argument determines whether the plot is shown in your browser after it is
325-
generated, with the default being false. The `image=False/True` argument determines whether a PNG
323+
As you'll see, they all take an experiment instance and have a number of other optional arguments.
324+
The `show=False/True` argument determines whether the plot is shown in your browser after it is
325+
generated, with the default being false. The `image=False/True` argument determines whether a PNG
326326
format image of the plot is created in addition to the HTML version. Lastly, the `html_file='boxplot.html'`
327327
allows you to specify a custom filename for your HTML plot (this is also used for the image filename).
328328

329+
As of version 0.2.0, the box plot outliers are limited to a total of 250,000 glyphs across all of the
330+
samples. Also, the heatmaps are limited to 2,500 rows, by tuning the fold change cutoff until there are
331+
less than 2,500 rows left. These limitations help to prevent strange behaviour when Bokeh deals with a
332+
really large number of glyphs.
333+
329334
All of the plots, with the exception of the clusters plot, use Bokeh's nifty hover function to show
330335
you information about the points on the plots, e.g. gene name, normalised expression values, etc.
331336

332-
The `do_volcanoplot` function takes an additional argument (a tuple) that specifies the pair of groups
337+
The `do_volcanoplot` function takes an additional argument (a tuple) that specifies the pair of groups
333338
to plot on the volcano plot, for example:
334339

335340
```python
@@ -338,4 +343,4 @@ do_volcanoplot(experiment, ('treated', 'untreated'), show=False, image=False, ht
338343

339344
**Note that the function, `render_plot_to_png`, that generates the PNG versions of the plots requires the
340345
[PhantomJS](http://phantomjs.org) JavaScript API to be installed (it's essentially a headless browser) to
341-
work properly.**
346+
work properly.**

0 commit comments

Comments
 (0)