Tutorial 2: Working with cytometry data
You can load any FCS file using loadFCS
function. For example, the Levine dataset (obtainable here) may be loaded as such:
params, data = loadFCS("Levine_13dim.fcs")
params
will now contain the list of FCS parameters; you can parse a lot of interesting information from it using the getMetaData
function:
getMetaData(params)
14×8 DataFrames.DataFrame. Omitted printing of 3 columns
│ Row │ E │ S │ N │ RMIN │ R │
│ │ String │ String │ String │ String │ String │
├─────┼────────┼────────┼────────┼───────────────────┼────────┤
│ 1 │ 0,0 │ │ CD45 │ -2.03601189282714 │ 1024 │
│ 2 │ 0,0 │ │ CD45RA │ -2.99700270621007 │ 1024 │
│ 3 │ 0,0 │ │ CD19 │ -3.05850183816765 │ 1024 │
│ 4 │ 0,0 │ │ CD11b │ -2.99956408593931 │ 1024 │
│ 5 │ 0,0 │ │ CD4 │ -2.22860674361335 │ 1024 │
│ 6 │ 0,0 │ │ CD8 │ -3.29174106765763 │ 1024 │
│ 7 │ 0,0 │ │ CD34 │ -2.74278770893026 │ 1024 │
│ 8 │ 0,0 │ │ CD20 │ -3.40866348184011 │ 1024 │
│ 9 │ 0,0 │ │ CD33 │ -2.31371406643428 │ 1024 │
│ 10 │ 0,0 │ │ CD123 │ -3.02624638359366 │ 1024 │
│ 11 │ 0,0 │ │ CD38 │ -3.14752313833461 │ 1024 │
│ 12 │ 0,0 │ │ CD90 │ -2.55305031846157 │ 1024 │
│ 13 │ 0,0 │ │ CD3 │ -3.52459385416266 │ 1024 │
│ 14 │ 0,0 │ │ label │ 1 │ 1024 │
data
is a matrix with cell expressions, one cell per row, one marker per column. If you want to run SOM analysis on it, you can cluster and visualize it just as in the previous tutorial, with one exception- we start with cutting off the label
column that contains NaN
values:
data = data[:,1:13]
som = initGigaSOM(data, 16, 16)
som = trainGigaSOM(som, data)
clusters = mapToGigaSOM(som, data)
e = embedGigaSOM(som, data)
# ... save/plot results, etc...
Work with distributed data
Usual experiments produce multiple FCS files, and distributed or parallel processing is very helpful in crunching through all the data.
To load multiple FCS files, use loadFCSSet
. This function works well in the "usual" single-process environment, but additionally it is designed to handle situations when the data is too big to fit into memory, and attempts to split them into available distributed workers workers.
For the purpose of data distribution, you need to identify each dataset by an unique dataset name that will be used for identifying your loaded data in the cluster environment. The dataset name is a simple Julia symbols; basically a variable name that is prefixed with a :
colon.
For example, we can load the Levine13 dataset as such:
datainfo = loadFCSSet(:levine, ["Levine_13dim.fcs"])
Expectably, if you have more files, just write their names into the array and the function will handle the rest.
The result datainfo
carries informaton about your selected dataset name and its distribution among the cluster. It can be used just as the "data" parameter in all SOM-related functions again; e.g. as trainGigaSOM(som, datainfo)
.
The following example exploits the possibility to actually split the data, and processes the Levine dataset parallelly on 2 workers:
using Distributed
addprocs(2) # add any number of CPUs/tasks/workers you have available
@everywhere using GigaSOM # load GigaSOM also on the workers
datainfo = loadFCSSet(:levine, ["Levine_13dim.fcs"]) # add more files as needed
dselect(datainfo, Vector(1:13)) # select columns that contain expressions (column 14 contains labels)
som = initGigaSOM(datainfo, 20, 20)
som = trainGigaSOM(som, datainfo)
To prevent memory overload of the "master" computation node, the results of all per-cell operations are also stored in distributed datainfo objects. In this case, the following code does the embedding, but leaves the resulting data safely scattered among the cluster:
e = embedGigaSOM(som, datainfo)
If you are sure you have enough RAM, you can collect the data to the master node. (In case of the relatively small Levine13 dataset, you very probably have the required 2.5MB of RAM, but there are many larger datasets.)
e = gather_array(e)
167044×2 Array{Float64,2}:
16.8251 11.5002
18.2608 12.884
12.0103 5.18401
18.381 12.3436
18.357 11.6622
14.8936 12.0897
17.6441 12.3652
17.8752 12.7206
17.301 11.0767
14.2055 12.2227
⋮
Working with larger datasets
In this example we will use a subset of the Cytometry data by Bodenmiller et al.. This data-set contains samples from peripheral blood mononuclear cells (PBMCs) in unstimulated and stimulated conditions for 8 healthy donors.
10 cell surface markers (lineage markers) are used to identify different cell populations. The dataset is described in two files:
PBMC8_panel.xlsx
(with antigen names categorized as lineage markers and functional markers)PBMC8_metadata.xlsx
(file names, sample IDs, condition IDs and patient IDs)
Download and prepare the dataset
The example data can be downloaded from imlspenticton.uzh.ch/robinson_lab/cytofWorkflow/
You can fetch the files directly from within Julia:
# fetch the required data for testing and download the zip archive and unzip it
dataFiles = ["PBMC8_metadata.xlsx", "PBMC8_panel.xlsx", "PBMC8_fcs_files.zip"]
for f in dataFiles
if !isfile(f)
download("http://imlspenticton.uzh.ch/robinson_lab/cytofWorkflow/"*f, f)
if occursin(".zip", f)
run(`unzip PBMC8_fcs_files.zip`)
end
end
end
The metadata is present in external files; we read it into a DataFrame
and extract information about FCS data columns from there. First, we read the actual content using the XLSX package:
using XLSX
md = GigaSOM.DataFrame(readtable("PBMC8_metadata.xlsx", "Sheet1", infer_eltypes=true)...)
panel = GigaSOM.DataFrame(readtable("PBMC8_panel.xlsx", "Sheet1", infer_eltypes=true)...)
After that, we can get the parameter structure from the first FCS files:
_, fcsParams = loadFCSHeader(md[1, :file_name])
Continue with extracting marker names using the prepared functions:
_, fcsAntigens = getMarkerNames(getMetaData(fcsParams))
Now, see which antigens we want to use (assume we want only the lineage markers):
antigens = panel[panel[:,:Lineage].==1, :Antigen]
Finally, it is often useful to make the names a bit more Julia-friendly and predictable:
cleanNames!(antigens)
cleanNames!(fcsAntigens)
Load and prepare the data
Now we have the vector of fcsAntigens
that the FCS files store, and list of antigens
that we want to analyze. We continue by loading the data, reducing it to the desired antigens and transforming it a bit:
di = loadFCSSet(:pbmc8, md[:,:file_name])
(If data distribution and parallelization is required, you must add parallel workers using addprocs
before this step.)
Now that the data is loaded, let's prepare them a bit by reducing to actual interesting columns, transformation and scaling:
# select only the columns that correspond to the lineage antigens we have prepared before
dselect(di, fcsAntigens, antigens)
cols = Vector(1:length(antigens)) # shortcut for "all rows"
# perform asinh transformation on all data in the dataset, '5' is the cofactor for the transformation
dtransform_asinh(di, cols, 5)
# normalize all dataset columns to mean=0 sdev=1
dscale(di, cols)
Create a Self Organizing MAP (SOM)
With the data prepared, running the SOM algorithm is straightforward:
# randomly initialize the SOM
som = initGigaSOM(di, 16, 16)
# train the SOM for 20 epochs (10 is default, but nothing will happen if the
# epochs are slightly overdone)
som = trainGigaSOM(som, di, epochs = 20)
Finally, calculate the clustering:
somClusters = mapToGigaSOM(som, di)
FlowSOM-style metaclustering
One disadvantage of SOMs is that they output a large amount of small clusters that are relatively hard to interpret manually. FlowSOM improved that situation by running a "clustering on clusters" (metaclustering) that address the problem.
In this example, we reduce the original 256 small clusters from 16x16 SOM to only 10 "metaclusters", using the standard hierarchical clustering:
using Clustering
import Distances
metaClusters =
cutree(k=10,
hclust(linkage=:average,
GigaSOM.distMatrix(Distances.Euclidean())(som.codes)))
The metaClusters
represent membership of the SOM codes in cluster; these can be expanded to membership of all cells using mapToGigaSOM
:
mapping = gather_array(mapToGigaSOM(som, di), free=true)
clusters = metaClusters[mapping]
clusters
now contain an integer from 1
to 10
with a classification of each cell in the dataset.
(The argument free=true
of gather_array
automatically removes the distributed data from workers after collecting, which saves their memory for other datasets.)