Functions
Data structures
GigaSOM.Som
— TypeSom
Structure to hold all data of a trained SOM.
Fields:
codes::Array{Float64,2}
: 2D-array of codebook vectors. One vector per rowxdim::Int
: number of neurons in x-directionydim::Int
: number of neurons in y-directionnumCodes::Int
: total number of neuronsgrid::Array{Float64,2}
: 2D-array of coordinates of neurons on the map (2 columns (x,y)] for rectangular and hexagonal maps 3 columns (x,y,z) for spherical maps)
Data loading and preparation
GigaSOM.distributeFCSFileVector
— FunctiondistributeFCSFileVector(name::Symbol, fns::Vector{String}, pids=workers())::Dinfo
Distribute a vector of integers among the workers that describes which file from fns
the cell comes from. Useful for producing per-file statistics. The vector is saved on workers specified by pids
as a distributed variable name
.
GigaSOM.distributeFileVector
— FunctiondistributeFileVector(name::Symbol, sizes::Vector{Int}, slices::Vector{Tuple{Int,Int,Int,Int}}, pids=workers())::Dinfo
Generalized version of distributeFCSFileVector
that produces the integer vector from any sizes
and slices
.
GigaSOM.getCSVSize
— Methodfunction getCSVSize(fn::String; args...)::Tuple{Int,Int}
Read the dimensions (number of rows and columns, respectively) from a CSV file fn
. args
are passed to function CSV.file
.
Example
getCSVSize("test.csv", header=false)
GigaSOM.getFCSSize
— MethodgetFCSSize(offsets, params)::Tuple{Int,Int}
Convert the offsets and keywords from an FCS file to cell and parameter count, respectively.
GigaSOM.loadCSV
— Methodfunction loadCSV(fn::String; args...)::Matrix{Float64}
CSV equivalent of loadFCS
. The metadata (header, column names) are not extracted. args
are passed to CSV.read
.
GigaSOM.loadCSVSet
— Functionfunction loadCSVSet(
name::Symbol,
fns::Vector{String},
pids = workers();
postLoad = (d, i) -> d,
csvargs...,
)::Dinfo
CSV equivalent of loadFCSSet
. csvargs
are passed as keyword arguments to CSV-loading functions.
GigaSOM.loadCSVSizes
— Methodfunction loadCSVSizes(fns::Vector{String}; args...)::Vector{Int}
Determine number of rows in a list of CSV files (passed as fns
). Equivalent to loadFCSSizes
.
GigaSOM.loadFCS
— MethodloadFCS(fn::String; applyCompensation::Bool=true)::Tuple{Dict{String,String}, Matrix{Float64}}
Read a FCS file. Return a tuple that contains in order:
- dictionary of the keywords contained in the file
- raw column names
- prettified and annotated column names
- raw data matrix
If applyCompensation
is set, the function parses and retrieves a spillover matrix (if any valid keyword in the FCS is found that would contain it) and applies it to compensate the data.
GigaSOM.loadFCSHeader
— MethodloadFCSHeader(fn::String)::Tuple{Vector{Int}, Dict{String,String}}
Efficiently extract data offsets and keyword dictionary from an FCS file.
GigaSOM.loadFCSSet
— FunctionloadFCSSet(name::Symbol, fns::Vector{String}, pids=workers(); applyCompensation=true, postLoad=(d,i)->d)::Dinfo
This runs the FCS loading machinery in a distributed way, so that the files fns
(with full path) are sliced into equal parts and saved as a distributed variable name
on workers specified by pids
.
applyCompensation
is passed to loadFCS function.
See slicesof
for description of the slicing.
postLoad
is applied to the loaded FCS file data (and the index) – use this function to e.g. filter out certain columns right on loading, using selectFCSColumns
.
The loaded dataset can be manipulated by the distributed functions, e.g.
dselect
for removing columnsdscale
for normalizationdtransform_asinh
(and others) for transformation- etc.
GigaSOM.loadFCSSizes
— MethodloadFCSSizes(fns::Vector{String})
Load cell counts in many FCS files at once. Useful as input for slicesof
.
GigaSOM.selectFCSColumns
— MethodselectFCSColumns(selectColnames::Vector{String})
Return a function useful with loadFCSSet
, which loads only the specified (prettified) column names from the FCS files. Use getMetaData
, getMarkerNames
and cleanNames!
to retrieve the usable column names for a FCS.
GigaSOM.cleanNames!
— MethodcleanNames!(mydata::Vector{String})
Replaces problematic characters in column names, avoids duplicate names, and prefixes an '_' if the name starts with a number.
Arguments:
mydata
: vector of names (gets modified)
GigaSOM.compensate!
— Methodcompensate!(data::Matrix{Float64}, spillover::Matrix{Float64}, cols::Vector{Int})
Apply a compensation matrix in spillover
(the individual columns of which describe, in order, the spillover of cols
in data
) to the matrix data
in-place.
GigaSOM.getMarkerNames
— MethodgetMarkerNames(meta::DataFrame)::Tuple{Vector{String}, Vector{String}}
Extract suitable raw names (useful for selecting columns) and pretty readable names (useful for humans) from FCS file metadata.
GigaSOM.getMetaData
— MethodgetMetaData(f)
Collect the meta data information in a more user friendly format.
Arguments:
f
: input structure with.params
and.data
fields
GigaSOM.getSpillover
— MethodgetSpillover(params::Dict{String, String})::Union{Tuple{Vector{String},Matrix{Float64}}, Nothing}
Get a spillover matrix from FCS params
. Returns a pair with description of columns to be applied, and with the actual spillover matrix. Returns nothing
in case spillover is not present.
GigaSOM.parseSpillover
— MethodparseSpillover(str::String)::Union{Tuple{Vector{String},Matrix{Float64}}, Nothing}
Parses the spillover matrix from the string from FCS parameter value.
GigaSOM.collectSlice
— MethodcollectSlice(loadVec, (startFile, startOff, finalFile, finalOff)::Tuple{Int,Int,Int,Int})::Vector
Alternative of vcollectSlice
for 1D vectors.
GigaSOM.slicesof
— Methodslicesof(lengths::Vector{Int}, slices::Int)::Vector{Tuple{Int,Int,Int,Int}}
Given a list of lengths
of input arrays, compute a slicing into a specified amount of equally-sized slices
.
The output is a vector of 4-tuples where each specifies how to create one slice. The i-th tuple field contains, in order:
- the index of input array at which the i-th slice begins
- first element of the i-th slice in that input array
- the index of input array with the last element of the i-th slice
- the index of the last element of the i-th slice in that array
GigaSOM.vcollectSlice
— MethodvcollectSlice(loadMtx, (startFile, startOff, finalFile, finalOff)::Tuple{Int,Int,Int,Int})::Matrix
Given a method to obtain matrix content (loadMtx
), reconstruct a slice from the information generated by slicesof
.
This function is specialized for reconstructing matrices and arrays, where the "element counts" split by slicesof
are in fact matrix rows. The function is therefore named vcollect (the slicing and concatenation is vertical).
The actual data content and loading method is abstracted out – function loadMtx
gets the index of the input part that it is required to fetch (e.g. index of one FCS file), and is expected to return that input part as a whole matrix. vcollectSlice
correctly calls this function as required and extracts relevant portions of the matrices, so that at the end the whole slice can be pasted together.
Example:
# get a list of files
filenames=["a.fcs", "b.fcs"]
# get descriptions of 5 equally sized parts of the data
slices = slicesof(loadFCSSizes(filenames), 5)
# reconstruct first 3 columns of the first slice
mySlice = vcollectSlice(
i -> last(loadFCS(slices[i]))[:,1:3],
slices[1])
# (note: function loadFCS returns 4 items, the matrix is the last one)
GigaSOM.dtransform_asinh
— Functiondtransform_asinh(dInfo::Dinfo, columns::Vector{Int}, cofactor=5)
Transform columns of the dataset by asinh transformation with cofactor
.
SOM training
GigaSOM.distributedEpoch
— MethoddistributedEpoch(dInfo::Dinfo, codes::Matrix{Float64}, tree)
Execute the doEpoch
in parallel on workers described by dInfo
and collect the results. Returns pair of numerator and denominator matrices.
GigaSOM.doEpoch
— MethoddoEpoch(x::Array{Float64, 2}, codes::Array{Float64, 2}, tree)
vectors and the adjustment in radius after each epoch.
Arguments:
x
: training Datacodes
: Codebooktree
: knn-compatible tree built upon the codes
GigaSOM.initGigaSOM
— Functionfunction initGigaSOM(ncol::Int64,
means::Vector{Float64}, sdevs::Vector{Float64},
xdim::Int64, ydim::Int64 = xdim;
seed = rand(Int), rng = StableRNG(seed))
Generate a stable random initial SOM with the random distribution that matches the parameters.
Arguments:
ncol
: number of desired data columnsmeans
,sdevs
: vectors that describe the data distribution, both of sizencol
xdim
,ydim
: Size of the SOMseed
: a seed (defaults to random seed from the current default random generatorrng
: a random number generator to be used (defaults to aStableRNG
initialized with theseed
)
Returns: a new Som
structure
GigaSOM.initGigaSOM
— Methodfunction initGigaSOM(data::Dinfo,
xdim::Int64, ydim::Int64 = xdim;
seed=rand(Int), rng=StableRNG(seed))
initGigaSOM
overload for working with distributed-style Dinfo
data. The rest of the arguments is passed to the data-independent initGigaSOM
.
Arguments:
data
: aDinfo
object with the distributed dataset matrix
GigaSOM.initGigaSOM
— MethodinitGigaSOM(data, args...)
Initializes a SOM by random selection from the training data. A generic overload that works for matrices and DataFrames that can be coerced to Matrix{Float64}
. Other arguments are passed to the data-independent initGigaSOM
.
Arguments:
data
: matrix of data for running the initialization
GigaSOM.mapToGigaSOM
— MethodmapToGigaSOM(som::Som, data;
knnTreeFun = BruteTree,
metric = Euclidean())
Overload of mapToGigaSOM
for simple DataFrames and matrices. This slices the data using DistributedArrays
, sends them the workers, and runs normal mapToGigaSOM
. Data is unscatter
d after the computation.
GigaSOM.mapToGigaSOM
— MethodmapToGigaSOM(som::Som, dInfo::Dinfo;
knnTreeFun = BruteTree, metric = Euclidean(),
output::Symbol=tmp_symbol(dInfo)::Dinfo
Compute the index of the BMU for each row of the input data.
Arguments
som
: a trained SOMdInfo
:Dinfo
that describes the loaded and distributed dataknnTreeFun
: Constructor of the KNN-tree (e.g. from NearestNeighbors package)metric
: Passed as metric argument to the KNN-tree constructoroutput
: Symbol to save the result, defaults totmp_symbol(dInfo)
Data must have the same number of dimensions as the training dataset and will be normalised with the same parameters.
GigaSOM.scaleEpochTime
— MethodscaleEpochTime(iteration::Int64, epochs::Int64)
Convert iteration ID and epoch number to relative time in training.
GigaSOM.trainGigaSOM
— MethodtrainGigaSOM(som::Som, train;
kwargs...)
Overload of trainGigaSOM
for simple DataFrames and matrices. This slices the data, distributes them to the workers, and runs normal trainGigaSOM
. Data is unscatter
d after the computation.
GigaSOM.trainGigaSOM
— MethodtrainGigaSOM(
som::Som,
dInfo::Dinfo;
kernelFun::Function = gaussianKernel,
metric = Euclidean(),
somDistFun = distMatrix(Chebyshev()),
knnTreeFun = BruteTree,
rStart = 0.0,
rFinal = 0.1,
radiusFun = expRadius(-5.0),
epochs = 20,
eachEpoch = (e, r, som) -> nothing,
)
Arguments:
som
: object of type Som with an initialised somdInfo
:Dinfo
object that describes a loaded datasetkernelFun::function
: optional distance kernel; one of (bubbleKernel, gaussianKernel
) default isgaussianKernel
metric
: Passed as metric argument to the KNN-tree constructorsomDistFun
: Function for computing the distances in the SOM mapknnTreeFun
: Constructor of the KNN-tree (e.g. from NearestNeighbors package)rStart
: optional training radius. If zero (default), it is computed from the SOM grid size.rFinal
: target radius at the last epoch, defaults to 0.1radiusFun
: Function that generates radius decay, e.g.linearRadius
orexpRadius(10.0)
epochs
: number of SOM training iterations (default 10)eachEpoch
: a function to call back after each epoch, accepting arguments(epochNumber, radius, som)
. For simplicity, this gets additionally called once before the first epoch, withepochNumber
set to zero.
GigaSOM.bubbleKernel
— MethodbubbleKernel(x, r::Float64)
Return a "bubble" (spherical) distribution kernel.
GigaSOM.distMatrix
— FunctiondistMatrix(metric=Chebyshev())
Return a function that uses the metric
(compatible with metrics from package Distances
) calculates distance matrixes from normal row-wise data matrices, using the metric
.
Use as a parameter of trainGigaSOM
.
GigaSOM.expRadius
— FunctionexpRadius(steepness::Float64)
Return a function to be used as a radiusFun
of trainGigaSOM
, which causes exponencial decay with the selected steepness.
Use: trainGigaSOM(..., radiusFun = expRadius(0.5))
Arguments
steepness
: Steepness of exponential descent. Good values range from -100.0 (almost linear) to 100.0 (really quick decay).
GigaSOM.gaussianKernel
— MethodgaussianKernel(x, r::Float64)
Return the value of normal distribution PDF (σ=r
, μ=0) at x
GigaSOM.gridRectangular
— MethodgridRectangular(xdim, ydim)
Create coordinates of all neurons on a rectangular SOM.
The return-value is an array of size (Number-of-neurons, 2) with x- and y- coordinates of the neurons in the first and second column respectively. The distance between neighbours is 1.0. The point of origin is bottom-left. The first neuron sits at (0,0).
Arguments
xdim
: number of neurons in x-directionydim
: number of neurons in y-direction
GigaSOM.linearRadius
— MethodlinearRadius(initRadius::Float64, iteration::Int64, decay::String, epochs::Int64)
Return a neighbourhood radius. Use as the radiusFun
parameter for trainGigaSOM
.
Arguments
initRadius
: Initial RadiusfinalRadius
: Final Radiusiteration
: Training iterationepochs
: Total number of epochs
GigaSOM.thresholdKernel
— FunctionthresholdKernel(x, r::Float64)
Simple FlowSOM-like hard-threshold kernel
Embedding
GigaSOM.embedGigaSOM
— MethodembedGigaSOM(som::GigaSOM.Som,
data;
knnTreeFun = BruteTree,
metric = Euclidean(),
k::Int64=0,
adjust::Float64=1.0,
smooth::Float64=0.0,
m::Float64=10.0)
Overload of embedGigaSOM
for simple DataFrames and matrices. This slices the data using DistributedArrays
, sends them the workers, and runs normal embedGigaSOM
. All data is properly unscatter
d after the computation.
Examples:
Produce a 2-column matrix with 2D cell coordinates:
e = embedGigaSOM(som, data)
Plot the result using 2D histogram from Gadfly:
using Gadfly
draw(PNG("output.png",20cm,20cm),
plot(x=e[:,1], y=e[:,2],
Geom.histogram2d(xbincount=200, ybincount=200)))
GigaSOM.embedGigaSOM
— MethodembedGigaSOM(som::GigaSOM.Som,
dInfo::Dinfo;
knnTreeFun = BruteTree,
metric = Euclidean(),
k::Int64=0,
adjust::Float64=1.0,
smooth::Float64=0.0,
m::Float64=10.0,
output::Symbol=tmp_symbol(dInfo))::Dinfo
Return a data frame with X,Y coordinates of EmbedSOM projection of the data.
Arguments:
som
: a trained SOMdInfo
:Dinfo
that describes the loaded datasetknnTreeFun
: Constructor of the KNN-tree (e.g. from NearestNeighbors package)metric
: Passed as metric argument to the KNN-tree constructork
: number of nearest neighbors to consider (high values get quadratically slower)adjust
: position adjustment parameter (higher values avoid non-local approximations)smooth
: approximation smoothness (the higher the value, the larger the neighborhood of approximate local linearity of the projection)m
: exponential decay rate for the score when approaching thek+1
-th neighbor distanceoutput
: variable name for storing the distributed result
Data must have the same number of dimensions as the training dataset, and must be normalized using the same parameters.
GigaSOM.embedGigaSOM_internal
— MethodembedGigaSOM_internal(som::GigaSOM.Som,
data::Matrix{Float64},
tree,
k::Int64,
adjust::Float64,
boost::Float64,
m::Float64)
Internal function to compute parts of the embedding on a prepared kNN-tree structure (tree
) and smooth
converted to boost
.