Functions
Data structures
GigaSOM.Som — TypeSomStructure to hold all data of a trained SOM.
Fields:
- codes::Array{Float64,2}: 2D-array of codebook vectors. One vector per row
- xdim::Int: number of neurons in x-direction
- ydim::Int: number of neurons in y-direction
- numCodes::Int: total number of neurons
- grid::Array{Float64,2}: 2D-array of coordinates of neurons on the map (2 columns (x,y)] for rectangular and hexagonal maps 3 columns (x,y,z) for spherical maps)
Data loading and preparation
GigaSOM.distributeFCSFileVector — FunctiondistributeFCSFileVector(name::Symbol, fns::Vector{String}, pids=workers())::DinfoDistribute a vector of integers among the workers that describes which file from fns the cell comes from. Useful for producing per-file statistics. The vector is saved on workers specified by pids as a distributed variable name.
GigaSOM.distributeFileVector — FunctiondistributeFileVector(name::Symbol, sizes::Vector{Int}, slices::Vector{Tuple{Int,Int,Int,Int}}, pids=workers())::DinfoGeneralized version of distributeFCSFileVector that produces the integer vector from any sizes and slices.
GigaSOM.getCSVSize — Methodfunction getCSVSize(fn::String; args...)::Tuple{Int,Int}Read the dimensions (number of rows and columns, respectively) from a CSV file fn. args are passed to function CSV.file.
Example
getCSVSize("test.csv", header=false)GigaSOM.getFCSSize — MethodgetFCSSize(offsets, params)::Tuple{Int,Int}Convert the offsets and keywords from an FCS file to cell and parameter count, respectively.
GigaSOM.loadCSV — Methodfunction loadCSV(fn::String; args...)::Matrix{Float64}CSV equivalent of loadFCS. The metadata (header, column names) are not extracted. args are passed to CSV.read.
GigaSOM.loadCSVSet — Functionfunction loadCSVSet(
    name::Symbol,
    fns::Vector{String},
    pids = workers();
    postLoad = (d, i) -> d,
    csvargs...,
)::DinfoCSV equivalent of loadFCSSet. csvargs are passed as keyword arguments to CSV-loading functions.
GigaSOM.loadCSVSizes — Methodfunction loadCSVSizes(fns::Vector{String}; args...)::Vector{Int}Determine number of rows in a list of CSV files (passed as fns). Equivalent to loadFCSSizes.
GigaSOM.loadFCS — MethodloadFCS(fn::String; applyCompensation::Bool=true)::Tuple{Dict{String,String}, Matrix{Float64}}Read a FCS file. Return a tuple that contains in order:
- dictionary of the keywords contained in the file
- raw column names
- prettified and annotated column names
- raw data matrix
If applyCompensation is set, the function parses and retrieves a spillover matrix (if any valid keyword in the FCS is found that would contain it) and applies it to compensate the data.
GigaSOM.loadFCSHeader — MethodloadFCSHeader(fn::String)::Tuple{Vector{Int}, Dict{String,String}}Efficiently extract data offsets and keyword dictionary from an FCS file.
GigaSOM.loadFCSSet — FunctionloadFCSSet(name::Symbol, fns::Vector{String}, pids=workers(); applyCompensation=true, postLoad=(d,i)->d)::DinfoThis runs the FCS loading machinery in a distributed way, so that the files fns (with full path) are sliced into equal parts and saved as a distributed variable name on workers specified by pids.
applyCompensation is passed to loadFCS function.
See slicesof for description of the slicing.
postLoad is applied to the loaded FCS file data (and the index) – use this function to e.g. filter out certain columns right on loading, using selectFCSColumns.
The loaded dataset can be manipulated by the distributed functions, e.g.
- dselectfor removing columns
- dscalefor normalization
- dtransform_asinh(and others) for transformation
- etc.
GigaSOM.loadFCSSizes — MethodloadFCSSizes(fns::Vector{String})Load cell counts in many FCS files at once. Useful as input for slicesof.
GigaSOM.selectFCSColumns — MethodselectFCSColumns(selectColnames::Vector{String})Return a function useful with loadFCSSet, which loads only the specified (prettified) column names from the FCS files. Use getMetaData, getMarkerNames and cleanNames! to retrieve the usable column names for a FCS.
GigaSOM.cleanNames! — MethodcleanNames!(mydata::Vector{String})Replaces problematic characters in column names, avoids duplicate names, and prefixes an '_' if the name starts with a number.
Arguments:
- mydata: vector of names (gets modified)
GigaSOM.compensate! — Methodcompensate!(data::Matrix{Float64}, spillover::Matrix{Float64}, cols::Vector{Int})Apply a compensation matrix in spillover (the individual columns of which describe, in order, the spillover of cols in data) to the matrix data in-place.
GigaSOM.getMarkerNames — MethodgetMarkerNames(meta::DataFrame)::Tuple{Vector{String}, Vector{String}}Extract suitable raw names (useful for selecting columns) and pretty readable names (useful for humans) from FCS file metadata.
GigaSOM.getMetaData — MethodgetMetaData(f)Collect the meta data information in a more user friendly format.
Arguments:
- f: input structure with- .paramsand- .datafields
GigaSOM.getSpillover — MethodgetSpillover(params::Dict{String, String})::Union{Tuple{Vector{String},Matrix{Float64}}, Nothing}Get a spillover matrix from FCS params. Returns a pair with description of columns to be applied, and with the actual spillover matrix. Returns nothing in case spillover is not present.
GigaSOM.parseSpillover — MethodparseSpillover(str::String)::Union{Tuple{Vector{String},Matrix{Float64}}, Nothing}Parses the spillover matrix from the string from FCS parameter value.
GigaSOM.collectSlice — MethodcollectSlice(loadVec, (startFile, startOff, finalFile, finalOff)::Tuple{Int,Int,Int,Int})::VectorAlternative of vcollectSlice for 1D vectors.
GigaSOM.slicesof — Methodslicesof(lengths::Vector{Int}, slices::Int)::Vector{Tuple{Int,Int,Int,Int}}Given a list of lengths of input arrays, compute a slicing into a specified amount of equally-sized slices.
The output is a vector of 4-tuples where each specifies how to create one slice. The i-th tuple field contains, in order:
- the index of input array at which the i-th slice begins
- first element of the i-th slice in that input array
- the index of input array with the last element of the i-th slice
- the index of the last element of the i-th slice in that array
GigaSOM.vcollectSlice — MethodvcollectSlice(loadMtx, (startFile, startOff, finalFile, finalOff)::Tuple{Int,Int,Int,Int})::MatrixGiven a method to obtain matrix content (loadMtx), reconstruct a slice from the information generated by slicesof.
This function is specialized for reconstructing matrices and arrays, where the "element counts" split by slicesof are in fact matrix rows. The function is therefore named vcollect (the slicing and concatenation is vertical).
The actual data content and loading method is abstracted out – function loadMtx gets the index of the input part that it is required to fetch (e.g. index of one FCS file), and is expected to return that input part as a whole matrix. vcollectSlice correctly calls this function as required and extracts relevant portions of the matrices, so that at the end the whole slice can be pasted together.
Example:
# get a list of files
filenames=["a.fcs", "b.fcs"]
# get descriptions of 5 equally sized parts of the data
slices = slicesof(loadFCSSizes(filenames), 5)
# reconstruct first 3 columns of the first slice
mySlice = vcollectSlice(
    i -> last(loadFCS(slices[i]))[:,1:3],
    slices[1])
# (note: function loadFCS returns 4 items, the matrix is the last one)GigaSOM.dtransform_asinh — Functiondtransform_asinh(dInfo::Dinfo, columns::Vector{Int}, cofactor=5)Transform columns of the dataset by asinh transformation with cofactor.
SOM training
GigaSOM.distributedEpoch — MethoddistributedEpoch(dInfo::Dinfo, codes::Matrix{Float64}, tree)Execute the doEpoch in parallel on workers described by dInfo and collect the results. Returns pair of numerator and denominator matrices.
GigaSOM.doEpoch — MethoddoEpoch(x::Array{Float64, 2}, codes::Array{Float64, 2}, tree)vectors and the adjustment in radius after each epoch.
Arguments:
- x: training Data
- codes: Codebook
- tree: knn-compatible tree built upon the codes
GigaSOM.initGigaSOM — Functionfunction initGigaSOM(ncol::Int64,
                     means::Vector{Float64}, sdevs::Vector{Float64},
                     xdim::Int64, ydim::Int64 = xdim;
                     seed = rand(Int), rng = StableRNG(seed))Generate a stable random initial SOM with the random distribution that matches the parameters.
Arguments:
- ncol: number of desired data columns
- means,- sdevs: vectors that describe the data distribution, both of size- ncol
- xdim,- ydim: Size of the SOM
- seed: a seed (defaults to random seed from the current default random generator
- rng: a random number generator to be used (defaults to a- StableRNGinitialized with the- seed)
Returns: a new Som structure
GigaSOM.initGigaSOM — Methodfunction initGigaSOM(data::Dinfo,
                     xdim::Int64, ydim::Int64 = xdim;
                     seed=rand(Int), rng=StableRNG(seed))initGigaSOM overload for working with distributed-style Dinfo data. The rest of the arguments is passed to the data-independent initGigaSOM.
Arguments:
- data: a- Dinfoobject with the distributed dataset matrix
GigaSOM.initGigaSOM — MethodinitGigaSOM(data, args...)Initializes a SOM by random selection from the training data. A generic overload that works for matrices and DataFrames that can be coerced to Matrix{Float64}. Other arguments are passed to the data-independent initGigaSOM.
Arguments:
- data: matrix of data for running the initialization
GigaSOM.mapToGigaSOM — MethodmapToGigaSOM(som::Som, data;
             knnTreeFun = BruteTree,
             metric = Euclidean())Overload of mapToGigaSOM for simple DataFrames and matrices. This slices the data using DistributedArrays, sends them the workers, and runs normal mapToGigaSOM. Data is unscatterd after the computation.
GigaSOM.mapToGigaSOM — MethodmapToGigaSOM(som::Som, dInfo::Dinfo;
    knnTreeFun = BruteTree, metric = Euclidean(),
    output::Symbol=tmp_symbol(dInfo)::DinfoCompute the index of the BMU for each row of the input data.
Arguments
- som: a trained SOM
- dInfo:- Dinfothat describes the loaded and distributed data
- knnTreeFun: Constructor of the KNN-tree (e.g. from NearestNeighbors package)
- metric: Passed as metric argument to the KNN-tree constructor
- output: Symbol to save the result, defaults to- tmp_symbol(dInfo)
Data must have the same number of dimensions as the training dataset and will be normalised with the same parameters.
GigaSOM.scaleEpochTime — MethodscaleEpochTime(iteration::Int64, epochs::Int64)Convert iteration ID and epoch number to relative time in training.
GigaSOM.trainGigaSOM — MethodtrainGigaSOM(som::Som, train;
             kwargs...)Overload of trainGigaSOM for simple DataFrames and matrices. This slices the data, distributes them to the workers, and runs normal trainGigaSOM. Data is unscatterd after the computation.
GigaSOM.trainGigaSOM — MethodtrainGigaSOM(
    som::Som,
    dInfo::Dinfo;
    kernelFun::Function = gaussianKernel,
    metric = Euclidean(),
    somDistFun = distMatrix(Chebyshev()),
    knnTreeFun = BruteTree,
    rStart = 0.0,
    rFinal = 0.1,
    radiusFun = expRadius(-5.0),
    epochs = 20,
    eachEpoch = (e, r, som) -> nothing,
)Arguments:
- som: object of type Som with an initialised som
- dInfo:- Dinfoobject that describes a loaded dataset
- kernelFun::function: optional distance kernel; one of (- bubbleKernel, gaussianKernel) default is- gaussianKernel
- metric: Passed as metric argument to the KNN-tree constructor
- somDistFun: Function for computing the distances in the SOM map
- knnTreeFun: Constructor of the KNN-tree (e.g. from NearestNeighbors package)
- rStart: optional training radius. If zero (default), it is computed from the SOM grid size.
- rFinal: target radius at the last epoch, defaults to 0.1
- radiusFun: Function that generates radius decay, e.g.- linearRadiusor- expRadius(10.0)
- epochs: number of SOM training iterations (default 10)
- eachEpoch: a function to call back after each epoch, accepting arguments- (epochNumber, radius, som). For simplicity, this gets additionally called once before the first epoch, with- epochNumberset to zero.
GigaSOM.bubbleKernel — MethodbubbleKernel(x, r::Float64)Return a "bubble" (spherical) distribution kernel.
GigaSOM.distMatrix — FunctiondistMatrix(metric=Chebyshev())Return a function that uses the metric (compatible with metrics from package Distances) calculates distance matrixes from normal row-wise data matrices, using the metric.
Use as a parameter of trainGigaSOM.
GigaSOM.expRadius — FunctionexpRadius(steepness::Float64)Return a function to be used as a radiusFun of trainGigaSOM, which causes exponencial decay with the selected steepness.
Use: trainGigaSOM(..., radiusFun = expRadius(0.5))
Arguments
- steepness: Steepness of exponential descent. Good values range from -100.0 (almost linear) to 100.0 (really quick decay).
GigaSOM.gaussianKernel — MethodgaussianKernel(x, r::Float64)Return the value of normal distribution PDF (σ=r, μ=0) at x
GigaSOM.gridRectangular — MethodgridRectangular(xdim, ydim)Create coordinates of all neurons on a rectangular SOM.
The return-value is an array of size (Number-of-neurons, 2) with x- and y- coordinates of the neurons in the first and second column respectively. The distance between neighbours is 1.0. The point of origin is bottom-left. The first neuron sits at (0,0).
Arguments
- xdim: number of neurons in x-direction
- ydim: number of neurons in y-direction
GigaSOM.linearRadius — MethodlinearRadius(initRadius::Float64, iteration::Int64, decay::String, epochs::Int64)Return a neighbourhood radius. Use as the radiusFun parameter for trainGigaSOM.
Arguments
- initRadius: Initial Radius
- finalRadius: Final Radius
- iteration: Training iteration
- epochs: Total number of epochs
GigaSOM.thresholdKernel — FunctionthresholdKernel(x, r::Float64)Simple FlowSOM-like hard-threshold kernel
Embedding
GigaSOM.embedGigaSOM — MethodembedGigaSOM(som::GigaSOM.Som,
             data;
             knnTreeFun = BruteTree,
             metric = Euclidean(),
             k::Int64=0,
             adjust::Float64=1.0,
             smooth::Float64=0.0,
             m::Float64=10.0)Overload of embedGigaSOM for simple DataFrames and matrices. This slices the data using DistributedArrays, sends them the workers, and runs normal embedGigaSOM. All data is properly unscatterd after the computation.
Examples:
Produce a 2-column matrix with 2D cell coordinates:
e = embedGigaSOM(som, data)Plot the result using 2D histogram from Gadfly:
using Gadfly
draw(PNG("output.png",20cm,20cm),
     plot(x=e[:,1], y=e[:,2],
     Geom.histogram2d(xbincount=200, ybincount=200)))GigaSOM.embedGigaSOM — MethodembedGigaSOM(som::GigaSOM.Som,
             dInfo::Dinfo;
             knnTreeFun = BruteTree,
             metric = Euclidean(),
             k::Int64=0,
             adjust::Float64=1.0,
             smooth::Float64=0.0,
             m::Float64=10.0,
             output::Symbol=tmp_symbol(dInfo))::DinfoReturn a data frame with X,Y coordinates of EmbedSOM projection of the data.
Arguments:
- som: a trained SOM
- dInfo:- Dinfothat describes the loaded dataset
- knnTreeFun: Constructor of the KNN-tree (e.g. from NearestNeighbors package)
- metric: Passed as metric argument to the KNN-tree constructor
- k: number of nearest neighbors to consider (high values get quadratically slower)
- adjust: position adjustment parameter (higher values avoid non-local approximations)
- smooth: approximation smoothness (the higher the value, the larger the neighborhood of approximate local linearity of the projection)
- m: exponential decay rate for the score when approaching the- k+1-th neighbor distance
- output: variable name for storing the distributed result
Data must have the same number of dimensions as the training dataset, and must be normalized using the same parameters.
GigaSOM.embedGigaSOM_internal — MethodembedGigaSOM_internal(som::GigaSOM.Som,
                      data::Matrix{Float64},
                      tree,
                      k::Int64,
                      adjust::Float64,
                      boost::Float64,
                      m::Float64)Internal function to compute parts of the embedding on a prepared kNN-tree structure (tree) and smooth converted to boost.