Functions

Data structures

GigaSOM.SomType
Som

Structure to hold all data of a trained SOM.

Fields:

  • codes::Array{Float64,2}: 2D-array of codebook vectors. One vector per row
  • xdim::Int: number of neurons in x-direction
  • ydim::Int: number of neurons in y-direction
  • numCodes::Int: total number of neurons
  • grid::Array{Float64,2}: 2D-array of coordinates of neurons on the map (2 columns (x,y)] for rectangular and hexagonal maps 3 columns (x,y,z) for spherical maps)
source

Data loading and preparation

GigaSOM.distributeFCSFileVectorFunction
distributeFCSFileVector(name::Symbol, fns::Vector{String}, pids=workers())::Dinfo

Distribute a vector of integers among the workers that describes which file from fns the cell comes from. Useful for producing per-file statistics. The vector is saved on workers specified by pids as a distributed variable name.

source
GigaSOM.distributeFileVectorFunction
distributeFileVector(name::Symbol, sizes::Vector{Int}, slices::Vector{Tuple{Int,Int,Int,Int}}, pids=workers())::Dinfo

Generalized version of distributeFCSFileVector that produces the integer vector from any sizes and slices.

source
GigaSOM.getCSVSizeMethod
function getCSVSize(fn::String; args...)::Tuple{Int,Int}

Read the dimensions (number of rows and columns, respectively) from a CSV file fn. args are passed to function CSV.file.

Example

getCSVSize("test.csv", header=false)
source
GigaSOM.getFCSSizeMethod
getFCSSize(offsets, params)::Tuple{Int,Int}

Convert the offsets and keywords from an FCS file to cell and parameter count, respectively.

source
GigaSOM.loadCSVMethod
function loadCSV(fn::String; args...)::Matrix{Float64}

CSV equivalent of loadFCS. The metadata (header, column names) are not extracted. args are passed to CSV.read.

source
GigaSOM.loadCSVSetFunction
function loadCSVSet(
    name::Symbol,
    fns::Vector{String},
    pids = workers();
    postLoad = (d, i) -> d,
    csvargs...,
)::Dinfo

CSV equivalent of loadFCSSet. csvargs are passed as keyword arguments to CSV-loading functions.

source
GigaSOM.loadCSVSizesMethod
function loadCSVSizes(fns::Vector{String}; args...)::Vector{Int}

Determine number of rows in a list of CSV files (passed as fns). Equivalent to loadFCSSizes.

source
GigaSOM.loadFCSMethod
loadFCS(fn::String; applyCompensation::Bool=true)::Tuple{Dict{String,String}, Matrix{Float64}}

Read a FCS file. Return a tuple that contains in order:

  • dictionary of the keywords contained in the file
  • raw column names
  • prettified and annotated column names
  • raw data matrix

If applyCompensation is set, the function parses and retrieves a spillover matrix (if any valid keyword in the FCS is found that would contain it) and applies it to compensate the data.

source
GigaSOM.loadFCSHeaderMethod
loadFCSHeader(fn::String)::Tuple{Vector{Int}, Dict{String,String}}

Efficiently extract data offsets and keyword dictionary from an FCS file.

source
GigaSOM.loadFCSSetFunction
loadFCSSet(name::Symbol, fns::Vector{String}, pids=workers(); applyCompensation=true, postLoad=(d,i)->d)::Dinfo

This runs the FCS loading machinery in a distributed way, so that the files fns (with full path) are sliced into equal parts and saved as a distributed variable name on workers specified by pids.

applyCompensation is passed to loadFCS function.

See slicesof for description of the slicing.

postLoad is applied to the loaded FCS file data (and the index) – use this function to e.g. filter out certain columns right on loading, using selectFCSColumns.

The loaded dataset can be manipulated by the distributed functions, e.g.

  • dselect for removing columns
  • dscale for normalization
  • dtransform_asinh (and others) for transformation
  • etc.
source
GigaSOM.loadFCSSizesMethod
loadFCSSizes(fns::Vector{String})

Load cell counts in many FCS files at once. Useful as input for slicesof.

source
GigaSOM.selectFCSColumnsMethod
selectFCSColumns(selectColnames::Vector{String})

Return a function useful with loadFCSSet, which loads only the specified (prettified) column names from the FCS files. Use getMetaData, getMarkerNames and cleanNames! to retrieve the usable column names for a FCS.

source
GigaSOM.cleanNames!Method
cleanNames!(mydata::Vector{String})

Replaces problematic characters in column names, avoids duplicate names, and prefixes an '_' if the name starts with a number.

Arguments:

  • mydata: vector of names (gets modified)
source
GigaSOM.compensate!Method
compensate!(data::Matrix{Float64}, spillover::Matrix{Float64}, cols::Vector{Int})

Apply a compensation matrix in spillover (the individual columns of which describe, in order, the spillover of cols in data) to the matrix data in-place.

source
GigaSOM.getMarkerNamesMethod
getMarkerNames(meta::DataFrame)::Tuple{Vector{String}, Vector{String}}

Extract suitable raw names (useful for selecting columns) and pretty readable names (useful for humans) from FCS file metadata.

source
GigaSOM.getMetaDataMethod
getMetaData(f)

Collect the meta data information in a more user friendly format.

Arguments:

  • f: input structure with .params and .data fields
source
GigaSOM.getSpilloverMethod
getSpillover(params::Dict{String, String})::Union{Tuple{Vector{String},Matrix{Float64}}, Nothing}

Get a spillover matrix from FCS params. Returns a pair with description of columns to be applied, and with the actual spillover matrix. Returns nothing in case spillover is not present.

source
GigaSOM.parseSpilloverMethod
parseSpillover(str::String)::Union{Tuple{Vector{String},Matrix{Float64}}, Nothing}

Parses the spillover matrix from the string from FCS parameter value.

source
GigaSOM.collectSliceMethod
collectSlice(loadVec, (startFile, startOff, finalFile, finalOff)::Tuple{Int,Int,Int,Int})::Vector

Alternative of vcollectSlice for 1D vectors.

source
GigaSOM.slicesofMethod
slicesof(lengths::Vector{Int}, slices::Int)::Vector{Tuple{Int,Int,Int,Int}}

Given a list of lengths of input arrays, compute a slicing into a specified amount of equally-sized slices.

The output is a vector of 4-tuples where each specifies how to create one slice. The i-th tuple field contains, in order:

  • the index of input array at which the i-th slice begins
  • first element of the i-th slice in that input array
  • the index of input array with the last element of the i-th slice
  • the index of the last element of the i-th slice in that array
source
GigaSOM.vcollectSliceMethod
vcollectSlice(loadMtx, (startFile, startOff, finalFile, finalOff)::Tuple{Int,Int,Int,Int})::Matrix

Given a method to obtain matrix content (loadMtx), reconstruct a slice from the information generated by slicesof.

This function is specialized for reconstructing matrices and arrays, where the "element counts" split by slicesof are in fact matrix rows. The function is therefore named vcollect (the slicing and concatenation is vertical).

The actual data content and loading method is abstracted out – function loadMtx gets the index of the input part that it is required to fetch (e.g. index of one FCS file), and is expected to return that input part as a whole matrix. vcollectSlice correctly calls this function as required and extracts relevant portions of the matrices, so that at the end the whole slice can be pasted together.

Example:

# get a list of files
filenames=["a.fcs", "b.fcs"]
# get descriptions of 5 equally sized parts of the data
slices = slicesof(loadFCSSizes(filenames), 5)

# reconstruct first 3 columns of the first slice
mySlice = vcollectSlice(
    i -> last(loadFCS(slices[i]))[:,1:3],
    slices[1])
# (note: function loadFCS returns 4 items, the matrix is the last one)
source
GigaSOM.dtransform_asinhFunction
dtransform_asinh(dInfo::Dinfo, columns::Vector{Int}, cofactor=5)

Transform columns of the dataset by asinh transformation with cofactor.

source

SOM training

GigaSOM.distributedEpochMethod
distributedEpoch(dInfo::Dinfo, codes::Matrix{Float64}, tree)

Execute the doEpoch in parallel on workers described by dInfo and collect the results. Returns pair of numerator and denominator matrices.

source
GigaSOM.doEpochMethod
doEpoch(x::Array{Float64, 2}, codes::Array{Float64, 2}, tree)

vectors and the adjustment in radius after each epoch.

Arguments:

  • x: training Data
  • codes: Codebook
  • tree: knn-compatible tree built upon the codes
source
GigaSOM.initGigaSOMFunction
function initGigaSOM(ncol::Int64,
                     means::Vector{Float64}, sdevs::Vector{Float64},
                     xdim::Int64, ydim::Int64 = xdim;
                     seed = rand(Int), rng = StableRNG(seed))

Generate a stable random initial SOM with the random distribution that matches the parameters.

Arguments:

  • ncol: number of desired data columns
  • means, sdevs: vectors that describe the data distribution, both of size ncol
  • xdim, ydim: Size of the SOM
  • seed: a seed (defaults to random seed from the current default random generator
  • rng: a random number generator to be used (defaults to a StableRNG initialized with the seed)

Returns: a new Som structure

source
GigaSOM.initGigaSOMMethod
function initGigaSOM(data::Dinfo,
                     xdim::Int64, ydim::Int64 = xdim;
                     seed=rand(Int), rng=StableRNG(seed))

initGigaSOM overload for working with distributed-style Dinfo data. The rest of the arguments is passed to the data-independent initGigaSOM.

Arguments:

  • data: a Dinfo object with the distributed dataset matrix
source
GigaSOM.initGigaSOMMethod
initGigaSOM(data, args...)

Initializes a SOM by random selection from the training data. A generic overload that works for matrices and DataFrames that can be coerced to Matrix{Float64}. Other arguments are passed to the data-independent initGigaSOM.

Arguments:

  • data: matrix of data for running the initialization
source
GigaSOM.mapToGigaSOMMethod
mapToGigaSOM(som::Som, data;
             knnTreeFun = BruteTree,
             metric = Euclidean())

Overload of mapToGigaSOM for simple DataFrames and matrices. This slices the data using DistributedArrays, sends them the workers, and runs normal mapToGigaSOM. Data is unscatterd after the computation.

source
GigaSOM.mapToGigaSOMMethod
mapToGigaSOM(som::Som, dInfo::Dinfo;
    knnTreeFun = BruteTree, metric = Euclidean(),
    output::Symbol=tmp_symbol(dInfo)::Dinfo

Compute the index of the BMU for each row of the input data.

Arguments

  • som: a trained SOM
  • dInfo: Dinfo that describes the loaded and distributed data
  • knnTreeFun: Constructor of the KNN-tree (e.g. from NearestNeighbors package)
  • metric: Passed as metric argument to the KNN-tree constructor
  • output: Symbol to save the result, defaults to tmp_symbol(dInfo)

Data must have the same number of dimensions as the training dataset and will be normalised with the same parameters.

source
GigaSOM.scaleEpochTimeMethod
scaleEpochTime(iteration::Int64, epochs::Int64)

Convert iteration ID and epoch number to relative time in training.

source
GigaSOM.trainGigaSOMMethod
trainGigaSOM(som::Som, train;
             kwargs...)

Overload of trainGigaSOM for simple DataFrames and matrices. This slices the data, distributes them to the workers, and runs normal trainGigaSOM. Data is unscatterd after the computation.

source
GigaSOM.trainGigaSOMMethod
trainGigaSOM(
    som::Som,
    dInfo::Dinfo;
    kernelFun::Function = gaussianKernel,
    metric = Euclidean(),
    somDistFun = distMatrix(Chebyshev()),
    knnTreeFun = BruteTree,
    rStart = 0.0,
    rFinal = 0.1,
    radiusFun = expRadius(-5.0),
    epochs = 20,
    eachEpoch = (e, r, som) -> nothing,
)

Arguments:

  • som: object of type Som with an initialised som
  • dInfo: Dinfo object that describes a loaded dataset
  • kernelFun::function: optional distance kernel; one of (bubbleKernel, gaussianKernel) default is gaussianKernel
  • metric: Passed as metric argument to the KNN-tree constructor
  • somDistFun: Function for computing the distances in the SOM map
  • knnTreeFun: Constructor of the KNN-tree (e.g. from NearestNeighbors package)
  • rStart: optional training radius. If zero (default), it is computed from the SOM grid size.
  • rFinal: target radius at the last epoch, defaults to 0.1
  • radiusFun: Function that generates radius decay, e.g. linearRadius or expRadius(10.0)
  • epochs: number of SOM training iterations (default 10)
  • eachEpoch: a function to call back after each epoch, accepting arguments (epochNumber, radius, som). For simplicity, this gets additionally called once before the first epoch, with epochNumber set to zero.
source
GigaSOM.distMatrixFunction
distMatrix(metric=Chebyshev())

Return a function that uses the metric (compatible with metrics from package Distances) calculates distance matrixes from normal row-wise data matrices, using the metric.

Use as a parameter of trainGigaSOM.

source
GigaSOM.expRadiusFunction
expRadius(steepness::Float64)

Return a function to be used as a radiusFun of trainGigaSOM, which causes exponencial decay with the selected steepness.

Use: trainGigaSOM(..., radiusFun = expRadius(0.5))

Arguments

  • steepness: Steepness of exponential descent. Good values range from -100.0 (almost linear) to 100.0 (really quick decay).
source
GigaSOM.gridRectangularMethod
gridRectangular(xdim, ydim)

Create coordinates of all neurons on a rectangular SOM.

The return-value is an array of size (Number-of-neurons, 2) with x- and y- coordinates of the neurons in the first and second column respectively. The distance between neighbours is 1.0. The point of origin is bottom-left. The first neuron sits at (0,0).

Arguments

  • xdim: number of neurons in x-direction
  • ydim: number of neurons in y-direction
source
GigaSOM.linearRadiusMethod
linearRadius(initRadius::Float64, iteration::Int64, decay::String, epochs::Int64)

Return a neighbourhood radius. Use as the radiusFun parameter for trainGigaSOM.

Arguments

  • initRadius: Initial Radius
  • finalRadius: Final Radius
  • iteration: Training iteration
  • epochs: Total number of epochs
source

Embedding

GigaSOM.embedGigaSOMMethod
embedGigaSOM(som::GigaSOM.Som,
             data;
             knnTreeFun = BruteTree,
             metric = Euclidean(),
             k::Int64=0,
             adjust::Float64=1.0,
             smooth::Float64=0.0,
             m::Float64=10.0)

Overload of embedGigaSOM for simple DataFrames and matrices. This slices the data using DistributedArrays, sends them the workers, and runs normal embedGigaSOM. All data is properly unscatterd after the computation.

Examples:

Produce a 2-column matrix with 2D cell coordinates:

e = embedGigaSOM(som, data)

Plot the result using 2D histogram from Gadfly:

using Gadfly
draw(PNG("output.png",20cm,20cm),
     plot(x=e[:,1], y=e[:,2],
     Geom.histogram2d(xbincount=200, ybincount=200)))
source
GigaSOM.embedGigaSOMMethod
embedGigaSOM(som::GigaSOM.Som,
             dInfo::Dinfo;
             knnTreeFun = BruteTree,
             metric = Euclidean(),
             k::Int64=0,
             adjust::Float64=1.0,
             smooth::Float64=0.0,
             m::Float64=10.0,
             output::Symbol=tmp_symbol(dInfo))::Dinfo

Return a data frame with X,Y coordinates of EmbedSOM projection of the data.

Arguments:

  • som: a trained SOM
  • dInfo: Dinfo that describes the loaded dataset
  • knnTreeFun: Constructor of the KNN-tree (e.g. from NearestNeighbors package)
  • metric: Passed as metric argument to the KNN-tree constructor
  • k: number of nearest neighbors to consider (high values get quadratically slower)
  • adjust: position adjustment parameter (higher values avoid non-local approximations)
  • smooth: approximation smoothness (the higher the value, the larger the neighborhood of approximate local linearity of the projection)
  • m: exponential decay rate for the score when approaching the k+1-th neighbor distance
  • output: variable name for storing the distributed result

Data must have the same number of dimensions as the training dataset, and must be normalized using the same parameters.

source
GigaSOM.embedGigaSOM_internalMethod
embedGigaSOM_internal(som::GigaSOM.Som,
                      data::Matrix{Float64},
                      tree,
                      k::Int64,
                      adjust::Float64,
                      boost::Float64,
                      m::Float64)

Internal function to compute parts of the embedding on a prepared kNN-tree structure (tree) and smooth converted to boost.

source