Distributed processing and HPC environments

Distributed processing in Julia is represented mainly by the package Distributed.jl.

COBREXA.jl is able to utilize this existing system to almost transparently run the large parallelizable analyses on multiple CPU cores and multiple computers connected through the network. Ultimately, the approach scales to thousands of computing nodes in large HPC facilities.

Here, we give a short overview of how to work in the distributed environment and utilize the resources for COBREXA.jl analyses.

Starting the distributed workers

COBREXA.jl follows the structure imposed by the Distributed package: You operate a main (usually called "master") computation node, connect to multiple other computers and start worker Julia processes there, and distribute the workload across this cluster.

To start, you need to load the package and add a few processes. This starts 5 processes locally:

using Distributed
addprocs(5)
`Distributed.jl` installation

Distributed.jl usually comes pre-installed with Julia distribution, but you may still need to "enable" it by typing ] add Distributed.

You may check that the workers are really there, using workers(). In this case, it should give you a vector of worker IDs, very likely equal to [2,3,4,5,6].

If you have compute resources available via a network, you may connect these as well, provided you have a secure shell (ssh) access to them. You will likely want to establish a key-based authentication (refer to ssh documentation) to make the connection easier.

With shell, check that you can ssh to a remote node and run Julia there:

user@pc> ssh server
...
user@server> julia
...
julia> _
Running shell commands from Julia

If you don't want to quit your Julia session to try out the ssh connection from the shell, press ; in the Julia prompt on the beginning of the line. The interpreter will execute your next line as a shell command.

If this works for you, you can add some workers that run on the server from your Julia shell running on your pc. For example, the following starts 20 workers on the remote server and 10 workers on your friend's computer called joe_pc:

addprocs([('server', 20), ('joe_pc', 10)])

With this, you can schedule various computation on the workers; see the Julia manual of Distributed for basic details. You may try various convenience packages, such as DistributedArrays.jl and DistributedData.jl, to process any data in a distributed fashion.

Running a distributed analysis

While not all COBREXA functions may be parallelized naturally, these that do will accept a special workers argument that specifies a list of worker IDs where the computation should be distributed. For the value, you can specify your desired worker IDs manually (e.g. [2,3,4]), or simply use workers().

For example, flux_variability_analysis can naturally parallelize the computation of all reactions's minima and maxima to finish the computation faster. To enable the parallelization, you first need to make sure that all workers have loaded both the COBREXA package and the optimizer:

using COBREXA, GLPK, Distributed
addprocs(10)                       # add any kind and number of processes here
@everywhere using COBREXA, GLPK    # loads the necessary packages on all workers

When the package is loaded and precompiled everywhere, you may load your model and run the FVA with the workers parameter:

model = load_model("e_coli_core.xml")
result = flux_variability_analysis(model, GLPK.Optimizer; workers=workers())

With the extra computing capacity from N workers available, the FVA should be computed roughly N-times faster.

Distribution and parallelization overhead

Communication of the workers with your Julia shell is not free. If the task that you are parallelizing is small and the model structure is very large, the distributed computation will actually spend most computation time just distributing the large model to the workers, and almost no time in executing the small parallel task. In such case, the performance will not improve by adding additional resources. You may want to check that the computation task is sufficiently large before investing the extra resources into the distributed execution. Amdahl's and Gustafson's laws can give you a better overview of the consequences of this overhead.

Interacting with HPC schedulers

Many researchers have access to institutional HPC facilities that allow time-sharing of the capacity of a large computer cluster between many researchers. Julia and COBREXA.jl work well within this environment; but your programs require some additional customization to be able to find and utilize the resources available from the HPC.

In our case, this reduces to a relatively complex task: You need to find out how many resources were allocated for your task, and you need to add the remote workers precisely at places that were allocated for your. Fortunately, the package ClusterManagers.jl can do precisely that.

For simplicily, we will assume that your HPC is scheduled by Slurm.

Adding of the workers from Slurm is done as follows:

  • you import the ClusterManagers package
  • you find how many processes to spawn from the environment from SLURM_NTASKS environment variable
  • you use the function addprocs_slurm to precisely connect to your allocated computational resources

The Julia script that does a parallel analysis may then start as follows:

using COBREXA, Distributed, ClusterManagers

available_workers = parse(Int, ENV["SLURM_NTASKS"])

addprocs_slurm(available_workers)

...
result = flux_variability_analysis(...; workers=workers())
...

After adding the Slurm workers, you may continue as if the workers were added using normal addprocs, and (for example) run the flux_variability_analysis as shown above.

What about the other HPC schedulers?

ClusterManagers.jl supports many other common HPC scheduling systems, including LFS, Sun Grid, SGE, PBS, and Scyld, in a way almost identical to Slurm. See the package documentation for details.

Wrapping your script in a Slurm job

To be able to submit your script for later processing using the sbatch Slurm command, you need to wrap it in a small "batch" script that tells Slurm how many resources the process needs.

Assuming you have a Julia computation script written down in myJob.jl and saved on your HPC cluster's access node, the corresponding Slurm batch script (let's call it myJob.sbatch) may look as follows:

#!/bin/bash -l
#SBATCH -n 100           # the job will require 100 individual workers
#SBATCH -c 1             # each worker will sit on a single CPU
#SBATCH -t 30            # the whole job will take less than 30 minutes
#SBATCH -J myJob         # the name of the job

module load lang/Julia   # this is usually required to make Julia available to your job

julia myJob.jl

To run the computation, simply run sbatch myJob.sbatch on the access node. The job will be scheduled and eventually executed. You may watch sacct and squeue in the meantime, to see the progress.

Remember that you need to explicitly save the result of your Julia script computation to files, to be able to retrieve them later. Standard outputs of the jobs are often mangled and discarded. If you still want to collect the standard output, you may change the last line of the batch script to

julia myJob.jl > myJob.log

and collect the output from the log later. This is convenient especially if logging various computation details using the @info and similar macros.