Vienna Scientific Cluster

High Performance Computing available to Staff Austrian HPC effort part of EuroCC

vsc

Links:

We have the privilege to be part of the VSC and have private nodes at VSC-5 (since 2022), VSC-4 (since 2020) and VSC-3 (since 2014), which is retired by 2022.

Access is primarily via SSH:

ssh to VSC
ssh user@vsc5.vsc.ac.at
ssh user@vsc4.vsc.ac.at

Please follow some connection instruction on the wiki which is similar to all other servers (e.g. SRVX1). The VSC is only available from within the UNINET (VPN, ...). Authentication requires a mobile phone.

We have private nodes at our disposal and in order for you to use these you need to specify the correct account in the jobs you submit to the queueing system (SLURM). The correct information will be given to you in the registration email.

IMGW customizations in the shell

If you want you can use some shared shell scripts that provide information for users about the VSC system.

Load IMGW environment settings
# run the install script, that just appends to your PATH variable.
/gpfs/data/fs71386/imgw/install_imgw.sh

Please find the following commands available:

imgw-quota shows the current quota on VSC for both HOME and DATA
imgw-container singularity/apptainer container run script, see below
imgw-transfersh Transfer-sh service on wolke, easily share small files.
imgw-cpuinfo Show CPU information

Please find a shared folder in /gpfs/data/fs71386/imgw/shared and add data there that needs to be used by multiple people. Please make sure that things are removed again as soon as possible. Thanks.

Node Information VSC-5

There are usually two sockets per Node, which means 2 CPUs per Node.

VSC-5 Compute Node
CPU model:  AMD EPYC 7713 64-Core Processor
2 CPU,  64 physical cores per CPU, total 256 logical CPU units

512 GB Memory

We have access to 11 private Nodes of that kind. We also have access to 1 GPU node with Nvidia A100 accelerators. Find the partition information with:

VSC-5 Quality of Service
$ sqos
                    qos name  type  total res   used res   free res        walltime   priority   total n*    used n*    free n*
================================================================================================================================
                 p71386_0512   cpu       2816       2816          0     10-00:00:00     100000         11         11          0
             p71386_a100dual   gpu          2          0          2     10-00:00:00     100000          1          0          1

* node values do not always align with resource values since nodes can be partially allocated

Storage on VSC-5

the HOME and DATA partition are the same as on VSC-4.

JET and VSC-5 holding hands

since Fall 2023 there has been a major update. JET and VSC-5 are holding hands now. Your files on JET are now accessible from VSC-5. e.g.

JET and VSC-5
a directory on JET
/jetfs/home/[username]

can be found on VSC-5
/gpfs/jetfs/home/[username]

JETFS on VSC

Only from VSC5 you can access JETFS. Not the other way around.

You can use these directories as well for direct writing. The performance is higher on VSC-5 storage. This does not work on VSC-4.

Node Information VSC-4

VSC-4 Compute Node
CPU model: Intel(R) Xeon(R) Platinum 8174 CPU @ 3.10GHz
2 CPU, 24 physical cores per CPU, total 96 logical CPU units

378 GB Memory

We have access to 5 private Nodes of that kind. We also have access to the jupyterhub on VSC. Check with

VSC-4 Quality of Service
$ sqos
                    qos name  type  total res   used res   free res        walltime   priority   total n*    used n*    free n*
================================================================================================================================
                 p71386_0384   cpu        480        288        192     10-00:00:00     100000          5          3          2
        skylake_0096_jupyter   cpu        288         12        276      3-00:00:00       1000          3          1          2

* node values do not always align with resource values since nodes can be partially allocated

Storage on VSC-4

All quotas are shared between all IMGW/Project users:

$HOME (up to 100 GB, all home directories)
$DATA (up to 10 TB, backed up)
$BINFL (up to 1TB, fast scratch), will be retired
$BINFS (up to 2GB, SSD fast), will be retired
$TMPDIR (50% of main memory, deletes after job finishes)
/local (Compute Nodes, 480 GB SSD, deletes after Job finishes)

VSC Storage Performance

Check quotas running the following commands yourself, including your PROJECTID or use the imgw-quota command as from the imgw shell extensions

Check VSC-4 IMGW quotas
$ mmlsquota --block-size auto -j data_fs71386 data
                         Block Limits                                    |     File Limits
Filesystem type         blocks      quota      limit   in_doubt    grace |    files   quota    limit in_doubt    grace  Remarks
data       FILESET      66.35T     117.2T     117.2T     20.45G     none |  4597941 5000000  5000000     1632     none vsc-storage.vsc4.opa

$ mmlsquota --block-size auto -j home_fs71386 home
                         Block Limits                                    |     File Limits
Filesystem type         blocks      quota      limit   in_doubt    grace |    files   quota    limit in_doubt    grace  Remarks
home       FILESET      182.7G       200G       200G     921.6M     none |  1915938 2000000  2000000     1269     none vsc-storage.vsc4.opa

Other Storage

We have access to the Earth Observation Data Center EODC, where one can find primarily the following data sets:

Sentinel-1, 2, 3
Wegener Center GPS RO

These datasets can be found directly via /eodc/products/.

We are given a private data storage location (/eodc/private/uniwien), where we can store up to 22 TB on VSC-4. However, that might change in the future.

Run time limits and queues

VSC-5 queues and limits:

VSC-5 Queues
$ sacctmgr show qos  format=name%20s,priority,grpnodes,maxwall,description%40s
                Name   Priority GrpNodes     MaxWall                                    Descr
-------------------- ---------- -------- ----------- ----------------------------------------
              normal          0           1-00:00:00                       Normal QOS default
         p71386_0384     100000          10-00:00:00                 private nodes haimberger
    zen3_0512_a100x2       1000           3-00:00:00            public qos for a100 gpu nodes
           zen3_0512       1000           3-00:00:00 vsc-5 regular cpu nodes with 512 gb of +
     zen3_0512_devel    5000000             00:10:00              fast short qos for dev jobs
           zen3_1024       1000           3-00:00:00 vsc-5 regular cpu nodes with 1024 gb of+
           zen3_2048       1000           3-00:00:00 vsc-5 regular cpu nodes with 2048 gb of+
           idle_0512          1           1-00:00:00                         vsc-5 idle nodes
           idle_1024          1           1-00:00:00                          vsc5 idle nodes
           idle_2048          1           1-00:00:00                          vsc5 idle nodes

The department has access to these partitions:

VSC5 available partitions with QOS
partition                                    QOS
------------------------------------------------
cascadelake_0384                        cascadelake_0384
zen2_0256_a40x2                         zen2_0256_a40x2
zen3_0512_a100x2                        zen3_0512_a100x2
zen3_0512                               zen3_0512,zen3_0512_devel
zen3_1024                               zen3_1024
zen3_2048                               zen3_2048

VSC-4 queues and limits:

VSC-4 Queues
$ sacctmgr show qos  format=name%20s,priority,grpnodes,maxwall,description%40s
                Name   Priority GrpNodes     MaxWall                                    Descr
-------------------- ---------- -------- ----------- ----------------------------------------
         p71386_0384     100000          10-00:00:00                 private nodes haimberger
                long       1000          10-00:00:00               long running jobs on vsc-4
           fast_vsc4    1000000           3-00:00:00                     high priority access
            mem_0096       1000           3-00:00:00 vsc-4 regular nodes with 96 gb of memory
            mem_0384       1000           3-00:00:00 vsc-4 regular nodes with 384 gb of memo+
            mem_0768       1000           3-00:00:00 vsc-4 regular nodes with 768 gb of memo+

The department has access to these partitions:

VSC-4 available partitions with QOS
partition              QOS
--------------------------
skylake_0096          skylake_0096,skylake_0096_devel
skylake_0384          skylake_0384
skylake_0768          skylake_0768

**single/few core jobs are allocated to nodes n4901-0[01-72] and n4902-0[01-72] **

SLURM allows for setting a run time limit below the default QOS's run time limit. After the specified time is elapsed, the job is killed:

slurm time limit
#SBATCH --time=<time>

Acceptable time formats include minutes, minutes:seconds, hours:minutes:seconds, days-hours, days-hours:minutes and days-hours:minutes:seconds.

Example Job

Example Job on VSC

We have to use the following keywords to make sure that the correct partitions are used:

--partition=mem_xxxx (per email)
--qos=xxxxxx (see below)
--account=xxxxxx (see below)

The core hours will be charged to the specified account. If not specified, the default account will be used.

Put this in the Job file (e.g. VSC-5 Nodes)

VSC slurm example job
#!/bin/bash
#
#SBATCH -J TEST_JOB
#SBATCH -N 2
#SBATCH --ntasks-per-node=16
#SBATCH --ntasks-per-core=1
#SBATCH --mail-type=BEGIN    # first have to state the type of event to occur
#SBATCH --mail-user=<email@address.at>   # and then your email address
#SBATCH --partition=zen3_0512
#SBATCH --qos=p71386_0512
#SBATCH --account=p71386
#SBATCH --time=<time>

# when srun is used, you need to set (Different from Jet):
<srun -l -N2 -n32 a.out >
# or
<mpirun -np 32 a.out>

-J job name
-N number of nodes requested (16 cores per node available)
-n, --ntasks= specifies the number of tasks to run,
--ntasks-per-node number of processes run in parallel on a single node
--ntasks-per-core number of tasks a single core should work on
srun is an alternative command to mpirun. It provides direct access to SLURM inherent variables and settings.
-l adds task-specific labels to the beginning of all output lines.
--mail-type sends an email at specific events. The SLURM doku lists the following valid mail-type values: "BEGIN, END, FAIL, REQUEUE, ALL (equivalent to BEGIN, END, FAIL and REQUEUE), TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 (reached 80 percent of time limit), and TIME_LIMIT_50 (reached 50 percent of time limit). Multiple type values may be specified in a comma separated list." cited from the SLURM doku
--mail-user sends an email to this address

slurm basic commands
sbatch check.slrm    # to submit the job
squeue -u `whoami`   # to check the status  of own jobs
scancel  JOBID       # for premature removal, where JOBID
                     # is obtained from the previous command

Example of multiple simulations inside one job

Sample Job when for running multiple mpi jobs on a VSC-4 node.

Note: The “mem_per_task” should be set such that

mem_per_task * mytasks < mem_per_node - 2Gb

The approx 2Gb reduction in available memory is due to operating system stored in memory. For a standard node with 96 Gb of Memory this would be eg.:

23 Gb * 4 = 92 Gb < 94 Gb

VSC-4 example concurrent job
#!/bin/bash
#SBATCH -J many
#SBATCH -N 1
# ... other slurm directives

# disable resources consumption by subsequent srun calls.
export SLURM_STEP_GRES=none

mytasks=4
cmd="stress -c 24"
mem_per_task=10G

for i in `seq 1 $mytasks`
do
        srun --mem=$mem_per_task --cpus-per-task=2 --ntasks=1 $cmd &
done
wait

Software

The VSC use the same software system as Jet and have environmental modules available to the user:

VSC Wiki Software
VSC-4 has miniconda3 modules for GNU and INTEL ;)

VSC modules
module avail        # lists the **available** Application-Software,
                      # Compilers, Parallel-Environment, and Libraries
module list         # shows currently loaded package of your session
module unload <xyz> # unload a particular package <xyz> from your session
module load <xyz>   # load a particular package <xyz> into your session

will load the intel compiler suite and add variables to your environment. Please do not forget to add the module load statements to your jobs.

on how to use environment modules go to Using Environment Modules

Import user-site packages

It is possible to install user site packages into your .local/lib/python3.* directory:

installing python packages in your HOME
# installing a user site package
pip install --user [package]

Please remember that all HOME and DATA quotas will be shared Installing a lot of packages creates a lot of files!

Python importing user site packages
import sys, site
sys.path.append(site.site.getusersitepackages())
# This will add the correct path.

Then you will be able to load all packages that are located in the user site.

Containers

We can use complex software that is contained in apptainer containers and can be executed on VSC. Please consider using one of the following containers:

PyMagic_202506.sif
JyMagic_202506.sif
JyMet_202506.sif

located in the $DATA directory of IMGW: /gpfs/data/fs71386/imgw

The Jupyter containers, can be run as well locally to open a jupyterlab environment, accessible via a port (8888) on localhost.

If you want to build your own container, you can use the script: micromamba2container.sh as described here. You can convert an existing environment into a container image or create a new one. Advanced: Take a look at the Apptainer.recipe file to understand how to modify it to your needs.

If you are interested in deploying a customized Jupyter kernel on the VSC Jupyterhub, have a look at this introduction.

How to use?

There are multiple ways of running these containers.

using the container itself
using a runscript

using the command line

Bash
# load apptainer or singularity
$ module load apptainer
# set some directories to be additionally mapped into the container (default is local)
# use APPTAINER_BIND or SINGULARITY_BIND
$ export APPTAINER_BIND=/opt/sw,/gpfs,/eodc
# launch the container
$ /gpfs/data/fs71386/imgw/containers/PyMagic_202506.sif
No arguments provided. Running default shell.
[PyMagic]~$
# or using a script
$ /gpfs/data/fs71386/imgw/containers/PyMagic_202506.sif python analysis.py

using the script

The script includes the automated module load command for running the container with singularity/apptainer and sets the BIND commands to map all the necessary directories inside the container. Check the $SINGULARITY_BIND variable. This can be manually set as well or inside your .bashrc.

Bash
# download script or copy
$ wget -O run_container.sh https://wolke.img.univie.ac.at/documentation/general/Misc/run_container_vsc.sh
$ cp /gpfs/data/fs71386/imgw/run_container.sh $HOME/
$ chmod +x run_container.sh
# The directory of the containers
$ ./run_container.sh -h
# executing the python inside
$ ./run_container.sh PyMagic_202506.sif python
# with a script
$ ./run_container.sh PyMagic_202506.sif python analysis.py

Understanding the container

In principle, a run script needs to do only 3 things:

load the module apptainer or singularity
set APPTAINER_BIND or SINGULARITY_BIND environment variable
execute the container with your arguments

It is necessary to set the SINGULARITY_BIND because the $HOME and $DATA or $BINFS path are no standard linux paths, therefore the container linux does not know about these and accessing files from within the container is not possible. In the future if you have problems with accessing other paths, adding them to the SINGULARITY_BIND might fix the issue.

What is inside the container?

In principe you can check what is inside by using

Execute commands inside a container
# List the conda environment installation directory
$ /gpfs/data/fs71386/imgw/containers/PyMagic_202506.sif ls /opt/conda
# or using conda for the environment
$ /gpfs/data/fs71386/imgw/containers/PyMagic_202506.sif micromamba info
       libmamba version : 2.3.0
     micromamba version : 2.3.0
           curl version : libcurl/8.14.1 OpenSSL/3.5.0 zlib/1.3.1 zstd/1.5.7 libssh2/1.11.1 nghttp2/1.64.0
     libarchive version : libarchive 3.8.1 zlib/1.3.1 bz2lib/1.0.8 libzstd/1.5.7 libb2/bundled
       envs directories : /opt/conda/envs
          package cache : /opt/conda/pkgs
                          /home/fs71386/user/.mamba/pkgs
            environment : base (active)
           env location : /opt/conda
      user config files : /home/fs71386/user/.mambarc
 populated config files :
       virtual packages : __unix=0=0
                          __linux=4.18.0=0
                          __glibc=2.36=0
                          __archspec=1=x86_64_v3
               channels : https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
       base environment : /opt/conda
               platform : linux-64
# for the package list
$ /gpfs/data/fs71386/imgw/containers/PyMagic_202506.sif micromamba list
List of packages in environment: "/opt/conda"

  Name                        Version            Build                    Channel
───────────────────────────────────────────────────────────────────────────────────────
  HeapDict                    1.0.1              pypi_0                   pypi
  Magics                      1.5.8              pypi_0                   pypi
  _libgcc_mutex               0.1                conda_forge              conda-forge
...

Debugging on VSC-4

Currently (6.2021) there is no development queue on VSC-4 and the support suggested to do the following:

Debuging on VSC-4
# Request resources from slurm (-N 1, a full Node)
$ salloc -N 1 -p mem_0384 --qos p71386_0384 --no-shell
# Once the node is assigned / job is running
# Check with
$ squeue -u $USER
# connect to the Node with ssh
$ ssh [Node]
# test and debug the model there.

otherwise you can access one of the *_devel queues/partitions and submit short test jobs to check your setup.