Slurm
Introduction
The Slurm platform provides a multi-node HPC environment based on the Slurm workload manager and Open OnDemand. The platform is accessible with a web-browser using the Open OnDemand web-interface, or via SSH.
Launch configuration
Warning
Platforms and their names are visible to all members of the cloud project!
Option | Explanation |
---|---|
Platform name | A name to identify the Slurm platform. |
External IP | Use the plus button to assign an external IP address to your cloud project if the list is empty, and then select an IP to assign to the login node of your Slurm platform. |
Compute node count | The amount of Slurm compute (worker) nodes to configure for your Slurm platform. |
Compute node size | The size of the Slurm compute (worker) nodes. The options in this menu are set by the cloud operator, and the number of CPUs and quantity of RAM are displayed for each size. |
Run post-configuration validation? | Run a small suite of tests to check that the Slurm platform is functioning as expected. |
Advanced
Platform monitoring
A Grafana dashboard for system monitoring is included in the platform, and is accessible from the platforms page. General current and historical system information is visible.
Additionally, Open OnDemand presents monitoring dashboards for each Slurm job.
Root access
The azimuth
user has passwordless sudo. Only this user can ssh between nodes so to get
sudo access to a non-login node ssh as azimuth
from the login node first, then use sudo.
Note that node names can be retrieved from the /etc/hosts
file on the login node, e.g.:
cat /etc/hosts
Additional software
Software installed directly via sudo
will be lost when the platform is upgraded, as upgrades are performed by reimaging all nodes with a new image.
Where possible, it is preferable to package additional software for use via apptainer which is installed on all Slurm platforms. This supports both SIF and Docker/OCI container formats.
Some software is also available via the EESSI pilot repository - follow instructions from here.
If these methods are not appropriate and the software has wide applicability, consider making a PR to the Ansible Slurm Appliance, which contains code for building images and configuring Slurm that is used by Azimuth.
Created: July 31, 2024