Ceres downtime is scheduled for Monday, August 23 – Wednesday, September 1, 2021
By: Marina Kraeva | 08-10-2021
Ceres cluster is back up.
This maintenance window will be longer than normal as there are several important hardware upgrades occurring during this window to enhance the overall power and capacity of the CERES HPC cluster. These upgrades include the remaining new priority nodes, sixty eight additional compute nodes, two additional high memory compute nodes, six management nodes, and faster Infiniband switching technology used by the HPC nodes to access storage. VRSC will re-rack and re-wire the whole cluster to accommodate additional hardware while adhering to power and cooling limits.
Queued jobs will not start if they cannot complete by 7AM August 23. These include jobs submitted to the long partition with the default 3-weeks long time limit. In the output of the squeue command the reason for those jobs will state (ReqNodeNotAvail, Reserved for maintenance). The jobs will start after the scheduled outage completes.
The Atlas cluster will stay up and running during Ceres downtime. All Ceres users can run jobs on Atlas. If you don’t have a large enough project quota on Atlas, remember that you can use /90daydata on Atlas that has no quotas.
Update on September 2:
We regret that the Ceres maintenance window must be extended until 5PM CT Friday, September 3.
The delay is due to issues we’ve encountered with the new Infiniband equipment needed to expand Ceres and improve cluster performance. Due to supply chain shortages, the failed components cannot be easily and rapidly replaced and the team is creatively working with what’s available to get back online.
We’re working on bringing up at least a part of the cluster as soon as possible, and will notify everyone via email when logins are permitted.
Update on September 3, 2021:
Ceres is back up. It has most of its nodes up, and single node jobs are running.