SCINet Past Scheduled Outages
The table below lists information about past SCINet outages. See SCINet Forum Announcements page (must have a SCINet account to access) for communications about emergency outages.
Ceres Storage Updates · Ceres - All · 2024
Ceres will be unavailable for maintenance starting at 4pm CDT on Friday October 11th. The final sync for the cutover to the new all-flash Vast storage appliance will start then. Below you will find information on the new storage implementation, information on the transition so far, and actions to take if you are a user who has queued jobs or would like to run jobs with the new storage.
Highlights:
Maintenance to cutover to the new storage starts at 4pm CDT 10/11/24 and is planned to run through 10/15/24.
- Users with jobs that will be held over maintenance will be required to issue
scontrol release <JobID>
commands for them to start. - Retired storage will be available in a read-only state for a limited time.
- /90daydata-old will be available, read-only, for 90 days while the data ages out.
- /project-old will be available, read-only, until the final sync to the new /project is done.
- New /project will be available when the final sync finishes if the sync takes longer than the maintenance window.
- New /90daydata will be immediately available (and empty).
What is happening:
Ceres is transitioning to a new storage appliance. After the maintenance, /project, /90daydata, and /home directories will all be served from the new Vast appliance. It has several performance and resilience advantages over the retiring storage. Most notable for users is the transition to all flash storage instead of the traditional spinning disks used by the retiring storage.
The reason a maintenance has been dedicated to this cutover is to ensure the smooth and complete transfer of data from the old /project to the new one. Since /project is over 2PB of data spread across more than 1 billion individual files, the transfer takes a considerable amount of time. VRSC has been copying data from /project to the new storage for the last few weeks in preparation for this. We’ve been limiting the transfer to about 100TB/day in order to not impact running jobs. Since there have been files added, removed, and overwritten in the normal day-to-day operations of the cluster while this initial sync has been taking place, a final, complete sync must be done to capture the complete state of the retiring filesystem.
What you may have to do:
- After the maintenance check if you have jobs waiting in the queue with
squeue --me
command. - Review your jobs waiting in the queue to identify what storage they need. Run the
scontrol show job <JobID>
command to view information on a submitted job. - Release your jobs AFTER reading the following
Jobs will be placed into a held state over the maintenance. This is being done to prevent them from running automatically, since the storage may not be in the state the jobs are relying on to run. If the final sync for /project takes longer than the maintenance window, we’re going to make the cluster available without it so jobs can be run in /90daydata. The retiring filesystems will still be available at /90daydata-old (for 90 days while it phases itself out) and /project-old (until the sync to the new /project completes). If you have jobs that are in a held state and you have confirmed that they will be able to access the directories and data they need, you can put them back into the queue with a scontrol resume <JobID>
command.
If you have jobs that will require data in its original location in /project and the sync hasn’t finished yet:
You can either wait for the sync to finish before running scontrol resume <JobID>
, or you can cancel the jobs with scancel <JobID>
, copy the data to /90daydata, and start new jobs working out of /90daydata. This is the most likely scenario for most jobs.
If you have data in /90daydata-old that you would like to use:
Transfer data from /90daydata-old to the new /90daydata to keep it for another 90 days.
If the new /project isn’t available yet and you would like to run jobs with data from /project-old:
Transfer data from /project-old to /90daydata and run your jobs from /90daydata. Directly referencing /project-old in jobs is NOT recommended as that storage mount will be removed when the sync is finished.
If you have any questions or concerns, or if you need help after the maintenance, please feel free to contact us at scinet_vrsc@usda.gov.
Maintenance · Ceres - All · 2024
Ceres cluster maintenance is scheduled for June 17-21, 2024 (the week of Juneteenth).
During the maintenance, the following major modifications to Ceres will take place in addition to the usual maintenance items:
- System software updates:
- Ceres will be transitioned from running Alma Linux to Red Hat Enterprise Linux.
- Infiniband switches will be updated.
- Storage:
- A new Vast storage will be added to the cluster
- The new storage will eventually replace existing storage hardware.
- Data will not be moved to the new storage during the maintenance.
- Hardware management:
- Old ethernet switches will be removed.
- IPA migration:
- The identity management system will be migrated to a new domain.
- Some users will need to perform a one-time account migration action after the maintenance.
Queued jobs will not start if they cannot complete by 6AM June 17. In the output of the squeue command the reason for those jobs will state (ReqNodeNotAvail, Reserved for maintenance). The jobs will start after the scheduled outage completes.
The Atlas cluster will be available during the Ceres maintenance. Make sure to copy data from Ceres to Atlas prior to the maintenance, if needed.
Please submit any questions you may have via email to scinet_vrsc@usda.gov.
Maintenance · Atlas - All · 2024
The Atlas compute cluster is scheduled for downtime/upgrade beginning April 30 at 6am Central and lasting through May 1. A downtime is required to repair a chilled water line for the cooling system. Lack of cooling during this repair will necessitate the shutdown of Atlas.
Taking advantage of this shutdown, the operating system on Atlas will be upgraded from CentOS 7.8 to the Rocky 9.x distribution of Linux. This upgrade will also present a newer software stack. Users may need to recompile their software.
The Ceres system will not be affected by this maintenance.
An announcement will be made once the system is returned to operational status.
Any issues/problems should be addressed to the help desk.
scinet_vrsc@usda.gov
help-usda@hpc.msstate.edu
OS and Network Update · Ceres - All · 2024
Ceres cluster maintenance is scheduled for February 19-21, 2024 (Presidents’ Day, and the following two days).
During the maintenance, the following major modifications to Ceres will take place in addition to the usual maintenance items:
- Operating System:
- Ceres will be transition from running AlmaLinux to Red Hat Enterprise Linux
- Network:
- Updates on existing switches
- Installation of new switches
- Recabling of Ceres to accommodate the new switches
Queued jobs will not start if they cannot complete by 6AM February 19. In the output of the squeue command the reason for those jobs will state (ReqNodeNotAvail, Reserved for maintenance) . The jobs will start after the scheduled outage completes.
Atlas cluster will be available during the Ceres maintenance. Make sure to copy data from Ceres to Atlas prior to the maintenance if needed.
Please submit any questions you may have via email to scinet_vrsc@usda.gov.
Holiday · VRSC Support - All · 2023
Due to the upcoming holiday, there will not be any VRSC support available from December 25-27.
Please direct all questions to scinet_vrsc@usda.gov.
Holiday · VRSC Support - All · 2023
Due to the upcoming holiday, there will not be any VRSC support available from November 22-24.
Please direct all questions to scinet_vrsc@usda.gov.
Maintenance · Galaxy - Ceres - Tuesday, November 14 · 2023
Galaxy will be unavailable between 9AM - 5PM on 11/14/2023
Downtime is required to change the location of galaxy related paths from /90daydata to /project
Background - Galaxy saves upload, output and intermediate files in /90daydata on Ceres. The 90daydata file system is experiencing frequent performance issues that is causing job timeouts and, in some extreme cases, job failures
Changes - During maintenance, the paths to upload, output and intermediate files will be set to /project as this is performant and is still under warranty. This is our current best option for Galaxy.
Notes - Only new files created after the maintenance will be saved in /project, existing files will still remain on /90daydata(reminder that these files will be purged by the filesystem after 90days so please save them elsewhere)
Maintenance · Site Service - Ames - Monday, October 23 · 2023
Site Service at Ames will be impacted for essential maintenance at Ames. Outages are expected and the entire window is reserved.
Software Update · Ceres - All - Monday, October 9 · 2023
Ceres cluster maintenance is scheduled for October 9-10, 2023 (Indigenous Peoples Day, and the following day), to update system software.
During the maintenance we will also upgrade Open OnDemand to version 3 and BeeGFS file system to version 7.4.
Queued jobs will not start if they cannot complete by 6AM October 9. In the output of the squeue command the reason for those jobs will state (ReqNodeNotAvail, Reserved for maintenance) . The jobs will start after the scheduled outage completes.
Atlas cluster will be available during the Ceres maintenance. Make sure to copy data from Ceres to Atlas prior to the maintenance if needed.
Please submit any questions you may have via email to scinet_vrsc@usda.gov.
Maintenance · Site Service - Ames - Friday, September 29 · 2023
ARS SCINet Site Service Ames will be unavailable while Internet2 circuit vendor Lumen performs circuit maintenance. The entire window is reserved.
Maintenance · Site Service NAL - Beltsville · 2023
Site Service NAL (Beltsville) will be unavailable while Fiberlight engineers perform maintenance. Outages are expected. The entire maintenance window is reserved.
Maintenance · Backbone - NAL · 2023
Backbone NAL-NAL will be unavailable while Fiberlight engineers perform maintenance. Outages are expected. The entire maintenance window is reserved.
Maintenance · Site Service - AMES, NAL - Tuesday, August 29 · 2023
Site Service at AMES and NAL will be impacted while Internet2 performs maintenance to upgrade core nodes. Outages are expected and the entire window is reserved.
Emergency Maintenance · Site Service - Ames - Friday, July 7 · 2023
The listed asset will be unavailable while vendor Internet2 performs a software maintenance and troubleshooting tasks on core1.eqch. Multiple 20 minute hard down events are expected. The entire window is reserved. </br></br> This will not affect the Ceres cluster and the jobs.
Maintenance · Site Service - Beltsville - Friday, June 23 · 2023
Site Service Beltsville will be unavailable while Fiberlight engineers performs maintenance. Outages are expected. The entire maintenance window is reserved.
Maintenance · Site Service - Stoneville - Thursday, June 22 · 2023
Site Service at Stoneville will be impacted while Internet2 performs maintenance to upgrade core nodes. Outages are expected and the entire window is reserved.
Maintenance · Site Service - Multiple locations - Wednesday, June 21 · 2023
Site Service at Fort Collins, Albany & Clay Center will be impacted while Internet2 performs maintenance to upgrade core nodes. Outages are expected and the entire window is reserved.
Maintenance · Site Service - Multiple - Tuesday, June 20 · 2023
Site Service at Ames & Beltsville will be impacted while Internet2 performs maintenance to upgrade core nodes. Outages are expected and the entire window is reserved.
System Update · Ceres - All · 2023
Ceres cluster maintenance is scheduled for the week of June 19, to update system software. The cluster will be down for several days.
Maintenance · Site Service - Beltsville - Sunday, June 18 · 2023
Site Service Beltsville will be unavailable while Fiberlight engineers performs maintenance. Outages are expected. The entire maintenance window is reserved.
Maintenance · Juno - All - Tuesday, June 13 · 2023
A planned maintenance evolution will occur on Tuesday, June 13th, 2023, between 6am and 5pm ET at the National Agricultural Library (NAL).
This maintenance is necessary to transfer core network equipment at NAL onto newer and more reliable backup power which will promote future stability and reliability for services at this site.
During this time, access to Juno storage will be disrupted. We apologize in advance for any inconvenience this may cause.
We will be working closely with our partners to minimize the impact of this maintenance and hope to complete the work early. We will provide updates on the status of the maintenance (on the SCINet Forum)[https://forum.scinet.usda.gov/t/access-to-juno-storage-disrupted-on-june-13-2023].
Maintenance · Site Service - Ames - Tuesday, May 23 · 2023
The listed assets may become unavailable due to scheduled maintenance being preformed by Internet2 vendor Lumen. Outages are expected. The entire window is reserved.
Maintenance · Site Service - Ames - Friday, May 19 · 2023
Site Service at Ames will be impacted while Lumen performs maintenance.
Outages are expected and the entire window is reserved.
Maintenance · Site Service - Stoneville - Thursday, May 18 · 2023
The listed assets will be unavailable while Internet2 engineers perform Core Node maintenance. Outage are expected. The entire window is reserved.
Maintenance · Juno - all - Wednesday, May 10 · 2023
At 6:00 PM Eastern on May 10th, the Juno long term storage system at Beltsville will be unmounted from SCINet DTNs and become inaccessible.
This is being done in preparation for network maintenance to be performed after hours.
The storage will be remounted, and access restored, the following morning.
Maintenance · Site Service - Fort Collins - Tuesday, May 2 · 2023
Site service Fort Collins will be unavailable while BISON engineers performs maintenance. Outages are expected. The entire maintenance window is reserved.
Maintenance · Atlas - all - Monday, May 1 · 2023
In order to replace a valve in the cooling loop supply for the atlas cluster system, a reservation has been made for Monday, May 1 beginning at 3:00am CST.
- No running jobs will be killed.
- All jobs that can not complete before the maintenance start time will be held and started once the system has returned to operation.
Maintenance · Site Service - Ames - Thursday, April 27 · 2023
Site Service at Ames will be impacted while Lumen performs maintenance. The entire window is reserved. Outages are expected and the entire window is reserved.
Maintenance · Ceres - All · 2023
The data center that hosts Ceres cluster will have reduced cooling capacity starting the morning of April 12 and lasting through the end of the week.
To lessen heat production generated by Ceres compute nodes during this maintenance a reservation has been created. New jobs will not start if they cannot complete by 6:00AM on April 12, 2023.
In the output of the squeue command, the reason for those jobs will state (ReqNodeNotAvail, Reserved for maintenance) The jobs will start after the scheduled outage completes.
Idle nodes will be turned off. Running jobs that had started prior to reservation will be allowed to continue running as long as the temperature in the data center does not exceed the set threshold.
The login and DTN nodes, as well as storage are scheduled to stay up.
More nodes may be turned back on and be available for jobs on Thursday and Friday.
The Ceres cluster is expected to run at full capacity starting Monday, April 18.
Maintenance · Atlas - All (Atlas offline) - Tuesday, April 4 · 2023
The Mississippi State University High Performance Computing Collaboratory’s (MSU/HPC2) Computing Office has scheduled maintenance for the Atlas cluster.
During this maintenance window, the compute nodes and all support nodes including the login, devel, dtn, ood, etc… and those services including cron, globus, login, will be shutdown and unavailable.
Helpdesk tickets should be submitted for any associated problems.
Maintenance · SCINet - Albany - Thursday, March 2 · 2023
The Albany site location will experience loss of connectivity to SCINet intermittently during the hours of 4:00 pm to 6:00 pm EST on March 2, 2023.
Maintenance · Ceres - All (Ceres offline) · 2023
Maintenance · Ceres - All (/project) - Thursday, October 27 · 2022
Due to recent issues with Ceres’ /project storage hardware, it needs to be replaced. The replacement hardware is expected to be delivered by end of the day on 10/26/2022 and the works will probably be done on 10/27/2022.
Before replacing the hardware, we will post on the SCINet Forum and update the message of the day displayed at login to Ceres.
While replacing the hardware, Ceres’ /project will not be accessible. We plan to suspend all running jobs before unmounting /project and resume the jobs once the maintenance completes.
While we expect this will not affect running jobs, we recommend submitting new jobs to run on /90daydata to minimize the risk of the job dying due to this maintenance.
Maintenance · Ceres - All (Ceres offline) · 2022
Maintenance · Ceres - All (Ceres offline) · 2022
Maintenance · Atlas - All (connections to Atlas) - Tuesday, May 17 · 2022
Maintenance · Ceres - All (Ceres offline) - Monday, February 21 · 2022
Maintenance · SCINet - Stoneville - Thursday, January 20 · 2022
The maintenance window is one (1) hour in duration. This will impact service to the Stoneville site only.
Full cluster Maintenance · Atlas - All (connections to Atlas) - Wednesday, December 8 · 2021
Wednesday, December 8, beginning at 8am CST, the HPC2 Computing Office has scheduled maintenance for the atlas compute cluster. During this maintenance window, the login, devel, dtn, ood, and compute nodes for atlas will be unavailable and all associated cron jobs will be disabled.
Downtime is expected to last most of the day. For any associated problems, submit a help desk ticket:
- help-usda@hpc.msstate.edu - specific atlas issues
- scinet_vrsc@usda.gov - general operational issues
Network Maintenance in Ames · SCINet - All (connections to Ceres) - Thursday, November 18 · 2021
SCINet network maintenance has been scheduled for Ames, IA. The maintenance window is from 8:30 to 10:30 Central Time (1430-1630 UTC) on 18 November 2021. Connectivity to SCINet will be sporadic during the maintenance window.
Network Maintenance in Ames · SCINet - All (connections to Ceres) - Tuesday, November 16 · 2021
Connectivity to SCINet will be sporadic during the maintenance window.
Network Maintenance in Albany · SCINet - Albany - Monday, November 15 · 2021
Local connectivity to SCINet will be sporadic during the maintenance window.
Maintenance · Ceres - All (Ceres offline) - Thursday, November 11 · 2021
Ceres maintenance is scheduled for Thursday, November 11, 2021 to upgrade internal cluster network.
Queued jobs will not start if they cannot complete by 6AM November 11. These include jobs submitted to the long partition with the default 3-weeks long time limit. In the output of the squeue command the reason for those jobs will state (ReqNodeNotAvail, Reserved for maintenance). The jobs will start after the scheduled outage completes.
The Atlas cluster will stay up and running during Ceres downtime. All Ceres users can run jobs on Atlas and use /90daydata that has no quotas.
Fiber relocation · Ceres - All (connections to Ceres) · 2021
The listed asset will be unavailable while Lumen engineers perform preventative fiber relocation work. Outage is expected to be two hours each day, but up to 5 hours is possible. The entire window is reserved.
Network update · Ceres, Juno - All (connections to Ceres, Juno) - Thursday, October 28 · 2021
A maintenance window has been scheduled for 28 October 2021 from 1530 - 1730 UTC (10:30am to 12:30pm Central time) to stabilize router (Albany MX480 RE Downgrade).
Periodic outages will be experienced as equipment is rebooted. Connectivity to Ceres and Juno cannot be guaranteed during the maintenance window.
Network update · Ceres, Juno - All (connections to Ceres, Juno) - Tuesday, October 26 · 2021
A maintenance window has been scheduled for 26 October 2021 from 4:30pm to 8:30pm Central time to stabilize the SCINet Network. Periodic outages will be experienced as equipment is rebooted. Connectivity to Ceres and Juno cannot be guaranteed during the maintenance window.
Router update · Ceres - All (connections to Ceres) - Tuesday, October 19 · 2021
The router at Ames will be rebooted on or about 4:30 CT. The reboot should be about 15 minutes. After that the router will be upgraded to the latest OS. Outages may occur during that process.
Router update · SCINet - various · 2021
More SCINet network hardware OS updates. Check the announcement page for more details
OS Upgrade · SCINet - various · 2021
GNOC plans to upgrade the OS on the SCINet gear at the 6 locations. This will result in connectivity interruptions during the upgrade. The upgrade schedule is the following:
- Albany - 9/16 8AM PST
- Clay Center - 9/16 4PM CST
- Ames - 9/17 8AM CST
- Stoneville - 9/20 8AM CST
- NAL - 9/20 3PM CST
- CSU - 9/21 9AM CST
Maintenance · Ceres - All (connections to Ceres) · 2021
This maintenance window will be longer than normal as there are several important hardware upgrades occurring during this window to enhance the overall power and capacity of the CERES HPC cluster. These upgrades include the remaining new priority nodes, sixty eight additional compute nodes, two additional high memory compute nodes, six management nodes, and faster Infiniband switching technology used by the HPC nodes to access storage. VRSC will re-rack and re-wire the whole cluster to accommodate additional hardware while adhering to power and cooling limits.
Queued jobs will not start if they cannot complete by 7AM August 23. These include jobs submitted to the long partition with the default 3-weeks long time limit. In the output of the squeue command the reason for those jobs will state (ReqNodeNotAvail, Reserved for maintenance). The jobs will start after the scheduled outage completes.
The Atlas cluster will stay up and running during Ceres downtime. All Ceres users can run jobs on Atlas. If you don’t have a large enough project quota on Atlas, remember that you can use /90daydata on Atlas that has no quotas
Outage · Ceres · 2021
Connection Restored on 07-21-2021
Maintenance · Ceres - All (connections to Ceres) · 2021
The listed assets will be unavailable while contractors perform testing on the elecrtical service switchgear, generators, and turbine. Outages throughout the window are expected. The entire window is reserved.
Maintenance · Atlas - All (connections to Atlas) - Tuesday, February 23 · 2021
The HPC2 Computing Office has scheduled a maintenance for its core networking services. During this time all network connectivity both inside and outside the HPC2 will be unavailable including access to the atlas cluster systems.
Maintenance · Ceres - All (Ceres offline) - Tuesday, February 16 · 2021
Maintenance · Ceres - All (Ceres offline) - Monday, February 15 · 2021
Maintenance · Ceres - All (Ceres offline) - Monday, October 12 · 2020
UPS Maintenance · SCINet - Stoneville - Tuesday, August 25 · 2020
SCINet equipment will be shutdown in order to perform Maintenance to the UPS. SCINet connectivity at the Stoneville location will be impacted. The Maintenance window is reserved from 0700 to 1600 Central Time.
Maintenance · Ceres - All (Ceres offline) · 2020
Planned power outage · SCINet & AWS - Multiple locations · 2020
SCINet equipment at the National Agricultural Library will be powered down in advance of a planned power outage to the NAL building. The outage is expected to last for 24 hrs or less. We expect that normal access to SCINet resources will be restored on or before Monday, April 20.
Please check Basecamp during the outage period for updates.
Router migration · SCINet - Ft Collins - Thursday, March 19 · 2020
Router replacement · SCINet - Ft Collins - Thursday, March 12 · 2020
Router replacement · SCINet - Clay Center - Monday, March 2 · 2020
Maintenance · Ceres - All (Ceres offline) - Monday, February 17 · 2020
Upgrades/expansion · Ceres - All (Ceres offline) · 2019
Ceres downtime is scheduled for Monday, December 2 - Friday, December 6. This downtime is to rewire both power and networking on Ceres for the addition of additional compute nodes and to ready it for storage expansion.
We do not anticipate any further extended downtimes for rewiring, as this should allow us to maximize the size of Ceres simply by adding additional compute nodes.
Since this affects the Authentication for SCINet, this will also affect logins to Data Transfer nodes at Ames, StoneVille, Fort Collins, Clay Center, Albany CA, and Beltsville.
GlobalNoc will also be upgrading software on the SCINet network infrastructure during this time.