Cluster Maintenance Policy

Last modified by juhaheli@helsinki_fi on 2024/02/08 06:49

1.0 Preventative Maintenance (PM)

Scheduled, or preventative maintenance (PM later in this document) requires users to be notified minimum of 30 calendar days in advance, or according to a pre-planned quarterly Schedule. Scheduled maintenance activities are not performed without notification to the local and grid users.

1.1 Scheduled Activities

All system upgrade operations that cause, or may cause an system wide outage require PM to be scheduled in advance, including but not limited to:

- - Major hardware upgrades
  - Operating system updgrades
  - Required firmware upgrades
  - Non-Critical bug fixes requiring downtime

1.2 Fixed Schedules

Scheduled outages are arranged to occur according to pre-planned schedule, but if there is no required activities, PM can be skipped. Fixed quarterly maintenance schedule(s) are as follows:

1st Quarter Schedule

- Ukko: Last Wednesday of February
- Kale: First Wednesday of March
- Vorna: Second Wednesday of March

2nd Quarter Schedule

- Ukko: Last Wednesday of May
- Kale: First Wednesday of June
- Vorna: Second Wednesday of June

3rd Quarter Schedule

- Ukko: Last Wednesday of August
- Kale: First Wednesday of September
- Vorna: Second Wednesday of September

4th Quarter Schedule

- Ukko: Last Wednesday of November
- Kale: First Wednesday of December
- Vorna: Second Wednesday of December

Should major activities take considerable time, and or have other major impact on users, a notifications need to be made several times.

1.3 Duration of Scheduled Outage

A standard duration of Scheduled outage is 9 hours. However, outage can always be shorter than the standard duration.

1.4 Software Upgrade Policies

All system software (operating system upgrades, vendor provided firmware upgrades etc.) upgrades which cause, or may cause partial or full system downtime, require following testing procedure before deployment into a production system. Scientific software modules, compilers and such are excluded from this list.

- Software is tested on a single representative node of it's kind that has been taken down with advance reservation (Ex. regular vs. GPU).
- Once tested, a node functionality has to be verified before approval for deployment.
- Once approved, upgrade will be deployed to the production system in a next scheduled outage.

1.5 Handling Planned Outage

Planned outage requires use of advance reservations, where a PM time reservation will be created with Slurm on the date of the outage. This allows system to empty towards the outage date, and allow and opportunity for very wide and short jobs to execute. Users are able to submit jobs at the virtual login host pair, and able to access files during the course of the outage, provided that access is allowed.

Outages affecting storage will require procedure where any upgrades are properly verified (and in case of Lustre, first performed well in advance on virtual test platform), and only then on the production system(s). Failure to heed the requirement may cause total loss of all user data, or permanent data corruption.

2.0 Unscheduled Outage

Unscheduled outage occurs when a critical hardware or software component fails, or when a critical external threat or exploit becomes known. Unscheduled outage is by definition a reactive action. Local, demonstrated exploits may be considered critical only after proper individual evaluation. Normally local potential exploits fall into category of issues that are resolved during planned outages.

2.1 Critical Hardware Interrupt

Should a critical hardware interrupt occur:

- Local and grid users are notified about the event through official channels, with estimated duration.
- Root cause is to be isolated, and cause of the event to be recorded in the system swap log.
- Appropriate service call is to be made
- If feasible, an workaround is to be created. If not, users are pointed to an alternate resources.
- Users have to be informed about the expected duration of the outage.

2.2. Critical Software Interrupt

Should a critical software interrupt occur:

- Local and grid users are notified about the event through official channels, with estimated duration.
- Root cause is to be isolated, and cause of the event to be recorded in the system swap log.
- If no permanent fix is available through rollback etc. An workaround needs to be created.
- Users have to be informed about the expected duration of the outage and the workaround.

2.3 Interrupt due to Critical External threat

Policy to handle critical external threats, either by likely critical bug that may result elevated risk to local users from outside, or critical issue with known, published exploit that can put users and/or user data in risk includes two phases:

- First is to limit the scope, by isolating system from network, and or deny access when applicable. This can be achieved with issuing nologin policy and if necessary, an isolation of the system from any external network. In extreme cases system shutdown may be needed. Emergency notice has to be sent out to both local and grid users.
- Once immediate damage control actions have been performed, a risk analysis is required before any other actions.
  If running batch jobs are not considered a threat, they can run into conclusion while system is not accessible from outside. Once followup activities are planned, remaining jobs can be requeued when possible and appropriate recovery actions taken.

2.4 Recovery Actions

In event of unscheduled outage, compromised batch job ID's and user names of affected users have to be collected and the end users notified accordingly. It is a responsibility of a service provider to safeguard data integrity a best they can and notify users about the potential loss, or corruption of data in timely manner. We can, under exceptional circumstanced elect to perform requeuing of user batch jobs if user has allowed it by setting an appropriate flag.