Wiki source code of Cluster Maintenance Policy

Last modified by juhaheli@helsinki_fi on 2024/02/08 06:49

Show last authors
1 = 1.0 Preventative Maintenance (PM) =
2
3 Scheduled, or preventative maintenance (//PM later in this document//) requires users to be notified minimum of 30 calendar days in advance, or according to a pre-planned quarterly Schedule. Scheduled maintenance activities are not performed without notification to the local and grid users.
4
5 (% style="margin-left: 30.0px;" %)
6 == 1.1 Scheduled Activities ==
7
8 (% style="margin-left: 30.0px;" %)
9 All system upgrade operations that cause, or may cause an system wide outage require PM to be scheduled in advance, including but not limited to:
10
11 (% style="list-style-type: square;" %)
12 *
13 (% style="list-style-type: square;" %)
14 **
15 (% style="list-style-type: square;" %)
16 *** Major hardware upgrades
17 *** Operating system updgrades
18 *** Required firmware upgrades
19 *** Non-Critical bug fixes requiring downtime
20
21 (% style="margin-left: 30.0px;" %)
22 == 1.2 Fixed Schedules ==
23
24 (% style="margin-left: 30.0px;" %)
25 Scheduled outages are arranged to occur according to pre-planned schedule, but if there is no required activities, PM can be skipped. Fixed quarterly maintenance schedule(s) are as follows:
26
27 (% style="margin-left: 30.0px;" %)
28 1st Quarter Schedule
29
30 (% style="list-style-type: square;" %)
31 *
32 (% style="list-style-type: square;" %)
33 ** (% style="letter-spacing: 0.0px;" %)**Ukko**: Last Wednesday of February
34 ** **Kale**: First Wednesday of March
35 ** **Vorna**: Second Wednesday of March
36
37 (% style="margin-left: 30.0px;" %)
38 2nd Quarter Schedule
39
40 (% style="list-style-type: square;" %)
41 *
42 (% style="list-style-type: square;" %)
43 ** **Ukko**: Last Wednesday of May
44 ** **Kale**: First Wednesday of June
45 ** **Vorna**: Second Wednesday of June
46
47 (% style="margin-left: 30.0px;" %)
48 3rd Quarter Schedule
49
50 (% style="list-style-type: square;" %)
51 *
52 (% style="list-style-type: square;" %)
53 ** **Ukko**: Last Wednesday of August
54 ** **Kale**: First Wednesday of September
55 ** **Vorna**: Second Wednesday of September
56
57 (% style="margin-left: 30.0px;" %)
58 4th Quarter Schedule
59
60 (% style="list-style-type: square;" %)
61 *
62 (% style="list-style-type: square;" %)
63 ** **Ukko**: Last Wednesday of November
64 ** **Kale**: First Wednesday of December
65 ** **Vorna**: Second Wednesday of December
66
67 (% style="margin-left: 30.0px;" %)
68 Should major activities take considerable time, and or have other major impact on users, a notifications need to be made several times.
69
70 (% style="margin-left: 30.0px;" %)
71 == 1.3 Duration of Scheduled Outage ==
72
73 (% style="margin-left: 30.0px;" %)
74 A standard duration of Scheduled outage is **9 hours**. However, outage can always be shorter than the standard duration.
75
76 (% style="margin-left: 30.0px;" %)
77 == 1.4 Software Upgrade Policies ==
78
79 (% style="margin-left: 30.0px;" %)
80 All system software (//operating system upgrades, vendor provided firmware upgrades etc.//) upgrades which cause, or may cause partial or full system downtime, require following testing procedure before deployment into a production system. Scientific software modules, compilers and such are excluded from this list.
81
82 (% style="list-style-type: square;" %)
83 *
84 (% style="list-style-type: square;" %)
85 ** Software is tested on a single representative node of it's kind that has been taken down with advance reservation (Ex. //regular vs. GPU//).
86 ** Once tested, a node functionality has to be verified before approval for deployment.
87 ** Once approved, upgrade will be deployed to the production system in a __next scheduled outage__.
88
89 (% style="margin-left: 30.0px;" %)
90 == 1.5 Handling Planned Outage ==
91
92 (% style="margin-left: 30.0px;" %)
93 Planned outage __requires use of advance reservations__, where a PM time reservation will be created with Slurm on the date of the outage. This allows system to empty towards the outage date, and allow and opportunity for very wide and short jobs to execute. Users are able to submit jobs at the virtual login host pair, and able to access files during the course of the outage, provided that access is allowed.
94
95 (% style="margin-left: 30.0px;" %)
96 Outages affecting storage will require procedure where any upgrades are properly verified (//and in case of Lustre, first performed well in advance on virtual test platform//), and only then on the production system(s). Failure to heed the requirement may cause total loss of all user data, or permanent data corruption.
97
98 = 2.0 Unscheduled Outage =
99
100 Unscheduled outage occurs when a critical hardware or software component fails, or when a critical external threat or exploit becomes known. Unscheduled outage is by definition a reactive action. Local, demonstrated exploits may be considered critical only after proper individual evaluation. Normally local potential exploits fall into category of issues that are resolved during planned outages.
101
102 (% style="margin-left: 30.0px;" %)
103 == 2.1 Critical Hardware Interrupt ==
104
105 (% style="margin-left: 30.0px;" %)
106 Should a critical hardware interrupt occur:
107
108 (% style="list-style-type: square;" %)
109 *
110 (% style="list-style-type: square;" %)
111 ** Local and grid users are notified about the event through official channels, with estimated duration.
112 ** Root cause is to be isolated, and cause of the event to be recorded in the system swap log.
113 ** Appropriate service call is to be made
114 ** If feasible, an workaround is to be created. If not, users are pointed to an alternate resources.
115 ** Users have to be informed about the expected duration of the outage.
116
117 (% style="margin-left: 30.0px;" %)
118 == 2.2. Critical Software Interrupt ==
119
120 (% style="margin-left: 30.0px;" %)
121 Should a critical software interrupt occur:
122
123 *
124 ** Local and grid users are notified about the event through official channels, with estimated duration.
125 ** Root cause is to be isolated, and cause of the event to be recorded in the system swap log.
126 ** If no permanent fix is available through rollback etc. An workaround needs to be created.
127 ** Users have to be informed about the expected duration of the outage and the workaround.
128
129 (% style="margin-left: 30.0px;" %)
130 == 2.3 Interrupt due to Critical External threat ==
131
132 (% style="margin-left: 30.0px;" %)
133 Policy to handle critical external threats, either by likely critical bug that may result elevated risk to local users from outside, or critical issue with known, published exploit that can put users and/or user data in risk includes two phases:
134
135 (% style="list-style-type: square;" %)
136 *
137 (% style="list-style-type: square;" %)
138 ** First is to limit the scope, by isolating system from network, and or deny access when applicable. This can be achieved with issuing nologin policy and if necessary, an isolation of the system from any external network. In extreme cases system shutdown may be needed. Emergency notice has to be sent out to both local and grid users.
139 ** Once immediate damage control actions have been performed, a risk analysis is required before any other actions.
140 (% style="letter-spacing: 0.0px;" %)If running batch jobs are not considered a threat, they can run into conclusion while system is not accessible from outside. Once followup activities are planned, remaining jobs can be requeued when possible and appropriate recovery actions taken.
141
142 (% style="margin-left: 30.0px;" %)
143 == 2.4 Recovery Actions ==
144
145 (% style="margin-left: 30.0px;" %)
146 In event of unscheduled outage, compromised batch job ID's and user names of affected users have to be collected and the end users notified accordingly. It is a responsibility of a service provider to safeguard data integrity a best they can and notify users about the potential loss, or corruption of data in timely manner. We can, under exceptional circumstanced elect to perform requeuing of user batch jobs if user has allowed it by setting an appropriate flag.
147
148 \\