Wednesday, April 3, 2019

Preventive Maintenance Program for IT

Preventative maintenance is performed on IT systems in order to resolve issues prior to failure and thus impacting operations. By identifying problems early and conducting periodic repairs and upgrades, savings in cost and labor are realized by avoiding corrective/reactive maintenance activities.

Preventive maintenance is performed regularly on a scheduled basis in order to minimize the chance that a certain piece of IT equipment will fail and cause unscheduled downtime. A bi-weekly after hours maintenance window is established to perform compliance, corrective/repair, and preventive maintenance on IT equipment. Prior to entering the maintenance window, the Operations Lead
prepares a list of activities to be performed during the maintenance window and obtains approval for execution from the Operations Manager. The plan for each maintenance window includes an estimation of time required to performed the planned actions and is prioritized with emphasis on corrective/repair and compliance activities. If the time allocated for the maintenance window does not allow preventive maintenance activities to be performed, our Operations Lead coordinates with the Operations Manager to set a aside a supplemental maintenance period or the actions are planned so as to not disrupt ongoing operations (e.g. replacing failed drives in a RAID that will not require any downtime). We develop the planned list of preventive maintenance activities to be performed during a given maintenance window using:
- Lifecycle Management (LCM) reports are produced weekly. Equipment with an end of support (EOS) or end of life (EOL) date within six months of the current date are flagged for replacement and a determination is made to sustain or retire the equipment or capability. If the equipment is to be retired, we follow a decommissioning process. If the equipment or capability is to be sustained, our engineering team evaluates replacement solutions and prepares an acquisition request which includes budgetary pricing, justification, and implementation plan. 
- Network and system monitoring tools (e.g. SolarWinds) are used to query and log capacity utilization for server CPU, memory, disk, and network resources. For CPU and memory components, additional resources are provisioned if utilization exceeds 70% over a prolonged interval (e.g. not an abnormal spike due to a surge in use). For disk resources, capacity or quota is increased by 20% when a volume reaches 80%. For physical systems, an acquisition request is made for additional hardware required. For virtual systems, hypervisor capacity is analyzed to verify additional resources can be provisioned.
- For systems configured to retain local log files, utilization of local storage is inspected weekly. When utilization exceeds 90%, the oldest events are deleted after verification that logs are being successfully delivered to the security information and event management (SIEM) system.
- SAN and NAS health is analyzed weekly by examining the management interface, logs, and/or alerts. Any disks flagged as failed or failing by the SAN or NAS are replaced. An acquisition is initiated for additional storage when capacity reaches 80%, with the amount of capacity increase determined by analysis of the growth rate and any projections provided by information owners.
- Tier 3 System Administrators, Network Engineers, and Database Administrators monitor vendor web sites and/or newsfeeds to maintain awareness of updates and/or manufacturer recommendations for periodic maintenance for systems they manage Operating system, application, SAN or NAS, switch, router, firewall, or other IT system updates are applied when enhancements to features, stability, and/or performance can be achieved.
- Security Information and Event Management (SIEM) logs and alerts are reviewed to identify indications of potential failures. These could include:  network interface errors indicative of failing hardware, loose connections, or wiring damage, warnings for PKI certificate expiration, equipment temperature or power fluctuations which could indicate failing fans or power supplies. Testing (e.g. wiring) or replacement of components is performed during the scheduled maintenance window.

During the maintenance window, the planned and approved preventive maintenance actions are performed. At the conclusion of the maintenance, the Operations Lead provides a report to the Operations Manager which indicates what activities were successful.

A computerized maintenance management system (CMSS) is recommendation to implement this preventive maintenance program.

No comments:

Post a Comment