Major Incident Management Process
This document describes Skyhigh Security’s Incident Management (IM) and Major Incident Management (MIM) processes. It explains how Skyhigh Security detects, evaluates, and manages major incidents to reduce service disruption and maintain platform reliability. The MIM process is designed to ensure rapid response, coordinated technical engagement, and clear communication with Users during high-severity service events.
Skyhigh Security’s MIM process follows industry-recognized best practices, incorporating core principles from ITIL standards for Incident Management. The process is also aligned with the requirements of ISO22301:2012 and ISO27001:2013 for information security management. These standards guide our commitment to consistent, effective, and transparent incident handling.
Determine When an Incident Becomes a Major Incident
A Major Incident is defined as an event that causes a significant disruption to critical business services, impacting a large number of users or customers, and requires an immediate, coordinated response beyond the standard incident management process.
Criteria for Major Incident Classification
|
Criterion |
Description |
Examples |
|---|---|---|
|
Severity |
Critical impact on a core service or system. |
Complete service outage, data loss, severe security breach. |
|
Scope |
Affecting a significant number of customers or users. |
Global platform degradation, regional control plane failure. |
|
Business Impact |
Significant financial or reputational risk. |
Inability to enforce core security policies, major regulatory compliance failure. |
|
Urgency |
Requires immediate and high-priority attention. |
Rapidly escalating failure, immediate need for executive awareness. |
Objectives of the Major Incident Management Process
Our Incident and Major Incident Management processes are designed to achieve the following core objectives:
- Rapid Discovery and Recovery. Ensure clear roles, responsibilities, and timely allocation of resources for efficient investigation and resolution of service-impacting issues.
- Effective Communication. Provide accurate and timely reports and communications to internal and external stakeholders.
- Proactive Escalation. Implement clear escalation paths (functional or hierarchical) to address critical issues promptly.
- Customer-Centric Resolution. Resolve incidents to customer satisfaction with the shortest Mean Time To Resolve (MTTR), through effective workarounds or permanent fixes.
- Continuous Improvement. Facilitate a thorough post-incident review process to identify and implement follow-up actions, address root causes, and prevent recurrence.
Major Incident Workflow and Response Stages
Once an issue is declared a Major Incident (MI), an immediate, centralized response is initiated.
Immediate Response Process
- Detection & Declaration. The issue is identified, confirmed to meet MI criteria, and the Incident Commander is engaged
- Mobilization. Technical teams, the Communications Lead, and necessary resources are assembled on the internal bridge
- Diagnosis & Action. Technical teams work to diagnose the fault and implement immediate actions to restore service (workarounds, fixes, restarts)
- Status Updates. Regular updates are published to the status page and via email
- Service Restoration. A resolution or successful workaround is implemented, and the service is verified as operational
- MI Conclusion. The Incident Commander declares the MI resolved, and the Technical Bridge is closed
Roles and Responsibilities During a Major Incident
|
Role |
Responsibility |
Customer Interaction Expectation |
|---|---|---|
|
Incident Commander (IC) |
Owns the overall resolution of the MI, directs the technical teams, and approves communications |
Direct interaction is rare; your updates come via the Communications Lead |
|
Technical Leads/Resolver Group |
Subject Matter Experts (SMEs) focused on diagnosing and resolving the technical issue |
Usually none but in some cases; they may engage ad-hoc to gather specific information to aid in resolution |
|
Communications Lead |
Manages all internal and external communication regarding the MI status and progress |
Primary point of contact for status updates (via email, status page, or Bridge) |
|
Executive Sponsors |
Provides necessary authority, resources, and strategic guidance |
None during MI |
Communication Channels and Update Mechanisms
1. Incident Bridges
- Internal Technical Bridge. A dedicated virtual conference was established immediately for technical teams to collaborate on diagnosis and resolution. Customers do not join this bridge.
- Customer Update Bridge (as needed). In high-severity, prolonged incidents, a dedicated bridge may be opened for key customers to receive verbal updates from the Communications Lead. You will be invited if this is necessary.
2. Update Mechanism
- Status Page. The primary channel for real-time status updates (https://status.skyhighsecurity.com/). Customers may subscribe to the status notifications using emails, webhooks, or Slack.
- Email Notifications. Proactive emails sent by the Communications Lead to Internal stakeholders and distribution lists at regular intervals (e.g., every 30-60 minutes, or upon significant change). In some cases, external notifications will be sent out via email route outside of Status Page updates. For customers with assigned CSM/TCSM, proactive updates will be shared depending on the severity of the Incident.
- Support Ticket. Your existing support ticket will be linked to the MI, and updates will be logged there.
What Happens After a Major Incident
Once the service is restored, the focus shifts from immediate restoration to understanding the root cause and preventing recurrence.
Post Incident Report (PIR) and Root Cause Analysis (RCA) Timelines
|
Activity |
Description |
Target Completion Timeline (Post-MI Resolution) |
|---|---|---|
|
Post Incident Report (PIR) |
Publication of incident summary (observed fault), incident timelines, and steps taken to restore service |
Within 24-48 hours |
|
Publication of Root Cause Analysis (RCA) |
Publication of the root cause analysis, including specific preventative actions (e.g., code fixes, infrastructure changes) |
Within 7 business days |
|
Corrective Action Tracking |
The scheduled preventative actions are tracked through the Skyhigh Problem Management process. |
Ongoing, based on complexity |
