Major Incident Management Process

Last updated
Save as PDF

This document describes Skyhigh Security’s Incident Management (IM) and Major Incident Management (MIM) processes. It explains how Skyhigh Security detects, evaluates, and manages major incidents to reduce service disruption and maintain platform reliability. The MIM process is designed to ensure rapid response, coordinated technical engagement, and clear communication with Users during high-severity service events.

Skyhigh Security’s MIM process follows industry-recognized best practices, incorporating core principles from ITIL standards for Incident Management. The process is also aligned with the requirements of ISO22301:2012 and ISO27001:2013 for information security management. These standards guide our commitment to consistent, effective, and transparent incident handling.

Determine When an Incident Becomes a Major Incident

A Major Incident is defined as an event that causes a significant disruption to critical business services, impacting a large number of users or customers, and requires an immediate, coordinated response beyond the standard incident management process.

Criteria for Major Incident Classification

Criterion	Description	Examples
Severity	Critical impact on a core service or system.	Complete service outage, data loss, severe security breach.
Scope	Affecting a significant number of customers or users.	Global platform degradation, regional control plane failure.
Business Impact	Significant financial or reputational risk.	Inability to enforce core security policies, major regulatory compliance failure.
Urgency	Requires immediate and high-priority attention.	Rapidly escalating failure, immediate need for executive awareness.

Objectives of the Major Incident Management Process

Our Incident and Major Incident Management processes are designed to achieve the following core objectives:

Rapid Discovery and Recovery. Ensure clear roles, responsibilities, and timely allocation of resources for efficient investigation and resolution of service-impacting issues.
Effective Communication. Provide accurate and timely reports and communications to internal and external stakeholders.
Proactive Escalation. Implement clear escalation paths (functional or hierarchical) to address critical issues promptly.
Customer-Centric Resolution. Resolve incidents to customer satisfaction with the shortest Mean Time To Resolve (MTTR), through effective workarounds or permanent fixes.
Continuous Improvement. Facilitate a thorough post-incident review process to identify and implement follow-up actions, address root causes, and prevent recurrence.

Major Incident Workflow and Response Stages

Once an issue is declared a Major Incident (MI), an immediate, centralized response is initiated.

Immediate Response Process

Detection & Declaration. The issue is identified, confirmed to meet MI criteria, and the Incident Commander is engaged
Mobilization. Technical teams, the Communications Lead, and necessary resources are assembled on the internal bridge
Diagnosis & Action. Technical teams work to diagnose the fault and implement immediate actions to restore service (workarounds, fixes, restarts)
Status Updates. Regular updates are published to the status page and via email
Service Restoration. A resolution or successful workaround is implemented, and the service is verified as operational
MI Conclusion. The Incident Commander declares the MI resolved, and the Technical Bridge is closed

Roles and Responsibilities During a Major Incident

Role	Responsibility	Customer Interaction Expectation
Incident Commander (IC)	Owns the overall resolution of the MI, directs the technical teams, and approves communications	Direct interaction is rare; your updates come via the Communications Lead
Technical Leads/Resolver Group	Subject Matter Experts (SMEs) focused on diagnosing and resolving the technical issue	Usually none but in some cases; they may engage ad-hoc to gather specific information to aid in resolution
Communications Lead	Manages all internal and external communication regarding the MI status and progress	Primary point of contact for status updates (via email, status page, or Bridge)
Executive Sponsors	Provides necessary authority, resources, and strategic guidance	None during MI

Communication Channels and Update Mechanisms

1. Incident Bridges

Internal Technical Bridge. A dedicated virtual conference was established immediately for technical teams to collaborate on diagnosis and resolution. Customers do not join this bridge.
Customer Update Bridge (as needed). In high-severity, prolonged incidents, a dedicated bridge may be opened for key customers to receive verbal updates from the Communications Lead. You will be invited if this is necessary.

2. Update Mechanism

Status Page. The primary channel for real-time status updates (https://status.skyhighsecurity.com/). Customers may subscribe to the status notifications using emails, webhooks, or Slack.
Email Notifications. Proactive emails sent by the Communications Lead to Internal stakeholders and distribution lists at regular intervals (e.g., every 30-60 minutes, or upon significant change). In some cases, external notifications will be sent out via email route outside of Status Page updates. For customers with assigned CSM/TCSM, proactive updates will be shared depending on the severity of the Incident.
Support Ticket. Your existing support ticket will be linked to the MI, and updates will be logged there.

What Happens After a Major Incident

Once the service is restored, the focus shifts from immediate restoration to understanding the root cause and preventing recurrence.

Post Incident Report (PIR) and Root Cause Analysis (RCA) Timelines

Activity	Description	Target Completion Timeline (Post-MI Resolution)
Post Incident Report (PIR)	Publication of incident summary (observed fault), incident timelines, and steps taken to restore service	Within 24-48 hours
Publication of Root Cause Analysis (RCA)	Publication of the root cause analysis, including specific preventative actions (e.g., code fixes, infrastructure changes)	Within 7 business days
Corrective Action Tracking	The scheduled preventative actions are tracked through the Skyhigh Problem Management process.	Ongoing, based on complexity