Skip to main content

Check out Interactive Visual Stories to gain hands-on experience with the SSE product features. Click here.

Skyhigh Security

Major Incident Management Process

This document describes Skyhigh Security’s Incident Management (IM) and Major Incident Management (MIM) processes. It explains how Skyhigh Security detects, evaluates, and manages major incidents to reduce service disruption and maintain platform reliability. The MIM process is designed to ensure rapid response, coordinated technical engagement, and clear communication with Users during high-severity service events.

Skyhigh Security’s MIM process follows industry-recognized best practices, incorporating core principles from ITIL standards for Incident Management. The process is also aligned with the requirements of ISO22301:2012 and ISO27001:2013 for information security management. These standards guide our commitment to consistent, effective, and transparent incident handling.

Determine When an Incident Becomes a Major Incident

A Major Incident is defined as an event that causes a significant disruption to critical business services, impacting a large number of users or customers, and requires an immediate, coordinated response beyond the standard incident management process. 

Criteria for Major Incident Classification

Criterion

Description

Examples

Severity

Critical impact on a core service or system.

Complete service outage, data loss, severe security breach.

Scope

Affecting a significant number of customers or users.

Global platform degradation, regional control plane failure.

Business Impact

Significant financial or reputational risk.

Inability to enforce core security policies, major regulatory compliance failure.

Urgency

Requires immediate and high-priority attention.

Rapidly escalating failure, immediate need for executive awareness.

Objectives of the Major Incident Management Process

Our Incident and Major Incident Management processes are designed to achieve the following core objectives:

  • Rapid Discovery and Recovery. Ensure clear roles, responsibilities, and timely allocation of resources for efficient investigation and resolution of service-impacting issues.
  • Effective Communication. Provide accurate and timely reports and communications to internal and external stakeholders.
  • Proactive Escalation. Implement clear escalation paths (functional or hierarchical) to address critical issues promptly.
  • Customer-Centric Resolution. Resolve incidents to customer satisfaction with the shortest Mean Time To Resolve (MTTR), through effective workarounds or permanent fixes.
  • Continuous Improvement. Facilitate a thorough post-incident review process to identify and implement follow-up actions, address root causes, and prevent recurrence.
Major Incident Workflow and Response Stages

Once an issue is declared a Major Incident (MI), an immediate, centralized response is initiated.

Immediate Response Process

  1. Detection & Declaration. The issue is identified, confirmed to meet MI criteria, and the Incident Commander is engaged
  2. Mobilization. Technical teams, the Communications Lead, and necessary resources are assembled on the internal bridge
  3. Diagnosis & Action. Technical teams work to diagnose the fault and implement immediate actions to restore service (workarounds, fixes, restarts)
  4. Status Updates. Regular updates are published to the status page and via email
  5. Service Restoration. A resolution or successful workaround is implemented, and the service is verified as operational
  6. MI Conclusion. The Incident Commander declares the MI resolved, and the Technical Bridge is closed
Roles and Responsibilities During a Major Incident

Role

Responsibility

Customer Interaction Expectation

Incident Commander (IC)

Owns the overall resolution of the MI, directs the technical teams, and approves communications

Direct interaction is rare; your updates come via the Communications Lead

Technical Leads/Resolver Group

Subject Matter Experts (SMEs) focused on diagnosing and resolving the technical issue

Usually none but in some cases; they may engage ad-hoc to gather specific information to aid in resolution

Communications Lead

Manages all internal and external communication regarding the MI status and progress

Primary point of contact for status updates (via email, status page, or Bridge)

Executive Sponsors

Provides necessary authority, resources, and strategic guidance

None during MI

Communication Channels and Update Mechanisms

1. Incident Bridges

  • Internal Technical Bridge. A dedicated virtual conference was established immediately for technical teams to collaborate on diagnosis and resolution. Customers do not join this bridge.
  • Customer Update Bridge (as needed). In high-severity, prolonged incidents, a dedicated bridge may be opened for key customers to receive verbal updates from the Communications Lead. You will be invited if this is necessary.

2. Update Mechanism

  • Status Page. The primary channel for real-time status updates (https://status.skyhighsecurity.com/). Customers may subscribe to the status notifications using emails, webhooks, or Slack.
  • Email Notifications. Proactive emails sent by the Communications Lead to Internal stakeholders and distribution lists at regular intervals (e.g., every 30-60 minutes, or upon significant change). In some cases, external notifications will be sent out via email route outside of Status Page updates. For customers with assigned CSM/TCSM, proactive updates will be shared depending on the severity of the Incident.
  • Support Ticket. Your existing support ticket will be linked to the MI, and updates will be logged there.

What Happens After a Major Incident

Once the service is restored, the focus shifts from immediate restoration to understanding the root cause and preventing recurrence.

Post Incident Report (PIR) and Root Cause Analysis (RCA) Timelines

Activity

Description

Target Completion Timeline (Post-MI Resolution)

Post Incident Report  (PIR)

Publication of incident summary (observed fault), incident timelines, and steps taken to restore service

Within 24-48 hours

Publication of Root Cause Analysis (RCA)

Publication of the root cause analysis, including specific preventative actions (e.g., code fixes, infrastructure changes) 

Within 7 business days

Corrective Action Tracking

The scheduled preventative actions are tracked through the Skyhigh Problem Management process.

Ongoing, based on complexity

  • Was this article helpful?