Beta

HA Scale

Failures happen. Downtime doesn't.

HA Scale

The HA Scale makes your Tophan cluster resilient to hardware failures, network partitions, and maintenance windows. VMs restart automatically, storage rebuilds itself, and workloads migrate without downtime.

Features

FeatureDescriptionStatus
VRRP FailoverVirtual IP failover between nodes. Sub-second switchover for critical services.Beta
VM Auto-RestartIf a node fails, its VMs restart on surviving nodes automatically. Priority-based ordering.Beta
Live MigrationMove running VMs between nodes with zero downtime for planned maintenance.Beta
Storage MigrationRelocate VM storage between pools or nodes without shutting down the VM.Planned
Maintenance ModeDrain a node of all workloads before maintenance. One click, zero downtime.Beta
Self-HealingAutomatic recovery from detected failures. Failed services restart, failed nodes are fenced.Beta
FencingIPMI/iLO-based node fencing ensures failed nodes are definitively shut down before recovery begins. No split-brain data corruption.Beta
Split-Brain PreventionQuorum-based decisions prevent both halves of a partition from operating independently. Configurable quorum policies.Beta
Health MonitoringContinuous health checks across all nodes and services. Failure detection in seconds.Beta
Affinity RulesControl VM placement: keep VMs together, keep them apart, prefer specific nodes.Planned

How Failover Works

  1. Detection: Health monitoring detects a node failure (missed heartbeats, failed health checks)
  2. Fencing: The failed node is fenced via IPMI/iLO to ensure it’s truly down — no assumptions
  3. Recovery: VMs from the failed node are restarted on surviving nodes based on priority and resource availability
  4. Storage: The Storage Scale automatically begins reconstructing any data that was on the failed node’s disks

The entire sequence — detection through recovery — completes in under a minute for most configurations. No human intervention required.

Maintenance Without Downtime

When you need to patch, upgrade, or replace hardware:

  1. Put the node in maintenance mode
  2. All VMs live-migrate to other nodes (zero downtime)
  3. Perform maintenance
  4. Exit maintenance mode
  5. VMs can migrate back (or stay where they are — the scheduler optimises placement)

This is routine, not exceptional. Tophan is designed to be maintained without maintenance windows.