offline-communicationsip-pbx

High Availability Architecture for Vessel Communication Systems

ShipwizeMay 13, 20267 min read

Why HA Matters More at Sea Than Ashore

In a shore-based IT environment, a server failure means calling the on-call engineer or waiting for the next business day. In a maritime environment, a server failure during a medical emergency or fire situation means the communication system is unavailable at the moment it's most needed.

This is the asymmetry that makes high availability (HA) non-optional for safety-critical vessel communication systems.

The Single-Point-of-Failure Problem

A typical first-generation vessel server deployment looks like this:

One server
One database instance
One Matrix homeserver process
One SIP proxy process

When this server fails — hardware fault, storage corruption, process crash — everything stops. No messaging, no calls, no push notifications.

For a vessel in port, this is inconvenient. For a vessel in the South Atlantic during a medical emergency, it's a serious operational risk.

HA Architecture: The Core Pattern

A high-availability maritime communication deployment requires at minimum:

Two servers (active/standby or active/active) — located in physically separate locations on the vessel (server room and a secondary location)

Redundant database — PostgreSQL with streaming replication between the two nodes, using Patroni for automatic primary promotion on failure

Shared fast storage or synchronised storage — media files (voice messages, shared documents, images) must be accessible from both nodes

Virtual IP (VIP) failover — a single IP address floating between nodes, managed by keepalived. Client devices always connect to the VIP; failover is transparent

Health monitoring — both nodes continuously monitor the other; automatic failover triggers within 30 seconds of primary failure

Component Deep Dive

PostgreSQL + Patroni

Database availability is the core dependency. Matrix (Synapse) and the SIP proxy both require the database. A database failure brings down the entire platform.

Patroni manages a PostgreSQL streaming replication cluster:

Primary node accepts reads/writes
Standby node receives streaming replication in real time
etcd (or consul) coordinates which node is primary
On primary failure, Patroni promotes the standby in under 30 seconds
Synapse reconnects to the new primary automatically

Keepalived for VIP

Keepalived manages the virtual IP address. Both servers run keepalived. The primary holds the VIP. On primary failure, the standby detects the failure via heartbeat and takes over the VIP.

Clients — crew devices — connect to the VIP. They don't know which physical server is currently primary. Failover is transparent except for in-flight requests during the 5–10 second transition.

Matrix Homeserver Failover

Synapse (the Matrix homeserver) should be started on the standby node automatically when the VIP transfers. This is handled by keepalived instance scripts or a systemd service that monitors network interface state.

TURN Server Redundancy

TURN servers handle WebRTC media relay. Both nodes run Coturn. Clients are configured with both TURN server IPs. If the primary is unreachable, the client falls back to the standby automatically via the STUN/TURN protocol.

Testing the Failover

No HA deployment should be considered complete without testing:

Planned failover — Manually trigger failover while a voice call is in progress. Call should survive or reconnect within 30 seconds.

Hard power failure — Cut power to the primary node. Measure time to full platform availability on standby.

Storage corruption test — Simulate database corruption (pg_inject_chaos or similar). Confirm streaming replica is clean and Patroni promotes it correctly.

Network partition — Disconnect the primary from the vessel network. Confirm VIP transfers to standby. Reconnect primary and confirm it rejoins as standby, not incorrectly as primary (split-brain prevention).

Monitoring

HA without monitoring is incomplete. At minimum:

Patroni REST API monitored for leader status, replication lag, and timeline
keepalived health status and VIP location
PostgreSQL replication lag (alert if > 10 seconds)
Synapse process health and federation state
Disk space on both nodes (full disk is the most common cause of database failure)

Monitoring should alert to a shore-based operations NOC when anomalies are detected, so that IT support can intervene remotely before a failure occurs.

Sizing for a Vessel Communication Deployment

For a vessel with 200–500 crew:

Two mini-PC servers (e.g., Intel NUC or HPE ProLiant MicroServer) — adequate
16–32 GB RAM per node
500 GB SSD RAID1 per node
1 GbE network interface

The hardware cost for a redundant deployment is modest: approximately €2,000–€3,000 total hardware cost. Against the risk of communication failure during a safety incident, this is straightforward capital expenditure.

See Shipwize in Action

Experience offline-first maritime communication and Augmented Communication live.

Request a Demo

WebRTC on Ships: Real-Time Video Communication Without Cloud Infrastructure

6 min

5 Maritime Communication Deployments That Failed — and What We Learned