High Availability Architecture for Vessel Communication Systems
Why HA Matters More at Sea Than Ashore
In a shore-based IT environment, a server failure means calling the on-call engineer or waiting for the next business day. In a maritime environment, a server failure during a medical emergency or fire situation means the communication system is unavailable at the moment it's most needed.
This is the asymmetry that makes high availability (HA) non-optional for safety-critical vessel communication systems.
The Single-Point-of-Failure Problem
A typical first-generation vessel server deployment looks like this:
- One server
- One database instance
- One Matrix homeserver process
- One SIP proxy process
For a vessel in port, this is inconvenient. For a vessel in the South Atlantic during a medical emergency, it's a serious operational risk.
HA Architecture: The Core Pattern
A high-availability maritime communication deployment requires at minimum:
Two servers (active/standby or active/active) — located in physically separate locations on the vessel (server room and a secondary location)
Redundant database — PostgreSQL with streaming replication between the two nodes, using Patroni for automatic primary promotion on failure
Shared fast storage or synchronised storage — media files (voice messages, shared documents, images) must be accessible from both nodes
Virtual IP (VIP) failover — a single IP address floating between nodes, managed by keepalived. Client devices always connect to the VIP; failover is transparent
Health monitoring — both nodes continuously monitor the other; automatic failover triggers within 30 seconds of primary failure
Component Deep Dive
PostgreSQL + Patroni
Database availability is the core dependency. Matrix (Synapse) and the SIP proxy both require the database. A database failure brings down the entire platform.
Patroni manages a PostgreSQL streaming replication cluster:
- Primary node accepts reads/writes
- Standby node receives streaming replication in real time
- etcd (or consul) coordinates which node is primary
- On primary failure, Patroni promotes the standby in under 30 seconds
- Synapse reconnects to the new primary automatically
Keepalived for VIP
Keepalived manages the virtual IP address. Both servers run keepalived. The primary holds the VIP. On primary failure, the standby detects the failure via heartbeat and takes over the VIP.
Clients — crew devices — connect to the VIP. They don't know which physical server is currently primary. Failover is transparent except for in-flight requests during the 5–10 second transition.
Matrix Homeserver Failover
Synapse (the Matrix homeserver) should be started on the standby node automatically when the VIP transfers. This is handled by keepalived instance scripts or a systemd service that monitors network interface state.
TURN Server Redundancy
TURN servers handle WebRTC media relay. Both nodes run Coturn. Clients are configured with both TURN server IPs. If the primary is unreachable, the client falls back to the standby automatically via the STUN/TURN protocol.
Testing the Failover
No HA deployment should be considered complete without testing:
Monitoring
HA without monitoring is incomplete. At minimum:
- Patroni REST API monitored for leader status, replication lag, and timeline
- keepalived health status and VIP location
- PostgreSQL replication lag (alert if > 10 seconds)
- Synapse process health and federation state
- Disk space on both nodes (full disk is the most common cause of database failure)
Sizing for a Vessel Communication Deployment
For a vessel with 200–500 crew:
- Two mini-PC servers (e.g., Intel NUC or HPE ProLiant MicroServer) — adequate
- 16–32 GB RAM per node
- 500 GB SSD RAID1 per node
- 1 GbE network interface
See Shipwize in Action
Experience offline-first maritime communication and Augmented Communication live.
Request a Demo