0

Article Background Blur

(QUIC) Solana Validator Failovers

SOL Strategies manages multiple Solana validators in a fast-paced context where every ~400ms counts and uptime is critical. To this end, validator operators often maintain an active, voting instance and a passive, backup, non-voting instance. Given a slot time of ~400ms, a failover from an active to a passive validator should ideally take less than this amount of […]

June 17th 2025

SOL Strategies Team

(QUIC) Solana Validator Failovers

SOL Strategies manages multiple Solana validators in a fast-paced context where every ~400ms counts and uptime is critical. To this end, validator operators often maintain an active, voting instance and a passive, backup, non-voting instance. Given a slot time of ~400ms, a failover from an active to a passive validator should ideally take less than this amount of time. We have automated this process to take as little as 10ms on some of our validator active/passive pairs using a peer-to-peer QUIC-based solution we are creatively calling solana-validator-failover.

In essence, solana-validator-failover allows a passive validator to initiate a failover from an active one by starting a bi-directional QUIC stream over which they coordinate an identity and file transfer in as little as 10ms. Because a GIF is an easier watch than a wall of text, this is what a failover looks like from a passive validator’s point of view taking over from its active counterpart:

And this is what it looks like from an active validator’s point of view:

The Failover Process

The passive validator’s primary function is that of a backup in case things go sideways on its active counterpart, although it is most often used to temporarily assume an active role while its peer is upgraded. Failovers are thus not an uncommon occurrence. Switching between validators requires a few basic coordinated steps that must be done as quickly as possible to avoid downtime and missed rewards. These can be summarized as:

1. Active validator sets its identity to a passive one

2. Transfer of the validator’s tower file (a vote history ledger of sorts) to the passive validator

3. Passive validator sets its identity to the active, voting one

Automating this simple three-step process is often tackled using SSH-like solutions like SCP or Rsync. While these are certainly fast in most contexts, our testing on our (geographically) closest validator pairs showed the process to take at best 600ms, and at worst 1–1.3s.

Architecture

Our failover solution uses a QUIC-based protocol, chosen for its:

  • Low latency (UDP-based)
  • Connection multiplexing
  • Built-in security
  • Stream prioritization

Initiating a failover with solana-validator-failover begins by ensuring validators are synced and healthy, querying their gossip state and deducing their current role: active or passive. When a validator is passive, it starts a QUIC failover server and waits for a connection from its active peer. The active validator starts a QUIC client that connects to the passive validator’s failover server.

The validator pair’s established bi-directional QUIC stream is used to exchange information about each other that includes their intended post-failover identities, public IPs and private DNS names. This information is used to ensure the active identity is successfully switched from the active to the passive validator and that the passive identities are distinct so that during the brief period both validators assume passive identities they are not duplicated.

To avoid failovers during times an active validator is due to be a network leader, the active validator holds up the failover process until there are no leader slots coming up for the validator’s identity in the next 5 minutes. Furthermore, it waits until the estimated start of the network’s next ~400ms slot to give the failover a good chance of keeping up.

QUIC Failover

Once a connection is established between the validator pair, a failover summary is presented to the operator on the passive one for review.

By default, solana-validator-failover dry-runs failovers where only the tower file transfer takes place and identity setting is mocked on both ends. This gives a good indication of the expected total failover time under current network conditions since the tower file transfer makes up the bulk of the time taken.

Once confirmed by the operator, the passive validator instructs the active one to begin the failover process. It does so by setting its identity to passive and streaming its tower file to the passive, soon-to-be active one.

On full receipt of the tower file, the passive validator sets its identity to active and assumes the active role, completing the failover process.

After the failover is complete, the newly active validator summarizes the time taken to transfer identities and the tower file before beginning monitoring the post-failover state. Monitoring involves querying the network for the validator’s active identity’s voting credits and rank for a period of time. Post-failover rank is ideally unchanged and vote credits are expected to be rising, indicative of successful voting activity by the active validator.

Summary

By leveraging a UDP-based QUIC protocol solution, we are able to achieve sub-slot time failovers to ensure we maintain high levels of uptime in a secure way while keeping up with the demands of modern validator operations.

CTA Background

Take a stake
in Solana

Take a stake
in Solana

Shift Blue
Shift Orange
Contact Background Top
Contact Background Bottom

Let's talk!
We're here to help

Drop us a message

Please provide your info and we'll connect with you soon.

Type of Inquiry