Failover - Replicated Registry

Assumption: The primary site has streaming replication setup to the secondary site using postgresl replication for "zero loss" failover. If the primary site is lost, failover is a manual process requiring intervention by the NIC technical team.

  • Step 1. ssh to the failover server.
  • Step 2. Promote the read-only replica to read-write so it can take over as a production registry.
    sudo su - postgres
    /usr/lib/postgresql/17/bin/pg_ctl promote -D /var/lib/postgresql/17/main/
    after running above - you should see "waiting for server to promote.... done server promoted"...
    exit - exit as postgres user
    sudo pg_ctlcluster 17 main restart

  • Step 3. Edit the resin.xml config file to update the host name - if required.
  • Step 4. Start resin - /etc/init.d/resin start
  • Step 5. Start nginx - sudo systemctl start nginx
  • Step 6. Change the A / AAAA records for the registry to the IP of the failover server.
  • Step 7. Open the required ports ( 80, 443, 700, 53 ) using the firewall appliance and/or the OS firewall.
  • Step 8. Once the A / AAAA records are updated, renew the ssl / letsencrypt certificates if required.
  • Step 9. Uncomment the bind config file notify and allow axfer commands, start bind.
  • Step 10. Log in to the portal and on the Login and Security page, reset the EPP server to listen on 127.0.0.1 (to clear the IPs stored in the db from the old primary). Then restart EPP and go to the EPP configuration page and set EPP to listen on the correct (new server) IPs.
  • Step 11. Contact the TLD DNS providers and ask them to pull the zones from the new secondary IP.

If you have an RDAP server, re-sync the RDAP db from the new primary.

Failover Preparations

  • Every time you log in to the production registry check the replication status monitor page to make sure the replicas are in-sync. If not, re-sync them (db icon top right, header row).
  • Every time you update the ROOT.war on the primary SRS, copy the ROOT.war to the failover server. Note: On the failover server resin and nginx are stopped until the failover is promoted - only postgres is runnig when status is standby.
  • Make sure the CoCCA support key is on the failover server.
  • If you have an SSL certificate / jks that is signed by a CA, make sure it is on the failover server.
  • Make sure you maintain updated / valid contact details (email / phone) for third party DNS providers so they can be easily contacted if required.

Revert to Standby status after a failover test

On the Standby server

/etc/init.d/resin stop
sudo pg_ctlcluster 17 main stop
rm -rf /var/lib/postgresql/17/main
sudo pg_basebackup -h [primary IP] -p 5432 -U [replica_username] -R -P -v -C --slot=[descriptive-slot-name] -D /var/lib/postgresql/17/main/

Note: When you re-sync you need to use a different slot name, you can see the existing slot name in postgresql.conf.

Check and grant postgres permission to the data folder
chmod -R 700 /var/lib/postgresql/17/main/
chgrp -R postgres /var/lib/postgresql/17/main/
chown -R postgres /var/lib/postgresql/17/main/
sudo nano /var/lib/postgresql/17/main/standby.signal
standby_mode = 'on'

Save and exit.

sudo nano /etc/postgresql/17/main/postgresql.conf
max_connections = 300
primary_conninfo = 'host=[primary IP or host name] user=replica_username password=********'

* Update the primary slot name to the slot named used when you did the re-sync.

primary_slot_name = 'descriptive-slot-name'
hot_standby = on
max_standby_archive_delay = 30s
max_standby_streaming_delay = 30s
wal_receiver_create_temp_slot = on
wal_receiver_status_interval = 10s
hot_standby_feedback = on
wal_receiver_timeout = 60s
wal_retrieve_retry_interval = 5s
recovery_min_apply_delay = 0
sudo pg_ctlcluster 17 main restart
/etc/init.d/resin start

On Primary - Drop old / unused Slots

sudo -u postgres psql postgres

  • To check the existing / in use slots on the primary.
    # select * from pg_replication_slots;
  • And if you want to drop a slot, on the primary db, run these comands.
    # select pg_drop_replication_slot('replica-slot-name');