Algorand MainNet Relay Guidance

Version 1.2

Overview

We expect to have 3 supported distribution channels for releases:

  1. update.sh / tarball
  2. Debian package (recommended on Debian-based distributions)
  3. RPM package (recommended on Fedora-based distributions)

The packages will be hosted on a public repo to enable APT and YUM installation and upgrade.  The tarball will continue to be hosted on S3 for manual / automated upgrades.

Whichever installation and upgrade mechanism selected, it’s important that you have a bullet-proof process to ensure minimum downtime when upgrading.  We recommend having a test environment to validate upgrades using your own process -- we will generally make the upgrades available early for validation prior to release.

The software will continue to have 4 main components:

  1. The binaries folder <bindir> - where algod, kmd, goal, etc. reside. We assume that this folder is added to your path (this is done automatically if you are using a DEB/RPM package)
  2. The data folder <datadir> - where the genesis.json file resides, as well as the kmd data folder and the ledger data folder, and the config.json file. If you are using RPM/DEB packages, <datadir>=/var/lib/algorand
  3. The ledger folder (subfolder of <datadir>) - where the blockchain and accounts databases live, as well as active Participation Key files (*.partkey)
  4. The kmd folder (subfolder of <datadir>) - only relevant when root keys live on the machine (which should not be the case for relay nodes)

For relays, which are always Archival, the ledger folder is the only folder with unbound growth; it will continue to grow as the blockchain grows, until the platform evolves to shard or otherwise distribute the storage. The estimated maximum growth of the ledger (ledger.block.sqlite) itself is ~6TB/year at peak transaction volume. The account database size can also grow unbounded if malicious users decide to spam the network with unique account addresses. We estimate that the worst-case account database (ledger.tracker.sqlite) growth is ~3TB/year; absent malicious attacks, something on the order of 10GB-100GB is more reasonable. The ledger directory should be on a fast storage medium as the database is in the critical path for processing block-to-block, though we have optimized around completely blocking progress on database writes.

Software Installation, Configuration, and Registration

Software Installation

Three options:

Debian package

Follow “install using the Debian package” on https://developer.algorand.org/docs/installing-ubuntu

RPM package

Follow “install using the RPM package” on https://developer.algorand.org/docs/installing-other-linux-distros-staging

Update.sh / tarball

        Using an unprivileged user (do not run a node as root):

update.sh -i -c stable -p <bindir> -d <datadir> -n

Replace <bindir> and <datadir> by two folders of your choice that the current user can write to.

You then need to set up automatic updates: https://developer.algorand.org/docs/configure-auto-update 

The remaining of this guide assumes that your binary dir <bindir> is in your PATH.

Enable as a service:

You should configure algod to run as a service.  If you installed the deb or rpm package, this was done for you.  Otherwise refer to the guidance here for details (adjust accordingly).

Unregistering your relay from TestNet (very rare case, skip if you did a fresh install)

If your relay node was registered on TestNet, you must first unregister it by sending an email to testnet-team@algorand.com

Switching from TestNet to MainNet (very rare case, skip if you did a fresh install)

If you installed your relay node before and are connected to TestNet, you need to switch to MainNet using the following instructions:

  1. Stop your node.
  2. Wipe your data folder (except config.json.example).
  3. Copy <bindir>/genesisfiles/mainnet/genesis.json to the data folder.
  4. Setup properly the config.json in the data folder according to the configuration below.
  5. In your datadir, grep v1.0 genesis.json and verify you get a match like "id": "v1.0", -- if not, copy this file instead (for step 3): <bindir>/genesisfiles/genesis/mainnet/genesis.json
  6. Start your node.

Configuration

Incoming Connections

Enable incoming connections (we are using port 4160 by convention on MainNet)

Testing your configuration:

Telemetry

Enable telemetry - we require all Relays to have telemetry enabled.

Metrics

If you are planning to constantly monitor your relay for signs of problems, you are not expected to enable metrics. Metrics should be enabled by anyone who does not have a professional operations staff monitoring and maintaining their hardware.  We recommend you use such a service, but you are not obligated to do so.

Even if you enable metrics, you are responsible to ensure availability and reliability of your node at all times. Some monitoring guidance is provided below.

If you want or need to provide metrics for the foundation to help monitor:

Testing your configuration:

Basic Testing

Before registering your relay, please do the following basic testing. All the tests must pass. If you have any questions, send an email to support@algorand.foundation.

From the server, run:

From another computer on macOS / Linux, run:

Registration

Before registering, please do the basic testing above.

Register with the Algorand Foundation using the customized URL you received by email to get your relay listed. If you do not have a customized URL for registration, you may send an email to support@algorand.foundation. Be prepared to provide the following details:

Contact

Email Address

On behalf of

Who's the investor / provider

Instance Resources

VM?  CPU/RAM/SSD/WAN

DNS / IP name and port

DNS or IP (prefer DNS)

Telemetry GUID:Host

Can discover through telemetry if necessary

Monitoring and Updating

Optional Configuration with algoh

algoh is a tool shipped alongside algod which is intended to host the algod software for additional diagnostics benefits. Specifically it will capture details of any runtime errors that cause algod to exit abruptly; if this happens, algod will send the details in a telemetry event.  It also monitors the algod status using the REST API, and if it detects a stall (> 120 seconds without progress) it captures internal algod application and sends the details to telemetry (and captures logs and uploads them).

algoh can be utilized by running it just like you would run algod, or you can specify `-H` when using goal node start to start algod. You’re free to come up with whatever process works best for you, if you choose to use algoh. We would appreciate if you do, but realize it complicates the monitoring and hosting process. The complication comes from algoh not being prepared to run as a service; you’ll want to monitor algoh and algod processes and kill them both and restart algoh in case of problems. We can provide more guidance if you have questions.

Updating

We are hoping to eventually transition everyone over to using DEB and RPM packages for installation and upgrades. For now, we will continue to offer releases as tarballs available for manual or automatic download and installation.  We will provide a public communication channel (possibly Slack) over which we will communicate details about pending releases, and will use this channel to coordinate application of the update to reduce downtime.  We will also use the forums to communicate.  

Monitoring Guidance

You should be monitoring the normal system resources as you would for any production server; you should also monitor the specific Algorand processes in use (algod, kmd, and possibly algoh node_exporter). You can also use `goal`, the REST API, or write your own tools to process the node.log file. The node.log file contains a wealth of information and insight into the internal workings of algod. You can also control the log verbosity by modifying the config.json’s "BaseLoggerDebugLevel” value. The default is 4 (Info). Level 5 is Debug and extremely verbose (can impact performance so not recommended).

Monitoring - What to Watch For

Alert Levels: 1 = Investigate soon; 2 = Investigate Immediately

Network Stall

Track time-per-block

If time > 15 seconds, alert level 1

If time > 1 minute, alert level 2

If during an update, ignore

If rolling average of last 10 blocks > 10 seconds, alert level 2

High CPU usage on node

Alert level 1 = 70%

Alert level 2 = 90%

Low Disk space on node

Alert level 1 = 70%

Alert level 2 = 90%

Algod process problems (algod exits and doesn't restart, or algod keeps restarting)

Alert level 2 (1 if Relay)

Panic / algoh-ErrorOutput telemetry events (Algorand monitors telemetry)

Alert level 1

Algoh-DeadManTriggered telemetry event (Algorand monitors telemetry)

Alert level 1

Error telemetry events (Algorand monitors telemetry)

Alert level 1

Heartbeat telemetry events stopped (Algorand monitors telemetry)

Alert level 1

After updates, ensure correct (latest) software version is installed

Active Attacks - learn boundaries -

Excessive network connections

Excessive network traffic?

Insufficient network connections

Excessive connection churn

High Blockchain Load (maybe not malicious)

New Account spam

Large blocks / many transactions

Excessive rejected transactions

 

DNS/SRV Problems

 

Diagnosis - failure modes - Stalls, cpu overloaded, network traffic flooding

Each node is an isolated island of failure whose loss shouldn’t affect network

 

Network Stall

Identify node(s) not participating that should be

Check dashboards

Check telemetry

Use `carpenter` on any instance to see vote counts

Check telemetry for errors / panics, algoh ErrorOutput events

 

High CPU Usage

External Cause (attack?), high network too?

Check other nodes/relays - also high?

No External Cause (no attack - possibly software bug)

 

Low Disk Space

Expected or unexpected?

 

Investigating a single node

Is algod running?

Is algod repeatedly restarting?

Check CPU usage (eg `top`).

We should capture this automatically if we can so we don't have to log it (most APM tools will monitor this).

Capture logs (`goal logging send`).

Correct any obvious issues (kill processes, free disk space).

Restart algod, monitor (if it seems to be running, use `carpenter` to ensure it catches up / starts voting).

If it doesn't recover, RESET NODE:

Stop algod, Delete <ledgerdir>/*.sqlite, Start algod.  Do NOT do this on a participation node without moving the *.partkey files out first (then move back after restarting algod)

Post-Mortem

Check telemetry for clues.

Check logs.

 

Investigating concurrent failures

If it appears multiple nodes are not functioning properly, it's likely a software bug - either a fork or just corrupted state.

Figure out if all nodes are affected, or just some.

Check for malicious behavior (equivocating / bogus votes).

Capture logs from affected and unaffected nodes.

Restart a single affected node and see if the problem resolves.

If so, restart all affected nodes.

If not, RESET NODE and see if it recovers.

Apply correction to all affected nodes.

If normal (stalled) operation doesn't resume, may need a software patch.

Communication

For the health of the network we expect information to flow freely between Relay runners, node runners, and Algorand.  If you experience any issues that are not isolated to your environment, please communicate with Algorand so we can be made aware and keep track of issues that arise. We will assist as necessary in diagnosing issues for the sake of keeping everything running smoothly. Likewise, Algorand will communicate incidents as they arise, in case they impact your environment or show up in your monitoring as well.


Change Log

Version

Date

Who

What

1.0

2019 06 11

David Shoots

Creation

1.1

2019 06 13

Update of installation instructions

1.2

2019 10 01

Add RPM/DEB, add “Basic Testing” section, and minor changes