Algorand MainNet Relay Guidance

Version 1.2

Overview

We expect to have 3 supported distribution channels for releases:

update.sh / tarball
Debian package (recommended on Debian-based distributions)
RPM package (recommended on Fedora-based distributions)

The packages will be hosted on a public repo to enable APT and YUM installation and upgrade. The tarball will continue to be hosted on S3 for manual / automated upgrades.

Whichever installation and upgrade mechanism selected, it’s important that you have a bullet-proof process to ensure minimum downtime when upgrading. We recommend having a test environment to validate upgrades using your own process -- we will generally make the upgrades available early for validation prior to release.

The software will continue to have 4 main components:

The binaries folder <bindir> - where algod, kmd, goal, etc. reside. We assume that this folder is added to your path (this is done automatically if you are using a DEB/RPM package)
The data folder <datadir> - where the genesis.json file resides, as well as the kmd data folder and the ledger data folder, and the config.json file. If you are using RPM/DEB packages, <datadir>=/var/lib/algorand
The ledger folder (subfolder of <datadir>) - where the blockchain and accounts databases live, as well as active Participation Key files (*.partkey)
The kmd folder (subfolder of <datadir>) - only relevant when root keys live on the machine (which should not be the case for relay nodes)

For relays, which are always Archival, the ledger folder is the only folder with unbound growth; it will continue to grow as the blockchain grows, until the platform evolves to shard or otherwise distribute the storage. The estimated maximum growth of the ledger (ledger.block.sqlite) itself is ~6TB/year at peak transaction volume. The account database size can also grow unbounded if malicious users decide to spam the network with unique account addresses. We estimate that the worst-case account database (ledger.tracker.sqlite) growth is ~3TB/year; absent malicious attacks, something on the order of 10GB-100GB is more reasonable. The ledger directory should be on a fast storage medium as the database is in the critical path for processing block-to-block, though we have optimized around completely blocking progress on database writes.

Software Installation, Configuration, and Registration

Software Installation

Three options:

Debian package

Follow “install using the Debian package” on https://developer.algorand.org/docs/installing-ubuntu

RPM package

Follow “install using the RPM package” on https://developer.algorand.org/docs/installing-other-linux-distros-staging

Update.sh / tarball

Using an unprivileged user (do not run a node as root):

update.sh -i -c stable -p <bindir> -d <datadir> -n

Replace <bindir> and <datadir> by two folders of your choice that the current user can write to.

You then need to set up automatic updates: https://developer.algorand.org/docs/configure-auto-update

The remaining of this guide assumes that your binary dir <bindir> is in your PATH.

Enable as a service:

You should configure algod to run as a service. If you installed the deb or rpm package, this was done for you. Otherwise refer to the guidance here for details (adjust accordingly).

Unregistering your relay from TestNet (very rare case, skip if you did a fresh install)

If your relay node was registered on TestNet, you must first unregister it by sending an email to testnet-team@algorand.com

Switching from TestNet to MainNet (very rare case, skip if you did a fresh install)

If you installed your relay node before and are connected to TestNet, you need to switch to MainNet using the following instructions:

Stop your node.
Wipe your data folder (except config.json.example).
Copy <bindir>/genesisfiles/mainnet/genesis.json to the data folder.
Setup properly the config.json in the data folder according to the configuration below.
In your datadir, grep v1.0 genesis.json and verify you get a match like "id": "v1.0", -- if not, copy this file instead (for step 3): <bindir>/genesisfiles/genesis/mainnet/genesis.json
Start your node.

Configuration

Incoming Connections

Enable incoming connections (we are using port 4160 by convention on MainNet)

In <datadir>, copy config.json.example to config.json (in the data directory) - if installed using RPM/DEB packages, <datadir>=/var/lib/algorand
Modify "NetAddress": "" to “NetAddress": ":4160"
Update firewall settings to enable remote access to that port. You should be able to curl that address from externally and get a ‘404 Page not found’ error

Testing your configuration:

From an external machine, you should be able to curl <dns-or-ip>:4160 and receive a `404 Page Not Found` error, or similar - this indicates algod is probably running (use `netstat` or other tools to confirm if desired)

Telemetry

Enable telemetry - we require all Relays to have telemetry enabled.

Please specify a friendly name for your relay:

If using a DEB/RPM package:

sudo -u algorand -H diagcfg telemetry name -n friendly-name

Otherwise:

diagcfg telemetry name -n friendly-name

Please restart your node after setting your telemetry name so it takes effect
Carefully note the GUID: it will be required to register your node

Metrics

If you are planning to constantly monitor your relay for signs of problems, you are not expected to enable metrics. Metrics should be enabled by anyone who does not have a professional operations staff monitoring and maintaining their hardware. We recommend you use such a service, but you are not obligated to do so.

Even if you enable metrics, you are responsible to ensure availability and reliability of your node at all times. Some monitoring guidance is provided below.

If you want or need to provide metrics for the foundation to help monitor:

diagcfg metric enable -d <datadir>

Note you’ll get errors about registering with Cloudflare - ignore them

Update your firewall settings to:

Enable localhost access to port 9100 (so algod can connect to node_exporter)
Enable external access to port 9100 but ONLY from Source IP: 18.215.132.11/32 (telemetry.algorand.network)

Testing your configuration:

From an external machine, you should NOT be able to curl <dns-or-ip>:9100. If you receive HTML, then your firewall is not restrictive enough.
From your algod host, you should be able to curl <dns-or-ip>:9100 and receive a valid HTML response.

Basic Testing

Before registering your relay, please do the following basic testing. All the tests must pass. If you have any questions, send an email to support@algorand.foundation.

From the server, run:

goal node status -d <datadir>

Where <datadir>=/var/lib/algorand if you used a package
Should output about 10 lines. Check that:

Last committed block: is a block number close to the latest block on https://algoexplorer.io. If not, check this number regularly increases and wait until it reaches the same block number as https://algoexplorer.io. Syncing can take many hours.
Genesis ID: is mainnet-v1.0. If it is not the case, you are on the wrong network. The simplest solution is to do a fresh install. You may also follow the instructions above “Switching from TestNet to MainNet”
Genesis hash:is wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=. Same comment as for Genesis ID

From another computer on macOS / Linux, run:

curl <dns-or-ip>:4160

Replace <dns-or-ip> by your node DNS name or IP address
Should return “404 page not found” or similar. If it is not the case, check your firewall and the section “Incoming Connections” in “Configuration.”

curl -s http://2.algorand.mpaxos.com:4161/v1/mainnet-v1.0/block/0 | base64

Should return: gqVibG9ja4qkZmVlc8Qgx/zNsljw1BicK/i21o7ml1CGQrCtAB8x/LkYw1S6hZqjZ2VurG1haW5uZXQtdjEuMKJnaMQgwGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit+lcHJvdG/ZWWh0dHBzOi8vZ2l0aHViLmNvbS9hbGdvcmFuZGZvdW5kYXRpb24vc3BlY3MvdHJlZS81NjE1YWRjMzZiYWQ2MTBjN2YxNjVmYTI5NjdmNGVjZmE3NTEyNWYwpHJhdGXOATEtAKZyd2NhbHLOAAehIKNyd2TEIP7/////////////////////////////////////////pHNlZWTEIMBhxNj8Hb3e0tdgS+RWjj9tBBmHrDe95LYgtas5JIrfonRzzlz+7wCjdHhuxCAneGKxstLRJ5u1oZ0Nh1GP5xUA8Sa4ujNndb00mh57c6RjZXJ0gA==
Any other output indicate that your relay is not properly configured. Check that you are on the correct network mainnet-1.0 by running goal node start -d <datadir> (see above).

curl <dns-or-ip>:9100

Should return timeout or indicates that the port is not accepting connections. If not, check your firewall and the section “Metrics” in “Configuration.”

Registration

Before registering, please do the basic testing above.

Register with the Algorand Foundation using the customized URL you received by email to get your relay listed. If you do not have a customized URL for registration, you may send an email to support@algorand.foundation. Be prepared to provide the following details:

Contact	Email Address
On behalf of	Who's the investor / provider
Instance Resources	VM? CPU/RAM/SSD/WAN
DNS / IP name and port	DNS or IP (prefer DNS)
Telemetry GUID:Host	Can discover through telemetry if necessary

Monitoring and Updating

Optional Configuration with algoh

algoh is a tool shipped alongside algod which is intended to host the algod software for additional diagnostics benefits. Specifically it will capture details of any runtime errors that cause algod to exit abruptly; if this happens, algod will send the details in a telemetry event. It also monitors the algod status using the REST API, and if it detects a stall (> 120 seconds without progress) it captures internal algod application and sends the details to telemetry (and captures logs and uploads them).

algoh can be utilized by running it just like you would run algod, or you can specify `-H` when using goal node start to start algod. You’re free to come up with whatever process works best for you, if you choose to use algoh. We would appreciate if you do, but realize it complicates the monitoring and hosting process. The complication comes from algoh not being prepared to run as a service; you’ll want to monitor algoh and algod processes and kill them both and restart algoh in case of problems. We can provide more guidance if you have questions.

Updating

We are hoping to eventually transition everyone over to using DEB and RPM packages for installation and upgrades. For now, we will continue to offer releases as tarballs available for manual or automatic download and installation. We will provide a public communication channel (possibly Slack) over which we will communicate details about pending releases, and will use this channel to coordinate application of the update to reduce downtime. We will also use the forums to communicate.

Monitoring Guidance

You should be monitoring the normal system resources as you would for any production server; you should also monitor the specific Algorand processes in use (algod, kmd, and possibly algoh node_exporter). You can also use `goal`, the REST API, or write your own tools to process the node.log file. The node.log file contains a wealth of information and insight into the internal workings of algod. You can also control the log verbosity by modifying the config.json’s "BaseLoggerDebugLevel” value. The default is 4 (Info). Level 5 is Debug and extremely verbose (can impact performance so not recommended).

Monitoring - What to Watch For

Alert Levels: 1 = Investigate soon; 2 = Investigate Immediately

Network Stall

Track time-per-block

If time > 15 seconds, alert level 1

If time > 1 minute, alert level 2

If during an update, ignore

If rolling average of last 10 blocks > 10 seconds, alert level 2

High CPU usage on node

Alert level 1 = 70%

Alert level 2 = 90%

Low Disk space on node

Alert level 1 = 70%

Alert level 2 = 90%

Algod process problems (algod exits and doesn't restart, or algod keeps restarting)

Alert level 2 (1 if Relay)

Panic / algoh-ErrorOutput telemetry events (Algorand monitors telemetry)

Alert level 1

Algoh-DeadManTriggered telemetry event (Algorand monitors telemetry)

Alert level 1

Error telemetry events (Algorand monitors telemetry)

Alert level 1

Heartbeat telemetry events stopped (Algorand monitors telemetry)

Alert level 1

After updates, ensure correct (latest) software version is installed

Active Attacks - learn boundaries -

Excessive network connections

Excessive network traffic?

Insufficient network connections

Excessive connection churn

High Blockchain Load (maybe not malicious)

New Account spam

Large blocks / many transactions

Excessive rejected transactions

DNS/SRV Problems

Diagnosis - failure modes - Stalls, cpu overloaded, network traffic flooding

Each node is an isolated island of failure whose loss shouldn’t affect network

Network Stall

Identify node(s) not participating that should be

Check dashboards

Check telemetry

Use `carpenter` on any instance to see vote counts

Check telemetry for errors / panics, algoh ErrorOutput events

High CPU Usage

External Cause (attack?), high network too?

Check other nodes/relays - also high?

No External Cause (no attack - possibly software bug)

Low Disk Space

Expected or unexpected?

Investigating a single node

Is algod running?

Is algod repeatedly restarting?

Check CPU usage (eg `top`).

We should capture this automatically if we can so we don't have to log it (most APM tools will monitor this).

Capture logs (`goal logging send`).

Correct any obvious issues (kill processes, free disk space).

Restart algod, monitor (if it seems to be running, use `carpenter` to ensure it catches up / starts voting).

If it doesn't recover, RESET NODE:

Stop algod, Delete <ledgerdir>/*.sqlite, Start algod. Do NOT do this on a participation node without moving the *.partkey files out first (then move back after restarting algod)

Post-Mortem

Check telemetry for clues.

Check logs.

Investigating concurrent failures

If it appears multiple nodes are not functioning properly, it's likely a software bug - either a fork or just corrupted state.

Figure out if all nodes are affected, or just some.

Check for malicious behavior (equivocating / bogus votes).

Capture logs from affected and unaffected nodes.

Restart a single affected node and see if the problem resolves.

If so, restart all affected nodes.

If not, RESET NODE and see if it recovers.

Apply correction to all affected nodes.

If normal (stalled) operation doesn't resume, may need a software patch.

Communication

For the health of the network we expect information to flow freely between Relay runners, node runners, and Algorand. If you experience any issues that are not isolated to your environment, please communicate with Algorand so we can be made aware and keep track of issues that arise. We will assist as necessary in diagnosing issues for the sake of keeping everything running smoothly. Likewise, Algorand will communicate incidents as they arise, in case they impact your environment or show up in your monitoring as well.

Change Log

Version	Date	Who	What
1.0	2019 06 11	David Shoots	Creation
1.1	2019 06 13		Update of installation instructions
1.2	2019 10 01		Add RPM/DEB, add “Basic Testing” section, and minor changes