Algorand MainNet Relay Guidance
Version 1.2
We expect to have 3 supported distribution channels for releases:
The packages will be hosted on a public repo to enable APT and YUM installation and upgrade. The tarball will continue to be hosted on S3 for manual / automated upgrades.
Whichever installation and upgrade mechanism selected, it’s important that you have a bullet-proof process to ensure minimum downtime when upgrading. We recommend having a test environment to validate upgrades using your own process -- we will generally make the upgrades available early for validation prior to release.
The software will continue to have 4 main components:
For relays, which are always Archival, the ledger folder is the only folder with unbound growth; it will continue to grow as the blockchain grows, until the platform evolves to shard or otherwise distribute the storage. The estimated maximum growth of the ledger (ledger.block.sqlite) itself is ~6TB/year at peak transaction volume. The account database size can also grow unbounded if malicious users decide to spam the network with unique account addresses. We estimate that the worst-case account database (ledger.tracker.sqlite) growth is ~3TB/year; absent malicious attacks, something on the order of 10GB-100GB is more reasonable. The ledger directory should be on a fast storage medium as the database is in the critical path for processing block-to-block, though we have optimized around completely blocking progress on database writes.
Three options:
Debian package
Follow “install using the Debian package” on https://developer.algorand.org/docs/installing-ubuntu
RPM package
Follow “install using the RPM package” on https://developer.algorand.org/docs/installing-other-linux-distros-staging
Update.sh / tarball
Using an unprivileged user (do not run a node as root):
update.sh -i -c stable -p <bindir> -d <datadir> -n
Replace <bindir> and <datadir> by two folders of your choice that the current user can write to.
You then need to set up automatic updates: https://developer.algorand.org/docs/configure-auto-update
The remaining of this guide assumes that your binary dir <bindir> is in your PATH.
Enable as a service:
You should configure algod to run as a service. If you installed the deb or rpm package, this was done for you. Otherwise refer to the guidance here for details (adjust accordingly).
Unregistering your relay from TestNet (very rare case, skip if you did a fresh install)
If your relay node was registered on TestNet, you must first unregister it by sending an email to testnet-team@algorand.com
Switching from TestNet to MainNet (very rare case, skip if you did a fresh install)
If you installed your relay node before and are connected to TestNet, you need to switch to MainNet using the following instructions:
Incoming Connections
Enable incoming connections (we are using port 4160 by convention on MainNet)
Testing your configuration:
Telemetry
Enable telemetry - we require all Relays to have telemetry enabled.
Metrics
If you are planning to constantly monitor your relay for signs of problems, you are not expected to enable metrics. Metrics should be enabled by anyone who does not have a professional operations staff monitoring and maintaining their hardware. We recommend you use such a service, but you are not obligated to do so.
Even if you enable metrics, you are responsible to ensure availability and reliability of your node at all times. Some monitoring guidance is provided below.
If you want or need to provide metrics for the foundation to help monitor:
Testing your configuration:
Before registering your relay, please do the following basic testing. All the tests must pass. If you have any questions, send an email to support@algorand.foundation.
From the server, run:
From another computer on macOS / Linux, run:
Before registering, please do the basic testing above.
Register with the Algorand Foundation using the customized URL you received by email to get your relay listed. If you do not have a customized URL for registration, you may send an email to support@algorand.foundation. Be prepared to provide the following details:
Contact | Email Address |
On behalf of | Who's the investor / provider |
Instance Resources | VM? CPU/RAM/SSD/WAN |
DNS / IP name and port | DNS or IP (prefer DNS) |
Telemetry GUID:Host | Can discover through telemetry if necessary |
algoh is a tool shipped alongside algod which is intended to host the algod software for additional diagnostics benefits. Specifically it will capture details of any runtime errors that cause algod to exit abruptly; if this happens, algod will send the details in a telemetry event. It also monitors the algod status using the REST API, and if it detects a stall (> 120 seconds without progress) it captures internal algod application and sends the details to telemetry (and captures logs and uploads them).
algoh can be utilized by running it just like you would run algod, or you can specify `-H` when using goal node start to start algod. You’re free to come up with whatever process works best for you, if you choose to use algoh. We would appreciate if you do, but realize it complicates the monitoring and hosting process. The complication comes from algoh not being prepared to run as a service; you’ll want to monitor algoh and algod processes and kill them both and restart algoh in case of problems. We can provide more guidance if you have questions.
We are hoping to eventually transition everyone over to using DEB and RPM packages for installation and upgrades. For now, we will continue to offer releases as tarballs available for manual or automatic download and installation. We will provide a public communication channel (possibly Slack) over which we will communicate details about pending releases, and will use this channel to coordinate application of the update to reduce downtime. We will also use the forums to communicate.
You should be monitoring the normal system resources as you would for any production server; you should also monitor the specific Algorand processes in use (algod, kmd, and possibly algoh node_exporter). You can also use `goal`, the REST API, or write your own tools to process the node.log file. The node.log file contains a wealth of information and insight into the internal workings of algod. You can also control the log verbosity by modifying the config.json’s "BaseLoggerDebugLevel” value. The default is 4 (Info). Level 5 is Debug and extremely verbose (can impact performance so not recommended).
Alert Levels: 1 = Investigate soon; 2 = Investigate Immediately
Network Stall
Track time-per-block
If time > 15 seconds, alert level 1
If time > 1 minute, alert level 2
If during an update, ignore
If rolling average of last 10 blocks > 10 seconds, alert level 2
High CPU usage on node
Alert level 1 = 70%
Alert level 2 = 90%
Low Disk space on node
Alert level 1 = 70%
Alert level 2 = 90%
Algod process problems (algod exits and doesn't restart, or algod keeps restarting)
Alert level 2 (1 if Relay)
Panic / algoh-ErrorOutput telemetry events (Algorand monitors telemetry)
Alert level 1
Algoh-DeadManTriggered telemetry event (Algorand monitors telemetry)
Alert level 1
Error telemetry events (Algorand monitors telemetry)
Alert level 1
Heartbeat telemetry events stopped (Algorand monitors telemetry)
Alert level 1
After updates, ensure correct (latest) software version is installed
Active Attacks - learn boundaries -
Excessive network connections
Excessive network traffic?
Insufficient network connections
Excessive connection churn
High Blockchain Load (maybe not malicious)
New Account spam
Large blocks / many transactions
Excessive rejected transactions
DNS/SRV Problems
Each node is an isolated island of failure whose loss shouldn’t affect network
Network Stall
Identify node(s) not participating that should be
Check dashboards
Check telemetry
Use `carpenter` on any instance to see vote counts
Check telemetry for errors / panics, algoh ErrorOutput events
High CPU Usage
External Cause (attack?), high network too?
Check other nodes/relays - also high?
No External Cause (no attack - possibly software bug)
Low Disk Space
Expected or unexpected?
Investigating a single node
Is algod running?
Is algod repeatedly restarting?
Check CPU usage (eg `top`).
We should capture this automatically if we can so we don't have to log it (most APM tools will monitor this).
Capture logs (`goal logging send`).
Correct any obvious issues (kill processes, free disk space).
Restart algod, monitor (if it seems to be running, use `carpenter` to ensure it catches up / starts voting).
If it doesn't recover, RESET NODE:
Stop algod, Delete <ledgerdir>/*.sqlite, Start algod. Do NOT do this on a participation node without moving the *.partkey files out first (then move back after restarting algod)
Post-Mortem
Check telemetry for clues.
Check logs.
Investigating concurrent failures
If it appears multiple nodes are not functioning properly, it's likely a software bug - either a fork or just corrupted state.
Figure out if all nodes are affected, or just some.
Check for malicious behavior (equivocating / bogus votes).
Capture logs from affected and unaffected nodes.
Restart a single affected node and see if the problem resolves.
If so, restart all affected nodes.
If not, RESET NODE and see if it recovers.
Apply correction to all affected nodes.
If normal (stalled) operation doesn't resume, may need a software patch.
For the health of the network we expect information to flow freely between Relay runners, node runners, and Algorand. If you experience any issues that are not isolated to your environment, please communicate with Algorand so we can be made aware and keep track of issues that arise. We will assist as necessary in diagnosing issues for the sake of keeping everything running smoothly. Likewise, Algorand will communicate incidents as they arise, in case they impact your environment or show up in your monitoring as well.
Version | Date | Who | What |
1.0 | 2019 06 11 | David Shoots | Creation |
1.1 | 2019 06 13 | Update of installation instructions | |
1.2 | 2019 10 01 | Add RPM/DEB, add “Basic Testing” section, and minor changes |