đź”—Behind the firewall
The Edge Network is comprised of three device types: Stargates
, Gateways
, and Hosts
. The latter devices—Hosts
—operate on such a variety of networks, with such a variety of performance and restrictions, that we have to be able to receive device measurements from them as and when the devices are online and able to report, regardless of limitations such as firewalls or patchy connectivity.
All Hosts
run a background process called the Telemetry Client
which sidesteps these restrictions by broadcasting to the network’s telemetry receivers over HTTPS. Both HTTP on port 80 and HTTPS on port 443 are generally open in firewall egress ruletables, which is the basis on which the network runs it’s gRPC connections between Hosts
and Gateways
.
đź”—Collecting the metrics
The Telemetry Client
collects basic system measurements such as load averages, memory usage, disk space utilisation and network activity. It also collects some other less interesting statistics such as system fork count, active process count, and established TCP connection counts.
This data is then packaged up and sent securely to the network’s telemetry receiver servers, which sit outside of the main network infrastructure — critical to remaining accessible during possible outages — before being verified and stored alongside the metrics of all other devices.
If a device is online but not connected to the Internet, then no data is recorded for that device. If a device is online but has a patchy connection, only the successfully broadcast metrics are stored for that device. This allows us to monitor both the network and individual devices accurately, showing a history of network condition for every second of every day.
đź”—Processing the data
Once the data is received by the network’s telemetry receiver servers, it is verified to ensure that it is from the correct device and has not been modified along the way. This is done by signing each payload with a per-device session secret.
The data is then stored in Prometheus, from which it is fed into a number of other services. One of these is Grafana, which we use for our internal interfaces and monitoring. At any one point we are able to see the total load and average load of the network, traffic and bandwidth statistics, and much more besides.
The screencap above shows a few metrics for a small portion of the testnet collected over a 24 hour period, with data collected every 5 seconds.
🔗What’s next?
With so much data the possibilities of how to use it and visualise it are near endless. As you may have seen, we recently launched the first iteration of the Edge Explorer, which shows you some cool statistics such as the number of online Stargates
, Gateways
, and Hosts
, the size of the edge cache and edge storage, as well as a list of devices.
By utilising the data we collect from the telemetry service, we’re going to be able to display, in real time, the cumulative network load, memory and storage utilisation.
In addition to this we’ve already seen some great benefits in detecting and debugging issues on the testnet. The overview that the telemetry service layer provides us really is invaluable.
Thanks for reading! And stay tuned, as I’ll be writing much more about this stuff.