Important in the networks growth is its ability to accept an increasing number of Hosts without compromising the speed in which vital coordination components calculate and synchronise topology.
Initial builds of the network manifest distributor operated on geographically spread Consul servers, adjacent to the Stargate services. Whilst decentralised in respect to their lack of a single point of failure, these nodes required consensus on all operations, including healthchecks, key/value updates and service metadata. Put simply, if a Host were to tell its nearest Consul server that it was up and healthy, the same data would need to be fully propagated before the write was considered successful. Operationally, this latency doesn’t scale well for our type of network, and we started to see some failing health checks (even whilst telemetry painted a far more serene landscape).
One of the largest single changes in the release was a shift away from globally propagated service data in favour of a multi datacenter approach, more akin to your typical VPC. The services continue to register with their local Consul server, operating alongside the nearest Stargate, but that’s where the syncronisation stops. Consul offers a little flexibility in the way it connects to its peers, so whilst a multi datacenter setup does not share service data globally it does permit access by proxy – something the network relies upon on rare but vital occasions, such as global load balancing during peak traffic spikes.
When debugging the latency experienced by the single data center configuration the team discovered a number of performance improvements to the way service data is stored. Whilst the refactors were complex, the result can be simplified to two key areas:
The first health checks consisted of periodic writes to a key/value directory on a per-device basis. A pruning process existed to remove services that exceeded the maximum TTL.
The latest iteration of the health check process uses Consuls inbuilt healthcheck methods, including GRPC standard health endpoints for Stargates and Gateways, and TTL checks for Hosts. This meant that we were able to completely remove the key/value data which had a significant reduction in latency due to the way the Consul propagates service health.
Until recently all devices wrote information such as their current connections and build digest to Consul key/value. Whilst it appeared to be largely an efficient method, benchmarking showed us that storing this data within the core Service metadata reduced the number of Consul requests by other services.
If, for example, a Host wanted to find the closest operational Gateway it would first need to make a call for all services of type ‘Gateway’, individually requesting the KV metadata of each Gateway before an informed preference could be made. By moving this data to the Service itself, what could be a query of 1+n became simply a query of 1.
ACL tokens were initially created to offer distinguishably different permissions to the three core services, with the Host application having the lowest level of trust, and the lowest staking. The Host token had read-only access to non-sensitive information, and a small amount of write access to its own part of the network manifest making it almost impossible to use it for evil.
A single ACL token for all Hosts does come with a downside, and one we discovered when migrating to multi DC on testnet. Firstly, Consul doesn’t allow ACL migration which poses a risk when embedding a single token into the Host. It also means ACL for a single machine cannot be revoked.
The next release will include a refactored method for retrieving ACL tokens to use with Consul, which required the introduction of Vault, an accompanying service which extends Consul ACL with policy templates.
🔗Founding Node status
As part of the network upgrade we updated Founding Nodes to the latest build of Host.
The new telemetry service in the platform also requires an update, but as a result of the permissions system in the previous build of Host, this cannot be performed automatically. This means that nodes will be reporting as “Status unavailable” within your dashboards.
Note that earnings and payouts for July are not effected, and where telemetry data is unavailable, usage approximations from Gateways will be applied.
🔗What do you need to do?
We will need to either update your node’s telemetry system directly or send you an updated SD card to replace the one in your node.
Updating the node directly is the fastest route, but it does require a port forward to be setup on your router.
Please indicate your preference by completing the following form: https://forms.gle/Szm9jp3GSU4pVD6V8