Scale
Important in the networks growth is its ability to accept an increasing number of Hosts without compromising the speed in which vital coordination components calculate and synchronise topology.
Initial builds of the network manifest distributor operated on geographically spread Consul servers, adjacent to the Stargate services. Whilst decentralised in respect to their lack of a single point of failure, these nodes required consensus on all operations, including healthchecks, key/value updates and service metadata. Put simply, if a Host were to tell its nearest Consul server that it was up and healthy, the same data would need to be fully propagated before the write was considered successful. Operationally, this latency doesn't scale well for our type of network, and we started to see some failing health checks (even whilst telemetry painted a far more serene landscape).
One of the largest single changes in the release was a shift away from globally propagated service data in favour of a multi datacenter approach, more akin to your typical VPC. The services continue to register with their local Consul server, operating alongside the nearest Stargate, but that's where the syncronisation stops. Consul offers a little flexibility in the way it connects to its peers, so whilst a multi datacenter setup does not share service data globally it does permit access by proxy – something the network relies upon on rare but vital occasions, such as global load balancing during peak traffic spikes.
Performance
When debugging the latency experienced by the single data center configuration the team discovered a number of performance improvements to the way service data is stored. Whilst the refactors were complex, the result can be simplified to two key areas:
Health
The first health checks consisted of periodic writes to a key/value directory on a per-device basis. A pruning process existed to remove services that exceeded the maximum TTL.
The latest iteration of the health check process uses Consuls inbuilt healthcheck methods, including GRPC standard health endpoints for Stargates and Gateways, and TTL checks for Hosts. This meant that we were able to completely remove the key/value data which had a significant reduction in latency due to the way the Consul propagates service health.
Metadata
Until recently all devices wrote information such as their current connections and build digest to Consul key/value. Whilst it appeared to be largely an efficient method, benchmarking showed us that storing this data within the core Service metadata reduced the number of Consul requests by other services.
Updates
Last Updated:
September 2019