In a previous article we talked about our plans to move from ZooKeeper (the network manifest protocol originally in use in the network) to Consul, a more scalable and flexible network-wide manifest.
Architecturally speaking, the network manifest is relatively straight forward. All network services connect to Consul, and based on a distributed instruction set, make a decision on which peers to interact with and which customer applications to run.
The move from ZooKeeper to Consul was predominantly driven by performance, but with an understanding that Consul opened up important opportunities that would be explored further at a later stage. I’ll try and cover those now.
First approach: persist everything, worry later
We store the manifest in two places. Services which is a catalogue of connected machines and KV, or _Key Value_, which is an index of objects structured like a filesystem.
In the first iteration of our integration with Consul, Host, Gateway or Stargate registered a Service on the network and wrote metadata to KV. We persisted this data - all of it - which meant that services all appeared to be online, even if they weren’t. And their metadata was preserved indefinitely.
This posed a number of issues, including the exponential growth of data footprint and the complexity in distinguishing between connected and disconnected devices. To patch this problem we had all applications periodically write health data including a last active timestamp to determine whether a machine was likely to still be connected. We then used this data to prune services that were no longer online. This was a fairly heavy handed approach. It worked perfectly on testnet, but when we migrated it to mainnet we saw an impact on latency.
Those that have used Consul will know that it ships with a check component. After realising that the original approach had scaleability issues, we moved to attaching health checks to services. There are two types of healthchecks in use: 1. gRPC for Gateway and Stargate - which both require public IPs - allows the Consul service to use the standard gRPC healthcheck protocol and communicate directly with the device; and 2. TTL checks, which are useful when the device is not publicly available.
Knowledge
Last Updated:
July 2019




