top of page

Simple steps to achieve single digit microsecond clock sync in the cloud

Many of our customers in the financial services space are moving containerised services managed by Kubernetes to clouds - public and private. It can be challenging to manage time synchronisation and monitor its accuracy in clouds - especially in public clouds. In this article, I show how dockerised Timebeat is a really cool solution to this problem and what accuracies are achievable. If you are subject to MiFID2 or FINRA regulations then you have to do this. If not, then you really should do this. Verifying that something works well always beats assuming.


Background :


Timebeat Cloud is our PaaS offering so if a customer doesn't want to run our back-end database cluster and web front-end on-premise, we can run it in the cloud on behalf of the customer. To provide this service we use Google Cloud and the Google Kubernetes Engine. On our website, we have a live demo of the Timebeat system which is deployed in gcloud, so this is an ideal opportunity to talk about how it is created. Below is a screenshot of what the demo site looks like:

ree

For the purposes of our demo site, we don't really need very accurate time (we just want to show off the Timebeat dashboards...:-)), so our source is the gcloud ntp service which is received on the timebeat-gm01 node. From here we distribute time to the Kubernetes installation using the Timebeat PTP grandmaster functionality. If you are in the financial services space, then gcloud's ntp service is probably not a sufficiently high-quality source of time and you'll want something better that can be traced to UTC. If you are deploying your own grandmasters then I'd recommend looking at the excellent Qg2 from our friends at Qulsar (it has an excellent price/quality ratio) or if you prefer to just get a PTP feed delivered and you have already your stuff in a reputable data centre, then I recommend you look at something like the Equinix Precision Time service from our friends at Equinix.


Kubernetes Configuration:


There are good ways and bad ways to synchronise time on Kubernetes nodes. The bad way is typically implementing time synchronisation on top of the bare-metal. Doing so sort of misses the point and benefits of containerisation. It's much cooler to deploy time sync as a container as well.


In this example we'll rely on containerised Timebeat to do the job. If you have a Timebeat Enterprise license, then we give you access to just pull the image directly from our GitHub container registry (@ghcr.io/timebeat-app/timebeat) so you don't have to created or manage the Timebeat docker images yourself.


Kubernetes operates with the concept of nodes. As we define replicaSets and daemonSets to run our business applications, Kubernetes schedules pods on nodes subject to a range of criteria and considerations such as: who is doing what, how much memory is available, how much CPU etc. For the purposes of time synchronisation we just want to run a single pod on each node which synchronises the time and reports the normal Timebeat data to our centralised database. Our demo environment consists of just two nodes and one simulated grandmaster - more isn't required. In the demo front-end we track the performance of all nodes in our demo environment as a customer does in a production environment. To get Kubernetes to schedule a pod on each node (no matter if we have 1 node or 1000 nodes) we use the Kubernetes concept of a daemonSet which is specifically intended for this purpose.

Below we look at the Timebeat daemonSet definition :


ree

Timebeat is very light weight, so we can limit the resources it is able to consume significantly. We can see that because the Timebeat pod has to synchronise the clock of the node it is running on, we run the pod as UID 0 and we give the pod SYS_TIME capability. Without these settings the pod would not be allowed to adjust the clock on the node.

The next thing we can see is the parameters for the connection Timebeat has to the elasticsearch back-end database. We use client specific PKI certificates for TLS1.3 encryption and authentication and on top of that we use permission-restricted user roles allowing only write operations on the database. This is very secure which is why we are happy to send the information over the Internet to the Timebeat Cloud database.


You can also see that we use three different configMaps to mount some files in the pods

  1. The PKI files (ca, crt and key)

  2. The Timebeat license file

  3. The Timebeat config file

This is a handy way of managing things so that the Timebeat image can maintain a key tenant of containerisation: statelessness.


Verifying Operation:


After we apply the daemonSet definition above, Kubernetes spins up two pods on the two node demo setup:

ree

In the Timebeat front-end the pods automatically appear in the inventory table:

ree

If later more nodes are added to the Kubernetes installation each will automatically schedule a Timebeat pod because of the daemonSet definition. If Google moves our node in the middle of the night everything starts up automatically. Very handy.


Accuracy:


So what accuracy can you achieve in a public cloud? If we were using any other time sync solution than timebeat (ptp4l for instance), then our view of the world would be very "black and white":

ree

This is not great and not within MiFID and FINRA regulations.

But Timebeat is quite clever and has a more "colourful" view of what PTP transactions to include when synchronising a clock and which ones to ignore, we see:


ree

This means that we can pass only the blue dots to the steering algorithm and the result is clock synchronisation that complies with even the most stringent MiFID and FINRA regulation requirements:


ree

In the plot above we can see accuracy is maintained at +/-25us. A fair guess is that most of the instability is not caused by servo instability, but instability of the google ntp source. If a stable grandmaster source was used, then undoubtedly the performance would improve. This assumption is sensible when we consider where error is introduced in the time chain by looking at all time sources in aggregate form in the front-end:


ree

As is evident, the most significant source of error is (unsurprisingly) Google's NTP server. I hope this was a helpful insight into how easy it is to deploy high accuracy time synchronisation in a containerised cloud environment. If you want to know more about Timebeat and how you can achieve great accuracy in cloud deployments, then please don't hesitate to get in touch with me or one of my colleagues. In the meantime check out the Timebeat website and the live demo used to make this article.






 
 
 

Comments


bottom of page