Introduction Wool, eyes, pulled
More or less every day I see ads from clock vendors promising that so long as you buy a clock from one of three vendors, then you are guaranteed to have nanosecond accuracy everywhere in your network. In fact, the more clocks the better as far as those vendors are concerned. And on top of that for sure you will be MiFID2 / FINRA CAT compliant…but is it true?
Similarly, switch vendors have been super keen to push the idea that when you do an equipment refresh (...or preferably before...) you must buy their PTP capable switches that can act as boundary or transparent clocks, because then great accuracy is surely guaranteed everywhere...right?
The Emperor's New Clothes
But in our real world experience when we work with really large businesses and examine how these two models work in practise, we find that they perform surprisingly poorly. It's just nobody ever knows about this because the current architecture is fundamentally flawed. In this article, I’ll explain why those businesses and (probably) yours too do not currently and would never in a million years apply the principles of the single-parent / single-source distribution model anywhere else.
To be upfront with you: Of course, we want to sell you something too. But you can get 95% of our product for free on our website right now, and if you do, you will still be much better off than you are now. As our young company grows, we are coming to realise that what our customers really buy from us is not just software. It’s a relationship with a team of experts who have more skills in clock synchronisation than any end-user organisation - given the niche nature of the subject - could reasonably be expected to have.
What you do not realise is this : The thing that so made me want to write this article is these two fundamental questions:
How many single points of failures in your network infrastructure would you be happy have?
How many of these single points of failure would you be happy to leave completely unmonitored, so that in the event they did not work properly you would have no way of knowing?
If you work for a bank, hedge fund, prop trader, asset manager or similar, then the answer is undoubtedly a resounding “zero” to both questions. And yet, as I am about to explain, right now the answer to both questions is most likely: "the number of grand master clocks + the number of boundary clock capable switches" you have.
“Wow”, I can hear you say, “how come I didn’t know about this?” The answer is: because your vendor relies on the fact that you do not have the knowledge or equipment to discover this for yourself.
Smart people understand this is a real problem
These fundamental flaws are also the reason why every hyperscaler I can think of, those who actually need accurate time to manage state in large global distributed databases for instance, are not using this distribution model. Instead they are developing things like Huygens, Sundial and GNSS capable PCI cards which all have a profoundly different set of distribution characteristics than the single source / single parent model used in the financial services sector and elsewhere today.
At Timebeat we have built a revolutionary ultra secure new peer-to-peer based overlay network to dynamically setup PTP unicast connections between servers (and we hope in the future also in switches to create a hybrid-hybrid model #cisco #arista) which is called PTP+Squared. This model while relying on the PTP (and PTP capable hardware) between individual nodes has some profoundly different and much safer distribution characteristics. So what is the problem actually?
There are two main problems with classic time distribution using PTP. They are :
The “best” master clock algorithm (BMCA) and
The single parent nature of boundary clocks and other participants that follow as a result.
Let’s examine a typical boundary clock style deployment like the one below
Yes, it really is that crazy
Think about the image above. If there is a problem with just a single element in this complicated chain of clocks and links between them, then everything downstream from it will be affected. Let me give some examples of how little is required to produce a major problem:
if a single switch-to-switch, clock-to-switch or switch-to-host connection has dissimilar ethernet transceivers, then an asymmetry is introduced which produces a large error, or
if an asymmetry is for whatever reason introduced in an intermediary switch or sometimes across a WAN between two clocks, or
if a single clock in the chain for whatever reason is off it is impossible for anything downstream for it to know - this can happen if CPU load is high, if network traffic spike, if the clock servo code is poorly written (and this is very much the case even with the largest vendors) or has a bug (we have real world experience with this and have filled bug reports), or
if a server changes p-state, then system clock performance is affected, or
....for literally 100 other reasons....
If any of these things happen (and they happen frequently), then the boundary clock distribution model has no mechanism to either a) discover / alert or b) quantify the magnitude of the error.
The single source problem
With traditional solutions (LinuxPTP / Cisco / Arista switches) all monitoring of boundary clocks in switches and host clock in servers (subject to MiFID / FINRA as the case may be) takes this form:
A single upstream clock reports the time to a downstream server. The downstream server accepts whatever calculation its PTP subsystem produces. If the upstream clock is wrong the downstream clock will never know. Even if the upstream clock is right, but the link to the downstream clock introduces an asymmetry the downstream clock will be wrong and it will never know it.
Observe the classic model diagram below:
Notice in the diagram how everything has exactly one parent - one reference. What other IT system would be this badly designed? Nothing I can think of. But the Best Master Clock algorithm in PTP is that poorly designed.
The consequence is that when people start to use Timebeat's UTC verification service they discover that switches in the access layer (as depicted above) don't actually have the same time. They discover that adjacent servers don't have anywhere near the same time even-though every boundary clock switch in the network says accuracy is being maintained at a few nanoseconds.
Our real world experience is shocking
I know of one MSP who had deployed a completely separate clock distribution network where the vendor promised "sub-nanosecond" accuracy, but on inspection different parts of the infrastructure were microseconds apart - 10,000x higher error (4 orders of magnitude) than what was thought to exist.
Similarly, I know of several financial services customers - subject to MiFID II regs - who take time feeds from MSPs and they have discovered that the time they buy is not remotely close to accurate by using a Timebeat reference source for verification.
If you still don't understand the problem with single source / single parent distribution let me furnish you with this analogy:
Analogy: How your network currently works
You are walking down the street. You don't know the time. You ask a man "what time is it please?", the man responds "it's 15:06". If the man is incorrect, or if you did not hear him correctly, then you won't know the right time, and if you only ask that one man you won't ever realise that he was wrong or that you didn't hear him correctly.
Shockingly, a lot of networks are built this flimsily - even when they have multiple redundant switches and clocks - because even in those builds there is only ever a single parent / single source active at any given time. And each and everyone of those might at any point be wrong and no one will ever know about it.
Hope is not a strategy
I have asked some of the largest companies in the finance and MSP space the following question: "If a boundary clock in a switch was introducing a 50 microsecond error, how would you find out about it?" I have never heard any other answer than "we expect that will not happen". That's not a time verification strategy. That's just hoping for the best.
Ok, I accept it is a biiiiiig problem! What is the solution?
The solution depends on your specific circumstances. In the first instance you might consider getting in touch with us to have a talk about your particular deployment. We've seen pretty much everything under the sun.
But to give you a flavour for what a typical solution looks like, let me show you the diagram below:
The benefit of this approach is that you get a multi-source, multi-parent view of the world. You are no longer asking a single entity for the time, you ask at least three - just like you would in real life if you wanted to be sure of the time. If one source is off, not only will you know, you will also know which one by comparing with another two sources. And because each server also distributes time to other servers for verification this model allows you to find asymmetry problems no other solution in the market is currently able to find. The end result is a view of the world like below (from Timebeat's management platform which of course can also alert / report based on the measurements) where you can compare multiple independent sources, distributed independently using different paths in real-time :
It's evident how when you compare independent sources, any deviation will immediately be apparent.
It sounds cool! Is it very expensive?
No, it's not too bad actually. Odds are that if you switch to Timebeat your bill will go down not up.
Per DC / PoP: the Timebeat Enterprise verification service (managed server / GNSS clock) is about the price of a cross connect per month. Including GNSS clock and server!
Per server: the Timebeat Enterprise client license fee is very competitive. We promise that if you currently buy a commercial PTP product from a competitor, then we will save you at least 50% off your current bill. And the product you get is soooo much better too....
... our competitors think this pricing model is "aggressive"... :-) I'm ready to talk, I want to see this in action, who can I contact? My colleague Ian Gough, firstname.lastname@example.org, +44 7989 140 622 I'm not ready to talk, but I want to know more... Ok, to learn more about Timebeat and PTP+Squared, here are some resources: and + a full article about PTP+Squared is in the works. Follow Timebeat for updates.