At CJC, we’ve applied the logic of heatmaps to events to provide a powerful understanding of system behaviours and top talkers.
Steve Moreton, Global Head of Product Management, CJC
In my previous post, I covered the importance of heatmaps, albeit with a focus on IT Metrics. Heatmaps are a proven technique to show dataset behaviour and provide clear visual cues to end-users. This post takes a deep dive into Events and how our IT analytics tool, mosaicOA, can provide a powerful understanding of system behaviours. Firstly - a quick reminder of the difference between IT Metrics and Events:
This is data from a server component, such as a CPU changing percentage (1% - 100%). We store every update against a timestamp.
Example of IT Metrics
For a CPU, there can be up to 12 million metrics every day.
An event is created when an IT Metric triggers a set threshold is passed – such as 85%:
An example of an IT Metric creating an Event
An event will only happen when these set thresholds are reached. This means that the size of events dwarf that of IT Metrics. Nevertheless, a capital market infrastructure can generate 500 thousand to 1 million+ warnings and critical events per month. This is a vast amount of data to decipher and understand. Like IT Metrics, live events can be replaced by new information (rectifications/snooze) and unless logged to a database – lost forever. Without tracking that information, you have no way of understanding behaviour and if you are tracking that information, you have a huge pile of data to sort and understand!
mosaicOA records events as well as IT Metrics. At CJC, we’ve applied the logic of heatmaps to events, which provides a powerful understanding of system behaviours and top talkers. Mosaic has always had powerful ways of searching through data (please see our case study on Critical Event Analysis) but to make a heatmap from events, we needed to reduce the datasets down into buckets. Our data science team broke the Events down into the following categories:
Data Buckets for Events
The mosaicOA front end makes a query, naming the server(s) via the Event heatmap function to the InfluxDB time-series database. This is a simple process of typing in your server names into the mosaic front end – regular expressions can be used. The front end will then retrieve the server(s) baseline (Hardware, CPU, Network, OS etc.) and Application (Solace, TREP, Bloomberg, Exegy, exchange Ticker plant etc.).
Heatmap showing Events for 1 Week across all ADH servers
The query will bring back rows of server names, with columns displaying every baseline/application component. The heatmap will show where each component has provided an OK, Warning or Critical message. This will quickly identify all servers which are causing more issues and trends. The time period can be changed from minutes to months if required.
To understand which component is being affected, users can hover over the column name and a prompt will appear:
Pointing to a column will reveal the component being affected – showing trends across servers
By clicking on the server name, all server details and event messages, with a breakdown of the error message ratios, will appear.
By clicking on a heatmap – it will provide detailed visual and text breakdowns
By clicking or highlighting multiple servers, these will be included in the breakdown analysis. Let’s have a look at a full event heatmap dashboard, with all these elements put together. In this case – Solace.
Full Event Heatmap Dashboard
Take a look at this example of a global Solace environment:
Full Dashboard - server click will provide critical event breakdowns and details
In the dashboard above, each Solace appliance (13 in total) has 56 various baseline and application events. These appliances have made almost 42K events in 1 week.
The heatmap orders the 56 categories into tiny, 15-pixel wide boxes (728 boxes in total), moving the 42K events into the correct buckets so the user can immediately identify the top talkers and trends across regions and delve deeper into the server issues if required.
CJC engineers and our clients can easily separate their infrastructure by region, business units or server technology type. They can look at the event behaviour over the last hour, day, weeks or months and easily spot servers which are causing more events than others, or trends across all appliances regionally or globally. The dashboard is split across infrastructure components such as Solace, Exegy, Exchanges, TREP, Bloomberg and is a key feature of weekly and monthly scrums. Using event heatmaps has increased understanding of system behaviours and has led to improved monitoring. IT teams can focus attention on these areas to continually improve uptime. The goal is always to see a sea of green!
mosaicOA’s event heatmaps feature has proven to be a great tool for Engineers, IT Executives and Service Managers globally – especially during the global pandemic. As teams are now separated and still largely working from home, event heatmaps offer a powerful dashboard showing specific IT departments everything they need to see on a single screen, so nothing is missed.
At CJC, we also use this as a standard function of our Managed Service and reporting to clients. This is how we deliver a world-class Managed Service and incredible economies of scale. For more information, please contact firstname.lastname@example.org.