We’ve all heard (or lived through) scary estimates of the cost of IT downtime – $5,000 or even $8,000 per minute. But in today’s IT landscape, latency is the new downtime. Customer expectations now demand that we prioritize network performance as much as availability when solving for reliability. So how can enterprises ensure this supercharged reliability? Observability that starts at the network layer can help.
An estimated 80% of enterprises use a hybrid cloud deployment model. In hybrid cloud, ensuring reliability means understanding how distributed systems are performing from the network layer on up. Teams face new challenges when they try to apply tried and tested strategies for understanding system status and health to a network landscape that has undergone transformational change in the last decade. Traditional network monitoring systems (NMS), many of which were developed two decades ago, have struggled to adapt to the evolving landscape of enterprise infrastructure and keep up with technological advancements in monitoring. In an attempt to compensate, 92.6% of enterprises now use three or more tools to monitor, manage, and troubleshoot their networks.
The observability promise
The staggering volume of telemetry distributed systems produce can drive a lot of impactful insight, but the glut of raw data is only useful to teams who know what to do with it. The last decade has seen observability solutions built to offer quick access to deep insight through an interconnected view that ties current live operations, transactions, and user experiences to the superset of telemetry emitted by heterogeneous infrastructure components.
But beyond the buzzword, what makes network observability more useful than monitoring? Do all observability platforms drive proactive engineering and better experiences? The industry considers the MELT framework (Metrics, Events, Logs, and Traces) a comprehensive definition of observability. If you’re deciding how to invest in traditional network monitoring and modern network observability solutions, this guide to how each pillar of MELT applies to network observability will help you evaluate:
Metrics
Historically, network monitoring has been confined to observing health and performance metrics such as up/down and interface statistics. Traditional NMSs alert users as these metrics reach defined thresholds to signal potential problems. NMSs typically “pull” metrics from devices by polling them at predefined cadences using the SNMP protocol. This introduces a lag between the data available in the NMS and the present state of the network. The fact that SNMP timestamps reflect the time ingested into NMS, not the time that a potentially concerning event occurred, exaggerates this lag. Speeding up polling reduces the lag, but can burden systems and impact performance. Additionally, the reliability and performance of NMSs themselves can challenge their ability to deliver valuable alerts and metrics in time to drive resolution effectively; since the majority of NMSs are self-hosted and use legacy architectures, many end up under-resourced and subject to human error. As performance becomes increasingly critical, equipping infrastructure teams with fresh, accurate network metrics means confronting tough resource tradeoffs.
Events
Traditional NMSs handle the “E” for events via a mechanism called an SNMP trap. In a trap, a device sends a message to the NMS to signal a critical event, usually when something goes wrong. But traps’ weaknesses lie in the very area they are meant to solve for – reliability. Since they use UDP for transport, they have no visibility into whether their event notifications are received and cannot re-transmit them in case of failure – which is likely during a critical event. Traps also provide only a single data point, rather than the higher-resolution insight into trends that polling can provide.
But, given the resource constraints that polling can introduce, many of today’s highest-performing networks rely on streaming telemetry to gather granular, event-driven metrics and gain a dependable, real-time view of network health. Streaming telemetry’s subscription-based model provides an accurate view of the network for rapid remediation without SNMP’s polling overhead and imprecise event timestamps. In some modern systems, streaming telemetry even triggers automated remediation. Today, 41% of enterprises are using streaming network telemetry, and 42.9% more plan to implement it in the next 12 months. The top drivers of streaming telemetry adoption cited by IT leaders are more reliable and secure data collection, and improved tool and data collection scalability.
Despite the promise of streaming telemetry, popular legacy NMSs such as SolarWinds have struggled to keep up with the newer protocols it uses, and, therefore, don’t support streaming. To leverage streaming telemetry to get real-time data from their most important gear, enterprises must support additional monitoring tools to provide complete, if fractured, infrastructure observability. Observability platforms do a better job of ingesting heterogeneous telemetry to provide that complete picture. However, very few observability platforms offer a network-first focus, and most cannot visualize and alert on both streaming telemetry and the critical on-prem and data center metrics that NMSs traditionally cover.
Logs
In network observability, the MELT “L” translates to flow logs, which capture detailed information about IP traffic between network interfaces. In today’s distributed architecture, quick remediation of network issues and proactive network resource optimization depends upon deep traffic understanding. Traffic flows fall outside the core expertise of traditional NMSs, but comprehensive network visibility platforms reach far beyond device metrics and events to include traffic insights – where it’s moving to and from, how it’s routed, and how flow has fluctuated in recent minutes, hours, or months. Traffic logs add rich, multi-dimensional data that contextualizes metrics, allowing network teams to quickly solve for “unknown unknowns” that threaten user experience.
Traces
When it comes to monitoring distributed networks, MELT’s “T” for traces is perhaps better expressed as “T for time-intensive.” In network observability, traces correlate the traffic telemetry present in flow logs to string together an end-to-end picture of an application's functions and service interactions. This is a lot of information to dig into, whether teams use traditional command-line packet analysis utilities such as tcpdump, or invest in expensive and complex network tap hardware and packet analysis software. As a result, teams often avoid tracing in all but the most serious incidents in which they have identified an “unknown unknown” they can’t crack using other observability strategies.
Modern observability tools offer distributed tracing tools that make it easier, but lack of automation, limited depth in frontend and distributed services, and architectural heterogeneity can make tracing impractical. In any case, teams should have low-friction access to rich, contextualized traffic telemetry so that packet-level analysis is rarely necessary.
Network observability beyond MELT
While the MELT framework covers a lot, it doesn’t encapsulate two critical components that drive high-quality user experiences: simulation and context enrichment. Synthetic testing simulates real-world user scenarios from global locations to get an accurate read on network performance as it impacts application performance and end user experiences. Network teams can automate synthetic tests to alert on problem scenarios or use them ad hoc after attempting a fix to test whether performance issues are resolved. Modern network observability platforms offer synthetic testing integrated with network monitoring, making this alert-fix-validate process seamless.
Data enrichment also accelerates troubleshooting and optimization work to improve experiences. Enriching network metrics and logs with business, application, and security context makes connecting the dots much faster. The best observability platforms do this enrichment automatically.
The shift to modern observability: rewards and challenges
The age of observability represents a transformative shift in network monitoring, moving from traditional methods towards a more dynamic, real-time, and integrated approach. Enterprises can gain deeper insights into their network's performance and reliability by consolidating legacy network monitoring into a modern observability system that brings together the network-layer components of the MELT framework and puts them in context. Strategic adoption of streaming telemetry, enhanced event handling, and sophisticated log and trace analysis are pivotal in achieving a comprehensive view of network health, thereby ensuring more reliable and effective IT operations.
However, even as network monitoring adopts critical observability practices, there are still hurdles to overcome. The recently released EMA Network Megatrends 2024 Report highlights that the biggest challenges that network operations face are a shortage of skilled personnel, insufficient budget, and tool fragmentation.
To learn more about the current state of enterprise network monitoring and observability, download the full EMA Network Megatrends 2024 Report today.