If you’ve been building or supporting applications for a while, you’ve probably experienced the uncomfortable postmortem meetings that inevitably follow significant service interruptions. You know how it works. There was a critical outage in one of your apps and it took the team an entire week to track down and fix the issue. Customers and revenue were lost.
Now you’re sitting in a large conference room with executives to discuss what happened and why.
It is supposed to be a blame free meeting but you’re a little nervous. You know you’re not monitoring everything. It’s a tradeoff. Logging can slow down apps and monitoring takes time away from delivering features and is expensive. The team built logging and monitoring for the “critical” apps. You trust your team, and your track record is solid—even with last month’s outage. This issue just happened to be on a service the team labeled tier 2, or “not critical.” Of course, since the outage, you’ve bumped it up to tier 1 where it belongs.
“So, tell me,” the VP of product says, “why are you monitoring your applications?”
You relax. This is going to be easier than you thought.
“Even in the best systems,” you say, “there will be problems, and problems cost money. So, we need to know as much as we can when problems occur. Without monitoring, we’ll know something is wrong, but we won’t know any details. And while we are searching for answers, we’ll be losing sales. Monitoring tells us exactly what went wrong and where to look.”
“I see,” says the VP of sales “So, you’re monitoring all your applications?”
“Yeah. We just happened to miss this one.”
“Why aren’t you monitoring every application—even the non-critical ones just to be safe?”
“It’s too expensive,” you say. “And there would be too much data—all those logs and metrics. We’d never be able to parse through it all.”
“So how do you choose which applications are critical?”
“It’s the important ones,” you say. “CRM, payroll, call center, the e-commerce platform…”
“OK,” the VP of product says. “But those are systems, not applications. Systems are made up of applications: authentication services, web apps, databases. Are you monitoring all those components?”
You start to feel a little uncomfortable. “Well,” you say, “yeah. Sort of. We’re monitoring most of them. But that’s a lot of applications.”
“What about your microservices? And your cloud APIs?”
“We can’t monitor everything,” you say. “We have budgets and time constraints.”
“And on the applications you are watching, what events and metrics are you monitoring?”
“Mostly just performance. Error rates, transactions per second, that kind of thing.”
“That’s it?” he asks.
You already know why it took you so long to track down your outage, and so do your VPs. It’s because you’ve sacrificed coverage for budgets and time. That’s how it’s always been.
But it doesn’t have to be that way.
Let’s step back and see what went wrong. We’ll look at how to identify your critical systems, why those systems are so complex and difficult to monitor, and how implementing a comprehensive log monitoring approach along with application performance monitoring (APM) can help you monitor all your systems with simplicity, ease, and minimal costs..
What Is a Critical Application?
First, let’s look at what makes an application critical. A great place to start is by defining your critical processes—flows required for the survival of your business. For example, a freight company delivering goods to supermarkets might consider their supply chain management process to be critical. An online hotel booking company might identify their reservation workflow to be critical. Critical processes make or break your business.
Once you identify critical processes, the next step is to define what critical applications power those critical processes. For the travel company, it might be the front-end application, a shopping cart application, a payment processing application, and the application that calls wholesaler APIs.
Then, once you identify the critical applications, you know what you need to monitor. If any of these parts stop working, your process is broken, and customers are lost. However, identifying all of these components is notoriously difficult, especially with the complexity of modern architecture.
Modern Applications Are Complex
Monitoring was easier years ago. Systems were monoliths that ran on user workstations and called databases directly. However, as the years passed, systems became more complicated, implementing a three-tier architecture: a database, an app, and the presentation layer. And more recently, complexity has multiplied with the introduction of microservices, hybrid and multi-cloud, and container and orchestration systems.
In a modern application, you might now need to monitor hundreds of microservices of your travel booking app, hybrid-cloud applications both on-prem and on public cloud, or multi-cloud communications between two vendors. And with modern DevOps, you might need to monitor containers and complicated orchestration systems such as Kubernetes, just to be sure your deployments are working.
These modern architectures might add reliability and scalability, but they also add abstraction, complexity, and more points of failures, and significantly increase monitoring needs. And even if you can monitor all these components, you now have a major problem: information overload. The sheer number of metrics and logs from your critical applications can be overwhelming. It’s impossible to continuously watch these metrics and logs.
Don’t Compromise Your Monitoring
As a result of all these applications creating all this information, teams compromise what they label critical. They may decide to monitor the slow query log, CPU, memory, and disk space of the database, but not the read/write throughput. They may decide to monitor the microservice that calls the bank API, but not the one that emails customers.
The result of this compromise? The uncomfortable postmortem meeting in the executive conference room. One day your application grinds to a halt. You check the app’s logs, the message queue, the alerts, but you can’t find the answer. Your queries are well-tuned and tested. Your DBA swears indexes are up-to-date. Yet your whole team is spending days trying to track down the issue.
However, if you had monitored all your truly critical apps, if you had watched all your event notifications and available metrics, you would have quickly found an answer. Your read/write throughput was higher than normal, causing locks on your table and slowing your queries, pushing the CPU to its limits. With thorough monitoring you would have quickly traced the problem back to the culprit: a recent update to an application causing a high volume of write requests.
A single missing metric or monitoring alert can cause days or even weeks of lost time and revenue.
Application Performance Management (APM)
This is why APM exists. APM complements your log management and alerting. It integrates with your applications to collect, process, analyze, and correlate information from all your components. Simply put, APM coupled with logs gives you a holistic view of all your applications, from the “bird’s-eye view” of a problem down to the root cause.
And if you use SolarWinds® Papertrail™ for your logs along with SolarWinds AppOptics™ as your APM tool, cost is no longer an issue. You won’t need to choose which logs to collect or which applications or metrics to monitor. With Papertrail and AppOptics, you can afford to monitor all the logs and all the metrics of all your critical applications, orchestration tools, and cloud platforms, in a simple and integrated solution.
SolarWinds Papertrail and AppOptics—Better Together
If you’ve been using SolarWinds Papertrail for a while, you know Papertrail makes it easy to aggregate logs from just about anything. And with the Papertrail event viewer, you can search all your logs from a central interface, see events in context, and even tail logs in real time. SolarWinds AppOptics adds a powerful APM solution to your logs. AppOptics allows you to create dashboards for all your applications, databases, servers, networks, and infrastructures all side by side. With a glance, you can tell you which layer is experiencing issues.
Then you can drill into specifics. AppOptics gives you a clear picture of application dependencies, allowing you to quickly see how microservices are connected, what the database is returning to users, what functions are being called by the API gateway, and more. Seamless integration with Papertrail enables you to move from infrastructure hosts or your transaction traces to the relevant log events associated with that host or services that make up a transaction trace with just one click.
You can also see your servers’ capacities and performance, and conduct trend analysis by comparing usage, capacity, and performance over time.
And importantly, with AppOptics and Papertrail, you can monitor all your applications, without sacrifice.
SolarWinds AppOptics and Papertrail are a new breed of combined APM and logging that’s simple and affordable. To see how AppOptics complements Papertrail, sign up for a free 14-day trial today.