A good application usually has some sort of logging to provide clues when something goes wrong. When it doesn’t, I have to spend time making sure I can get some basic information logged when an issue occurs. In those cases, I used to write files in /var/log for every application or scheduled task. But I came to a point where there were too many files to review when I had to troubleshoot. To simplify troubleshooting, I decided to change my approach to an easier and more convenient system. This post describes what I came up with.
As stated above, I had too many files to review when I had to troubleshoot. Additionally, each file required active management, which consumed even more time. I had to add each file to the rotation system and perform additional backups. In my new approach, I created a single repository of information where I logged every issue I could find. From business applications to administrative tasks, all the errors had to be logged in the same place.
I didn’t think about it too hard as there was no need to reinvent the wheel. Syslog was my choice because it was already a standard for message logging and present in most Unix and Linux systems. It took me some time, but I migrated most of my scripts and had them send their output to the system log. They ended looking similar to this one:
#!/bin/sh BKDATE=$(date +"%d-%b-%Y-%H%I") BKDIR=/home/dbbackups cd $BKDIR pg_dump -U user -d database >db/database-$BKDATE.sql tar -zcf /home/dbbackups/backup-$BKDATE.tar.gz [ "$?" != "0" ] && logger "$0 - PostgreSQL Backup failed" || :
In the example above, the key change is the last line, which invokes the command logger to send a message to Syslog whenever the database backup fails.
Additionally, I scheduled its execution and added the call as part of a crontab file:
10 20 * * * /home/dbbackups/makebackups.sh 2>&1 | /usr/bin/logger -t psqlbackup
Whenever there was an issue with a custom script or with a cron job, I only had to review a single file (/var/log/messages). From this point, I started to configure each application to send logs to Syslog whenever possible. No more independent files, no more files per application.
I have small clients with just a couple of servers, but nowadays business often have more than a few, and you must be able to monitor each and every one of them (whether they’re physical, virtual, or both). If the business is small, it’s very likely that all servers have standard configurations, at least in terms of software. This means you only have to deal with one or two operating systems, likely with similar software versions. When I’m in charge, I make sure all servers are the same. Extra hardware may be limited to some additional network devices.
Later, as the business grows and the number of complex applications increases, monitoring and troubleshooting expands to include more potential sources of failure. Now the challenge isn’t about monitoring a single file in a couple of servers but monitoring files in every node, all the equipment in use by the application, the connectivity between nodes, security—you name it. The ecosystem evolves and becomes more diverse in terms of hardware and software.
For this scenario I use remote logging, Rsyslog specifically. This way I can send every message that arrives locally to a remote logging server, which aggregates logs for every node and every application. Now I have a better way to debug errors because I’ll know what fails, where, and when. I can monitor every application, every script, every process. I’m better prepared to troubleshoot.
Forward Data Configuration
Let’s look at basic Rsyslog configuration:
*.* action(type="omfwd" target="192.0.2.2" port="10514" protocol="tcp" action.resumeRetryCount="100" queue.type="linkedList" queue.size="10000") *.info;mail.none;authpriv.none /var/log/messages authpriv.* /var/log/secure mail.* /var/log/maillog cron.* /var/log/cron *.emerg :omusrmsg:* uucp,news.crit /var/log/spooler local7.* /var/log/boot.log
- The first three lines forward all messages to 192.0.2.2 on port 10514 using protocol tcp and retrying 100 times before discarding a message if the target is not reachable.
- The fourth line tells Rsyslog to log all messages from level info or higher (except emails and private authentication messages).
- The fifth line designates a restricted access file.
- The seventh and eighth lines define location files for email and cron messages.
- The ninth line defines that everybody gets emergency messages.
- The tenth line saves news errors of level crit and higher in a special file.
- The eleventh line saves boot messages to boot.log.
Receive Data Configuration
On the receiving end, instead of the first three lines in the last section configuration, we have the following:
module(load="imtcp") input(type="imtcp" port="514") module(load="imudp") input(type="imudp" port="514")
- The first line loads module for tcp protocol.
- The second line sets tcp port 514 to receive tcp data.
- The third line loads module for udp protocol.
- The fourth line sets udp port 514 to receive udp data.
The rest remains the same to save syslog messages.
Another advantage of this configuration is you can get logs from other hardware devices such as routers and firewalls as long as they’re enabled to send events in Syslog format.
Consolidating the different types of log messages from your different sources comes at a price. Still, I consider it worth paying for the following benefits:
- The ability to use different plugins for different log formats.
- The ability to save all input messages from your environment.
- High processing power to analyze your logs.
- Enough storage for current and historical data.
- The ability to use a cluster to save all your logs.
You may need to monitor and maintain your logging infrastructure, but this isn’t required.
Log Analysis and Alerts
Once you’re collecting data from your business application, your server’s resources, and even hardware devices, you need to build in some automation to know when to take action. For example, you need an alert for you and your team if your application is unreachable, if any of your cron jobs fail, or if your firewall detects an unauthorized login attempt. These events require immediate action, so you need to perform real-time analysis to trigger and generate alerts.
You’ll also have information about your applications’ peak activity today, last week, and even last month or last year. Which application is failing more? The answer is there. You’ll be able to see suspicious network activity, traffic, and users and websites visited. You’ll know which custom process is failing. All the data is there for you, and your imagination is the limit to what you can learn from it. Historical data will provide a basis for learning, both human and machine, which can help you proactively troubleshoot and result in happier end users.
I just showed you how I went from a collection of single log files with a small client to a whole logging and monitoring mechanism. A system like this allows you to monitor your processes, from simple cron jobs to software and hardware alerts, and to improve your troubleshooting capabilities. If you already have an advanced IT setup with plenty of diverse log sources, you may want to consider a cloud solution to reduce the amount of maintenance required. With that in mind, look at how SolarWinds® Papertrail™ can make your monitoring and analysis tasks easier. If you’ve already centralized your logs, just change the destination. And if you’re only beginning to aggregate them, you may save precious time.
This post was written by Juan Pablo Macias Gonzalez. Juan is a computer systems engineer with experience in back end, front end, databases, and systems administration.