The Apache HTTP server allows a system administrator to configure how it should log requests. This is good in terms of flexibility, but it’s horrid in terms of parsing: every installation can be different.
I was tasked with getting Apache logs into Graylog and discovered that $CUST has different Apache log formats even between Apache instances which run on a single machine. I certainly didn’t want to have to write extractors for all of those, and I can’t imagine people here wanting to maintain those …
People have tried submitting JSON directly from Apache, but I find that a bit cumbersome to write, and I have the feeling it’s brittle: an unexpected brace in the request (which ought to be possible) could render the JSON invalid.
I settled on what I think is a much simpler and rather flexible format: a
TAB-separated (\t
) list of key=value pairs configured like this in
httpd.conf
:
LogFormat "clientaddr=%h\trequest=%r\tstatus=%s\toctets=%b\ttime=%t\truntime=%D\treferer=%{Referer}i\tuseragent=%{User-Agent}i\tinstance=nsd9" graylog
CustomLog "|/usr/local/apache-logger.py" graylog
The apache-logger program splits those up, adds fields required for GELF, and fires that off to a Graylog server configured with an appropriate GELF input.
#!/usr/bin/env python
# JPMens, March 2015 filter for special Apache log format to GELF
import sys
import json
import gelf # https://github.com/jspaulding/gelf-python/blob/master/gelf.py
import socket
import fileinput
from geoip import open_database # http://pythonhosted.org/python-geoip/
my_hostname = socket.gethostname() # GELF "host" (i.e. source)
try:
geodb = open_database('GeoLite2-City.mmdb')
except:
sys.exit("Cannot open GeoLite2-City database")
c = gelf.Client(server='192.168.1.133', port=10002)
def isnumber(s):
try:
float(s)
return True
except ValueError:
pass
return False
for line in fileinput.input():
parts = line.rstrip().split('\t')
data = {}
for p in parts:
key, value = p.split('=', 1)
if isnumber(value):
try:
value = int(value)
except:
value = float(value)
if value != '' and value != None:
data[key] = value
data['host'] = my_hostname # overwrite with GELF source
data['type'] = 'special'
request = data.get('request', 'GET I dunno')
method = request.split(' ', 1)[0]
data['short_message'] = request
data['method'] = method
if 'request' in data:
del data['request']
try:
g = geodb.lookup(data['clientaddr'])
if g is not None:
data['country_code'] = g.country
except:
pass
try:
c.log(json.dumps(data))
except:
pass
Graylog effectively receives something like this (the Geo-location having been added by apache-logger):
{
"clientaddr": "62.x.x.x",
"host": "tiggr",
"instance": "nsd9",
"method": "GET",
"country_code": "GB",
"octets": 282,
"referer": "-",
"runtime": 501,
"short_message": "GET /barbo HTTP/1.1",
"status": 404,
"time": "[20/Mar/2015:06:41:36 +0000]",
"type": "special",
"useragent": "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2"
}
You’ll have noted that the LogFormat allows me to specify any number of fields (e.g. instance
) and values.