After scratching the surface of Logstash (and my head) I wanted to understand a bit better how Logstash' filters work, so I set myself the task of parsing a file with a file input and using some of the filters to grab bits and pieces of it for further processing. I also ran into a few surprises...

Logstash flowchart

The input file contains lines of text; I'm interested in grabbing names from lines which begin with "Name:". So, for the following input file (tfile.in), I want to process lines like the second, which contains "Jane Jolie", split that into first and last names, and ignore the rest.

something or other: or not
Name: Jane Jolie uses logstash
Hello: JP

The first surprise is that Logstash's file input, while perfectly capable of reading the file, starts at EOF only: there seems to be no way of telling Logstash to process a file from the beginning when first starting up, even if the documented sincedb_path doesn't exist.

input {
    file {
        type => "tfile"
        path => [ "/tmp/tfile.in" ]
        format => "plain"
        add_field => [ "filetype", "tfile" ]
    }
    tcp {
        type => "tfile"
        port => '1514'
        format => 'plain'
        add_field => [ "filetype", "tfile" ]
    }
}

So, for testing, I either append lines to path (echo "something" >> tfile.in), or replace file by tcp (as shown above) and feed data to Logstash via TCP -- I found this easier to do, and the result is identical.

netcat -v -w 1 127.0.0.1 1514 < tfile.in

Recall from my earlier discussion of Logstash that a type follows an input as it processes through Logstash. There's probably some fancy terminology for this, but I think of it as the type being "attached" or associated with the message (line) just read by the corresponding Logstash input. The following filters all check type and handle messages of type "tfile" only; if I had any other input with a distinct type these filter would ignore that input (c.f. the flowchart above).

The grep filter I use here selects messages by matching the regular expression towards the @message field. (This filter is not strictly necessary -- the following grok filter produces the same result.) All messages filtered by grep have a tag added to them which I can use is subsequent filter matching rules. In other words: a message that passes this filter (i.e. is "grep"'ed, now has type equals "tfile" and a tag equals "grepped" associated with it.

filter {
    grep {
        type => "tfile"
        add_tag => "grepped"
        match => [ "@message", "^Name:" ]
    }

grok does the hard work: it uses either a pattern or a match (equivalent) to a) find the message lines I'm interested in, and b) split up the message line into two fields called "givenname" and "sn" respectively; in my example, the former will contain "Jane" and the latter "Jolie". Additionally I add a tag ("grokked") to the tags already associated with this message. (In other words, this one message now has two tags: "grepped" and "grokked" :-)

grok {
        type => "tfile"
        tags => [ "grepped" ]
# The following two are equivalent
#        pattern => [ "^Name:%{SPACE}%{USERNAME:givenname}%{SPACE}%{USERNAME:sn}%{GREEDYDATA}" ]
        match => [ "@message", "^Name:%{SPACE}%{USERNAME:givenname}%{SPACE}%{USERNAME:sn}%{GREEDYDATA}" ]
        add_tag    => "grokked"
    }

At this point, as the message continues to pass through Logstash's filters, it basically contains the following values. (I believe this is called a "json_event" in Logstash-terminology.)

{
    "@tags": [
        "grepped", 
        "grokked"
    ], 
    "@fields": {
        "filetype": [
            "tfile"
        ], 
        "givenname": [
            "Jane"
        ], 
        "sn": [
            "Jolie"
        ]
    }, 
    "@timestamp": "2012-08-09T19:39:26.331000Z", 
    "@source_path": "/client/127.0.0.1:53435", 
    "@source": "tcp://0.0.0.0:1514/client/127.0.0.1:53435", 
    "@source_host": "0.0.0.0", 
    "@message": "Name: Jane Jolie uses logstash\n", 
    "@type": "tfile"
}

Looking at the above message, I note:

  • The "@type" field was added during "input".
  • The "filetype" field we added during "input".
  • The "@tags" we set during grep and grok filters.
  • The "givenname" and "sn" fields we split out in the grok filter.
  • The other values were added by Logstash; in particular, Logstash created a "@timestamp" corresponding to the current time.

There are a few things I now want to change in the message, so I use the mutate filter to modify it. Here again, we choose to handle only messages with a particular type and tags, and I replace fields, add and remove tags, and add a field.

mutate {
        type => 'tfile'
        tags => [ 'grokked', 'grepped' ]

        uppercase => [ "@message" ]
        lowercase => [ 'givenname' ]
        replace => [ '@type', 'mutated' ]
        remove => [ 'filetype' ]
        remove_tag => [ 'grepped', 'grokked' ]
        add_tag => [ 'fumbled' ]
        add_field => [ 'origin', 'Bulk', 'nom', "%{sn}" ]
    }
}

(Note that lowercasing the givenname doesn't work -- I don't know why Jordan Sissel confirmed this to be a bug and has fixed it.)

Our original log file line now looks like this within Logstash:

{
    "@tags": [
        "fumbled"
    ], 
    "@fields": {
        "origin": [
            "Bulk"
        ], 
        "nom": [
            "Jolie"
        ], 
        "givenname": [
            "Jane"
        ], 
        "sn": [
            "Jolie"
        ]
    }, 
    "@timestamp": "2012-08-09T19:39:26.331000Z", 
    "@source_path": "/client/127.0.0.1:53462", 
    "@source": "tcp://0.0.0.0:1514/client/127.0.0.1:53462", 
    "@source_host": "0.0.0.0", 
    "@message": "NAME: JANE JOLIE USES LOGSTASH\n", 
    "@type": "mutated"
}

We have changed the message or json_event as follows:

  • The "grokked" and "grepped" tags have been removed and replaced with a tag called "fumbled".
  • We've added a field called "origin" and a field called "nom" (French for "name").
  • We've removed the "filetype" field we added during "input". Why? 'Cause it's cool. ;-)
  • Oh, and we've uppercased the "@message" field. (Again, pretty useless, because I won't be using that any longer, but I hope you get the idea...)

Enough transformations for now. Let's output what we have. As A test, I'll use the http output, in which I can format the output as I desire. The final result:

{
    "givenName": "Jane", 
    "surname": "Jolie",
    "time": "2012-08-09T19:39:26.331000Z"
}

So how did I follow what was happening while Logstash was applying filters? I used the stdout output of course, but I also added http output to POST (ahem: logstash expects that to be lowercased) to a Web server. (http typically sends out a JSON event as shown above; using the mapping function, I can reformat that to produce the JSON structure I want, as just above.)

output {
    http {
        type => 'mutated'
        tags => [ 'fumbled' ]
        http_method => 'post'
        url => 'http://127.0.0.1:8000/'
        mapping => [ "surname", "%{sn}", "givenName", "%{givenname}", "time", "%{@timestamp}" ]
    }

    file {
        type => "mutated"
        path => "tfile.out"
        flush_interval => 0
        message_format => "CN: %{givenname} %{sn}  (%{@timestamp})"
    }
}

The file output, which also catches messages of type "mutated" (named that way in the mutate filter we defined above) finally produces the result as defined by "message_format", and writes that into tfile.out:

CN: Jane Jolie  (2012-08-09T19:39:26.331000Z)

That about covers it, I believe.

For your amusement, this is the Web server built with Bottle: it dumps the received JSON to stdout:

#!/usr/bin/env python

from bottle import get, post, request, run
import StringIO
import json

@get('/')
def hello():
    return "Hola!"

@post('/')
def post():
    str = request.body
    params = json.loads(str.getvalue())
    print json.dumps(params, indent=4)

run(host='localhost', port=8000, debug=True)

Logstash's plug-ins, whilst powerful, are time-consuming to learn: documentation is not quite "optimal" and some of the plug-ins seem to be of a mediocre quality; to be fair: some are also marked as beta or experimental. Do to the rather long startup-time that Logstash (at least the monolithic thingy I'm currently using) requirs, testing is tedious: every change in configuration means a restart.

Apart from that, I'm really enjoying Logstash.

Further reading:

Comments

blog comments powered by Disqus