LogAnalysis
Re: [logs] regexless parsing, again? Sep 13 2007 10:51PM
Marcus J. Ranum (mjr ranum com) (1 replies)
Re: [logs] regexless parsing, again? Sep 14 2007 12:12AM
E G (bronc94583 yahoo com) (1 replies)
Re: [logs] regexless parsing, again? Sep 14 2007 07:30PM
Christina Noren (cfrln cfrln com) (1 replies)
Re: [logs] regexless parsing, again? Sep 14 2007 08:05PM
E G (bronc94583 yahoo com) (1 replies)
RE: [logs] regexless parsing, again? Sep 14 2007 09:41PM
Kinsley, Michael (michael kinsley sensage com) (1 replies)
Hello All -

I've been a lurker on this list for some time and thought I might
finally chime in.

Consider the following:
- You are receiving Web access logs from 2 different boxes (they
stream to us over syslog)
- Each server is from a different vendor.
*These happen to be vendors that both said: "Hey, we
will follow the W3C standard for our access logs".

Can you devise a regular expression that can discriminate between vendor
x logs and vendor y?

Answer: Not without hard coding an IP Address or host name... and then
we would need to store this "Meta-Information" somewhere else... and
then we need a procedural language to go map and sort these results.

For all the beauty of RegEx , they have the following short-comings that
are critical in log analysis:

- They cannot consider anything that isn't contained directly in the
message
- There is no branching
- Matching is binary ( success | failure)

Eric G's suggestions are excellent. Such a system that could say:

- This looks 75% like a CISCO PIX message
* we have %FAC-PRI-MNEMONIC we can extract
* I recognize <SRC_IP> where
<SRC_IP>:=src:\d{1,3}\.\d{1,3}\.\d{1,3}.\d{1,3}
+ since I found a src ip, it makes sense to look
for <DST_IP>
- This falls under the IOS bucket ( PIX would be the child of
IOS in a tree structure)
- All messages from this server look alike ( i.e. everything
falls into the same bucket)

Furthermore such an approach would be "trainable". As an admin I could
look at the system and say :

- Hey, you are seeing some things you don't quite recognize...
Let me look.
- SuperAmazingLogAnalyzer, those are messages from a Catalyst
6500 series switch.
* its like a PIX, only you will see %FWSM instead of
%PIX
- That switch is running IOS version X.Y.Z
- Oh look, that Timestamp had an * in front of it... our FW
admins aren't smart enough to sync to NTP :)

You get the point.

In case you didn't notice from the email:
<DISCLAIMER>SenSageEmployee</DISCLAIMER>

-Michael

-----Original Message-----
From: loganalysis-bounces (at) loganalysis (dot) org [email concealed]
[mailto:loganalysis-bounces (at) loganalysis (dot) org [email concealed]] On Behalf Of E G
Sent: Friday, September 14, 2007 1:06 PM
To: Christina Noren
Cc: loganalysis (at) loganalysis (dot) org [email concealed]
Subject: Re: [logs] regexless parsing, again?

Totally, recognizing a line as being say, an ssh
syslog log line is a different matter then
"normalizing" that same data into a usable form on the
other end.

At the same time, once you know what a line is,
hopefully you've learned to "recgonize" its type based
on some method which also helps you break the data
down to make it easier to normalize later.

Using XML for everything is one way to have those
pseudo structures made for you, but in its absence
there needs to be another method until some kind of
standard takes hold.

I tried to make a method that would break data into a
"header" token and a "context" token (the latter would
then be broken into sub tokens). Take syslog as an
easy example. All syslog messages should have the
standard syslog header on them. After this header is
the meat of the message, or "context". So if we can
recgonize the header token, why do a linear search
with regex on the entire string to figure out what it
is (as it standard today) and to then try and
normalize it based on what the regex captured? Just
classify it as a syslog message and send it down the
syslog branch to have the next token delt with. Using
this tree method, you can reduce the search times
quite a bit.

I guess it's coding 101 really. A tree search is
faster then a linear search, and that's kind of what I
came up with.

For normalization, the only answer out there right now
is RegEx, especially if you want to add any kind of
intelligence to the data itself (i.e. "CODE 123"
translates to "PRINTER ON FIRE"). Adding in
intelligence is a big key, IMHO. I got tired of
looking up obscure Windows event code, if you know
what I mean.

Anyhow, I wanted to work on a tokenizing
format/engine, to replace the PCRE RegEx engine I had
been using, but my project never got that far.

And my disclaimer: I work at a SPAM company that
doesn't even collect it's logs :-P

- Erik

--- Christina Noren <cfrln (at) cfrln (dot) com [email concealed]> wrote:

> I think you need to distinguish between the two
> different goals of
> parsing in order to have a productive discussion of
> this:
>
> 1) classifying log messages based on having a
> specific pattern -
> which the below approach does better than regexes -
> and which is
> realized somewhat similarly in Splunk's automatic
> event type
> classification feature.
> 2) pulling out and naming fields within the log data
> - which is also
> something where there are other possible approaches
> than regexes.
> However, even other approaches would necessarily
> rely on pattern
> matching of some sort in the absence of a
> self-describing log format
> whether XML, name/value pair, csv header, etc.
> (Splunk also guesses
> at fields for such well defined formats.) But the
> pattern matching
> could be less cryptic and smarter about the patterns
> that are in
> logs, such as Marcus mentions in his last post.
>
> Repeating Raffy's disclaimer, I also work at Splunk.
>
> Christina
>
> On Sep 13, 2007, at 5:12 PM, E G wrote:
>
> > Back when I worked at "another company" a few
> years
> > back I did a lot of research into this area.
> >
> > We looked a an approach that grouped logs together
> > based upon what we already knew about that type of
> log
> > source and how they are similar, rather then
> > "guessing" what each line was as it came in.
> >
> > This came about from doing quite a bit of
> statistical
> > analysis on raw log data, I noted quite a bit of
> > correlation from source to source (which in itself
> > isn't news), but because of this, would allow us
> to
> > classify unknown data in some semi-intelligent
> method
> > and dump known entities in known "buckets".
> >
> > Working with some people who were much smarter
> then I,
> > I was able to create a reverse Patricia Trie tree
> like
> > structure. Think of it like when you're on your
> > blackberry and you're typing. It attempts to
> predict
> > the next letter and tries to complete the word
> you're
> > typing for you. The same logic can basically work
> in
> > reverse where you use this Trie structure to
> dissemble
> > a word, or string in our case. Once you reach an
> end
> > point on the Trie, it leaves you with what the
> data
> > is, however you have decided to classify it.
> >
> > I hope that's understandable; I didn't want to
> write
> > out a book.
> >
> > Anyhow, my ideas didn't end up going anywhere.
> They
> > choose to stay with the RegEx "guessing" method -
> as
> > is the standard. I had a lot of the code I
> developed
> > after I left up on SourceForce for a while, but
> real
> > life took me away from it. I might be able to dig
> it
> > up if anyone is interested.
> >
> >
> > - Erik
> >
> >
> > --- "Marcus J. Ranum" <mjr (at) ranum (dot) com [email concealed]> wrote:
> >
> >> Anton Chuvakin wrote:
> >>> Anybody care to restart the discussion and see
> what
> >> the collective
> >>> wisdom of loganalysis can produce?
> >>
> >> I am coding on something regarding regexless
> parsing
> >> as we
> >> speak. ETA is unknown but certainly before Xmas.
> It
> >> will be
> >> open source but not GPL.
> >>
> >> mjr.
> >> _______________________________________________
> >> LogAnalysis mailing list
> >> LogAnalysis (at) loganalysis (dot) org [email concealed]
> >>
> >
>
http://www.loganalysis.org/mailman/listinfo/loganalysis
> >>
> >
> >
> >
> >
> >
>
______________________________________________________________________
>
> > ______________
> > Be a better Globetrotter. Get better travel
> answers from someone
> > who knows. Yahoo! Answers - Check it out.
> >
>
http://answers.yahoo.com/dir/?link=list&sid=396545469
> > _______________________________________________
> > LogAnalysis mailing list
> > LogAnalysis (at) loganalysis (dot) org [email concealed]
> >
>
http://www.loganalysis.org/mailman/listinfo/loganalysis
>
>

________________________________________________________________________

____________
Check out the hottest 2008 models today at Yahoo! Autos.
http://autos.yahoo.com/new_cars.html
_______________________________________________
LogAnalysis mailing list
LogAnalysis (at) loganalysis (dot) org [email concealed]
http://www.loganalysis.org/mailman/listinfo/loganalysis

_______________________________________________
LogAnalysis mailing list
LogAnalysis (at) loganalysis (dot) org [email concealed]
http://www.loganalysis.org/mailman/listinfo/loganalysis

[ reply ]
Re: [logs] regexless parsing, again? Sep 14 2007 10:40PM
Christina Noren (cfrln cfrln com) (1 replies)
Re: Re: [logs] regexless parsing, again? Sep 15 2007 12:07AM
Michael Kinsley (michael kinsley sensage com) (3 replies)
Re: Re: [logs] regexless parsing, again? Sep 15 2007 05:59AM
Tom Le (dottom gmail com)
Re: Re: [logs] regexless parsing, again? Sep 15 2007 05:33AM
E G (bronc94583 yahoo com)
Re: Re: [logs] regexless parsing, again? Sep 15 2007 02:25AM
cfrln cfrln com


 

Privacy Statement
Copyright 2010, SecurityFocus