LogAnalysis
Re: Re: [logs] regexless parsing, again? Sep 15 2007 08:40PM
Marcus J. Ranum (mjr ranum com) (1 replies)
Re: Re: [logs] regexless parsing, again? Sep 16 2007 05:42AM
Tom Le (dottom gmail com) (1 replies)
On 9/15/07, Marcus J. Ranum <mjr (at) ranum (dot) com [email concealed]> wrote:

| Tom Le wrote:
| >No you don't. You only have to try the # of times until you match.
|
| If you do that, then you have no way of detecting multiple matches
| on a single input line. Which raises the question of "how do you know
| which one is right?" If you don't, then you have to worry about the
| ordering of your match rules and that's absolute in(s)anity.

Good point - this is implementation specific. We deal with this at rule
creation, guaranteeing that each regex hit will match a unique log event.
There are some corner cases but there are also ways to deal with them. The
only time when a log message could match multiple regex rules (in our case)
is when it's intentional (i.e. setting priority of one rule over the other;
to distinguish between similar log events between different app/OS versions
which match legacy regex rules; etc). I'd love to discuss specifics if
anyone has a problem they can share. We tend to not discuss specifics on
this list (a little ironic considering the nature of log analysis).

I am advocating an expert systems approach to defining regex rules. Not
applications can do that and if you have to deal with more "vagueness" in
your regex parsing, then most definitely, you have to address the use case
of a single log message matching multiple rules (or multiple paths within a
tree).

| If you are so fortunate as to have log messages that are matchable
| in that way, yes. Snort log messages, for example, can be almost all
| matched by 3 or 4 regexes. But try HP printers. :) Just getting the
| quoting rules for HP printer log messages is enough to make a
| rational person think seriously about putting a gun to their head.

Agreed... if all vendor log messages were as easy as IDS and (most) firewall
messages, there wouldn't be a need for more advanced approaches. <Insert
some witty comment about job security here.>

The hard work will always be in defining the "ruleset" whether it is pure
regex or some other parsing/logic combination. The trick is building
something that has high performance and is easily maintainable. I'm
preaching to the choir here... but you all understand that you can't just
identify all 5000 possible Unix OS messages in one sitting. If it's an
iterative approach, how do you leverage work that has already been done, not
clobber legacy "rules" or "logic", deal with subtle changes from vendors,
etc.

| > Another example: use regex to define your "matching rules" and then
| > convert from regex to DFA at implementation.
|
| Yes; that's "industrial strength lipstick."

Heh. Just making sure we didn't trivialize the fact that one can still
maintain the more traditional ways of building regex rules and still achieve
significant performance gains. Scale should be mentioned here. If you go
from parsing 1000 msgs/sec => 10,000 msgs/sec that might be great for you,
but insignificant for others. YMMV.

| I'm trying to understand your point. If I can summarize it, it appears
| to be: "No. Marcus, you're wrong. Regexes CAN still be used even
| though using them is awkward, brain-damaging, and ugly. If you are
| stubborn enough, there is no need to try to do better."

More like: "Marcus, you should separate discussion of regexes vs. other
parsing approaches into separate categories: performance, initial ruleset
development cost, and on-going maintenance."

Each discussion has it's pros and cons with different cost(x) *
complexity(y) functions depending on the what you're doing and size of your
rulesets. I was just trying to explore a deeper level of discussion than
the usual 'regexes suck' or 'PCRE performance sucks' or 'maintaining 100,000
rules is ugly' type discussions.

Note, however, that I will reserve the right to use parts of your above
quote in the future. :)

| I'm guessing you must love Microsoft Windows, too.

Other than the fact that the constant stream of security vulnerabilities
keep us in business, no.

Tom
On 9/15/07, Marcus J. Ranum <<a href="mailto:mjr (at) ranum (dot) com [email concealed]">mjr (at) ranum (dot) com [email concealed]</a>> wrote:<br><br>| Tom Le wrote:<br>| >No you don't.  You only have to try the # of times until you match.<br>|<br>| If you do that, then you have no way of detecting multiple matches
<br>| on a single input line. Which raises the question of "how do you know<br>| which one is right?"    If you don't, then you have to worry about the<br>|  ordering of your match rules and that's absolute in(s)anity.
<br><br>Good point - this is implementation specific.  We deal with this at rule creation, guaranteeing that each regex hit will match a unique log event.  There are some corner cases but there are also ways to deal with them.  The only time when a log message could match multiple regex rules (in our case) is when it's intentional (
i.e. setting priority of one rule over the other; to distinguish between similar log events between different app/OS versions which match legacy regex rules; etc).  I'd love to discuss specifics if anyone has a problem they can share.  We tend to not discuss specifics on this list (a little ironic considering the nature of log analysis).
<br><br>I am advocating an expert systems approach to defining regex rules.  Not applications can do that and if you have to deal with more "vagueness" in your regex parsing, then most definitely, you have to address the use case of a single log message matching multiple rules (or multiple paths within a tree).
<br><br>| If you are so fortunate as to have log messages that are matchable<br>| in that way, yes. Snort log messages, for example, can be almost all<br>| matched by 3 or 4 regexes. But try HP printers. :)  Just getting the
<br>| quoting rules for HP printer log messages is enough to make a<br>| rational person think seriously about putting a gun to their head.<br><br>Agreed... if all vendor log messages were as easy as IDS and (most) firewall messages, there wouldn't be a need for more advanced approaches.  <Insert some witty comment about job security here.>
<br><br>The hard work will always be in defining the "ruleset" whether it is pure regex or some other parsing/logic combination.  The trick is building something that has high performance and is easily maintainable.  I'm preaching to the choir here... but you all understand that you can't just identify all 5000 possible Unix OS messages in one sitting.  If it's an iterative approach, how do you leverage work that has already been done, not clobber legacy "rules" or "logic", deal with subtle changes from vendors, etc.
<br><br>| > Another example: use regex to define your "matching rules" and then <br>| > convert from regex to DFA at implementation.<br>|<br>| Yes; that's "industrial strength lipstick."<br><br>
Heh.  Just making sure we didn't trivialize the fact that one can still maintain the more traditional ways of building regex rules and still achieve significant performance gains.  Scale should be mentioned here.  If you go from parsing 1000 msgs/sec => 10,000 msgs/sec that might be great for you, but insignificant for others.  YMMV.
<br><br>| I'm trying to understand your point. If I can summarize it, it appears<br>| to be: "No. Marcus, you're wrong. Regexes CAN still be used even<br>| though using them is awkward, brain-damaging, and ugly. If you are
<br>| stubborn enough, there is no need to try to do better."<br><br>More like: "Marcus, you should separate discussion of regexes vs. other parsing approaches into separate categories: performance, initial ruleset development cost, and on-going maintenance."
<br><br>Each discussion has it's pros and cons with different cost(x) * complexity(y) functions depending on the what you're doing and size of your rulesets.  I was just trying to explore a deeper level of discussion than the usual 'regexes suck' or 'PCRE performance sucks' or 'maintaining 100,000 rules is ugly' type discussions.
<br><br>Note, however, that I will reserve the right to use parts of your above quote in the future. :)<br><br>| I'm guessing you must love Microsoft Windows, too.<br><br>Other than the fact that the constant stream of security vulnerabilities keep us in business, no.
<br><br>Tom<br><br>
_______________________________________________
LogAnalysis mailing list
LogAnalysis (at) loganalysis (dot) org [email concealed]
http://www.loganalysis.org/mailman/listinfo/loganalysis

[ reply ]
RE: Re: [logs] regexless parsing, again? Sep 18 2007 03:06AM
Desai, Ashish (Ashish Desai fmr com) (1 replies)
Re: Re: [logs] regexless parsing, again? Sep 18 2007 10:24AM
Andrew Hay (andrewsmhay gmail com)


 

Privacy Statement
Copyright 2010, SecurityFocus