LogAnalysis
Re: [logs] regexless parsing, again? Sep 13 2007 10:51PM
Marcus J. Ranum (mjr ranum com) (1 replies)
Re: [logs] regexless parsing, again? Sep 14 2007 12:12AM
E G (bronc94583 yahoo com) (1 replies)
Re: [logs] regexless parsing, again? Sep 14 2007 07:30PM
Christina Noren (cfrln cfrln com) (1 replies)
Re: [logs] regexless parsing, again? Sep 14 2007 08:05PM
E G (bronc94583 yahoo com) (1 replies)
RE: [logs] regexless parsing, again? Sep 14 2007 09:41PM
Kinsley, Michael (michael kinsley sensage com) (1 replies)
Re: [logs] regexless parsing, again? Sep 14 2007 10:40PM
Christina Noren (cfrln cfrln com) (1 replies)
Re: Re: [logs] regexless parsing, again? Sep 15 2007 12:07AM
Michael Kinsley (michael kinsley sensage com) (3 replies)
Hello Christina -

Thank you for the excellent questions.

> why do you care which vendor created the web access log if it's the
> same format?

ANSWER1: There are a number of reasons.
- I want to separate data based on the device type.
- Large scale customers often have device specific admins.
- Large scale customers often have region specific admins ( i.e. our
admins from Singapore should not see or have access to data coming
through New York or Zurich. Pretend you are a large multinational bank.)
- All sources are not created equal. Some are more equal than others
( See Animal Farm for more information).
* if we were managing compliance for a large multinational,
segregating data based on its criticality or asset group is often
desirable.

ANSWER2: It was a simplified example, designed to illustrate a point.

> and why on earth would you build a system that accepts syslog input
> without recording and being able to use the originating host's IP
> in other logic?

ANSWER : You wouldn't. You should always be able to use the host / ip
address - this is important information.

Please correct me if I am wrong, but we were talking about "regexless
parsing" or some of the pitfalls behind using regular expressions.
I'm not thinking about how to solve the problem of managing logs for
a SMB. We are all capable of rolling our own solutions. We are
examining fundamental elements of log analysis and transforming data
into information.

So I return to my original point:
- A REGEX lacks ability to differentiate between two things that
look very similar.
- A REGEX can only answer YES or NO
- A REGEX cannot branch or make decisions

* A system that addressed the above weakness of REGEX would provide a
great step forward.
* The tree-like learning system proposed by Ginorio is an example of
such a system. Such trees are also used heavily and successfully in
areas like Bioinformatics. Take a look at Suffix trees if you are
really interested.

Cheers.

-Michael

On Sep 14, 2007, at 3:40 PM, Christina Noren wrote:

> why do you care which vendor created the web access log if it's the
> same format?
>
> and why on earth would you build a system that accepts syslog input
> without recording and being able to use the originating host's IP
> in other logic?
>
> On Sep 14, 2007, at 2:41 PM, Kinsley, Michael wrote:
>
>> Consider the following:
>> - You are receiving Web access logs from 2 different boxes (they
>> stream to us over syslog)
>> - Each server is from a different vendor.
>> *These happen to be vendors that both said: "Hey, we
>> will follow the W3C standard for our access logs".
>>
>> Can you devise a regular expression that can discriminate between
>> vendor
>> x logs and vendor y?
>>
>> Answer: Not without hard coding an IP Address or host name... and
>> then
>> we would need to store this "Meta-Information" somewhere else... and
>> then we need a procedural language to go map and sort these results.
>

<HTML><BODY style="word-wrap: break-word; -khtml-nbsp-mode: space; -khtml-line-break: after-white-space; ">Hello Christina -<DIV><BR class="khtml-block-placeholder"></DIV><DIV>Thank you for the excellent questions.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><BLOCKQUOTE type="cite">why do you care which vendor created the web access log if it's the same format?</BLOCKQUOTE><DIV><BR class="khtml-block-placeholder"></DIV><DIV>ANSWER1: There are a number of reasons.</DIV><DIV><SPAN class="Apple-tab-span" style="white-space:pre"> </SPAN>- I want to separate data based on the device type.</DIV><DIV><SPAN class="Apple-tab-span" style="white-space:pre"> </SPAN>- Large scale customers often have device specific admins.</DIV><DIV><SPAN class="Apple-tab-span" style="white-space:pre"> </SPAN>- Large scale customers often have region specific admins ( i.e. our admins from Singapore should not see or have access to data coming through New York or Zurich. Pretend you are a large multinational bank.)</DIV><DIV><SPAN class="Apple-tab-span" style="white-space:pre"> </SPAN>- All sources are not created equal. Some are more equal than others ( See Animal Farm for more information).</DIV><DIV><SPAN class="Apple-tab-span" style="white-space:pre"> </SPAN>* if we were managing compliance for a large multinational, segregating data based on its criticality or asset group is often desirable.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>ANSWER2: It was a simplified example, designed to illustrate a point.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><BLOCKQUOTE type="cite"><DIV style="">and why on earth would you build a system that accepts syslog input without recording and being able to use the originating host's IP in other logic?</DIV></BLOCKQUOTE><DIV><BR class="khtml-block-placeholder"></DIV><DIV>ANSWER : You wouldn't. You should always be able to use the host / ip address - this is important information.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>Please correct me if I am wrong, but we were talking about "regexless parsing" or some of the pitfalls behind using regular expressions. I'm not thinking about how to  solve the problem of managing logs for a SMB. We are all capable of rolling our own solutions.   We are examining fundamental elements of log analysis and transforming data into information.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>So I return to my original point:</DIV><DIV><SPAN class="Apple-tab-span" style="white-space:pre"> </SPAN>- A REGEX lacks ability to differentiate between two things that look very similar. </DIV><DIV><SPAN class="Apple-tab-span" style="white-space:pre"> </SPAN>- A REGEX can only answer YES or NO</DIV><DIV><SPAN class="Apple-tab-span" style="white-space:pre"> </SPAN>- A REGEX cannot branch or make decisions</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>* A system that addressed the above weakness of REGEX would provide a great step forward.</DIV><DIV>* The tree-like learning system proposed by Ginorio is an example of such a system. Such trees are also used heavily and successfully in areas like Bioinformatics. Take a look at Suffix trees if you are really interested.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>Cheers.</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>-Michael</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV><SPAN class="Apple-tab-span" style="white-space:pre"> </SPAN></DIV><DIV><DIV><DIV>On Sep 14, 2007, at 3:40 PM, Christina Noren wrote:</DIV><BR class="Apple-interchange-newline"><BLOCKQUOTE type="cite">why do you care which vendor created the web access log if it's the same format?<DIV><BR class="khtml-block-placeholder"></DIV><DIV>and why on earth would you build a system that accepts syslog input without recording and being able to use the originating host's IP in other logic?</DIV><BR><DIV><DIV>On Sep 14, 2007, at 2:41 PM, Kinsley, Michael wrote:</DIV><BR class="Apple-interchange-newline"><BLOCKQUOTE type="cite"><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><FONT face="Helvetica" size="3" style="font: 12.0px Helvetica">Consider the following:<SPAN class="Apple-converted-space"> </SPAN></FONT></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><FONT face="Helvetica" size="3" style="font: 12.0px Helvetica"><SPAN class="Apple-tab-span" style="white-space:pre"> </SPAN>- You are receiving Web access logs from 2 different boxes (they</FONT></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><FONT face="Helvetica" size="3" style="font: 12.0px Helvetica">stream to us over syslog)</FONT></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><FONT face="Helvetica" size="3" style="font: 12.0px Helvetica"><SPAN class="Apple-tab-span" style="white-space:pre"> </SPAN>- Each server is from a different vendor.<SPAN class="Apple-converted-space"> </SPAN></FONT></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><FONT face="Helvetica" size="3" style="font: 12.0px Helvetica"><SPAN class="Apple-tab-span" style="white-space:pre"> </SPAN><SPAN class="Apple-tab-span" style="white-space:pre"> </SPAN>*These happen to be vendors that both said: "Hey, we</FONT></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><FONT face="Helvetica" size="3" style="font: 12.0px Helvetica">will follow the W3C standard for our access logs".</FONT></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><FONT face="Helvetica" size="3" style="font: 12.0px Helvetica">Can you devise a regular expression that can discriminate between vendor</FONT></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><FONT face="Helvetica" size="3" style="font: 12.0px Helvetica">x logs and vendor y?<SPAN class="Apple-converted-space"> </SPAN></FONT></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal Helvetica; min-height: 14px; "><BR></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><FONT face="Helvetica" size="3" style="font: 12.0px Helvetica">Answer: Not without hard coding an IP Address or host name... and then</FONT></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><FONT face="Helvetica" size="3" style="font: 12.0px Helvetica">we would need to store this "Meta-Information" somewhere else... and</FONT></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><FONT face="Helvetica" size="3" style="font: 12.0px Helvetica">then we need a procedural language to go map and sort these results.</FONT></DIV> </BLOCKQUOTE></DIV><BR></BLOCKQUOTE></DIV><BR></DIV></BODY></HTML>______
_________________________________________
LogAnalysis mailing list
LogAnalysis (at) loganalysis (dot) org [email concealed]
http://www.loganalysis.org/mailman/listinfo/loganalysis

[ reply ]
Re: Re: [logs] regexless parsing, again? Sep 15 2007 05:59AM
Tom Le (dottom gmail com)
Re: Re: [logs] regexless parsing, again? Sep 15 2007 05:33AM
E G (bronc94583 yahoo com)
Re: Re: [logs] regexless parsing, again? Sep 15 2007 02:25AM
cfrln cfrln com


 

Privacy Statement
Copyright 2010, SecurityFocus