|
Focus on IDS
Re: Re: Intrusion Detection Evaluation Datasets Mar 10 2009 08:55AM zubair shafiq yahoo com (1 replies) Re: Intrusion Detection Evaluation Datasets Mar 10 2009 08:40PM Stefano Zanero (s zanero securenetwork it) (1 replies) Re: Intrusion Detection Evaluation Datasets Mar 12 2009 03:40PM \Zow\ Terry Brugger (zow acm org) (3 replies) Re: Intrusion Detection Evaluation Datasets Mar 13 2009 10:56AM Stefano Zanero (zanero elet polimi it) Re: Intrusion Detection Evaluation Datasets Mar 12 2009 08:43PM Paul Palmer (paul_palmer us ibm com) |
|
Privacy Statement |
On Mar 12, 2009, at 8:40 AM, Zow Terry Brugger wrote:
> I see a lot of people saying (correctly) that advanced (non-signature
> based) NIDS can't be researched until we have good evaluation
> datasets, and I see a lot of people ignoring them and doing it anyway.
> Is anyone (else) actually working on fixing the data problem?
There's a number of things about the framing of this discussion that
are bugging me (I come at this from the perspective of having spent
quite a bit of time on both the research and the commercial sides of
the field).
For one, the nature of the intrusion detection problem is very
dynamic. Ten+ years ago, the biggest problem was interactive
attacks. Five years ago, the biggest headache for organizations was
automated random scanning worms. Today, RS worms have become much
less of a big deal, and most of the action is attacks on clients
primarily via the web, and the resulting remote control of systems via
bots. These are very different problems requiring pretty different
approaches. And in another five years, I'm sure the main problem will
be something else again. So the main nuisances on the wire keep
changing, and any dataset is necessarily going to get stale very
quickly. In particular, quite a lot of staleness will happen between
the start of a hypothetical graduate student starting and finishing a
thesis.
Secondly, I think there's an assumption lurking implicitly in the
search for datasets that the appropriate focus for research is the
inference algorithm. Much like the machine learning community does -
get a fixed data set, and then try all kinds of inference algorithms
to see what works best. For our problem set, I don't think that's a
great way of doing things. For us, the main focus is "What are the
bad guys doing now?" and "What features do we need to detect what they
are now doing". Usually, if you have good features with high
discrimination, most algorithms can be tweaked to do ok. If you don't
have good features, no inference algorithm will save you. And if you
have good features today, they'll be a lot less useful in a couple of
years and new ones will have to be invented.
I think there's a lot of contribution that researchers can continue to
make in this field. But you can't think of it that you are
discovering timeless principles or something - this is much too
applied a field. It's about figuring out what's happening on the wire
*now*, and what can be done about it.
So forget looking for a dataset. Look for a wire. Do whatever it
takes to get your institution let you sniff the egress link - it's
just about guaranteed to have plenty of attacks on it. Build, or
adapt, some software to look at the packets with respect to some
problem that interests you and that seems like a currently rising
challenge. Spend a lot of time manually poring over the packets to
figure out what is going on, and label your own data. You need to get
your hands dirty. If you look at the most influential highly cited
researchers (Todd Heberlein, Vern Paxson, etc, etc) their influential
contributions were always driven off actually trying to detect attacks
on real networks. In the end, intrusion detection is about detecting
intrusions, just like the name says. Any amount of theoretical or
algorithmic sophistication is a waste of time unless it directly
contributes to that goal, and no amount of sophistication will be very
exciting if it only improves the detection of five-year-old attacks
(this is not to say that technical sophistication is not required for
current problems - I believe it is).
I think the problem of producing regular timely datasets that can be
safely published is probably just about intractable, even if one of
the funding agencies were to step up to try and fill the shoes DARPA
long ago left behind. Synthetic datasets would not be that
interesting, and since most attacks are now inside packet content, the
challenge of reliably anonymizing the data while not affecting the
traffic materially would be just about impossible (what algorithm is
going to sanitize every single web developer's cookie format, for
example? How could one be sure that obfuscated javascript didn't
contain any personal information?).
Stuart Staniford.
[ reply ]