<div dir="ltr">Thanks a lot for sharing. <div>One of the problem I am facing is not having enough actual data. I can create simulated data but it is overfitting my algorithm.<br><div>Second problem is I am not sure what all factors (called features in ML terms) are useful for pattern creation.</div><div>Some of the factors I could think of were : </div><div>1. Memory used</div><div>2. CPU</div><div>3. shared memory</div><div>4. vmstat</div><div>5. message queue sizes</div><div><br></div><div>Regards,</div><div>Prathamesh</div><div><br></div><div><br></div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 9, 2019 at 2:28 PM Valdis Klētnieks <<a href="mailto:valdis.kletnieks@vt.edu">valdis.kletnieks@vt.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Wed, 09 Oct 2019 01:23:28 -0700, prathamesh naik said:<br>
> I want to work on project which can predict kernel process<br>
> crash or even user space process crash (or memory usage spikes) using<br>
> machine learning algorithms. <br>
<br>
This sounds like it's isomorphic to the Turing Halting Problem, and there's<br>
plenty of other good reasons to think that predicting a process crash is, in<br>
general, somewhere between "very difficult" and "impossible".<br>
<br>
Even "memory usage spikes" are going to be a challenge.<br>
<br>
Consider a program that's doing an in-memory sort. Your machine has 16 gig of<br>
memory, and 2 gig of swap. It's known that the sort algorithm requires 1.5G of<br>
memory for each gigabyte of input data.<br>
<br>
Does the system start to thrash, or crash entirely, or does the sort complete<br>
without issues? There's no way to make a prediction without knowing the size<br>
of the input data. And if you're dealing with something like <br>
<br>
grep <regexp> file | predictable-memory-sort<br>
<br>
where 'file' is a logfile *much* bigger than memory....<br>
<br>
You can see where this is heading...<br>
<br>
Bottom line: I'm pretty convinced that in the general case, you can't do much<br>
better than current monitoring systems already do: Look at free space, look at<br>
the free space trendline for the past 5 minutes or whatever, and issue an alert<br>
if the current trend indicates exhaustion in under 15 minutes.<br>
<br>
Now, what *might* be interesting is seeing if machine learning across multiple<br>
events is able to suggest better values than 5 and 15 minutes, to provide a<br>
best tradeoff between issuing an alert early enough that a sysadmin can take<br>
action, and avoiding issuing early alerts that turn out to be false alarms.<br>
<br>
The problem there is that getting enough data on actual production systems<br>
will be difficult, because sysadmins usually don't leave sub-optimal configuration<br>
settings in place so you can gather data.<br>
<br>
And data gathered for machine learning on an intentionally misconfigured test<br>
system won't be applicable to other machines.<br>
<br>
Good luck, this problem is a lot harder than it looks....<br>
</blockquote></div>