<div dir="ltr">Thanks a lot for sharing. <div>One of the problem I am facing is not having enough actual data. I can create simulated data but it is overfitting my algorithm.<br><div>Second problem is I am not sure what all factors (called features in ML terms) are useful for pattern creation.</div><div>Some of the factors I could think of were : </div><div>1. Memory used</div><div>2. CPU</div><div>3. shared memory</div><div>4. vmstat</div><div>5. message queue sizes</div><div><br></div><div>Regards,</div><div>Prathamesh</div><div><br></div><div><br></div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 9, 2019 at 2:28 PM Valdis Klētnieks <<a href="mailto:valdis.kletnieks@vt.edu">valdis.kletnieks@vt.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Wed, 09 Oct 2019 01:23:28 -0700, prathamesh naik said:<br>

>             I want to work on project which can predict kernel process<br>

> crash or even user space process crash (or memory usage spikes) using<br>

> machine learning algorithms. <br>

<br>

This sounds like it's isomorphic to the Turing Halting Problem, and there's<br>

plenty of other good reasons to think that predicting a process crash is, in<br>

general, somewhere between "very difficult" and "impossible".<br>

<br>

Even "memory usage spikes" are going to be a challenge.<br>

<br>

Consider a program that's doing an in-memory sort. Your machine has 16 gig of<br>

memory, and 2 gig of swap.  It's known that the sort algorithm requires 1.5G of<br>

memory for each gigabyte of input data.<br>

<br>

Does the system start to thrash, or crash entirely, or does the sort complete<br>

without issues?  There's no way to make a prediction without knowing the size<br>

of the input data.  And if you're dealing with something like <br>

<br>

grep <regexp> file | predictable-memory-sort<br>

<br>

where 'file' is a logfile *much* bigger than memory....<br>

<br>

You can see where this is heading...<br>

<br>

Bottom line:  I'm pretty convinced that in the general case, you can't do much<br>

better than current monitoring systems already do: Look at free space, look at<br>

the free space trendline for the past 5 minutes or whatever, and issue an alert<br>

if the current trend indicates exhaustion in under 15 minutes.<br>

<br>

Now, what *might* be interesting is seeing if machine learning across multiple<br>

events is able to suggest better values than 5 and 15 minutes, to provide a<br>

best tradeoff between issuing an alert early enough that a sysadmin can take<br>

action, and avoiding issuing early alerts that turn out to be false alarms.<br>

<br>

The problem there is that getting enough data on actual production systems<br>

will be difficult, because sysadmins usually don't leave sub-optimal configuration<br>

settings in place so you can gather data.<br>

<br>

And data gathered for machine learning on an intentionally misconfigured test<br>

system won't be applicable to other machines.<br>

<br>

Good luck, this problem is a lot harder than it looks....<br>

</blockquote></div>