Predicting Process crash / Memory utlization using machine learning

Wed Oct 9 19:40:45 EDT 2019

Thanks a lot for sharing.
One of the problem I am facing is not having enough actual data. I can
create simulated data but it is overfitting my algorithm.
Second problem is I am not sure what all factors (called features in ML
terms) are useful for pattern creation.
Some of the factors I could think of were :
1. Memory used
2. CPU
3. shared memory
4. vmstat
5. message queue sizes

Regards,
Prathamesh

On Wed, Oct 9, 2019 at 2:28 PM Valdis Klētnieks <valdis.kletnieks at vt.edu>
wrote:

> On Wed, 09 Oct 2019 01:23:28 -0700, prathamesh naik said:
> >             I want to work on project which can predict kernel process
> > crash or even user space process crash (or memory usage spikes) using
> > machine learning algorithms.
>
> This sounds like it's isomorphic to the Turing Halting Problem, and there's
> plenty of other good reasons to think that predicting a process crash is,
> in
> general, somewhere between "very difficult" and "impossible".
>
> Even "memory usage spikes" are going to be a challenge.
>
> Consider a program that's doing an in-memory sort. Your machine has 16 gig
> of
> memory, and 2 gig of swap.  It's known that the sort algorithm requires
> 1.5G of
> memory for each gigabyte of input data.
>
> Does the system start to thrash, or crash entirely, or does the sort
> complete
> without issues?  There's no way to make a prediction without knowing the
> size
> of the input data.  And if you're dealing with something like
>
> grep <regexp> file | predictable-memory-sort
>
> where 'file' is a logfile *much* bigger than memory....
>
> You can see where this is heading...
>
> Bottom line:  I'm pretty convinced that in the general case, you can't do
> much
> better than current monitoring systems already do: Look at free space,
> look at
> the free space trendline for the past 5 minutes or whatever, and issue an
> alert
> if the current trend indicates exhaustion in under 15 minutes.
>
> Now, what *might* be interesting is seeing if machine learning across
> multiple
> events is able to suggest better values than 5 and 15 minutes, to provide a
> best tradeoff between issuing an alert early enough that a sysadmin can
> take
> action, and avoiding issuing early alerts that turn out to be false alarms.
>
> The problem there is that getting enough data on actual production systems
> will be difficult, because sysadmins usually don't leave sub-optimal
> configuration
> settings in place so you can gather data.
>
> And data gathered for machine learning on an intentionally misconfigured
> test
> system won't be applicable to other machines.
>
> Good luck, this problem is a lot harder than it looks....
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20191009/6e854456/attachment.html>