Recovering Linux system from hung state via software

Wed Dec 4 03:24:59 EST 2013

On Wed, Dec 4, 2013 at 4:13 PM, Mandeep Sandhu
<mandeepsandhu.chd at gmail.com>wrote:

> > assuming one mother process is monitoring 10 child process, so inside
> each
> > child process, simply just setup a PERIODIC (eg, per 5 sec) mechanism to
> > toggle a binary variables through IPC means.   It will be reset when the
> > mother process go around checking all the variable status and, if not
> reset
> > it therefore implies that the particular process might be hung.    it can
> > wait further, or continue checking other process.   at the end of
> checking
> > ALL the process, if everything is OK, it should feed the kernel watchdog
> > timer.   if the kernel watchdog timer is not reset, the kernel module
> will
> > then reboot the system.   (ie, reboot is from kernel module).
>
> Hold on! Why should we reboot the whole system if only some of these
> processes are misbehaving?!?! Why should other processes suffer due
> this? Wouldn't it be better to just kill the erroneous process (like
> how most OS's anyway do, eg: "Force Quit" in Ubuntu, or chrome tabs).
>
>
In many COTS software, the behavior of every process is highly dependent on
one-another, especially some of these will talk to hardware, and other are
just processing the intermediate data.   When something goes wrong, it is
difficult to diagnose the faults (which is why faults logging is important,
and always done on flash or harddisk, but not temporary filesystem) in
realtime (ie, self-diagnosis mechanism), so it is better to reboot.   yes,
not all process need to trigger reboot, so design it with care.   eg,
Apache server can always afford to be kill and restart a new one.

> Or are these processes the only ones running on the system?
>
> -mandeep
>

-- 
Regards,
Peter Teoh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20131204/9bc87e30/attachment.html