Determining patch impact on a specific config

Wed Aug 17 14:48:53 EDT 2016

On Wed, Aug 17, 2016 at 07:34:02PM +0200, Greg KH wrote:
> On Wed, Aug 17, 2016 at 04:50:30PM +0000, Nicholas Mc Guire wrote:
> > > But you aren't comparing that to the number of changes that are
> > > happening in a "real" release.  If you do that, you will see the
> > > subsystems that never mark things for stable, which you totally miss
> > > here, right?
> > 
> > we are not looking at the runup to 4.4 here we are looking at
> > the fixes that go into 4.4.1++ and for those we look at all
> > commits in linux-stable. so that should cover ALL subsystems 
> > for which bugs were discovered and fixed (either in 4.4.X or
> > ported from other 4.X findings).
> 
> No, because (see below)...
> 
> > > For example, where are the driver subsystems that everyone relies on
> > > that are changing upstream, yet have no stable fixes?  What about the
> > > filesystems that even more people rely on, yet have no stable fixes?
> > > Are those code bases just so good and solid that there are no bugs to be
> > > fixed?  (hint, no...)
> > 
> > that is not what we are claiming - the model here is that the 
> > operation is uncovering bugs and the critical bugs are being 
> > fixed in stable releases. That there are more fixes and lots 
> > of cleanups that go into stable is clear but with respect to 
> > the usability of the kernel we do assume that if a bug in 
> > driver X is found that results in this driver being unusable 
> > or destabilizing the kernel it would be fixed in the stable 
> > fixes as well (which is also visible in the close to 50% 
> > fixes being in drivers) - now if that assumption is overly 
> > naive then you are right - and the assessment will not hold
> 
> No, that's not how bugs normally get found/fixed. They aren't found in
> older kernels for the most part, they are found in the "latest" kernel
> and then sometimes tagged that they should be backported.
> 
> All of the automated testing/debugging that we have going on to fix
> issues are on the latest kernel release, not the older releases.  We

Well our QA Farm at OSADL does do long-term testing of specific versions
Carsten Emde calls this freeze-and-grow, stick to one specific
version and "monitor" HEAD but dont jump to each new erlease, in fact
I would assume that a HA system would be better off with some simplistic
kernel version selection like:
  1) is it an LTS ?
  2) did it make it into some mainstream distro ?
  3) how many bugs sufaced over time in those distros ?
  4) did it make it through a few sublevels that show decreasing trends ?
  ...
All the automated build-bots/kernelci etc. are nice but they
do not replace field-data - we need both.

> might get lucky and get bug reports from a good distro like Debian or
> Fedora that is running the latest stable kernel, but usually those
> reports are "backport these fixes to the stable kernel please" as the
> fixes have already been made by the community.
> 
> But this does depened on the subsystem/area as well.  Some arches don't
> even test every kernel, they only wake up once a year and start sending
> patches in.  Some subsystems are the same (scsi was known for this...)
> So things are all over the place.

well that measn some of the possible issues are known - and some might
be backed by meta-data - if that is much of an incentif to fix the 
process I do not know, let see.

> 
> Also, you have the subsystems and arches that are just quiet for
> long stretches of time (like scsi used to be), where patches would queue
> up for many releases before they finally got merged.  Some arches only
> send patches in every other year for things that are more complex than
> build-breakage bugs because they just don't care.
> 
> So please, don't assume that the patches I apply to a LTS kernel are due
> to someone noticing it in that kernel.  It's almost never the case, but
> of course there are exceptions.
> 
> Again, I think you are trying to attribute a pattern to something that
> doesn't have it, based on how I have been seeing kernel development work
> over the years.
> 
> > > So because of that, you can't use the information about what I apply to
> > > stable trees as an indication that those are the only parts of the
> > > kernel that have bugs to be fixed.
> > 
> > so a discovered critical bug found in 4.7 that also is found
> > to apply to say 4.4.14 would *not* be fixed in 4.4.15 stable 
> > release ? 
> 
> Maybe, depends on the subsystem.  I know some specific ones that the
> answer to that would be no.  And that's the subsystem maintainers
> choice, I can't tell him to do extra work just because I decided to
> maintain a specific kernel version for longer than expected.

oops.. ok thats bad - that messes up a bit what we had been expecting
then we will need to include monitoring of HEAD basically to fix by
backporting on our own in case it is not done. Seems that I was then
a bit naive on that one.

> 
> > > > > So be careful about what you are trying to measure, it might just be not
> > > > > what you are assuming it is...
> > > > 
> > > > A R^2 of 0.76 does indicate that the commits with Fixes: tags in 4.4 series
> > > > is quite well representing the overall stable fixes. 
> > > 
> > > "overall stable fixes".  Not "overall kernel fixes", two very different
> > > things, please don't confuse the two.
> > 
> > Im not - we are looking at stable fixes - not kernel fixes the 
> > reason for that simply being that for kernel fixes it is not
> > possible to say if they are bug-fixes or optimzations/enhancements
> > - atleast not in any automated way.
> 
> I agree, it's hard, if not impossible to do that :)

...well then I´ll chicken out and go for the meager but possible.

> 
> > The focus on stable dot releases and their fixes was chosen 
> >  * because it is manageable
> >  * because we assume that critical bugs discovered will be fixed
> >  * and because there are no optimizations or added features 
> 
> The first one makes this easier for you, the second and third are not
> always true.  There have been big patchsets get merged into longterm
> stable kernel releases that were done because they were "optimizations"
> and the maintainer of that subsystem and I discussed it and deemed it
> was a valid thing to accept.  This happens every 6 months or so if you
> look closely.  The mm subsystem is known for this :)

so major mm subsystem optimizations will go in in the middle of a 
LTS between "random" sublevel releases ? Atleast for 4.4-4.4.13 I was not
able to pin-point such a change (based on files-changes/lines-added/removed)
could you point me to the one or other ? would help to see why we missed it.

> 
> And as for #2, again, I'm at the whim of the subsystem maintainer to
> mark the patches as such.  And again, this does not happen for all
> subsystems.
> 
> > > And because of that, I would state that "overall stable fixes" number
> > > really doesn't mean much to a user of the kernel.
> > 
> > It does for those that are using some LTS release and it says 
> > something about the probability of a bug in a stable relase
> > being detected. Or would you say that a 4.4.13 is not to be
> > expected to be better off than 4.4.1 ?
> 
> Yes, I would hope that it is better, otherwise why would I have accepted
> the patches to create that kernel?  :)
> 
> But you can't make the claim that all bugs that are being found are
> being added to the stable kernels, and especially not the lts kernels.
>

Im not making such a claim - we are just trying to estimate residual
bugs in the kernel for a given (defined) configuration based on the
git-meta data. We know that the kernel has bugs - but we can classify
their severity, estimate there distribution, estimate the residual bugs
and from that estimate the overall criticality of the kernel in a quantiative
way (with modeled/quantified uncertainty)

> > From the data we have
> > looked at so far: life-time of a bug in -stable as well as with 
> > respect to the discovery rate of bugs in sublevel releases
> > it seems clear that the reliability of the kernel over
> > sublevel releases is increasing and that this can be utilized
> > to select a kernelversion more suitable for HA or critical
> > systems based on trending/analysis.
> 
> That's good, I'm glad we aren't regressing.  But the only way you can be
> sure to get all fixes is to always use the latest kernel release.
> That's all the kernel developers will ever guarantee.
> 
> "critical" and HA systems had better be updating to newer kernel
> releases as they have all of the fixes in it that they need.  There
> shouldn't be any "fear" of changing to a new kernel any more than they
> should fear moving to a new .y stable release.

That would be nice - but its not doable - not as soon as you need a certification for such a system. Dot releases have the key advantage of not including 
feature changes or significant redesign - so the testing and field-data as 
well as analysis (like ftrace campagnes/code-coverage/LTP/etc.) stay valid
to a large extent. From what we have been reviewing for mainstream hardware
we also did not see that backporting was *not* happening for a quite 
constraint/small configuration.

> 
> > > Over time, more people are using the "fixes:" tag, but then that messes
> > > with your numbers because you can't compare the work we did this year
> > > with the work we did last year.
> > 
> > sur why not ? You must look at relative usage and correlation
> > of the tags - currently about 36% of the stable commits in the
> > dot-releases (sublevels) are a uable basis - if the use of
> > Fixes: increases all the better - it just means we are moving
> > towards an R^2 of 1 - results stay comparable, it just means
> > that the confidence intervals for the current data are wider
> > than for the data of next year.
> 
> Depends on how you describe "confidence" levels, but sure, I'll take
> your word for it :)

We are trying to put numeric values on artefacts of development so that they
are comparable - and with confidence here we do mean formal confidence
levels (p-values, AIC values, hypothesis/significance testing at
defined levels) - trying to get away from "gut-feeling" only - but lets see
if we do better or just produce "formalized gut-feeling"...

> 
> > > Also, our rate of change has increased, and the number of stable patches
> > > being tagged has increased, based on me going around and kicking
> > > maintainers.  Again, because of that you can't compare year to year at
> > > all.
> > 
> > why not ? We are not selecting a specific class of bugs in any
> > way - the Fixes are neatly randomly distributed across the 
> > effective fixes in stable - it may be a bit biased because some
> > maintainer does not like Fixes: tags and her subsystem is 
> > significantly more complex/more buggy/better tested/etc. than
> > the average bussystem - so we would get a bit of a bias into it
> > all - but that does not invalidate the results. 
> > You can ask the voters in 3 states who they will elect president
> > and this will give you a less accurate result than if you ask in
> > all 51 states but if you factor in that uncertainty into the
> > result its perfectly valid and stays comparable to other results 
> > 
> > Im not saying that you simply can compare numeric values for
> > 2016 with those from 2017 but you can compare the trends and
> > the expectations if you model uncertainties. 
> 
> Ok, fair enough.  As long as we continue to do better I'll be happy.
> 
> > Note that we have a huge advantage here - we can make predictions
> > from models - say predict 4.4.16 and then actually check our models
> 
> That's good, and is what I've been telling people that they should be
> doing for a long time.  Someone actually went and ran regression tests
> on all 3.10.y kernel releases and found no regressions for their
> hardware platform.  That's a good thing to see.

and just like you can do that on code to detect null pointers or what not
you can do it on development processes to detect systematic problems there
like infinite fix cycles or accumulation of fix-fix-commits indicating a
possibly broken design rather than "just" broken code.

> 
> > Now if there are really significant changes like the task struct
> > bein redone then that may have a large impact and the assumption
> > that the convoluted parameter "sublevel" is describing a more or
> > less stable development might be less correct - it will not be
> > completely wrong - and consequently the prediction quality will
> > suffer - but does that invalidate the approach ?
> 
> I don't know, you tell me :)

ok - will answer that by - say end of 2017 +/- 1 year with a probability of...

> 
> > > There's also the "bias" of the long-term and stable maintainer to skew
> > > the patches they review and work to get applied based on _why_ they are
> > > maintaining a specific tree.  I know I do that for the trees I maintain,
> > > and know the other stable developers do the same.  But those reasons are
> > > different, so you can't compare what is done to one tree vs. another one
> > > very well at all because of that bias.
> > 
> > If the procedures applied do not "jump" but evolve then bias is
> > not an issue - you can find many factors that will increas the
> > uncertainty of any such prediction - but if the parameters, which
> > all are convoluted - be it by presonal preferences of maintainers
> > selection of a specific FS in mainline distributions, etc - stil
> > represent the overall development and as long as your bias as you
> > called it does not flip-flop from 4.4.6 to 4.4.7 we do not care
> > to much.
> 
> Ok, but I don't think the users of those kernels will like that, as you
> can't represent bias in your numbers and perhaps a whole class of users
> is being ignored for one specific LTS release.  Then they would get no
> bugfixes for their areas :(

I think we are looking at it from different angles here - our intent is to 
uncover high-level faults in the develoment life-cycle - thinkgs that start
going off track, like fix-rates going up or complexity metrics jumping, bug
ages changing statitically significantly. e.g. ext4 has kind of poped out
as a problem case - we can´t yet really say much why but from what Ive been
looking at it seems that the problem goes all the way back to the initial
release as a copy of ext3 rather than a clean re-implementation/re-design
(This conclusion may well be wrong - its based on the observation that 
 ext4 stable fixes are seemingly not stabilizing)

So yes - we might be missing a whole subssytem or arch - we can not
do much about that - but we can detect some types of high-level faults
in the development and possibly address them by tools (like static code
checkers or git meta-data harvesting, etc.) or fixes to the process and 
in this sense it will profit those, for us, hidden users. 

> 
> > > > > > Some early results where presented at ALS in Japan on July 14th
> > > > > > but this still needs quite a bit of work.
> > > > > 
> > > > > Have a pointer to that presentation?
> > > > >
> > > > They probably are somewher on the ALS site - but I just dropped
> > > > them to our web-server at
> > > >   http://www.opentech.at/Statistics.pdf and
> > > >   http://www.opentech.at/TechSummary.pdf
> > > > 
> > > > This is quite a rough summary - so if anyone wants the actual data
> > > > or R commands used - let me know - no issue with sharing this and having
> > > > people tell me that Im totally wrong :)
> > > 
> > > Interesting, I'll go read them when I get the chance.
> > > 
> > > But I will make a meta-observation, it's "interesting" that people go
> > > and do analysis of development processes like this, yet never actually
> > > talk to the people doing the work about how they do it, nor how they
> > > could possible improve it based on their analysis.
> > 
> > I do talk to the people - Ive been doing this quit a bit - one of
> > the reasons for hoping over to ALS was precisely that. We ahve been
> > publishing our stuff all along including any findings, patches
> > etc. 
> 
> What long-term stable kernel maintainer have you talked to?
> 
> Not me :)

I actually did talk to you in Duesseldorf at the LISI session (I think
that was its name) about LTS kernels for safety-related automotive systems.
but I did not discuss the statistical stuff at that time as it was not
really ready yet. But as noted before - first we need solid data
so that we can actually resonably uncover high-level issues (like
type missmatches and the whole linux kernel type system/mess - developers
or subsystems with particular issues like lack of reviewed-by or what
ever)

> 
> > BUT: Im not going to go to LinuxCon and claim that I know how
> >      to do better - not based on the preliminary data we have now
> >  
> > Once we think we have something solid - I´ll be most happy to sit 
> > down and listen.
> > 
> > > 
> > > We aren't just people to just be researched, we can change if asked.
> > > And remember, I _always_ ask for help with the stable development
> > > process, I have huge areas that I know need work to improve, just no one
> > > ever provides that help...
> > 
> > And we are doing our best to support that - be it by documentation
> > fixes, compliance analysis, type safety analysis and appropriate
> > patches Ive been pestering maintainers with.
> 
> You have?  As a subsystem maintainer I haven't seen anything like this,
> I guess no one relies on my subsystems :)

Actually a number of them went to you and showed up in stable review
patch series in the past. Now nothing wild and big - its just cleanup
patches (type fixes, completion API fixes, doc fixes) - and some did
go into backports like 3.14-stable (from 3.19 I think)
Atleast I do have a dozen or so "Applied, thanks" from your email address
or your "friendly semi-automated patch-bot"

> 
> > But you do have to give us the time to have SOLID data first
> > and NOT rush conclusions - as you pointed out here your self
> > some of the assumptions we are making might well be wrong so 
> > what kind of suggestions do you expect here ? 
> >  First get the data
> >   -> make a model
> >    -> deduce your analysis/sample/experiements
> >     -> write it all up and present it to the community 
> >      -> get the feedback and fix the model
> > and if after tha some significant findings are left - THEN
> > we will show up at LinuxCon and try to find someone to listen
> > to what we think we have to say...
> 
> No need to go to LinuxCon, email works.  And lots of us go to much
> better confernces as well (Plumbers, Kernel Recipes, FOSDEM, etc.) :)
>
Well early findings were preseted this year at FOSDEM and
at plumbers in Duesseldorf we presented the certification
approach as well - but that was still at a very early stage
in the context of the SIL2LinuxMP certification project at
OSADL (http://www.osadl.org/SIL2)

We are more than happy to present findings and rake in some more rants ...

if e-mail is prefered - all the better.

thx!
hofrat