Fwd: Custom Linux Kernel Scheduler issue

Thu Nov 24 13:05:47 EST 2016

So, I ran perf on my host and it came back far more true. The top consumers
of time were all atomics and some function called sse3, which I believe is
a super fast memcpy implementation provided the the arch. In addition, all
the highest time consumers are within my image- it stayed out of the kernel
as designed and it used additional extensions and features.

I just thought of something-what if there is some kind of page size
difference between my host and my Linux kernel causing the performance
problems?

On Nov 24, 2016 11:33 AM, "Kenneth Adam Miller" <kennethadammiller at gmail.com>
wrote:
>
> On Thu, Nov 24, 2016 at 11:13 AM, Greg KH <greg at kroah.com> wrote:
> > On Thu, Nov 24, 2016 at 10:31:18AM -0500, Kenneth Adam Miller wrote:
> >> On Nov 24, 2016 2:18 AM, "Greg KH" <greg at kroah.com> wrote:
> >> >
> >> > On Thu, Nov 24, 2016 at 02:01:41AM -0500, Kenneth Adam Miller wrote:
> >> > > Hello,
> >> > >
> >> > >
> >> > > I have a scheduler issue in two different respects:
> >> > >
> >> > > 1) I have a process that is supposed to tight loop, and it is being
> >> > > given very very little time on the system. I don't want that - I
want
> >> > > those who would use the processor to be given the resources to run
as
> >> > > fast as they each can.
> >> >
> >> > What is causing it to give up its timeslice?  Is it waiting for I/O?
> >> > Doing something else to sleep?
> >>
> >> It's multithreaded, so it reads in a loop in one thread and writes in
> >> another thread. What I saw when I ran strace on it is each process
> >> would run for too long- the program is designed to try and stay out of
> >> the kernel on each side, so it checks some shared variables before it
> >> ever goes.
> >
> > So locking/cpu contention for those "shared variables" perhaps?
>
> I don't think that could possibly be it, because the shared variables
> are controlled by atomics. It's just some memory operation to check to
> see if it needs to go to the kernel, as in is there more data in the
> shm region for me to read? If not, I'll go wait on this OS semaphore.
> It's lightening fast on my host machine.
>
> >
> >> > > 2) I am seeing with perf that the maximum overhead at each section
> >> > > does not sum up to be more than 15 percent. Total, probably
something
> >> > > like 18% of cpu time is used, and my binary has rocketed in
slowness
> >> > > from about 2 seconds or less total to several minutes.
> >> >
> >> > What changed to make things slower?  Did you change kernel versions
or
> >> > did you change something in your userspace program?
> >> >
> >>
> >> The kernel versions specifically couldnt have anything to do with it
> >> but it was different kernels. The test runs in less that 2 seconds on
> >> my host. When I copy it to our custom linux, it takes minutes for it
> >> to run. I think it's some extra setting that we're missing while
> >> building the kernel, and I don't know what that is. I got a huge
> >> improvement when I changed the multicore scheduling to allow
> >> preemption "(desktop)" but there's still a problem as I've described
> >> with one of the processes not using the core as it should.
> >
> > What do you mean by "custom linux"?  Is this the exact same hardware as
> > your machine?  Or different?  If so, what is different?  What is
> > different between the different kernel versions you are using?  Does the
> > perf output look different from running on the two different machines?
> > If so, where?
>
> I am building with buildroot a linux that is meant to be really
> stripped down and only have the things we want. In my case, the what
> the bzImage sees is either what QEMU gives it or what it sees in our
> dedicated hardware, with is just off the shelf i7 and other stuff you
> get a market - nothing custom in the sense you are thinking. Custom as
> in, roll your own linux.
>
> The kernel versions between my host and the target are 3.13.x and
> 3.14.5x; they don't change so much, and certainly don't affect
> performance on their own. I'm missing some setting or something with
> how I'm configuring or building linux.
>
> I haven't had a chance to run perf on my host. I can't find what
> ubuntu package it is just yet, but I will search for it in a minute. I
> have to go somewhere and will be right back immediately.
>
> >
> > Have you changed the priority levels of your application at all?  Have
> > you thought about just forcing your app to a specific CPU and getting
> > the kernel off of that CPU in order so that the kernel isn't even an
> > option here at all (Linux allows you to do this, details are somewhere
> > in the documentation, sorry, can't remember off the top of my head...)
> >
>
> No, that may be it or help though. I thought that binding an
> application to a particular cpu had something to do with affinity and
> that there was some C api for it or something. That would work for our
> particular scenario, and we've even talked about it, I just don't know
> how to do it yet.
>
> > But really, you should track down what the differences are between your
> > two machines/environments, as something is different that is causing the
> > slow down.
>
> True - the kernel configuration is most suspect based on everything I
> know. The hardware differences between my host to the target we're
> building for is each modern, and well supported by linux. I'm thinking
> it absolutely must have something to do with the way I've built linux.
>
> >
> > You haven't even said what kernel version you are using, and if you have
> > any of your own kernel patches in those kernels.
> >
>
> For the target hardware is 3.14.5x, and there aren't any kernel
> patches at this time; I've disabled grsec while in the process of
> narrowing down what the problem is.
>
> >> > > I think that
> >> > > the linux scheduler isn't scheduling it, because this process is
just
> >> > > some unit tests that double as benchmarks in that they shm_open a
file
> >> > > and write into it with memcpy's.
> >> >
> >> > Are you sure that I/O isn't happening here like through swap or
> >> > something else?
> >> >
> >>
> >> Well, we're using tmpfs and don't have a disk in the machine, but I
> >> will say this process is using all lot of the address space. One
> >> problem here is that the kernel has more ram than it thinks it does,
> >
> > What do you mean, is this a hardware issue?
>
> I don't think it's hardware; we're using this proprietary software
> beneath the linux kernel, but it's still ram of course. I can't say
> too too much, but what I can say is that while how much linux thinks
> it has could be affecting how it behaves, on our end we have the
> resources and can just change the configuration to make sure that
> linux sees and has enough ram. So that we can test on our end, and
> indeed we will.
>
> >
> >> but what I want to emphasize is that I haven't changed the program to
> >> allocate any more than it was previously. I'm not sure if that's a
> >> kernel change or some setting, but it went from 85% to 98%.
> >
> > What exactly went up by 17%?
>
> Consider the process that I was talking about that is meant to tight
> loop and burn on a core to be the "end product process". This is
> different from the test benchmarks that I was explaining run so
> poorly.
>
> >
> >> The reason
> >> why is that there is a large latency even without that big program in
> >> there; I can't run my standalone tests in qemu without it also taking
> >> minutes. I understand qemu has to emulate, and that's its not just a
> >> VM, but I'm going from host CPU to guest, and the settings are the
> >> same.
> >
> > That doesn't really make much sense, why is qemu even in the picture
> > here?  And no, qemu doesn't always emulate things, that depends on the
> > hardware you are running it on, and what type of image you are running
> > on it.
>
> Well, when I'm not at work, I have to be able to run the bzImage on
> something, and I don't have a dedicated machine. So I run it in QEMU.
>
> >
> >> > What does perf say is taking all of your time?
> >>
> >> When I ran perf what it appeared to indicate is that the largest
> >> consumer of time was my library, which should be right in either
> >> scenario because it should use stay out of the kernel as I've designed
> >> it. In addition, the work takes place there anyway, so that's right.
> >> What's not right is the fact that the largest percent of time used is
> >> around 15%, and all the others combined don't add up to anything near
> >> 100.
> >
> > So perhaps you have other processes running on the machine that you are
> > not noticing that is taking up the time slices?  Are you _sure_ nothing
> > else is running?
>
> I'm certain that there are other processes alive, but they are not
> using the CPU. This process is the only one running. I even gave qemu
> "-smp 4" because I want it to behave as close as possible to what it
> would if it were just on the host.
>
> >
> > Basically, you have a bunch of variables, and haven't been very specific
> > with what really is changing, or even being used here, so there's not
> > much specific that I can think of at the moment.
> >
> > thanks,
> >
> > greg k-h
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20161124/725f06ae/attachment-0001.html