Current and correct CPU clock and asm("cpuid")

Peter Senna Tschudin peter.senna at gmail.com
Tue Oct 4 09:34:14 EDT 2011


Hi Peter,

The post:

http://aufather.wordpress.com/2010/09/08/high-performance-time-measuremen-in-linux

Is very good reference. The author is clever and It answers my
questions. Thanks! :-)

I also found old school document from 1997 made by Intel which was
proud of its brand new Pentium-II. See:

http://www.ccsl.carleton.ca/~jamuir/rdtscpm1.pdf

The section 3 describes how to deal with out of order execution and L1 cache.

Thanks!

Peter


On Mon, Oct 3, 2011 at 9:57 PM, Peter Teoh <htmldeveloper at gmail.com> wrote:
> On Tue, Oct 4, 2011 at 2:57 AM, Peter Senna Tschudin
> <peter.senna at gmail.com> wrote:
>> Hi Peter,
>>
>> Thanks for the repply. I've realized that I have no need to transform
>> the arbitrary number in something like seconds because I'm interested
>> in comparing them.
>>
>> Is it safe to say that if I do not make the division by
>> CPU_THOUSAND_HZ I have the number of clock cycles that were "spent"
>> between the calls to getticks()(including some for getticks() itself)?
>>
>> Please see below.
>>
>> Thank you!
>>
>> Peter
>>
>> On Mon, Oct 3, 2011 at 1:17 PM, Peter Teoh <htmldeveloper at gmail.com> wrote:
>>> why not u put a sleep(1) here like this:
>>>
>>>>        ticks tickBegin, tickEnd;
>>>>        tickBegin = getticks();
>>>>
>>>
>>> sleep(1);
>>>
>>>>
>>>>        tickEnd = getticks();
>>>>        double time = (tickEnd-tickBegin)/CPU_THOUSAND_HZ;
>>>>
>>> Then u know that it is reading the TSC values for 1 sec.   And by
>>> running the same program on different system u will get different
>>> "time" values, and then u divide by that values for THAT system - so
>>> that eventually running the same program on different system will get
>>> u the same difference of ticks, which in our present case is "1".
>>> After this "normalization", you can run your system with any timing
>>> difference, and maximum achievable resolution is of course 1 sec.   Is
>>> that what u wanted?
>>
>> That sounds as great idea but:
>>  - may dynamic clock rate and multiple CPU cores mess with your proposal?
>>  - How precise is sleep about sleeping for 1 second?
>>  - I hope that the out of order execution mechanism of the CPU gets
>> frustrated with your proposal and runs the instructions in the order
>> we're expecting (tickBegin-> sleep-> tickEnd). How can we be sure that
>> the instructions were run in correct order?
>>
>>>
>>> BTW, modern OS does not use TSC any more, but yes, your assembly can
>>> still access and read TSC.   But the OS usually read from HPET (which
>>> is how sleep(1) calculate the time differences) and to read the HPET
>>> here is a link:
>>>
>>> http://www.fftw.org/cycle.h
>>
>> Looking cycle.h I found this familiar code(starts on line 216):
>>
>> /*----------------------------------------------------------------*/
>> /*
>>  * X86-64 cycle counter
>>  */
>>
>> static __inline__ ticks getticks(void)
>> {
>>     unsigned a, d;
>>     asm volatile("rdtsc" : "=a" (a), "=d" (d));
>>     return ((ticks)a) | (((ticks)d) << 32);
>> }
>>
>
> Oh no, you are right, I just re-quote the link from Wiki which says
> 'code to read the high-resolution timer on many CPUs and compilers'
> ......ok, RDTSC is nevertheless a high resolution timer as well.....
>
>> The code found on cycle.h is so similar to the one I was using that I
>> guess that both codes were written by the same author. I got the code
>> I'm using from the paper at:
>> http://people.virginia.edu/~chg5w/page3/assets/MeasuringUnix.pdf
>>
>>>
>>> And query the OS via:
>>>
>>> cat /sys/devices/system/clocksource/clocksource0/*
>>> hpet acpi_pm
>>> hpet
>>>
>
> The above is from my x86 Ubuntu 10.04 laptop.
>
>>> and u can see from above that "tsc" is missing from my system.
>>> (linux kernel is 2.6.35-22)
>>>
>>> For TSC, I am not sure what is the highest resolution u can go, but in
>>> a modern SoC chip, with 600Mhz core speed (speaking of PowerPC
>>> http://en.wikipedia.org/wiki/PowerPC_e500), the fastest execution is
>>> 600 millions instruction per sec, assuming the instruction is one insn
>>> per clock.   With this kind of speed, TSC is a very bad for measuring
>>> time differences.
>>
>> This is my mistake. I did not told you about my tests will run only on x86 arch.
>>
>
> Sorry to you too....I forgotten to mention that my hpet output is from
> x86 arch.   Anyway, TSC is nevertheless a valid timer, after some
> research, I found its resolution is as good as HPET:
>
> http://aufather.wordpress.com/2010/09/08/high-performance-time-measuremen-in-linux/
>
> But it did highlight lots of risks with TSC.
>
> And reading further:
>
> http://stackoverflow.com/questions/3388134/rdtsc-accuracy-across-cpu-cores
>
> http://stackoverflow.com/questions/3835111/whats-the-most-accurate-way-of-measuring-elapsed-time-in-a-modern-pc
>
> http://the-b.org/Linux_timers
>
> beware of something called "constant TSC" or 'invariant tsc', and
> overflow time (all different timers are given except for TSC, in link
> above) - if your duration is longer than that, the timer would have
> turnaround before that and gave you inaccurate figures.
>
>>>
>>> On Mon, Oct 3, 2011 at 9:27 AM, Peter Senna Tschudin
>>> <peter.senna at gmail.com> wrote:
>>>> Dear list members,
>>>>
>>>> I'm following:
>>>>
>>>> http://people.virginia.edu/~chg5w/page3/assets/MeasuringUnix.pdf
>>>>
>>>> And I'm trying to measure executing time of simple operations with RDTSC.
>>>>
>>>> See the code below:
>>>>
>>>> #include <stdio.h>
>>>> #define CPU_THOUSAND_HZ 800000
>>>> typedef unsigned long long ticks;
>>>> static __inline__ ticks getticks(void) {
>>>>        unsigned a, d;
>>>>        asm("cpuid");
>>>>        asm volatile("rdtsc" : "=a" (a), "=d" (d));
>>>>        return (((ticks)a) | (((ticks)d) << 32));
>>>> }
>>>>
>>>> void main() {
>>>>        ticks tickBegin, tickEnd;
>>>>        tickBegin = getticks();
>>>>
>>>>        // code to time
>>>>
>>>>        tickEnd = getticks();
>>>>        double time = (tickEnd-tickBegin)/CPU_THOUSAND_HZ;
>>>>
>>>>        printf ("%Le\n", time);
>>>> }
>>>>
>>>> How can the C code detects the correct value for CPU_THOUSAND_HZ? The
>>>> problems I see are:
>>>>  - It is needed to collect the information for the CPU that will run
>>>> the process. On Core i7 processors, different cores can run at
>>>> different clock speed at same time.
>>>>  - If the clock changes during the execution of process, what should
>>>> it do? When is the best time for collecting the clock speed?
>>>>
>>>> The authors of the paper are not sure about the effects of
>>>> "asm("cpuid");" Does it ensure that the entire process will run on the
>>>> same CPU, and will serialize it avoiding out of order execution by the
>>>> CPU?
>>>>
>>>> Thank you very much! :-)
>>>>
>>>> Peter
>>>>
>>>>
>>>> --
>>>> Peter Senna Tschudin
>>>> peter.senna at gmail.com
>>>> gpg id: 48274C36
>>>>
>>>> _______________________________________________
>>>> Kernelnewbies mailing list
>>>> Kernelnewbies at kernelnewbies.org
>>>> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies
>>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Peter Teoh
>>>
>>
>>
>>
>> --
>> Peter Senna Tschudin
>> peter.senna at gmail.com
>> gpg id: 48274C36
>>
>
>
>
> --
> Regards,
> Peter Teoh
>



-- 
Peter Senna Tschudin
peter.senna at gmail.com
gpg id: 48274C36



More information about the Kernelnewbies mailing list