Module vs Kernel main performacne

Tue May 29 19:50:35 EDT 2012

Hi,

I am working on x8_64 arch. Profiled (oprofile) Linux kernel module
and notice that whole lot of cycles are spent in copy_from_user call.
I compared same flow from kernel proper and noticed that for more data
through put cycles spent in copy_from_user are much less. Kernel
proper has 1/8 cycles compared to module. (There is a user process
which keeps sending data, like iperf)

Used perf tool to gather some statistics and found that call from kernel proper

185,719,857,837 cpu-cycles               #    3.318 GHz
     [90.01%]
  99,886,030,243 instructions              #    0.54  insns per cycle
       [95.00%]
    1,696,072,702 cache-references     #   30.297 M/sec
   [94.99%]
       786,929,244 cache-misses           #   46.397 % of all cache
refs     [95.00%]
  16,867,747,688 branch-instructions   #  301.307 M/sec
   [95.03%]
         86,752,646 branch-misses          #    0.51% of all branches
       [95.00%]
    5,482,768,332 bus-cycles                #   97.938 M/sec
        [20.08%]
    55967.269801 cpu-clock
    55981.842225 task-clock                 #    0.933 CPUs utilized

and call from kernel module

 9,388,787,678 cpu-cycles               #    1.527 GHz
    [89.77%]
 1,706,203,221 instructions             #    0.18  insns per cycle
    [94.59%]
    551,010,961 cache-references    #   89.588 M/sec                   [94.73%]
   369,632,492 cache-misses           #   67.083 % of all cache refs
  [95.18%]
   291,358,658 branch-instructions   #   47.372 M/sec                   [94.68%]
    10,291,678 branch-misses           #    3.53% of all branches
   [95.01%]
  582,651,999 bus-cycles                 #   94.733 M/sec
     [20.55%]
 6112.471585 cpu-clock
 6150.490210 task-clock                 #    0.102 CPUs utilized
                367 page-faults                #    0.000 M/sec
                367 minor-faults                #    0.000 M/sec
                    0 major-faults                #    0.000 M/sec
           25,770 context-switches        #    0.004 M/sec
                 23 cpu-migrations            #    0.000 M/sec

So obviously, CPU is stalling when it is copying data and there are
more cache misses. My question is, is there a difference calling
copy_from_user from kernel proper compared to calling from LKM ?