Question : Interpretation of load average vs. CPU utilization

Hi,

I have a server with 16 CPU ( 4 sockets with Quad-Core ) and I have a multi-threaded (32 threads) batch program that is expected work mainly in CPU.

When I check CPU usage, is shows less than 50% only.
1:
Cpu(s): 55.2% us,  2.0% sy,  0.0% ni, 42.0% id,  0.0% wa,  0.0% hi,  0.7% si


So that shows a very under-utilized system.

However load average is about 13:
1:
load average: 13.78, 13.21, 13.41


13  among 16 CPU cores, that is about 80%. That's different.

That was from top.
Please can you explain where my interpretation is false.

I checked with sar as well:

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
$ sar -p
09:20:01 AM       CPU     %user     %nice   %system   %iowait     %idle
09:00:02 AM       all     49.68      0.00      2.46      0.01     47.85
09:10:01 AM       all     49.73      0.00      2.63      0.01     47.64
09:20:01 AM       all     49.71      0.00      2.50      0.01     47.78
09:30:01 AM       all     51.00      0.00      2.72      0.01     46.27

$ sar -q
09:20:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
09:00:02 AM        15      1218     13.58     13.23     13.31
09:10:01 AM        13      1218     13.99     14.21     13.72
09:20:01 AM        11      1212     14.19     13.74     13.66
09:30:01 AM        13      1216     13.86     13.79     13.60


And I've one more question: runq-sz is defined as number of processes waiting for run time. Does that mean that at 9:30 there was 13 processes waiting for available CPU, when cpu is 46% idle ?  Or is runq-sz included in load average, meaning that I have 13 processes running in cpu, and then I should expect cpu usage about 80% ?

Thanks,
Franck.

Answer : Interpretation of load average vs. CPU utilization

1) the load is calculated in differents ways
but on this link http://www.teamquest.com/resources/gunther/display/5/index.htm it's nicely presented:
The load average is the sum of the run queue length and the number of jobs currently running on the CPUs.

so
(30, 0, 15) load is 15 = 0 +15
(30, 8, 7) load is 15 = 8 + 7
(30, 8, 6) load is 14 = 8 + 6

2) Yes
 plus----
It's quite complicated and really really fast in the real life
I recommand you read solaris internals (http://www.solarisinternals.com/) and the book with them.
It's for solaris but in the end all kernel are based on the same ideas with differents implementations.

3) yes 30,000 is a big number of syscall
you can use the "strace" command to get an idea of which syscall is called more often

looks like there is a software called kerneltrap that might help you (I've searched dtrace linux on google)
Random Solutions  
 
programming4us programming4us