Topping top in Solaris 8 with prstat

Article

Topping top in Solaris 8 with prstat

By Tom Kincaid and Andrei Dorofeev, March 2001

Introduction

This article describes how the prstat utility can be used to provide views of a system’s activity and resource consumption. In addition, identifying activity and resource consumption of individual processes and threads is discussed. The advantages of prstat over top are identified. Examples of commonly encountered performance problems, and solutions that utilize prstat, are included.

This article will help:

Solaris developers understand how their applications are consuming system resources and how the applications are spending time. With this understanding, developers can identify and correct resource leaks in their applications and gain an understanding of how their applications can be changed to perform better on the Solaris platform.
Solaris system administrators understand how prstat can be used to identify the resources consumption taking place on their systems. With this understanding of resource consumption, system administrators can identify system performance problems and correct them.

Different Views of the System

When users see a noticeable drop in system performance, it is almost always because a resource is being exhausted. The resources most commonly exhausted are CPU, memory, and disks. The Solaris operating environment has many tools for identifying which resources is being overtaxed. Some common Solaris tools for monitoring a system’s usage of memory, CPU, and disks are vmstat, mpstat, and iostat respectively. In addition, the Solaris performance tool perfmeter can be used to graphically monitor system resource utilization on systems that have the appropriate graphic ability.

Note: On some Solaris servers, windowing systems are not available. For this reason, the subject of viewing resource utilization graphically will not be presented in any detail in this article.

Once you have identified which resource is being exhausted, you can use prstat to identify which processes causing it.

Finding Processes that are Using Up the CPU

Suppose you are using a Solaris system to write a document in StarOffice and you notice that it is taking a long time to process large spreadsheet calculations and other tasks.

If you suspect that the system is behaving poorly because the CPU resources are being overtaxed, a quick way to get some kernel statistics on CPU usage is to use the mpstat command.

The mpstat 5 5 command will print the CPU statistics 5 times at 5 second intervals. The following is a sample output of this command.

kincaid/tartan 49 $ mpstat 5 5

CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl

0 1 0 0 345 224 589 220 0 0 0 799 29 1 0 70

0 1 0 0 302 200 752 371 0 0 0 1191 99 1 0 0

0 0 0 0 341 221 767 375 0 0 0 1301 100 0 0 0

0 0 0 0 411 256 776 378 0 0 0 1313 99 1 0 0

0 0 0 0 382 241 738 363 0 0 0 1163 97 3 0 0

kincaid/tartan 50 $

In the output sample above, 4 of the 5 samples have CPU 0 with a combined user time and sytem time at 100 and idle time at 0 (column headings usr, sys, idl). This indicates that the CPU is completely consumed on this system.

perfmeter output

After gathering the data from perfmeter and mpstat, which indicates that the system CPU resources are overtaxed, you can use prstat to identify which processes are consuming the CPU resources. The prstat -s cpu -n 5 command is used to list the five processes that are consuming the most CPU resources. The -s cpu flag tells prstat to sort the output by CPU usage. The -n 5 flag tells prstat to restrict the output to the top five processes.

kincaid/tartan 43 $ prstat -s cpu -n 5

`PID`	`USERNAME`	`SIZE`	`RSS`	`STATE`	`PRI`	`NICE`	`TIME`	`CPU`	`PROCESS/NLWP`
`13974`	`kincaid`	`888K`	`432K`	`run`	`40`	`0`	`36:14.51`	`67%`	`cpuhog/1`
`27354`	`kincaid`	`2216K`	`1928K`	`run`	`31`	`0`	`314:48.51`	`27%`	`server/5`
`14690`	`root`	`136M`	`46M`	`sleep`	`59`	`0`	`0:00.59`	`2.3%`	`Xsun/1`
`14797`	`kincaid`	`9192K`	`7496K`	`sleep`	`59`	`0`	`0:00.10`	`0.9%`	`dtwm/8`
`14851`	`kincaid`	`24M`	`14M`	`sleep`	`48`	`0`	`0:00.03`	`0.3%`	`netscape/1`
`Total: 97 processes, 190 lwps, load averages: 2.18, 2.15, 2.11`

In the example output, there is a process named cpuhog that is consuming the majority (67 %) of the CPU cycles. If this process is killed using the kill -9 13974 command, and the mpstat command is subsequently repeated, the output shows that the CPU is idle the majority of the time. The office application that was tested becomes very responsive and the spreadsheet calculations complete faster.

Note: The -s cpu option is the default setting for prstat. Therefore, if the intent is to sort output by CPU usage, specifying the -s cpu option is not necessary. For the purpose of this article, the -s cpu option is set to distinguish it from other ways of sorting the output produced by prstat.

mpstat 5 5

CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl

0 1 0 0 348 225 577 219 0 0 0 780 32 1 0 67

0 1 0 0 307 206 353 104 0 0 0 433 0 0 0 100

0 0 0 0 304 203 347 103 0 0 0 359 0 0 0 100

0 0 0 0 303 203 347 104 0 0 0 368 0 0 0 100

0 0 0 0 318 211 349 103 0 0 0 369 0 0 0 100

kincaid/tartan 53 $

Determining User Consumption

Occassionally there are many small processes, each of which consume a small piece of the CPU. On a system such as a computing server that is shared by many users, prstat can be used to determine which user (as opposed to which processes) is consuming the most resources. If the user consuming the most resources on the system can be identified, it is possible to move at least part of the work to another machine. To have prstat report statistics about resource consumption by user, add the -a option to the prstat command line.

Adding the -a option to any prstat command will identify how many processes each user is using, what percent of the CPUs, and how much memory, they are using on a system, as shown above. The command prstat -s cpu -a -n 8 asks for the top 8 processes consuming the CPU and a list of resource consumption statistics for each user.

The output below shows that user larry is consuming the most CPU resources.

kincaid/tartan 43 $ prstat -s cpu -a -n 8

`PID`	`USERNAME`	`SIZE`	`RSS`	`STATE`	`PRI`	`NICE`	`TIME`	`CPU`	`PROCESS/NLWP`
`17005`	`larry`	`888K`	`432K`	`run`	`21`	`0`	`0:03.15`	`38%`	`cpuhog/1`
`17015`	`larry`	`888K`	`432K`	`run`	`21`	`0`	`0:03.06`	`36%`	`cpuhog/1`
`17175`	`larry`	`944K`	`872K`	`run`	`24`	`0`	`0:00.37`	`5.7%`	`find/1`
`16911`	`moe`	`944K`	`872K`	`sleep`	`58`	`0`	`0:00.48`	`3.3%`	`find/1`
`16915`	`moe`	`944K`	`872K`	`sleep`	`59`	`0`	`0:00.43`	`3.3%`	`find/1`
`17849`	`curly`	`944K`	`872K`	`run`	`31`	`0`	`0:00.00`	`3.0%`	`find/1`
`16472`	`root`	`132M`	`42M`	`sleep`	`59`	`0`	`0:01.00`	`0.9%`	`Xsun/1`
`16827`	`kincaid`	`6864K`	`4704K`	`sleep`	`48`	`0`	`0:00.05`	`0.4%`	`dtterm/1`

`NPROC`	`USERNAME`	`SIZE`	`RSS`	`MEMORY`	`TIME`	`CPU`
`7`	`larry`	`7504K`	`5656K`	`0.6%`	`0:06.58`	`80%`
`8`	`moe`	`8248K`	`6800K`	`0.7%`	`0:01.31`	`6.6%`
`3`	`curly`	`3336K`	`2832K`	`0.3%`	`0:00.00`	`3.0%`
`34`	`root`	`213M`	`95M`	`9.5%`	`0:03.05`	`1.0%`
`78`	`kincaid`	`433M`	`294M`	`30%`	`0:00.38`	`0.7%`
`Total: 132 processes, 218 lwps, load averages: 3.90, 4.29, 2.45`

kincaid/tartan 44 $

Looking at Processor Sets

The Solaris operating environment is used to host applications on multi-processor machines. Frequently, system administrators create what is known as a processor set. A processor set is a group of CPUs to which tasks can be assigned. For instance, consider a machine containing four CPUs that is being used to run a web server and an application server. One approach to partitioning the system CPU resources between these two tasks is to create two processors with two CPUs each, and to bind the web server to one processor set and the application server to the other processor set. This avoids having one server consume all the CPU resources on the machine.

Included below is a set of commands you can use to create two processor sets.

# psrset -c 0
created processor set 1
processor 0: was not assigned, now 1

# psrset -a 1 1
processor 1: was not assigned, now 1

# psrset -c 2
created processor set 2
processor 2: was not assigned, now 2

# psrset
user processor set 1: processors 0 1
user processor set 2: processor 2

After creating processor sets, prstat commands can be restricted to retrieve the process statistics for processes bound to a specific processor set. This is accomplished with the -C option. For example, prstat -C 1 -s cpu will report all the process activity for processes bound to processor set 1, and sort the results by CPU usage. This is extremely useful for identifying which processes are running on what processor sets.

Evaluating the CPU column for processor sets is also useful for determining how busy a processor set is, which then determines what processor set to assign future computing tasks to.

Note: The CPU column of prstat always reports the percentage of system CPU resources a process is consuming and not the percentage of CPU resources of a processor or a processor set, even if the -C option is specified on the command line. For example, if there is a two-processor set on a four-processor system and a prstat -C is executed on the processor set, since the processor set has 50% of the system’s CPUs, the total percentages of the CPU column will not exceed 50%.

Identifying Virtual Memory Usage

When the active computing tasks on a system require memory resources that exceed the physical memory available on the machine, the system moves units of memory called pages to disk to make memory available for an active section of a program. This is known as paging. When a system starts paging, there will be a significant drop in performance. A good tool for determining if a system is paging is sar. The sar -g command will yield the paging statistics for a given system.

A common cause for system paging is a process or group of processes that are using a majority of the system’s memory. prstat is an excellent tool for identifying which processes are consuming the majority of a system’s memory. Use prstat -s size, which is similar to the previous command, but which sorts prstat output by size instead of by CPU usage.

The following output illustrates sar -g for a system that is paging at a very high rate.

sar -g 5 5

SunOS tartan 5.8 Generic_108528-01 sun4u 02/12/01

`13:20:37`	`pgout/s`	`ppgout/s`	`pgfree/s`	`pgscan/s`	`%ufs_ipf`
`13:20:42`	`39.92`	`538.72`	`670.26`	`1147.31`	`0.00`
`13:20:47`	`36.60`	`483.80`	`515.40`	`353.80`	`0.00`
`13:20:52`	`40.20`	`508.20`	`632.00`	`1125.20`	`0.00`
`13:20:57`	`35.80`	`462.60`	`580.40`	`1141.60`	`0.00`
`13:21:02`	`0.00`	`0.00`	`0.00`	`0.00`	`0.00`
`Average`	`30.51`	`398.72`	`479.69`	`753.74`	`0.00`

Once it has been determined that the system performance drop off is a result of heavy paging activity, the next step is to determine which processes have introduced the increase. Also, any time scanning occurs (as indicated by the column pgscan/s in the above output) there is a memory shortage on the system. It is not easy to identify all the reasons for paging, but identifying the processes that are consuming the most virtual memory is a good start. To view the process consuming the most virtual memory, use the command prstat with the -s size option. The command prstat -s size -n 5 provides the top five processes on a system in terms of virtual memory consumption. Included below is the output from the prstat -s size -n 5 command on the system on which the above sar command was run.

tartan% prstat -s size -n 5

`PID`	`USERNAME`	`SIZE`	`RSS`	`STATE`	`PRI`	`NICE`	`TIME`	`CPU`	`PROCESS/NLWP`
`21307`	`kincaid`	`1001M`	`616M`	`run`	`2`	`0`	`0:01.16`	`32%`	`memhog/1`
`16472`	`root`	`138M`	`43M`	`sleep`	`59`	`0`	`0:17.28`	`1.2%`	`Xsun/1`
`18133`	`kincaid`	`92M`	`31M`	`sleep`	`49`	`0`	`0:01.42`	`0.0%`	`soffice.bin/9`
`16574`	`kincaid`	`44M`	`24M`	`sleep`	`49`	`0`	`0:10.37`	`0.2%`	`.netscape.bin/1`
`16674`	`kincaid`	`36M`	`25M`	`sleep`	`49`	`0`	`0:00.08`	`0.0%`	`sdtperfmeter/1`
`Total: 130 processes, 220 lwps, load averages: 0.51, 0.36, 0.23`

One process in the example is using over 1000 megabytes of virtual memory. The system only has 1 gigabyte of physical memory total. The process with ID 21307, memhog, is most likely the process that is slowing down the system.

After the kill -9 21307 command is issued to terminate the process on the system, the performance returns to normal and repeating the sar command shows that all paging and scanning have ceased, as shown in the following output.

tartan% sar -g 5 5

SunOS tartan 5.8 Generic_108528-01 sun4u 02/12/01

`13:20:02`	`pgout/s`	`ppgout/s`	`pgfree/s`	`pgscan/s`	`%ufs_ipf`
`13:20:07`	`0.00`	`0.00`	`0.00`	`0.00`	`0.00`
`13:20:12`	`0.00`	`0.00`	`0.00`	`0.00`	`0.00`
`13:20:17`	`0.00`	`0.00`	`0.00`	`0.00`	`0.00`
`13:20:22`	`0.00`	`0.00`	`0.00`	`0.00`	`0.00`
`13:20:27`	`0.00`	`0.00`	`3.80`	`590.60`	`0.00`
`Average`	`0.00`	`0.00`	`0.76`	`118.07`	`0.00`

An In-depth Look at Individual Processes

The previous section decribed how prstat can be used to identify processes that are causing system performance issues. This section describes how to examine the details of individual processes using prstat.

The techniques presented in this section are useful for debugging certain classes of bugs and problems frequently encountered when developing or running server applications.

prstat, like top, periodically updates the screen display with a new set of statistics for processes. There are two command line options that can be used to focus in on specific processes and then monitor the process for a longer period.

Focusing on Specific Processes

To limit the statistics reported by prstat to a specific process or a set of processes, use the -p option followed by the list of process IDs. Process IDs can be obtained with the ps -ef command. This can be a useful way to gather information on a running process that you suspect has some performance issues.

Redirecting the Output of `prstat`

By default, prstat will update a section of a terminal screen each time it reports new statistics about the processes it is monitoring. However, if the objective is to determine if a process is leaking resources or to learn how the process behaves over time, it is useful to have multiple sets of data reports from prstat for comparison.

If you redirect the output of prstat to a file, each set of statistics produced by prstat will be preserved in the file.

Tracking Down Resource Leaks

Consider the following scenario. Suppose you suspect that the server application received from a software vendor contains a slow memory leak. To illustrate the problem to the vendor, run a prstat command that is limited to the process ID of the server application and gathers the statistics every 15 seconds. The following code example shows how you can do this.

kincaid/tartan 74 $ server &

[1] 2423
kincaid/tartan 75 $
kincaid/tartan 56 $ prstat -p 2443 15 > server.out &
kincaid/tartan 57 $ tail -f server.out
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  2443 kincaid    32M 1008K sleep   58    0   0:00.00 0.0% server/1
Total: 1 processes, 1 lwps, load averages: 0.00, 0.01, 0.01
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  2443 kincaid    47M 1248K sleep   58    0   0:00.00 0.0% server/1
Total: 1 processes, 1 lwps, load averages: 0.00, 0.01, 0.01
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  2443 kincaid    62M 1488K sleep   58    0   0:00.00 0.0% server/1
Total: 1 processes, 1 lwps, load averages: 0.00, 0.01, 0.01
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  2443 kincaid    77M 1728K sleep   58    0   0:00.00 0.0% server/1
Total: 1 processes, 1 lwps, load averages: 0.00, 0.01, 0.01
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  2443 kincaid    92M 1968K sleep   58    0   0:00.00 0.0% server/1
Total: 1 processes, 1 lwps, load averages: 0.00, 0.01, 0.01
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  2443 kincaid   107M 2208K sleep   58    0   0:00.00 0.0% server/1

From the data gathered by prstat, it is clear that the process is growing by roughly 15 meg. every 15 seconds. In addition, the resident set size of the process is growing by roughly 40K every 15 seconds as well. While there may be explanations for this other than a memory leak in the server application, the data, like that shown in the code example, should raise strong suspicions about there being a memory leak in the application.

Here is an example of a how prstat can be used to observe a Java server application leaking threads. prstat always lists the number of lwps (threads) in each process.

kincaid/tartan 47 $ prstat -p 3221 > threads.out &

[1] 3235
kincaid/tartan 48 $ tail -f threads.out
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  3221 kincaid    29M   11M run     58    0   0:00.00  32% java/43
Total: 1 processes, 43 lwps, load averages: 1.29, 1.20, 1.18
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  3221 kincaid    29M   11M sleep   58    0   0:00.00  36% java/45
Total: 1 processes, 45 lwps, load averages: 1.44, 1.23, 1.19
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  3221 kincaid    29M   11M sleep   58    0   0:00.00  38% java/48
Total: 1 processes, 48 lwps, load averages: 1.43, 1.23, 1.19
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  3221 kincaid    29M   11M run     44    0   0:00.00  45% java/50
Total: 1 processes, 50 lwps, load averages: 1.65, 1.29, 1.21
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  3221 kincaid    29M   11M sleep   58    0   0:00.00  52% java/52
Total: 1 processes, 52 lwps, load averages: 2.03, 1.37, 1.24
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
  3221 kincaid    29M   11M run     48    0   0:00.00  55% java/55

This output shows the number of lwps increasing over time. Hence, it is possible that the server application is leaking threads.

Getting the Resource Statistics for Each Thread Within a Process.

You can also use prstat to find other resource leaks. A very nice feature of prstat is that by using the -L switch, prstat will report statistics for each thread of a process.

This has a variety of uses, including:

The ability to see how the balance of the work is being distributed across the pool by viewing the CPU usage of each thread.
The option to further narrow down resource leaks to individual threads and not just a process. For example you can determine which thread is leaking memory.

To illustrate, the following output shows the resource statistics for each thread of a server application.

^C
kincaid/tartan 49 $
kincaid/tartan 49 $ prstat -L -p 3295

`PID`	`USERNAME`	`SIZE`	`RSS`	`STATE`	`PRI`	`NICE`	`TIME`	`CPU`	`PROCESS/LWPID`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`38`	`0`	`0:00.01`	`2.1%`	`java/16`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`55`	`0`	`0:00.01`	`1.9%`	`java/17`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`48`	`0`	`0:00.01`	`1.8%`	`java/15`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`58`	`0`	`0:00.01`	`1.8%`	`java/23`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`52`	`0`	`0:00.01`	`1.7%`	`java/12`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`48`	`0`	`0:00.01`	`1.6%`	`java/22`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`58`	`0`	`0:00.01`	`1.5%`	`java/13`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`58`	`0`	`0:00.01`	`1.5%`	`java/14`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`48`	`0`	`0:00.01`	`1.4%`	`java/19`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`48`	`0`	`0:00.01`	`1.4%`	`java/18`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`38`	`0`	`0:00.01`	`1.4%`	`java/21`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`58`	`0`	`0:00.01`	`1.3%`	`java/24`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`58`	`0`	`0:00.01`	`1.2%`	`java/20`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`58`	`0`	`0:00.00`	`0.0%`	`java/1`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`58`	`0`	`0:00.00`	`0.0%`	`java/11`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`0`	`0`	`0:00.00`	`0.0%`	`java/10`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`59`	`0`	`0:00.00`	`0.0%`	`java/9`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`0`	`0`	`0:00.00`	`0.0%`	`java/8`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`0`	`0`	`0:00.00`	`0.0%`	`java/7`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`59`	`0`	`0:00.00`	`0.0%`	`java/6`
`3295`	`kincaid`	`28M`	`10M`	`sleep`	`58`	`0`	`0:00.00`	`0.0%`	`java/5`
`Total: 1 processes, 24 lwps, load averages: 1.30, 1.22, 1.21`

kincaid/tartan 50 $

Getting the micro statistics on a process

To examine further what a process is doing, you can use the -m option to have prstat print out the micro statistics of the process. When you specify the -m option, a different set of columns is presented by prstat that represent how the process spent its time since the last sample was taken. The following table lists the micro statistic column headings and their meanings.

*Column Heading*	*Meaning*
`USR`	The percentage of time the process has spent in user mode
`SYS`	The percentage of time the process has spent in system mode
`TRP`	The percentage of time the process has spent in processing system traps
`DFL`	The percentage of time the process has spent processing data page faults
`LCK`	The percentage of time the process has spent waiting for user locks
`SLP`	The percentage of time the process has spent sleeping
`TFL`	The percentage of time the process has spent processing text page faults

Following is sample output from using the -m option on the server application in the previous section.

prstat -m -p 3295
     PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP
  3295 kincaid  0.8 0.0 0.0 0.0 0.0 0.0  99 0.0  22   4  25   0 java/24

Explaining the meaning and implications of all these columns is beyond the scope of this paper. However, as you become familiar with operating system concepts and Solaris internals, the ability to monitor the micro statistics of an individual process or a group of processes becomes a very powerful tool, especially when trying to identify performance-related problems or issues. In addition, the -m option can be used with the -L option so the micro statistics of each thread of a process can be monitored.

Other Solaris Performance Tools

As mentioned previously, prstat can be used with other Solaris performance monitoring tools to strengthen a developer’s arsenal for debugging and diagnosing system performance issues. Some of the other Solaris tools that are worth reading about are: sar, iostat, netstat, and mpstat. All of these are great tools for identifying system performance issues. prstat is a nice complement to these tools because it helps further identify the processes and threads that may be responsible for system performance problems.

Conclusion

prstat is a great addition to the Solaris tool set. It is standard, beginning with Solaris 2.8. Developers no longer need to track down a version of top with each new release of Solaris. prstat has all of the most commonly used features of top plus several very powerful features not found in top.

March 2001

Macro 32 Ramblings

Mind Archive