Using HW Performance Counters via perf_event Linux kernel calls

Nina Engelhardt Merten Sach Sean Halle

1 Motivation and Purpose

Several projects in the AES group require measurements of various things happening inside the CPU cores and inter-core communication network. One approach is to use the Time Stamp Counter (TSC) to measure intervals. However, in the course of doing so, apparent inconsistencies have arisen. Also, the TSC is limited, as it only reports clock cycles, and reports all clock cycles, including time spent in the kernel and time when the thread recording is swapped out.

The desired functionality includes measuring:

  • time (exclude swap-out periods and system calls, and adjust for frequency changes)
  • Instructions executed (exclude kernel instructions)
  • Cache events
  • reading the counters at specific locations in the VMS code

After a survey of approaches to using the hardware performance counters built in to x86 processors, the choice was made to use the perf_event system, which is built-in to the Linux Kernel.

2 Performance Counters for Linux

Since Linux Kernel ver. 2.6.31, the “performance counters” subsystem has been included into the mainline kernel. This is a unified interface allowing

  • machine-independent access to a set of predefined performance counters that, depending on the hardware’s capabilities, are either implemented by hardware performance counters or estimated by sampling,
  • machine-specific access through “raw” mode, where the hardware counter configuration bits are given directly.

Usually this system is used via the “perf” tool, which works similarly to other profiling tools, by launching a given command as a child process and measuring its progress. However, if more specific measurements are needed, the performance counters can be set up and measurements made at specific points in the application code.

2.1 Prerequisites

  • Kernel compiled with CONFIG PERF COUNTERS=y
  • /proc/sys/kernel/perf event paranoid contains -1 (for access by non-privileged users)

2.2 Usage

A counter is set up with a number of properties, which are set via the contents of struct perf event attr as defined in #include <linux/perf_event.h>. Of particular interest is the field config, which contains the bit-pattern indicating the kind of counter to set up. This can be either one of the predefined counter types, such as PERF COUNT HW INSTRUCTIONS, or a raw value [?].

The syscall itself also allows choosing process vs core behavior. The counter can be attached to a specific process, or it can be attached to a core. It can also be attached to both. When attached to a process, but not a core, the kernel will stop the counter when the process loses its current core and start it again the next time it gets scheduled, on whichever core it gets scheduled to. When attached to a process and a core, the counter only counts when the process is scheduled on the target core. And when attached just to a core, the counter logs events regardless of which process is running.

Thanks to these features, it is possible to track all events from a process (and only that process), no matter how it happens to be scheduled; or it is possible to separately track the usage of each core by a process.

2.3 details of format

There are two different files of interest: linux/perf_event.h defines struct perf event attr. It is the actual code compiled into the kernel, and is the newest and most up to date. Then design.txt gives a full explanation of how the counter system works, including details of bit formats in particular fields. Although, design.txt is precise for an older version that has fewer features, it is still accurate. There are new fields in perf event.h that are not explained. These fields must be set to 0, to indicate they’re not being used. In fact, everything that is not set to a specific value must be set to 0, even fields that the documentation claims are not used. design.txt can be found at: https://github.com/torvalds/linux/blob/master/tools/perf/design.txt

The config field configures the events recorded by the hardware counter. It has pre-defined constants that can be used, or raw mode, in which case user-code specifies a bit pattern that is directly written into the hardware counter configuration register. When not raw mode, then design.txt gives the meaning of the bit fields.

struct perf event attr has several fields not explained in the file. It is important to zero out these, even fields specified as ”reserve” must contain zero otherwise the counter fails. The field ”type” of perf event attr is unknown, but it’s a good guess that it’s the same as the ”perf event types” in design.txt. When it is set to zero, the counters seem to work.

2.4 Things that don’t work

  • Some combinations of parameters to the system call don’t work. Even though the explanation in design.txt seems mostly accurate. We happened to find one combination that works, as shown in the code below, and now don’t touch it anymore.
  • The group counter functionality hasn’t been gotten to work yet.
  • Note, when pinning threads, try making the system call to create the counter file handles before pinning the threads to the cores.
  • Can’t count all tasks on all CPUs.
  • The explanations in design.txt are a bit weird – found one combination that works and just use that.
  • Supposed to be able to set up a group of counters that can be all returned together, as a group. Supposeldy, when create the file handle attached to a counter, can give it the handle of a different counter to attach to. In this case, first create a group leader, with file handle to attach to of ”-1”. Then, for the other counters, use the file descriptor returned by the leader. But this didn’t work – gave an error message ”fail”.

2.5 Sample code

Here is sample code for using the perf counters:

 //setup performance counters
 struct perf_event_attr hw_event;
 memset(&hw_event,0,sizeof(hw_event));
 hw_event.type = PERF_TYPE_HARDWARE;
 hw_event.size = sizeof(hw_event);
 hw_event.disabled = 1;
 hw_event.freq = 0;
 hw_event.inherit = 1; /* children inherit it */
 hw_event.pinned = 1; /* must always be on PMU */
 hw_event.exclusive = 0; /* only group on PMU */
 hw_event.exclude_user = 0; /* don’t count user */
 hw_event.exclude_kernel = 0; /* ditto kernel */
 hw_event.exclude_hv = 1; /* ditto hypervisor */
 hw_event.exclude_idle = 0; /* don’t count when idle */
 hw_event.mmap = 0; /* include mmap data */
 hw_event.comm = 0; /* include comm data */
 for( i = 0; i < NUM_CORES; i++ )
  {
    hw_event.config = 0x0000000000000000; //cycles
    cycles_counter_fd[i] = syscall(__NR_perf_event_open, &hw_event,
     0,//pid_t pid,
     i,//int cpu,
     -1,//int group_fd,
     0//unsigned long flags
     );
    if (cycles_counter_fd[i]<0)
     {
       fprintf(stderr,"On core %d: ",i);
       perror("Failed to open cycles counter");
     }
    hw_event.config = 0x0000000000000001; //instrs
    instrs_counter_fd[i] = syscall(__NR_perf_event_open, &hw_event,
     0,//pid_t pid,
     i,//int cpu,
     -1,//int group_fd,
     0//unsigned long flags
     );
    if (instrs_counter_fd[i]<0)
     {
       fprintf(stderr,"On core %d: ",i);
       perror("Failed to open instrs counter");
     }
  }
 prctl(PR_TASK_PERF_EVENTS_ENABLE);

for example, hw_event.type = PERF_TYPE_HARDWARE; is a constant that means use hardware counters, instead of software sampling. As explained in design.txt.

2.6 Integration in VMS

VMS Slave Virtual Processors (VPs) loop through a number of stages during execution. These stages are controlled by the Master VPs, so we set up counters in the master environment and insert counter reads between each stage.