Using the Performance Visualization Tool

Getting Started

The visualization tool started life as a small script that just printed some stats, so it is not production ready. However, it is available for use for those willing to dive into the details of using it. Before being able to use it, performance counters must be enabled, as detailed on the page: using performance counters.

After performance counters are enabled and working, you'll need a version of a language that is instrumented to collect the measurements. The measurements are enabled or disabled via a compiler switch (see the file "VMS_defs__turn_on_and_off.h" for all such compiler switches). One project with the instrumentation and switches already on is the SSR matrix multiply project: SSR matrix multiply project revision with measurements turned on

This project requires you to create a folder called "counters" in the run directory, where it saves three trace files per run (or fails if it can't find the folder, or it contains more than 255 files... it's not exactly production-value code).

Collecting Measurements and Generating Graphs

After measurements are collected, a post-processing python script is run, which generates a graphical representation, in SVG format. This script is in the 2__runs_and_data repository, under scripts/ucc_and_loop_graph_treatment/parse_loop_graph.py. To work with this SSR project, you want the revision (6d03033aca59).

The script calls for two command line arguments, which are the names of the trace files output during the run: the first is the one called "LoopGraph.x" and the second "Counter.x.csv" where x is whatever number was available when the file was created. (The numbers can occasionally get desynchronized if there happens to be a failure somewhere during writing of the trace files, so check the console output of the run, it says which ones were created.) There's also a "lazy mode", if both files are in the current directory, have those canonical names, and x is the same for both files, you can avoid some typing and just use "parse_loop_graph.py x".

The script should create a file with the visual representation of the run, called LoopGraph.x.svg, or just x.svg in lazy mode (watch out what you leave lying around in the folder, it overwrites existing files) and prints out some stats. Keep an eye on the line that says something about the difference between expected and actual execution time; for a decent-sized matrix the percentage displayed shouldn't exceed 5%. If you're having issues there, it's usually because some thread(s) got interrupted by the OS; let me know and I'll switch you over to a version that uses timestamps instead.

As of this writing, the measurement of cache behavior isn't very detailed. It only collects L1 data cache read misses during work (i.e. excluding runtime overhead) and then displays the block in a shade between blue or red depending on the cache miss per instruction ratio (blue=low, red=high).

Linking Unit Numbers on Graph to Lines of Code

When looking at the constraint graph, each box has numbers on is, such as "(333, 2)" which is an identifier of a particular unit of work. The first number identifies the virtual processor, the second counts the number of times that VP has been assigned to a core. So, the instructions executed in VP "333" in between assignment "2" and assignment "3" make up the trace of one work-unit. The combination of 333 and 2 is used as a unique identifier of that work-unit.

Now, to figure out what lines of code were executed in that trace, the work-unit identifier has to be connected back to the code. The tool doesn't currently have an automated way to do this. Instead, it has to be done by hand, using a bit of intuition.

The starting point for establishing the linkage is the LoopGraph file (see above). It has a line starting with "unit" for each unit, that contains the identifier and also the address of the instruction where that unit started. So if you're looking for unit (333,2), you open the LoopGraph file, and Ctrl-S for "unit,333,2" (no spaces) and that should find the corresponding line, and it'll say something like

  ...
  unit,333,2,0x40c5c3,0
  ...

and then you know that unit (333,2) starts executing at 0x40c5c3. If you also look for "unit,333,3" the start pointer for that is the suspend pointer for the previous unit, so if you find

  unit,333,3,0x401562,1

then you know that unit (333,2) executed code from 0x40c5c3 to 0x401562. If that's not in the same function, you won't necessarily know how it went from one to the other, but at least it's that. Right now there's no easy way to get file and line of code that that address corresponds to, so you'll have to run objdump --syms on the binary and that'll give you something like this:

  ...
  0000000000405585 g     F .text	000000000000007d              makeHist_helper
  0000000000408d22 g     F .text	0000000000000127              readCASQ
  00000000006122f0 g     O .bss 	0000000000000008              dot_file
  000000000040c587 g     F .text	00000000000000b6              readPrivQ
  0000000000406051 g     F .text	0000000000000401              printHist
  00000000004021dd g     F .text	0000000000000046              VMS__throw_exception
  00000000006121a8 g       *ABS*	0000000000000000              _edata
  ...

except going on forever, and if you start looking for 40c5c3 you'll see that the closest address to that is readPrivQ so it's probably in there somewhere.

gdb probably has functions for resolving addresses, so if one of you can figure out how to use those instead that'd be awesome!