Great meeting. We got a lot of the "same page" stuff done.. talked through the visualization a bit, to understand it, and walked through the wiki pages, to get familiar with what information is missing about the visualization tool, and where the info is about the performance counter stuff.

Outcomes: 1) Looks like Kernel 3.5 is needed for support of the cache miss hardware counters on the Ivy Bridge architecture. So, next step will be upgrading the kernel, to get the counters working.

2) Asked Nina for help on how to connect the numbers on the boxes to lines of code.

3) Asked Nina for help on why the boxes are so small (probably because the counters not working)

4) Talked through the approach that proto-runtime takes for assignment, and ideas for the assigner (see below).

5) Carole and Shoaib mentioned focusing on getting the performance counters working, and writing micro-benchmarks that let them verify that the visualizer displays what they think it does.

6) Sean mentioned creating a skeleton assigner, which includes the basic data structures for the approach we talked about.

The approach we talked about

Talked through the approach of proto-runtime to assignment: it is hardware-availability driven. So, the assigner is triggered at the point hardware becomes available. It then looks at the work available and picks what work to assign to that hardware.

This means that the assigner is written so that it receives a hardware opportunity as input, and calculates a "goodness" measure for each work unit available. It then assigns the best work-unit to that given hardware.

Went over the ideas about the data-tracking. Will track data movement through caches, and calculate "goodness" of a work-unit based on expected time to move consumed data to the given core.

Talked through how a named data-chunk travels through the cache hierarchy, and "steps on" each cache in its path, starting at core that created the data, going to core that consumes.

Once the assigner has chosen a work-unit to assign to the given core, record each cache that input data travels through, starting at the place the data travels from, ending at the given core. Make a heuristic to simplify issues of "when" does the data travel, and "how much" travels (IE, if L1 cache is calculated to have 50% of data-chunk, then only 50% travels through L2 from main-memory -- this is at least the second time data from that chunk has traveled through L2, so what portion is expected to be in L2? Use micro-benchmarks to measure behaviors and make heuristic).

One idea for managing is to keep a list for each cache of the data-chunks that have landed on it, in the order they landed. At the point the assigner does a recording of data landing on a cache, add the landing-event to a list. The list keeps the order of data-chunks landing on a cache, with the size of each. Might also keep separate array of the data-chunks in the list, and keep a running percentage for each. Prune chunks that fall below a target percentage.

At the point want to calculate the percentage of a given data chunk that resides in a given cache, find that chunk-name in the list, and look at all the chunks that have stepped on the cache after it. Create a heuristic to estimate percent of target chunk still in the cache, given the cache size, cache type (direct, vs number of set-ways), and the chunks that have stepped on the cache after the target chunk passed through.

Calculate the "goodness" of a candidate work-unit. For each data-chunk that unit consumes, calculate expected percent of the chunk that resides in each cache. Given the expected percent, use standard cache eqns to calculate expected time spent waiting for data transfer (percent missing times chunk-size divided by line-size times time to get a line).

Sum up expected wait-time for each work-unit candidate. That gives a measure of "goodness" of choosing that work-unit for the current hardware-opportunity. This is where the interesting work comes in, of defining an assignment-choosing algorithm that will deliver the best performance..

(Shoaib mentioned defining several alternatives, and perhaps using measurements of the application to choose which is best..)