Open Source Research Institutue - VMS - SchedulersDataTrackingMeeting1

VMS.SchedulersDataTrackingMeeting1 History

Hide minor edits - Show changes to output - Cancel

November 21, 2012, at 05:49 PM by 24.130.186.152 -

Changed lines 11-12 from:

4) Talked through the approach of proto-runtime to assignment and ideas for the assigner (see below).

to:

4) Talked through the approach that proto-runtime takes for assignment, and ideas for the assigner (see below).

Changed lines 15-16 from:

6) Sean mentioned ~~that he may create~~ a skeleton assigner, which includes the basic data structures for the approach we talked about.

to:

6) Sean mentioned creating a skeleton assigner, which includes the basic data structures for the approach we talked about.

Changed lines 20-28 from:

~~Went over~~ the ~~ideas about the data-tracking. Will track data movement through caches,~~ and ~~choose placement of work based on expected time to move consumed data to core doing the~~ work.

Talked through how a named data-chunk travels through the cache hierarchy, and "steps on" each cache in its path, starting from core that created to core that consumes.

At the time of assignment of work to a ~~core, record each cache that input data travels through, starting from the source,~~ in ~~hardware. (make a heuristic to simplify issues of "when" does source data get written out of source cache and travel up~~ to ~~main memory). Each time record~~, ~~add the event to a list. List has~~ the ~~order of~~ data ~~landing on a cache (with size of data). At point want~~ to ~~calculate the percentage~~ of ~~a given~~ data ~~chunk that resides in a given~~ cache~~, find that chunk-name in the list, and look at all the chunks that have stepped on the cache after it. Create a heuristic to estimate percent, given the cache size~~, cache type (direct, vs number of set-ways), and the chunks that have stepped on the cache after the target chunk passed through.

While assigner is ~~considering what work to put where, it picks a unit of work to focus~~ on~~. For that unit, for each data-chunk it consumes, calculate expected percent of the chunk that resides in each~~ cache~~. Given~~ the ~~expected percent, use standard cache eqns to calculate expected time spent waiting for~~ data transfer (percent missing times chunk-size divided by line-size times time to get a line).

Sum up expected wait-time for each work-unit candidate. That gives a ~~measure of "goodness" of choosing that work-unit for the current hardware-opportunity. This is where the interesting work comes in,~~ of ~~defining an assignment-choosing algoritm~~ that will deliver the best performance..

to:

This means that the assigner is written so that it receives a hardware opportunity as input, and calculates a "goodness" measure for each work unit available. It then assigns the best work-unit to that given hardware.

Went over the ideas about the data-tracking. Will track data movement through caches, and calculate "goodness" of a work-unit based on expected time to move consumed data to the given core.

Talked through how a named data-chunk travels through the cache hierarchy, and "steps on" each cache in its path, starting at core that created the data, going to core that consumes.

Once the assigner has chosen a work-unit to assign to the given core, record each cache that input data travels through, starting at the place the data travels from, ending at the given core. Make a heuristic to simplify issues of "when" does the data travel, and "how much" travels (IE, if L1 cache is calculated to have 50% of data-chunk, then only 50% travels through L2 from main-memory -- this is at least the second time data from that chunk has traveled through L2, so what portion is expected to be in L2? Use micro-benchmarks to measure behaviors and make heuristic).

One idea for managing is to keep a list for each cache of the data-chunks that have landed on it, in the order they landed. At the point the assigner does a recording of data landing on a cache, add the landing-event to a list. The list keeps the order of data-chunks landing on a cache, with the size of each. Might also keep separate array of the data-chunks in the list, and keep a running percentage for each. Prune chunks that fall below a target percentage.

At the point want to calculate the percentage of a given data chunk that resides in a given cache, find that chunk-name in the list, and look at all the chunks that have stepped on the cache after it. Create a heuristic to estimate percent of target chunk still in the cache, given the cache size, cache type (direct, vs number of set-ways), and the chunks that have stepped on the cache after the target chunk passed through.

Calculate the "goodness" of a candidate work-unit. For each data-chunk that unit consumes, calculate expected percent of the chunk that resides in each cache. Given the expected percent, use standard cache eqns to calculate expected time spent waiting for data transfer (percent missing times chunk-size divided by line-size times time to get a line).

Sum up expected wait-time for each work-unit candidate. That gives a measure of "goodness" of choosing that work-unit for the current hardware-opportunity. This is where the interesting work comes in, of defining an assignment-choosing algorithm that will deliver the best performance..

Restore

November 21, 2012, at 05:17 PM by 24.130.186.152 -

Added lines 1-36:

Great meeting. We got a lot of the "same page" stuff done.. talked through the visualization a bit, to understand it, and walked through the wiki pages, to get familiar with what information is missing about the visualization tool, and where the info is about the performance counter stuff.

Outcomes:
1) Looks like Kernel 3.5 is needed for support of the cache miss hardware counters on the Ivy Bridge architecture. So, next step will be upgrading the kernel, to get the counters working.

2) Asked Nina for help on how to connect the numbers on the boxes to lines of code.

3) Asked Nina for help on why the boxes are so small (probably because the counters not working)

4) Talked through the approach of proto-runtime to assignment and ideas for the assigner (see below).

5) Carole and Shoaib mentioned focusing on getting the performance counters working, and writing micro-benchmarks that let them verify that the visualizer displays what they think it does.

6) Sean mentioned that he may create a skeleton assigner, which includes the basic data structures for the approach we talked about.

===== The approach we talked about =====
Talked through the approach of proto-runtime to assignment: it is hardware-availability driven. So, the assigner is triggered at the point hardware becomes available. It then looks at the work available and picks what work to assign to that hardware.

Went over the ideas about the data-tracking. Will track data movement through caches, and choose placement of work based on expected time to move consumed data to core doing the work.

Talked through how a named data-chunk travels through the cache hierarchy, and "steps on" each cache in its path, starting from core that created to core that consumes.

At the time of assignment of work to a core, record each cache that input data travels through, starting from the source, in hardware. (make a heuristic to simplify issues of "when" does source data get written out of source cache and travel up to main memory). Each time record, add the event to a list. List has the order of data landing on a cache (with size of data). At point want to calculate the percentage of a given data chunk that resides in a given cache, find that chunk-name in the list, and look at all the chunks that have stepped on the cache after it. Create a heuristic to estimate percent, given the cache size, cache type (direct, vs number of set-ways), and the chunks that have stepped on the cache after the target chunk passed through.

While assigner is considering what work to put where, it picks a unit of work to focus on. For that unit, for each data-chunk it consumes, calculate expected percent of the chunk that resides in each cache. Given the expected percent, use standard cache eqns to calculate expected time spent waiting for data transfer (percent missing times chunk-size divided by line-size times time to get a line).

Sum up expected wait-time for each work-unit candidate. That gives a measure of "goodness" of choosing that work-unit for the current hardware-opportunity. This is where the interesting work comes in, of defining an assignment-choosing algoritm that will deliver the best performance..

(Shoaib mentioned defining several alternatives, and perhaps using measurements of the application to choose which is best..)

Restore

Home Edit History Recent Changes
OSRI Home Proto-Runtime Home PStack Home Computation Model BLIS Home Code All OSRI papers VMSSkin powered by PmWiki	VMS.SchedulersDataTrackingMeeting1 History Hide minor edits - Show changes to output - Cancel November 21, 2012, at 05:49 PM by 24.130.186.152 - Changed lines 11-12 from: 4) Talked through the approach of proto-runtime to assignment and ideas for the assigner (see below). to: 4) Talked through the approach that proto-runtime takes for assignment, and ideas for the assigner (see below). Changed lines 15-16 from: 6) Sean mentioned ~~that he may create~~ a skeleton assigner, which includes the basic data structures for the approach we talked about. to: 6) Sean mentioned creating a skeleton assigner, which includes the basic data structures for the approach we talked about. Changed lines 20-28 from: ~~Went over~~ the ~~ideas about the data-tracking. Will track data movement through caches,~~ and ~~choose placement of work based on expected time to move consumed data to core doing the~~ work. Talked through how a named data-chunk travels through the cache hierarchy, and "steps on" each cache in its path, starting from core that created to core that consumes. At the time of assignment of work to a ~~core, record each cache that input data travels through, starting from the source,~~ in ~~hardware. (make a heuristic to simplify issues of "when" does source data get written out of source cache and travel up~~ to ~~main memory). Each time record~~, ~~add the event to a list. List has~~ the ~~order of~~ data ~~landing on a cache (with size of data). At point want~~ to ~~calculate the percentage~~ of ~~a given~~ data ~~chunk that resides in a given~~ cache~~, find that chunk-name in the list, and look at all the chunks that have stepped on the cache after it. Create a heuristic to estimate percent, given the cache size~~, cache type (direct, vs number of set-ways), and the chunks that have stepped on the cache after the target chunk passed through. While assigner is ~~considering what work to put where, it picks a unit of work to focus~~ on~~. For that unit, for each data-chunk it consumes, calculate expected percent of the chunk that resides in each~~ cache~~. Given~~ the ~~expected percent, use standard cache eqns to calculate expected time spent waiting for~~ data transfer (percent missing times chunk-size divided by line-size times time to get a line). Sum up expected wait-time for each work-unit candidate. That gives a ~~measure of "goodness" of choosing that work-unit for the current hardware-opportunity. This is where the interesting work comes in,~~ of ~~defining an assignment-choosing algoritm~~ that will deliver the best performance.. to: This means that the assigner is written so that it receives a hardware opportunity as input, and calculates a "goodness" measure for each work unit available. It then assigns the best work-unit to that given hardware. Went over the ideas about the data-tracking. Will track data movement through caches, and calculate "goodness" of a work-unit based on expected time to move consumed data to the given core. Talked through how a named data-chunk travels through the cache hierarchy, and "steps on" each cache in its path, starting at core that created the data, going to core that consumes. Once the assigner has chosen a work-unit to assign to the given core, record each cache that input data travels through, starting at the place the data travels from, ending at the given core. Make a heuristic to simplify issues of "when" does the data travel, and "how much" travels (IE, if L1 cache is calculated to have 50% of data-chunk, then only 50% travels through L2 from main-memory -- this is at least the second time data from that chunk has traveled through L2, so what portion is expected to be in L2? Use micro-benchmarks to measure behaviors and make heuristic). One idea for managing is to keep a list for each cache of the data-chunks that have landed on it, in the order they landed. At the point the assigner does a recording of data landing on a cache, add the landing-event to a list. The list keeps the order of data-chunks landing on a cache, with the size of each. Might also keep separate array of the data-chunks in the list, and keep a running percentage for each. Prune chunks that fall below a target percentage. At the point want to calculate the percentage of a given data chunk that resides in a given cache, find that chunk-name in the list, and look at all the chunks that have stepped on the cache after it. Create a heuristic to estimate percent of target chunk still in the cache, given the cache size, cache type (direct, vs number of set-ways), and the chunks that have stepped on the cache after the target chunk passed through. Calculate the "goodness" of a candidate work-unit. For each data-chunk that unit consumes, calculate expected percent of the chunk that resides in each cache. Given the expected percent, use standard cache eqns to calculate expected time spent waiting for data transfer (percent missing times chunk-size divided by line-size times time to get a line). Sum up expected wait-time for each work-unit candidate. That gives a measure of "goodness" of choosing that work-unit for the current hardware-opportunity. This is where the interesting work comes in, of defining an assignment-choosing algorithm that will deliver the best performance.. Restore November 21, 2012, at 05:17 PM by 24.130.186.152 - Added lines 1-36: Great meeting. We got a lot of the "same page" stuff done.. talked through the visualization a bit, to understand it, and walked through the wiki pages, to get familiar with what information is missing about the visualization tool, and where the info is about the performance counter stuff. Outcomes: 1) Looks like Kernel 3.5 is needed for support of the cache miss hardware counters on the Ivy Bridge architecture. So, next step will be upgrading the kernel, to get the counters working. 2) Asked Nina for help on how to connect the numbers on the boxes to lines of code. 3) Asked Nina for help on why the boxes are so small (probably because the counters not working) 4) Talked through the approach of proto-runtime to assignment and ideas for the assigner (see below). 5) Carole and Shoaib mentioned focusing on getting the performance counters working, and writing micro-benchmarks that let them verify that the visualizer displays what they think it does. 6) Sean mentioned that he may create a skeleton assigner, which includes the basic data structures for the approach we talked about. ===== The approach we talked about ===== Talked through the approach of proto-runtime to assignment: it is hardware-availability driven. So, the assigner is triggered at the point hardware becomes available. It then looks at the work available and picks what work to assign to that hardware. Went over the ideas about the data-tracking. Will track data movement through caches, and choose placement of work based on expected time to move consumed data to core doing the work. Talked through how a named data-chunk travels through the cache hierarchy, and "steps on" each cache in its path, starting from core that created to core that consumes. At the time of assignment of work to a core, record each cache that input data travels through, starting from the source, in hardware. (make a heuristic to simplify issues of "when" does source data get written out of source cache and travel up to main memory). Each time record, add the event to a list. List has the order of data landing on a cache (with size of data). At point want to calculate the percentage of a given data chunk that resides in a given cache, find that chunk-name in the list, and look at all the chunks that have stepped on the cache after it. Create a heuristic to estimate percent, given the cache size, cache type (direct, vs number of set-ways), and the chunks that have stepped on the cache after the target chunk passed through. While assigner is considering what work to put where, it picks a unit of work to focus on. For that unit, for each data-chunk it consumes, calculate expected percent of the chunk that resides in each cache. Given the expected percent, use standard cache eqns to calculate expected time spent waiting for data transfer (percent missing times chunk-size divided by line-size times time to get a line). Sum up expected wait-time for each work-unit candidate. That gives a measure of "goodness" of choosing that work-unit for the current hardware-opportunity. This is where the interesting work comes in, of defining an assignment-choosing algoritm that will deliver the best performance.. (Shoaib mentioned defining several alternatives, and perhaps using measurements of the application to choose which is best..) Restore Edit Page History Source Attach File Backlinks List Group Page last modified on November 21, 2012, at 05:49 PM
skin config pmwiki-2.2.37