A conversation between Sean and Carole-Jean Wu of Princeton:
What about something that tracks data somehow, to detect affinity of work units?
Can you give more context for this?
Would be nice if the scheduler (assigner of work to cores), could identify blocks of data, and know which blocks a given unit of work accesses. It could then track which cache the data currently resides in, by setting the cache to contain the data attached to a work-unit assigned to the core associated to that cache. Something along those lines..
Just think what it would do for improving cache misses, or on a super-computer, reduce the amount of time spent waiting for data to transfer. If some way could be worked out to calculate the probability that the data accessed by a candidate work-unit already resides in a target core's cache, then the optimal placement of work onto cores could be quickly searched for or calculated inside the scheduler. For memory-limited applications with reliable access patterns, this could have a major impact on performance.
Here's a VMS project that has a blocked version of matrix multiply, written in the SSR language, for which a visualization tool is available that shows the cache behavior of each scheduled unit of work: SSR Matrix Mult with instrumentation