CS7290 Advanced Microarchitecture

Fall 2017

[Home][Course overview][Schedule][Reading] [Assignments ]

Reading Papers

Please install the web localizer to access the papers

Modeling

Power

[MCP] Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42)
[PGPU] Sunpyo Hong and Hyesoon Kim. An integrated GPU power and performance model. In Proceedings of the 37th annual international symposium on Computer architecture (ISCA '10).

Front-end and Branch Predictors

Scheduler

[LLS]Eric Borch, Srilatha Manne, Joel Emer, and Eric Tune. 2002. Loose Loops Sink Chips. In Proceedings of the 8th International Symposium on High-Performance Computer Architecture (HPCA '02). IEEE Computer Society, Washington, DC, USA, 299-.
[MAC]Ilhyun Kim and Mikko H. Lipasti. 2003. Macro-op Scheduling: Relaxing Scheduling Loop Constraints. In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36). IEEE Computer Society, Washington, DC, USA, 277-.
[CYL] D. Ernst, A. Hamel and T. Austin, "Cyclone: a broadcast-free dynamic instruction scheduler with selective replay," Computer Architecture, 2003. Proceedings. 30th Annual International Symposium on, 2003, pp. 253-262.

Cache Optimizations

[UCP] Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. Moinuddin K. Qureshi and Yale N. Patt. MICRO'06.
[AIP] Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive insertion policies for high performance caching. In Proceedings of the 34th annual international symposium on Computer architecture (ISCA '07).
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. < 2010. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th annual international symposium on Computer architecture (ISCA '10)
[TAP] Jaekyu Lee; Hyesoon Kim, "TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture," High Performance Computer Architecture (HPCA), 2012
[ACC]Alaa R. Alameldeen and David A. Wood. 2004. Adaptive Cache Compression for High-Performance Processors. In Proceedings of the 31st annual international symposium on Computer architecture(ISCA '04)
[COM] Alaa R. Alameldeen and David A. Wood. 2004. Adaptive Cache Compression for High-Performance Processors. In Proceedings of the 31st annual international symposium on Computer architecture (ISCA '04)
[COM2] Magnus Ekman and Per Stenstrom. 2005. A Robust Main-Memory Compression Scheme. In Proceedings of the 32nd annual international symposium on Computer Architecture (ISCA '05)

Coherence

[CSB]Culler and Singh, Parallel Computer Architecture Chapter 5.1 (pp 269 – 283), Chapter 5.3 (pp 291 – 305)
[PH] P&H, Computer Organization and Design Chap 5.8
[HP] H&P, Computer Architecture, Chap 5.4-5.6
[COH1] Andreas Moshovos. 2005. RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence. In Proceedings of the 32nd annual international symposium on Computer Architecture (ISCA '05).
Daehoon Kim, Jeongseob Ahn, Jaehong Kim, and Jaehyuk Huh. 2010. Subspace snooping: filtering snoops with operating system support. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT '10).

TLB

[TLB1] Gokul B. Kandiraju and Anand Sivasubramaniam. 2002. Going the distance for TLB prefetching: an application-driven study. In Proceedings of the 29th annual international symposium on Computer architecture (ISCA '02). IEEE Computer Society, Washington,
[TLB2] Abhishek Bhattacharjee and Margaret Martonosi. 2010. Inter-core cooperative TLB for chip multiprocessors. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems (ASPLOS XV).
[TLB3] Shekhar Srikantaiah and Mahmut Kandemir. 2010. Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '43)
[TLB4] Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared last-level TLBs for chip multiprocessors. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA '11).
[COLT] Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. 2012. CoLT: Coalesced Large-Reach TLBs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45)
[EVBS] Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M. Swift. 2013. Efficient virtual memory for big memory servers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13).

Non-conventional architecture

[DFA1] Dennis and Misunas, “A Preliminary Architecture for a Basic Data Flow Processor,” ISCA 1974.
[DFA2]Arvind and Nikhil, “Executing a Program on the MIT Tagged-Token Dataflow Architecture,” IEEE TC 1990
[ATU] Micron Automata Processor

GPU architectures

[BGPU]“Performance analysis and tuning for GPGPUs.” Synthesis Lectures on Computer Architecture, Morgan & Claypool