Cache-adaptive Matrix Multiplication

Apivich Hemachandra, Tran Phong

CS5234 Algorithms at Scale

National University of Singapore, Singapore

Abstract
In our project, we empirically investigate the performance of various matrix multiplication algorithms under a cache-adaptive model. We try to simulate a two-level memory hierarchy and see how the algorithm performs in the external memory model when the cache changes size. We then attempt to test the algorithms on a real machine, by checking the number of page faults the algorithm requires with varying cache availability. Both the simulation and on the real machine demonstrate that cache-adaptive algorithms are more stable as the cache profile varies.
Theory and Algorithm
Formally, the cache-adaptive model can be seen as an extension to the disk-access model (DAM), where the costs are only incurred when the algorithm has to read or write information to the disk [1]. The cache-adaptive model is the case where we allow the size of the cache to change during the execution of the program. The intention of the cache-adaptive model is to mimic situations in real life where the cache available to a program is not stationary but is being divided based on the CPU and on the needs of the other competing processes.
We can specify the memory profile of a cache as a function m : \( N \times N \) where \( m(t) \) is defined the be the number of cache blocks the cache has at the tth cache miss of the algorithm, and let the total cache size \( M(t) = B · m(t) \) where \( B \) is the cache block size [1].
Definition 1. Cache-adaptivity[1]
The cache-adaptive algorithm is shown below.

[1] Bender, M. A., Ebrahimi, R., Fineman, J. T., Ghasemiesfeh, G., Johnson, R., and McCauley, S. Cache-adaptive algorithms. In Proceedings of the Twenty-Fifth Annual ACMSIAM Symposium on Discrete Algorithms (USA, 2014), SODA ’14, Society for Industrial and Applied Mathematics, p. 958–971.

Experiments
Figure 1. Cache Structure.
1. The class Block represents the block data in both Memory and Cache. It is a list of integers and each element has 4-byte size. We utilize the object reference trick in which when we write to Block in Cache, the contents are also updated in Memory.
2. The class Memory plays as a memory to store data. It has only 2 main functions. The first function is named read_block_from_memory(), which is used to return a Block from a requested address. The second function is allocate() which is used to initialize the memory according to the matrix data we used for multiplication. We store matrix in row-major order.
3. Our Cache class is the linchpin of the simulator. It communicates with Memory through CPU and transferring data using Block.
4. There are three main eviction policies we implemented: LRUPolicy (Least Recently Used), LFUPolicy (Least Frequently Used), and FIFOPolicy (First-In First-Out). LRUPolicy evicts least recently used Block. To do this, the algorithm maintains an OrderedDict which has Doubly Linked List as underlying data structure. Everytime a block is accessed, it moves the index of block to the end. When cache is full, it would remove the index of blocks at the beginning. In LFUPolicy, the policy evicts the one with least frequently used. We use a hashmap to maintain the frequencies of Block index. In removing phase, we need to loop through hashmap to find the index with least frequencies and remove it. For FIFOPolicy, cache is evicted in order as they added. We maintain an OrderDicted to serve this purpose.
5. CPU contains three important functions. The first one is read() which is used to read a Block from Cache. If this Block is not inside cache, it would be fetched from Memory and store inside Cache. The second function is write() which writes a byte to a Block in Cache. If no existing Block, then the function requests Block from Memory, update this Block, and write the Block to Cache. The final function is for the purpose of cache-adaptive: change_cache_size(). It updates the size of the cache every time a cache miss occurs according to a specific memory profile.
6. This class is responsible for changing cache's size every time a cache miss occurs as in [1]. The memory profile is as followed: $$ m(t) = \begin{cases} \text{$m$} & \text{if $t \leq m^{5/2}$} \\ \text{$2m^{5/2} - t$} & \text{if $m^{5/2} < t < 2m^{5/2}-B$} \\ \text{$B$} & \text{if $t \geq 2m^{5/2} - B$} \end{cases} $$ where \(t\) is the \(t^{th} \) cache misses, \( m = c_1 B\), where \(c_1\) is any constant speed augmentation. Here we choose \(c_1 = 1\).
For the results of cache-adaptive algorithms compared to other algorithms (e.g., naive, cache-aware, and cache-oblivious) for matrix multiplication, please see the report.
Report

Link to the report

Github Code

Link to the Github code

Discussion

For any questions regarding the publications, please contact me or post a comment below and I will respond shortly.