# Concurrent average memory access time

Jump to: navigation, search
This article was considered for deletion at Wikipedia on August 2 2015. This is a backup of Wikipedia:Concurrent_average_memory_access_time. All of its AfDs can be found at Wikipedia:Special:PrefixIndex/Wikipedia:Articles_for_deletion/Concurrent_average_memory_access_time, the first at Wikipedia:Wikipedia:Articles_for_deletion/Concurrent_average_memory_access_time. Purge

Concurrent Average Memory Access Time (C-AMAT) is an extension of average memory access time (AMAT). It considers concurrent memory accesses and in doing so provides an accurate metric and a design and optimization tool for modern memory systems. Today's computing memory systems are organized as a memory hierarchy, where data locality is a key factor in memory performance. AMAT is the conventional tool to measure and analyze locality's influence on data access time. AMAT can be used recursively to measure the impact of locality throughout all layers of the memory hierarchy and is widely used in memory system design and optimization. However, modern memory systems are not only hierarchical but also advanced in recent years to support concurrent data accesses at each layer of the memory hierarchy. Concurrent data accesses allow for the processor to receive more data in a smaller amount of time compared to sequential accesses. Because of this, concurrent data accesses have become increasingly more important as memory systems are highly utilized for big data applications. AMAT does not consider concurrent memory accesses, rather it assumes that memory accesses are sequential. C-AMAT integrates data concurrency into AMAT and unifies the influence of data locality and concurrency into one formula. C-AMAT can be applied recursively to each layer of the memory hierarchy and reduces to AMAT when there exists no data concurrency.

C-AMAT introduces two new parameters, hit concurrency and miss concurrency. It also introduces the notion of a pure miss, a miss which contains at least one pure miss cycle, where a pure miss cycle is a miss cycle which does not overlap with a hit cycle. Using pure misses, C-AMAT redefines miss rate and average miss penalty of AMAT to pure miss rate and pure average miss penalty. C-AMAT reveals two important facts. 1) Reducing the number of cache misses is not important for performance improvement but reducing the number of pure misses is. 2) Improving data locality may not always improve performance, thus optimizations should focus on providing a balanced improvement in both data locality and concurrency. C-AMAT is a practical and effective tool to reach such a balanced design. Like AMAT, C-AMAT finds its applications in memory system design and optimization, system configuration, FPGA utilization, and task scheduling. In turn, task scheduling and memory optimization further influence algorithm design and general system software design. Concurrent Average Memory Access Time was proposed by Xian-He Sun and Dawei Wang in the IEEE Computers Society's sponsored journal Computer.[1]

## Definition and Formulation

AMAT depends on three terms, hit latency, miss rate, and miss penalty. It is calculated by taking the product of miss rate and miss penalty and adding it to the hit latency. C-AMAT is formulated similar to AMAT, but considers concurrent hit and concurrent pure miss accesses. Quantitatively, C-AMAT is defined as the total memory access cycles (the total number of cycles executed in which there is at least one outstanding memory reference), represented as $T_{MemCycle}$, divided by the total number of memory accesses, represented as $C_{MemAcc}$:

$C\mbox{-}AMAT = \frac{T_{MemCycle}}{C_{MemAcc}}$

Concurrency is implicit in the above equation. When several memory accesses coexist during the same cycle, $T_{MemCycle}$ only increases by one but $C_{MemAcc}$ increases by the amount of accesses which overlap. Thus, $T_{MemCycle}$ is calculated in overlapping mode, to account for concurrency in the modern memory hierarchy.

In addition, $T_{MemCycle}$ only includes clock cycles which have memory access activities; thus cycles without any memory references are excluded. This is related to another research work, which defines Access Per Cycle (APC).[2][3] APC is a performance metric to measure concurrent memory system performance. C-AMAT is the reciprocal of APC:

$C\mbox{-}AMAT=\frac{1}{APC}=\frac{T_{MemCycle}}{C_{MemAcc}}$

In order to gain meaningful insights from the C-AMAT model, C-AMAT is extended to the following:

$C\mbox{-}AMAT = \frac{H}{C_H} + pMR \cdot \frac{pAMP}{C_M}$

C-AMAT now includes five parameters: H(hit latency), C$_{\text{H}}$ (hit concurrency), pMR (pure miss rate), pAMP (pure average miss penalty), and C$_{\text{M}}$(average pure miss concurrency). A pure miss is a miss which contains at least one miss cycle that does not have any hit access activity. From the above formula, as C$_{\text{H}}$ increases, access time will decrease, and as C$_{\text{M}}$ increases access time decreases as well. Thus concurrency can lower the access time, but as pMR and pAMP increase access time will increase. C-AMAT then shows the performance advantages that can be achieved as concurrency increases as well as how pure miss reduction, which is described in the next section, can improve overall performance.

C-AMAT redefines pMR as the number of pure misses divided by the total number of accesses. pAMP is the average number of pure miss cycles per miss access. In contrast to AMAT's MR and AMP, C-AMAT calculates miss rate and average miss penalty in terms of pure misses.

## Pure Miss

A pure miss cycle is simply a miss cycle which does not overlap with a hit cycle. Thus a pure miss, introduced by C-AMAT, is a miss that contains at least one pure miss cycle. The intuition behind pure miss is based on the fact that not all cache misses will cause processor stall, but rather only pure misses. C-AMAT is measured in terms of one core. Thus, even though an access may miss during a cycle, there can be another coexisting hit during the same cycle which the core can use to continue processing without waiting for the miss to be fetched. In this way, the amount of time a core must wait for data depends on how many pure misses exist, not traditional misses. Pure miss and C-AMAT, together, focus on the importance of concurrency in designing computer architecture and algorithms.

## Simple Example

File:C-AMAT Calculation.png
C-AMAT Calculation Example for Concurrent Accesses

The figure on the right provides an example of how to calculate C-AMAT. The figure shows five accesses all of which are to the same layer of the memory hierarchy. Each access has a 3 cycle cache hit latency. If the access is a miss, there is an additional miss penalty but the value of the penalty (cycles) is uncertain. Access 1,2, and 5 are hit accesses. Access 3 has a three cycle miss penalty while Access 4 has a one cycle miss penalty. When considering the access concurrency, only Access 3 contains two pure miss cycles. Though Access 4 has 1 miss cycle, this cycle is not a pure miss cycle because it overlaps with the hit cycles of Access 5. Therefore, the miss rate of the five accesses is 0.2 according to the new definition of concurrent pure miss rate, instead of 0.4 of the conventional, non-concurrent version. The intuition behind omitting cycles which completely overlap with hit accesses is that these miss accesses will not limit processor performance. This is because the processor can continue generating memory accesses while waiting for the missed data to return from lower memory hierarchies. Using this figure, C-AMAT is 1.6 cycles per access, whereas AMAT is 3.8 cycles per access. From the point of view of the processor, it will receive missing data every 1.6 cycles, not 3.8.

## Recursive Expression

C-AMAT can be extended to multiple layers of the memory hierarchy. This allows accurate performance analysis throughout all layers of the memory hierarchy.

$C\mbox{-}AMAT_1 = \frac{H_1}{C_{H_1}}+pMR_1 \cdot \eta_1 \cdot C\mbox{-}AMAT_2$

Where,

$C\mbox{-}AMAT_1 = \frac{H_1}{C_{H_1}} + pMR_1 \cdot \frac{pAMP_1}{C_{M_1}}$

$C\mbox{-}AMAT_2 = \frac{H_2}{C_{H_2}} + pMR_2 \cdot \frac{pAMP_2}{C_{M_2}}$

$\eta_1 = \frac{pAMP_1}{AMP_1} \cdot \frac{C_{m_1}}{C_{M_1}}$

$pMR_1 \cdot \eta_1$ is the concurrency contribution in reducing average memory access delay at the L$_{\text{1}}$ cache level. $C_m$ is the miss concurrency, not pure miss concurrency. The number of misses which occurs in the L$_{\text{2}}$ cache is $C_m$, while the performance of the L$_{\text{1}}$ cache only depends on $C_M$, which is pure miss concurrency. This recurrence relation can be extended to the other layers of the memory hierarchy in a similar manner.

## Concurrent Data Accesses

Concurrent data accesses occur when there are multiple accesses being serviced in the same memory cycle. In modern memory systems, there can be multiple accesses occurring simultaneously in the same and different layer(s) of a memory hierarchy. As opposed to sequential accesses, where each access has to wait for the previous accesses to complete before accessing the memory hierarchy. Concurrent data accesses minimize an access's wait time by allowing each memory layer to serve multiple accesses in parallel. For example in a single modern cache, there can be multiple outstanding cache misses and multiple concurrent cache hits in the same memory cycle. If this cache could only process accesses sequentially, each access would be queued until the miss or hit access completes. The amount of accesses a memory layer can serve in parallel depends on hardware characteristics, such as instruction level parallelism. Data concurrency is a common research field in cache and memory optimization and is the focus of many industrial and academic researchers.

## References

1. [1] Sun X H, Wang D. Concurrent Average Memory Access Time. IEEE Computer, 2014, 47(5):74-80
2. Xian-He Sun and Dawei Wang, "APC: A Performance Metric for Memory Systems", ACM SIGMETRICS Performance Evaluation Rev., vol 40, no. 2. 2012, pp. 125-130.
3. Wang, D, Sun, X H (2014) APC: A novel memory metric and measurement methodology for modern memory system. IEEE Transactions on Computers 63: pp. 1626-1639.