Proposal New Cache Coherence Protocol to Optimize CPU Time through Simulation Caches

The cache coherence is the most important issue that rapidly affected the performance of a multicore processor as a result of increasing the number of cores on chip multiprocessors and the shared memory program that will be run on these processors. "Snoopy protocols" and "directory based protocols" are two types of protocols that are used to achieve coherence between caches. The main objective of these Protocols is to achieve consistency and validation of the data value in the caches of a multi core processor so that any reading of a memory address via any caches will returns the latest data written to that address. In this paper, a new protocol has been designed to solve a problem of a cache coherence that combines the two schemes of coherency: snooping and directory depending on the states of MESI protocol. The MESI protocol is a version of the snooping cache protocol which based on four (Modified, Exclusive, Shared, Invalid) states that a block in the cache memory can have. The proposed protocol has the same states of MESI protocol but the difference is in laying the directory inside a shared cache instead of main memory to make the processor more efficient by reducing the gap between fast CPU and slow main memory.


INTRODUCTION
hared memory is the hardware part that supported by many modern computer systems and multicore chips. Each of the processor cores in a shared memory system may read and write a single address space [1]. But in designing shared system, one of the most important problems appears which is called coherence problem. The coherence problem results when the caches are laid in recent computers between processor and main memory to solve the contention problem as trying to access a shared memory at the same time which then causes performance degradation [2].
Two hardware-based protocols are used to solve coherence problem that appeared in a shared memory system of a multicore processor with caches that can store multiple copies of memory blocks simultaneously which are "Snooping protocols" and "Directory-based protocols" [3,4,5].

Memory Hierarchy
The basic idea to overcome the problem of increasing the gap between a fast CPU and a slow RAM is in using a hierarchy of memories as in figure (1). Each level speedier, more expensive and smaller, the closer it is to CPU, to feed the CPU with the required data [6].

Protocols for Cache Coherence
Two hardware-based protocols to coherence the caches in multiprocessor systems are used which include:-

Snooping Protocol
To solve the problem of cache coherence by snoopy protocol, the central bus is used as a "broadcast medium" which make the transactions on bus visible to all` caches [11,12]. As a result the cache controllers of all processors can observe all memory accesses (figure 2) [5,7]: These protocols are called "update-based protocols" when updated is performed directly by the cache controllers. Also "invalidation-based protocols" occur when the cache block that match memory block is invalidated and as a result main memory must update next read [3].

Directory Based Protocol
The ability of scaling in directory based schemes is better than snooping because it does not depend upon a shared bus for communication. The directory which can be central or distributed keeps state of all memory block shared between processors and then the cache controller uses point-to-point messages looking up directory instead of observing shared broadcast to get memory block state [3, 10]. (Figure 3). Although the directory-based protocols will likely have to be employed for multi core architectures of the future, there exist a drawbacks that appears in a directory which are: storage overhead, frequent indirections, and are more prone to design bugs

Measuring Cache Performance
Time CPU lapses in the implementation of the program as well as in waiting inside the memory, so CPU time is calculated as in the following equations [6,7]:  Miss --the referenced information is not in cache, and must be read from MM  Hit timeis how long it takes data to be sent from the cache to the processor. This is usually fast, on the order of 1-5 clock cycles at Level1, of 10-20 clock cycles at Level2, of 30-40 clock cycles at Level3, of 50-100 clock cycles at main memory.  Miss penaltyis the time to copy data from main memory to the cache. This often requires dozens of clock cycles (at least).  Hit ratio --percentage of time the data is found in the higher cache.  Miss ratiois the percentage of misses and equal (100 -hit ratio).

The Experiment Result Using DEV C++ Language Binary Representation
Binary representation is one of a necessary preprocessing steps used to convert decimal addresses to binary address in order to obtain tag and index and offset of each binary address so as to facilitate the work of a mapping algorithm. In a proposed protocol, Main Memory has 8bit to represent the address. So, the addresses of memory have 256 addresses as in table 1. Journal, Vol.34,Part (B), No.6   These steps are repeated to simulate caches at level2 and level3 but different is in number of cache line and tag that they are specified initially

Proposal protocol Results
Before applying the proposal protocol, a binary function is used to convert addresses of input sample program to binary address and then other functions are used to obtain tag, index and offset from a binary address. Table2 list binary addresses that are used in a sample program exist in table3. The results of applying a proposed protocol on a sample program are listed in table7. Initially all states of input sample program are invalid and also all the values equal to zero.

Cache Performance Result
Cache performance can be measured by counting a program execution cycles that include cache Hit time and a memory stall cycles which result from cache misses. Suppose that after depending on the clock speed of the central processor, it takes: 7 ns to access data in L1 cache, 17 ns to access data in L2 cache, 30 ns to access data in L3 cache, 80 ns to access data in Main Memory. Calculate hit and miss ratio according to both addresses and also to proposed protocol: -In using proposed protocol1 The proposed protocol in figure 7 are applied on a sample program, then hit and miss ratio results are appears at level1 caches only because the input addresses use only 3 address and all these addresses appear at the same cache line. Hit and miss ratio calculate from table 3 as follow: Hit ratio at L1 = (no. of hit in level1/ total no. of address)*100 = (5/20)*100= 25% Miss ratio in L1 =100-Hit ratio =100-25= 75% Hit and Miss ratio in level2 and Level3 are not exist

2-in using addresses
First address equal to miss and all other addresses are equal to hit because the same line contains all addresses and when this line is fetch then the first address take from main memory and the rest addresses appears. So, Hit and miss ratio is calculated as follow: Hit ratio at L1 = (19/20) * 100 = 95 % Miss ratio at L1 = 100 -95 = 5 % Hit ratio = (hit ratio in protocol + hit ratio in address) / 2 = (25 + 95) / 2 = 60 % Miss ratio = (miss ratio in protocol + miss ratio in address) / 2 = (75 + 5) / 2 = 40 % Finally Average Memory Access Time (AMAT) is applied as in equation11 The Comparison between MESI and proposed Protocol -In MESI cache coherence Protocols the directory that keep track of shared data is located in main memory but in a proposed protocol the directory is located in a shared cache level3. As a result the efficiency is increased by reducing a gap between a fast CPU and a slow main memory.
-The write through and write back has been translated from main memory into level3 shared cache, so the disadvantages of write through in uses more memory bandwidth is reduced, and the disadvantages of write back of making the main memory inconsistent with cache also reduced. The different between MESI and proposed protocol in using sample program that are shown in table 3 are as follow: Steps 3,5,6,11,13,14,17,19 are write back addresses of a previous modified state to main memory as a result of a remote read. And step 10 is write back modified address line to main memory as a result of remote write. But in using proposed protocol these steps return to update the directory at level3 instead of access to main memory.

Conclusion and Future Works
A new idea is proposed in this research to achieve cache coherency. The reason behind the development of coherency protocol is that this protocol effectively affects the efficiency of the processor in multi-core computer systems.
In future work the number of caches at level1 and level2 are tried to be increased and also modifying in one of the states and also increase associativity in using mapping algorithm. All these idea are proposed in order to reduce access to main memory.