How is Cache organized and working?

My habit is to think about where to write, where to write, just to dig pits, regardless of filling, in the near future is busy work, did not update the contents of the Cache part, I am sorry. Thanks to a lot of friends, the Cache’s attention is still good, and today brings the second content of Cache.

We often see 2-ways, 4-ways cache, what does it mean? How does Cache and memory address correspond? Today we will introduce this content.

In the CPU, the Cache Memory is at the top of the Memory Hierarchy, and below it is memory and external memory.

So how does the CPU access the Cache? Similar to accessing memory, the CPU microarchitecture also encodes the Cache address spaces at all levels, but these codes are not observable by the software. For the sake of simplicity, we ignore the difference between Caches at all levels and only discuss access to Cache in the general sense.

Cache composition and access methods

In most modern processors, Cache is divided into many lines (Cache Line), Cache Line size varies from 16Byte to 128Byte, the general size is 64 Bytes, we all think that Cache Line has 64B . Let’s assume that there is a 512KB Cache, which can be divided into 8192 Cache Lines.

So how does an address access map to the Cache? We all know that addresses are divided into logical addresses (virtual addresses), linear addresses, and physical addresses. The simple process of transforming a virtual address into a physical address is shown in the following figure, which I will not elaborate on:

The page table is used for the linear address to the physical address during the conversion process. A page table consists of a number of items, each called a page table entry. The entire page table is maintained by the operating system and placed in memory (or on disk). From above ( Where is L1, L2, L3 Cache? ) We know that a memory access takes hundreds of clock cycles, and it is a waste of time to look at the memory page table every time an address is translated. In order to speed up this process, modern computers have introduced a translation aid buffer TLB. TLB can be regarded as the Cache of the page table. The CPU will look at the TLB every time it converts the address. If it does, it does not need to fetch the memory page table.

So what is the relationship between TLB and Cache? It can be said that the TLB hit is the basic condition of the Cache hit. If the TLB misses, the TLB entry will be updated. This is very expensive, and the benefits of Cache hits are basically gone. In the case of a TLB hit, the physical address can be selected, and the hit of the Cache can be achieved. As shown below:

n-ways Set-Associative Cache

As you can see from the previous article that you know the physical address, you can know whether the Cache hits or not. Then there are many kinds of correspondence between Cache address and physical address. Let’s look at two extreme examples:

  • Direct Mapping

Direct mapping can be understood as each address can be directly and directly mapped to a Cache Line. Still give the example of the previous 512KB Cache. Suppose we have 1GB of physical memory, we divide them into 8192 copies, each of which is 128KB. Then our direct mapping Cache algorithm is that all addresses in each 128KB block can only be mapped to the corresponding Cache Line. We judge the algorithm of Cache hit is very simple, the direct address /128KB, the quotient is the address of the Cache Line, if it is indeed the address, it will hit, otherwise it is unknown, simple and efficient, the operation is very fast. The meaning map is as follows:

Image courtesy of HardwareSecrets

This seems to be very efficient. What are the disadvantages? That is, the Cache Miss rate is extremely high. Because of the relevance and limitations of the data, most of the data that needs to be used at the same time is nearby, which causes the Cache to frequently swap in and out, causing bumps. An improved approach is to use inteleave, but it will also cause Cache Miss to be large, uneven.

  • Fully Associated Mapping

Full mapping means that all Cache Lines can correspond to addresses. This way Cache will not cause uneven heating and cooling, Cache Miss is reduced a lot, but at the same time brings another problem, that is, the cost of finding Cache hits or not (Over head) is very high. A missed search, to traverse the entire Cache, can be finally determined.

Direct mapping has become popular when Cache was first invented, but with the development of Cache, its problems have also been exposed. At the same time, full connectivity is not a good solution. People are second to none, looking for balance in two extreme methods, and proposing n-ways Set-Associative mapping:

  • N-way Set-Associative mapping

Here n way, the Cache is divided into n groups, each group corresponding to an address. This means that an address can be mapped to n Cache Lines. Let’s take an example of 4 ways. Let’s divide each Cache into 4 groups:

The same 1GB memory is also adjusted:

Image courtesy of HardwareSecrets

It can be seen that it is very similar to direct mapping. The difference is that one is in a group, the first one is not found, and the next one can be found until the last one of the group. In this way, the advantages of both can be combined.

Realistic choice

N-ways Set-Associative, this n=1 is the direct mapping; n=cache size, which is the full correlation mapping. We know from the above that both are not good, and n is best to take a value in the middle. So what exactly should you choose? This is more complicated, and it is related to the speed and size of the Cache, the speed of the memory, the frequency of the clock, etc. In many cases, it is an empirical value and a result of a large number of pre-silicon experiments.

Let’s take a look at real life L1, L2 and L3 are all connected maps:

AMD
Intel

From this we can see that the Cache selection at each level is different, and the intergenerational generation will also be fine-tuned. Intel is a power of 2, and AMD will have a number of odd-numbered ways.

The next article will explain how the size of the Cache affects performance, and why you can’t make a Cache CPU. Stay tuned.

Cache other articles:

Where is L1, L2, L3 Cache?

Why are Cache so many levels? Why is the level one level larger than the first level? Is the Cache bigger and better?

You are welcome to follow this column and use WeChat to scan the QR code below to join the WeChat public account “UEFIBlog”, where you have the latest articles. At the same time, everyone is welcome to contribute to this column and the public number!

Use WeChat to scan the QR code to join the UEFIBlog public number.

Reference materials:

[1]: How The Cache Memory Works