L1 Data cache = 64 KB. 32 B/line, 16-WAY.
L1 Instruction cache = 64 KB. 32 B/line, 16-WAY.
L2 cache size = 128 KB. 32 B/line, 4-WAY
Notes about L1 Data Cache: It adds two clocks of cache latency, if memory operand is not in last accessed way of data cache. Last way of L1 covers 4 KB.
L1 TLB size = 16 items (fully associative). Miss penalty = 3
L2 TLB size = 64 items. Miss penalty = 5
DTE cache size (PDE cache) = 12 entries (cover 48 MB).
Size | Latency | Description |
---|---|---|
4 K | 3 | L1-TLB + L1-Cache-Way1 |
64 K | 5 | + 2 (L1-Cache-Way1 miss -> L1-Cache hit) |
128 K | 16 | + 8 (L1-Cache miss -> L2-Cache hit) + 3 (L1-TLB miss) |
256 K | 16 + 70 ns | + RAM |
48 MB | 21 + 70 ns | + 5 (L2-TLB miss) |
... | 25 + 70 ns | + 4 (DTE-Cache miss -> Cache hit) |
There are no 4 MB Pages TLB!
It uses 4 KB Pages TLB (16 items)!
TLB size = 16 items. Miss penalty = 4
4M PTE Cache = 4 entries (cover 16 MB).
Size | Latency | Description |
---|---|---|
4 K | 3 | TLB + L1-Cache-Way1 |
64 K | 5 | + 2 (L1-Cache-Way1 miss -> L1-Cache hit) |
128 K | 17 | + 8 (L1-Cache miss -> L2-Cache hit) + 4 (TLB miss) |
16 M | 17 + 70 ns | +RAM |
... | 21 + 70 ns | + 4 (4M-PTE-Cache miss -> Cache hit) |
8-bytes range cross penalty = 1 cycle.
Reading B/W:
L2 B/W (4 Bytes stride) = 740 MB/s
L2 B/W (32 Bytes stride) = 1270 MB/s (13 cycles per cache line)
RAM B/W (4 Bytes stride) = 312 MB/s
RAM B/W (32 Bytes stride) = 390 MB/s
It's possible to set prefetch mode via MSR. In prefetch mode it can load one additional cache line from RAM for each access. It can increase sequential read speed, but the latence for random access will be increased in that case.
The latency for ADD r32, [m32] = 3 cycles. So real L1 Data Cache latency is lower than 3 cycles.
Branch misprediction penalty = 8 cycles.
AMD Geode LX Instruction Latencies (Everest)
Integer pipeline (from AMD docs):
# | Name | Description |
---|---|---|
1 | Instruction Prefetch | Raw instruction data is fetched from the instruction memory cache. |
2 | Instruction Pre-decode | Prefix bytes are extracted from raw instruction data. This decode looks-ahead to the next instruction and the bubble can be squashed if the pipeline stalls down stream. |
3 | Instruction Decode | Performs full decode of instruction data. Indicates instruction length back to the Prefetch Unit, allowing the Prefetch Unit to shift the appropriate number of bytes to the beginning of the next instruction. |
4 | Instruction Queue | FIFO containing decoded x86 instructions. Allows Instruction Decode to proceed even if the pipeline is stalled downstream. Register reads for data operand address calculations are performed during this stage. |
5 | Address Calculation #1 | Computes linear address of operand data (if required) and issues request to the Data Memory Cache. Microcode can take over the pipeline and inject a micro-box here if multi-box instructions require additional data operands. |
6 | Address Calculation #2 | Operand data (if required) is returned and set up to the Execution stage with no bubbles if there was a data cache hit. Segment limit checking is performed on the data operand address. The mROM is read for setup to Execution Unit. |
7 | Execution Unit | Register and/or data memory fetch fed through the Arithmetic Logic Unit (ALU) for arithmetic or logical operations. mROM always fires for the first instruction box down the pipeline. Microcode can take over the pipeline and insert additional boxes here if the instruction requires multiple Execution Unit stages to complete. |
8 | Writeback | Results of the Execution Unit stages are written to the register file or to data memory. |