AMD Zen
AMD Ryzen 7 1700X (Zen), 3.9 GHz (XFR), 14 nm. RAM: 32 GB, RAM DDR4-2600 (PC4-20800, dual channel)
- L1 Data cache = 32 KB, 64 B/line, 8-WAY. write-back, ECC
- L1 Instruction cache = 64 KB, 64 B/line, 4-WAY. 32 bytes / cycle.
- L2 cache = 512 KB, 64 B/line, 8-WAY, write-back. 32 bytes L2->L1 datapath.
- L3 cache = 8 MB (per 4 cores), 64 B/line, 16-WAY. Victim cache for L1/L2 victims.
4 slices interleaved by low order address bits (amd), TAG in each line (amd pdf).
- I TLB L0 : 8 items. all page sizes
- I TLB L1 : 64 items. full-assoc, all page sizes
- I TLB L2 : 512 items. 8-way, 4KB/2MB pages, no 1G page
- Micro-tags for IC & OP cache
- Next Address Logic: When no branches are identified in
the current fetch block, the next-address logic calculates the starting address of
the next sequential 64-byte fetch block.
This calculation is performed every cycle to support the 64 byte per cycle fetch
bandwidth of the op cache.
- 2 Branches per BTB entry (if branches in same 64-byte line).
- L0 BTB : 4 forward taken branches and 4 backward taken branches, and predicts with 0 bubbles. (No CALLs / RETs)
- L1 BTB : 256 entries, it creates 1 bubble if prediction differs from L0BTB
- L2 BTB :4096 entries, it creates 4 bubbles if prediction differs from L1BTB
- Return stack: 31 entry (2 * 15-entry in dual-thread)
- Inderect Target array: 512 entry
- The conditional branch predictor uses a global history scheme that
keeps track of the previously executed branches.
Global history is not updated for not-taken branches.
Conditional branches not-taken always: not marked in the BTBs.
Conditional branches after first-taken: prediced as always-taken.
- Instruction Byte Queue (IBQ) 20-entry (2*10-entry in dual-thread), 16 bytes/entry
- Pick window: 32 byte aligned on a 16-byte boundary.
- Only the first pick slot (of 4) can pick instructions greater than eight bytes in length.
- Fetch: 4 instructions
- fused to single MOP: CMP or TEST instruction immediately followed by a conditional jump.
(Agner). fused ad dispatch stage (not at decode stage)
- Op Cache (OC): 2048 items, 32 sets, 8-WAY, 8 ops / line.
- Up to 8 sequential instructions ending in the same 64-byte aligned memory region may be
cached together in an entry.
- 8 32-bit immediates/displacements per entry (64-bit immediates/displacements take two slots).
- The machine can only transition from IC mode to OC mode at a branch target.
- L1 Data cache tags contains microtag (utag) (hashed VAs).
Microtags are used to select cache way using VA (without PA from TLB).
for reads before TLB.
In case of misprediction, a fill request to the L2 cache is
initiated and the utag is updated when L2 responds to the fill request.
- uOP Queue: 72-entry ?
- Retire queue 192-entry: (24 * 8-entry in ST) / (2 * 12 x 8-entry in SMT).
- The integer physical register file (PRF) consists of 168 registers,
with up to 38 per thread mapped to architectural state or microarchitectural temporary state.
The remaining registers are available for out-of-order renames.
- MOV elimination.
- INT scheduler queues: 6 x 14-entry
- 4 ALUS, 2 AGUs
- FP scheduler : 36 entry micro-op.
- 4 FPU pipes.
- 2x 16B loads + 1x 16B store per cycle
- Load queue: 72-entry
- Store queue: 44-entry
- LS unit can track up to 22 outstanding in-flight cache misses.
- Zen uses address bits 11:0 to determine STLF eligiblity.
L1 Data Cache Latency:
- 4 cycles for simple access via pointer
- 4 cycles for (base_reg + displacement) (AMD DOCs)
- 4 cycles for (base_reg + index_reg) (AMD DOCs)
- 5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).
- L2 Cache Latency = 17 cycles (Ryzen 1xxx)
- L2 Cache Latency = 12 cycles (Ryzen 2xxx / Threadripper / Epyc)
- RAM Latency = 40 cycles + 90 ns (Ryzen 1xxxx)
CCX L3:
Ryzen 1xxx: L3 Cache Latency (random access):
- 40 cycles : average latency for reading to any core
- 37 cycles : core reads from nearest L3 slice
- 43 cycles : core reads from farthest L3 slice
Ryzen 2xxx: L3 Cache Latency (random access):
- 35 cycles : average latency for reading to any core
- 32 cycles : core reads from nearest L3 slice
- 38 cycles : core reads from farthest L3 slice
CCX L3 Latency penalty for reading from different L3 Slices to Cores:
+4c
Core-0 Slice-0 ====== Slice-2 Core-2
|| ||
+2c || || +2c
|| ||
Core-1 Slice-1 ====== Slice-3 Core-3
+4c
Note: These penalty numbers are total penalties that include data request and data response.
The hop latency for one direction must be 2 times lower.
Infinity Fabric
Infinity fabric links in Threadripper and Epyc for path from CCX to Memory Controller:
Local Memory access: CCX - xC - xM - MemCtl
Remote Memory access with 1 hop: CCX - xC - x6 - x3 - CAKE --- CAKE - x3 - x6 - xP - xM - MemCtl
Remote Memory access with 2 hops (short): CCX - xC - x6 - x3 - CAKE --- CAKE - x3 - CAKE --- CAKE - x3 - x6 - xP - xM - MemCtl
Remote Memory access with 2 hops (long): CCX - xC - x6 - x3 - CAKE --- CAKE - x3 - x6 - x3 - CAKE --- CAKE - x3 - x6 - xP - xM - MemCtl
- xC - switch for CCX
- xM - switch for memory controllers
- x6 - main switch for external infinity fabric links
- x3 - additional switch for 3 external infinity fabric links (2 IFOP and 1 IFIS)
- xP - intermediate switch from x6 to xM.
Estimated latencies in Infinity Fabric clock cycles (1200/1333/1467/1600 MHz):
- Each switch hop costs 2-3 cycles for one direction. The latency can be higher if there is a big distance between switches.
- 40 cycles - latency for IFOP (on package) (CAKE --- CAKE) totaling for request and response.
- 66-71 cycles - total overhead for access to Remote Die Memory in same socket over local memory access.
- 120 cycles - latency for IFIS (between sockets) (CAKE --- CAKE) totaling for request and response.
- 150 cycles - total overhead for access to Remote Die Memory in another socket (1 hop).
- 200 cycles - total overhead for access to Remote Die Memory in another socket (2 hops).
1 GB pages (64-bit)
- 1GB Data TLB L1: 64 (by specification). Tests show: 32 items. Miss Penalty = 13 cycles. Parallel miss: ? cycles per access
Size Latency Increase Description
32 K 4
64 K 11 7 + 13 (L2)
128 K 14 3
256 K 16 2
512 K 17 1
1 M 29 12 + 23 (L3)
2 M 35 6
4 M 37 2
8 M 39 + 5 ns 2 + 5 ns
16 M 40 + 48 ns 1 + 43 ns + 90 ns (RAM)
32 M 40 + 70 ns 22 ns
64 M 40 + 81 ns 11 ns
128 M 40 + 86 ns 5 ns
256 M 40 + 88 ns 2 ns
512 M 40 + 89 ns 1 ns
1024 M 40 + 90 ns 1 ns
2 MB pages (32-bit)
- 2MB Data TLB L1: 64 items. full assoc. Miss Penalty = 8 cycles. Parallel miss: ? cycles per access
- 2MB Data TLB L2: 1536 items. 12-way. Miss Penalty = ? cycles. Parallel miss: ? cycles per access
Size Latency Increase Description
32 K 4
64 K 11 7 + 13 (L2)
128 K 14 3
256 K 16 2
512 K 17 1
1 M 29 12 + 23 (L3)
2 M 35 6
4 M 37 2
8 M 39 + 5 ns 2 + 5 ns
16 M 40 + 48 ns 1 + 43 ns + 90 ns (RAM)
32 M 40 + 70 ns 22 ns
64 M 40 + 81 ns 11 ns
128 M 40 + 86 ns 5 ns
256 M 44 + 88 ns 4 + 2 ns + 8 (L1 TLB miss)
512 M 46 + 89 ns 2 + 1 ns
1024 M 47 + 90 ns 1 + 1 ns
4 KB pages mode (64-bit)
- Data TLB L1: 64 items. full assoc. Miss penalty = 8 cycles. Parallel miss: 1 cycle per access
- Data TLB L2: 1536 items. 12-way. Miss penalty = 34 ? cycles. Parallel miss: 18 ? cycles per access (read from L3)
amd : 2 page table walkers to handle L2 TLB misses.
- amd : PDE cache = 1536 items ? (same with Data TLB L2). Miss penalty = ? cycles.
- amd : PDC cache = 64 items (PML4Es, PDPEs). Miss penalty = ? cycles.
Size Latency Increase Description
32 K 4
64 K 11 7 + 13 (L2)
128 K 14 3
256 K 16 2
512 K 20 4 + 8 (L1 TLB miss)
1 M 35 15 + 23 (L3)
2 M 42 7
4 M 45 3
8 M 63 + 5 ns 18 + 5 ns + 34 ? (L2 TLB miss)
16 M 72 + 48 ns 9 + 43 ns + 90 ns (RAM)
32 M 82 + 70 ns 10 + 22 ns
64 M 87 + 81 ns 5 + 11 ns
128 M 97 + 86 ns 10 + 5 ns
256 M 109 + 88 ns 10 + 2 ns
512 M 113 + 89 ns 4 + 1 ns
1024 M 125 + 90 ns 12 + 1 ns
MISC
- Branch misprediction penalty = 19 cycles (mOp cache hit ?)
- Branch history table = 2K items or more (for 8 branches code). Half misprediction for 16K branches.
- L1 Data Reading: 32-bytes range cross penalty = 1 cycle
- L1 Data Reading: 4096-bytes range cross - no additional penalty
- L1 B/W (Parallel Random Read) = 0.5 cycles per one access
- L2->L1 B/W (Parallel Random Read) = 2.0 cycles per cache line
- L2->L1 B/W (Read, 32-64 bytes step) = 2.0 cycles per cache line
- L2 Write (Write, 64 bytes step) = 2.1 cycles per write (cache line)
- L3->L1 B/W (Parallel Random Read) = 2.2 cycles per one access
- L3->L1 B/W (Parallel Random Read from one L3 slice) = 4.7 cycles per one access
- L3->L1 B/W (Read, 32-64 bytes step) = 2.7 cycles per cache line
- L3 Write (Write, 64 bytes step) = 2.9 cycles per write
- RAM Read B/W (Parallel Random Read) = 4.6 ns / read (128 ? bytes read)
- RAM Read B/W (Read, 8 Bytes step) = 20 GB/s
- RAM Read B/W (Read, 32 Bytes step) = 28 GB/s
- RAM Read B/W (Read, 64 Bytes step - pointer chasing) = 17 GB/s (HW prefetch)
- RAM Write B/W (Write, 8 Bytes step, full) = 9 GB/s
- RAM Write B/W (Write, 64 Bytes step) = 11 GB/s
Links
Zen at Wikipedia
Zen at Wikichip