AMD Zen

AMD Ryzen 7 1700X (Zen), 3.9 GHz (XFR), 14 nm. RAM: 32 GB, RAM DDR4-2600 (PC4-20800, dual channel)

L1 Data cache = 32 KB, 64 B/line, 8-WAY. write-back, ECC
L1 Instruction cache = 64 KB, 64 B/line, 4-WAY. 32 bytes / cycle.
L2 cache = 512 KB, 64 B/line, 8-WAY, write-back. 32 bytes L2->L1 datapath.
L3 cache = 8 MB (per 4 cores), 64 B/line, 16-WAY. Victim cache for L1/L2 victims. 4 slices interleaved by low order address bits (amd), TAG in each line (amd pdf).

I TLB L0 : 8 items. all page sizes
I TLB L1 : 64 items. full-assoc, all page sizes
I TLB L2 : 512 items. 8-way, 4KB/2MB pages, no 1G page
Micro-tags for IC & OP cache

Next Address Logic: When no branches are identified in the current fetch block, the next-address logic calculates the starting address of the next sequential 64-byte fetch block. This calculation is performed every cycle to support the 64 byte per cycle fetch bandwidth of the op cache.
2 Branches per BTB entry (if branches in same 64-byte line).
L0 BTB : 4 forward taken branches and 4 backward taken branches, and predicts with 0 bubbles. (No CALLs / RETs)
L1 BTB : 256 entries, it creates 1 bubble if prediction differs from L0BTB
L2 BTB :4096 entries, it creates 4 bubbles if prediction differs from L1BTB
Return stack: 31 entry (2 * 15-entry in dual-thread)
Inderect Target array: 512 entry
The conditional branch predictor uses a global history scheme that keeps track of the previously executed branches. Global history is not updated for not-taken branches. Conditional branches not-taken always: not marked in the BTBs. Conditional branches after first-taken: prediced as always-taken.

Instruction Byte Queue (IBQ) 20-entry (2*10-entry in dual-thread), 16 bytes/entry
Pick window: 32 byte aligned on a 16-byte boundary.
Only the first pick slot (of 4) can pick instructions greater than eight bytes in length.
Fetch: 4 instructions
fused to single MOP: CMP or TEST instruction immediately followed by a conditional jump. (Agner). fused ad dispatch stage (not at decode stage)

Op Cache (OC): 2048 items, 32 sets, 8-WAY, 8 ops / line.
Up to 8 sequential instructions ending in the same 64-byte aligned memory region may be cached together in an entry.
8 32-bit immediates/displacements per entry (64-bit immediates/displacements take two slots).
The machine can only transition from IC mode to OC mode at a branch target.

L1 Data cache tags contains microtag (utag) (hashed VAs). Microtags are used to select cache way using VA (without PA from TLB). for reads before TLB. In case of misprediction, a fill request to the L2 cache is initiated and the utag is updated when L2 responds to the fill request.
uOP Queue: 72-entry ?
Retire queue 192-entry: (24 * 8-entry in ST) / (2 * 12 x 8-entry in SMT).
The integer physical register file (PRF) consists of 168 registers, with up to 38 per thread mapped to architectural state or microarchitectural temporary state. The remaining registers are available for out-of-order renames.
MOV elimination.
INT scheduler queues: 6 x 14-entry
4 ALUS, 2 AGUs
FP scheduler : 36 entry micro-op.
4 FPU pipes.
2x 16B loads + 1x 16B store per cycle
Load queue: 72-entry
Store queue: 44-entry
LS unit can track up to 22 outstanding in-flight cache misses.
Zen uses address bits 11:0 to determine STLF eligiblity.

L1 Data Cache Latency:

4 cycles for simple access via pointer
4 cycles for (base_reg + displacement) (AMD DOCs)
4 cycles for (base_reg + index_reg) (AMD DOCs)
5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).

L2 Cache Latency = 17 cycles (Ryzen 1xxx)
L2 Cache Latency = 12 cycles (Ryzen 2xxx / Threadripper / Epyc)

RAM Latency = 40 cycles + 90 ns (Ryzen 1xxxx)

CCX L3:

Ryzen 1xxx: L3 Cache Latency (random access):

40 cycles : average latency for reading to any core
37 cycles : core reads from nearest L3 slice
43 cycles : core reads from farthest L3 slice

Ryzen 2xxx: L3 Cache Latency (random access):

35 cycles : average latency for reading to any core
32 cycles : core reads from nearest L3 slice
38 cycles : core reads from farthest L3 slice

CCX L3 Latency penalty for reading from different L3 Slices to Cores:

                 +4c
Core-0  Slice-0 ====== Slice-2  Core-2
          ||            ||    
      +2c ||            || +2c
          ||            ||    
Core-1  Slice-1 ====== Slice-3  Core-3
                 +4c

Note: These penalty numbers are total penalties that include data request and data response. The hop latency for one direction must be 2 times lower.

Infinity Fabric

Infinity fabric links in Threadripper and Epyc for path from CCX to Memory Controller:

Local Memory access: CCX - xC - xM - MemCtl

Remote Memory access with 1 hop: CCX - xC - x6 - x3 - CAKE --- CAKE - x3 - x6 - xP - xM - MemCtl

Remote Memory access with 2 hops (short): CCX - xC - x6 - x3 - CAKE --- CAKE - x3 - CAKE --- CAKE - x3 - x6 - xP - xM - MemCtl

Remote Memory access with 2 hops (long): CCX - xC - x6 - x3 - CAKE --- CAKE - x3 - x6 - x3 - CAKE --- CAKE - x3 - x6 - xP - xM - MemCtl

xC - switch for CCX
xM - switch for memory controllers
x6 - main switch for external infinity fabric links
x3 - additional switch for 3 external infinity fabric links (2 IFOP and 1 IFIS)
xP - intermediate switch from x6 to xM.

Estimated latencies in Infinity Fabric clock cycles (1200/1333/1467/1600 MHz):

Each switch hop costs 2-3 cycles for one direction. The latency can be higher if there is a big distance between switches.
40 cycles - latency for IFOP (on package) (CAKE --- CAKE) totaling for request and response.
66-71 cycles - total overhead for access to Remote Die Memory in same socket over local memory access.
120 cycles - latency for IFIS (between sockets) (CAKE --- CAKE) totaling for request and response.
150 cycles - total overhead for access to Remote Die Memory in another socket (1 hop).
200 cycles - total overhead for access to Remote Die Memory in another socket (2 hops).

1 GB pages (64-bit)

1GB Data TLB L1: 64 (by specification). Tests show: 32 items. Miss Penalty = 13 cycles. Parallel miss: ? cycles per access

  Size        Latency       Increase   Description

  32 K     4                           
  64 K    11                       7   + 13 (L2)        
 128 K    14                       3   
 256 K    16                       2
 512 K    17                       1   
   1 M    29                      12   + 23 (L3)
   2 M    35                       6
   4 M    37                       2
   8 M    39 +  5 ns       2 +  5 ns
  16 M    40 + 48 ns       1 + 43 ns   + 90 ns (RAM)
  32 M    40 + 70 ns           22 ns
  64 M    40 + 81 ns           11 ns
 128 M    40 + 86 ns            5 ns   
 256 M    40 + 88 ns            2 ns
 512 M    40 + 89 ns            1 ns
1024 M    40 + 90 ns            1 ns

2 MB pages (32-bit)

2MB Data TLB L1: 64 items. full assoc. Miss Penalty = 8 cycles. Parallel miss: ? cycles per access
2MB Data TLB L2: 1536 items. 12-way. Miss Penalty = ? cycles. Parallel miss: ? cycles per access

  Size        Latency       Increase   Description

  32 K     4                           
  64 K    11                       7   + 13 (L2)        
 128 K    14                       3   
 256 K    16                       2
 512 K    17                       1   
   1 M    29                      12   + 23 (L3)
   2 M    35                       6
   4 M    37                       2
   8 M    39 +  5 ns       2 +  5 ns
  16 M    40 + 48 ns       1 + 43 ns   + 90 ns (RAM)
  32 M    40 + 70 ns           22 ns
  64 M    40 + 81 ns           11 ns
 128 M    40 + 86 ns            5 ns   
 256 M    44 + 88 ns       4 +  2 ns   + 8 (L1 TLB miss)
 512 M    46 + 89 ns       2 +  1 ns   
1024 M    47 + 90 ns       1 +  1 ns

4 KB pages mode (64-bit)

Data TLB L1: 64 items. full assoc. Miss penalty = 8 cycles. Parallel miss: 1 cycle per access
Data TLB L2: 1536 items. 12-way. Miss penalty = 34 ? cycles. Parallel miss: 18 ? cycles per access (read from L3) amd : 2 page table walkers to handle L2 TLB misses.
amd : PDE cache = 1536 items ? (same with Data TLB L2). Miss penalty = ? cycles.
amd : PDC cache = 64 items (PML4Es, PDPEs). Miss penalty = ? cycles.

  Size        Latency       Increase   Description

  32 K     4                           
  64 K    11                       7   + 13 (L2)        
 128 K    14                       3   
 256 K    16                       2
 512 K    20                       4   + 8 (L1 TLB miss)
   1 M    35                      15   + 23 (L3)
   2 M    42                       7
   4 M    45                       3	
   8 M    63 +  5 ns      18 +  5 ns   + 34 ? (L2 TLB miss)
  16 M    72 + 48 ns       9 + 43 ns   + 90 ns (RAM)
  32 M    82 + 70 ns      10 + 22 ns
  64 M    87 + 81 ns       5 + 11 ns
 128 M    97 + 86 ns      10 +  5 ns   
 256 M   109 + 88 ns      10 +  2 ns
 512 M   113 + 89 ns       4 +  1 ns
1024 M   125 + 90 ns      12 +  1 ns

MISC

Branch misprediction penalty = 19 cycles (mOp cache hit ?)
Branch history table = 2K items or more (for 8 branches code). Half misprediction for 16K branches.

L1 Data Reading: 32-bytes range cross penalty = 1 cycle
L1 Data Reading: 4096-bytes range cross - no additional penalty

L1 B/W (Parallel Random Read) = 0.5 cycles per one access

L2->L1 B/W (Parallel Random Read) = 2.0 cycles per cache line
L2->L1 B/W (Read, 32-64 bytes step) = 2.0 cycles per cache line
L2 Write (Write, 64 bytes step) = 2.1 cycles per write (cache line)

L3->L1 B/W (Parallel Random Read) = 2.2 cycles per one access
L3->L1 B/W (Parallel Random Read from one L3 slice) = 4.7 cycles per one access
L3->L1 B/W (Read, 32-64 bytes step) = 2.7 cycles per cache line
L3 Write (Write, 64 bytes step) = 2.9 cycles per write

RAM Read B/W (Parallel Random Read) = 4.6 ns / read (128 ? bytes read)
RAM Read B/W (Read, 8 Bytes step) = 20 GB/s
RAM Read B/W (Read, 32 Bytes step) = 28 GB/s
RAM Read B/W (Read, 64 Bytes step - pointer chasing) = 17 GB/s (HW prefetch)
RAM Write B/W (Write, 8 Bytes step, full) = 9 GB/s
RAM Write B/W (Write, 64 Bytes step) = 11 GB/s

Links

Zen at Wikipedia

Zen at Wikichip