AMD Zen2

AMD 3800X (Zen2), 7 nm. RAM: 32 GB, RAM DDR4-3200 16-18-18-38-56-1T (dual channel)

CCD = 74 mm2, 3.8 BT
sIOd = 416 mm2, 8.34 BT
cIOd = 125 mm2, 2.09 BT

L1 Data cache = 32 KB, 64 B/line, 8-WAY. write-back, ECC. Two 256-bit loads and one 256-bit store per cycle.
L1 Instruction cache = 32 KB, 64 B/line, 8-WAY. 32 bytes fetch / cycle.
L2 cache = 512 KB, 64 B/line, 8-WAY, write-back. 32 bytes L2->L1 datapath.
L3 cache = 16 MB or 4 MB per CCX 4 cores), 64 B/line, 16-WAY, write-back. victim for L2.

I TLB L1 : 64 items. full-assoc, all page sizes
I TLB L2 : 512 items. 8-way, 4KB/2MB pages, 1G pages use 2MB items.
Micro-tags for IC & OP cache

L0 BTB : 16 entries: 8 forward and 8 backward taken branches
L1 BTB : 512 entries, creates 1 bubble if prediction differs from L0 BTB
L2 BTB : 7168 entries, creates 4 bubbles
Return stack: 31 entry (2 * 15-entry in dual-thread)
Inderect Target array: 1024 entry

Instruction Byte Queue (IBQ) 20-entry (2*10-entry in dual-thread), 16 bytes/entry
Pick window: 32 byte aligned on a 16-byte boundary.
Only the first pick slot (of 4) can pick instructions greater than eight bytes in length.
Fetch: 4 instructions
fused to single MOP: CMP or TEST instruction immediately followed by a conditional jump. (Agner). fused ad dispatch stage (not at decode stage)

Op Cache (OC): 4096 items. up to 8 Instructions/entry. 8-way, 64 sets. Entry limits: 8 instructions, 8 32-bit t immediates/displacements (64-bit immediates/displacements take two slots), 4 microcode instructions. An OC entry terminates at the end of a 64-byte aligned memory region.
Dispatch: 6 ops/cycle
Retire Unit: receive 6 macro-op/cycle, 224 macro ops. 8 macro-op/cycle retire
4 ALUS, 3 AGUs
ALU schedulers : 4x16.
AGU scheduler : 1x28.
reorder buffer : 224.
integer register file : 180.
FP scheduler : 36 entry micro-op.
4 FPU pipes.
2x 32B loads + 1x 32B store per cycle
Load Queue: 44 entry
Store Queue: 48 entry
LS unit can track up to 22 outstanding in-flight cache misses.

L1 Data Cache Latency:

4 cycles for simple access via pointer
5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).
AMD DOC: 7- or 8-cycle FPU load-to-use latency.

L2 Cache Latency = 12 cycles

L3 Cache Latency = 38 cycles

RAM Latency = 38 cycles + 66 ns

Note: Zen2 in Windows 10 probably can use Page Table Entry (PTE) Coalescing. The benchmark results show that TLB misses start for blocks that are 4 times larger than expected for 4 KB pages. So each item in L1 DTLB and L2 TLB probably cover 16 KB data (4 Coalesced 4 KB pages). We don't know how it's implemented: does OS software (Windows/Linux) need special support for it?

1 GB pages (64-bit)

1GB Data TLB L1: 64 items. full-assoc. Miss Penalty = ? cycles. Parallel miss: ? cycles per access page-directory entries (PDEs) used to speed up table walks
1-Gbyte pages smashed into 2-Mbyte pages in Data TLB L2: 2048 items. 16-way.

  Size        Latency       Increase   Description

  32 K     4                           
  64 K     8                       4   + 8 (L2)        
 128 K    10                       2   
 256 K    11                       1
 512 K    12                       1   
   1 M    25                      13   + 26 (L3)
   2 M    32                       7
   4 M    35                       3
   8 M    37                       2
  16 M    38 +  4 ns        1 + 4 ns   
  32 M    38 + 40 ns           36 ns   + 66 ns (RAM)
  64 M    38 + 55 ns           15 ns
 128 M    38 + 62 ns            7 ns   
 256 M    38 + 63 ns            2 ns
 512 M    38 + 65 ns            1 ns
1024 M    38 + 66 ns            1 ns

2 MB pages (32-bit)

2MB Data TLB L1: 64 items. full assoc. Miss Penalty = 7 cycles. Parallel miss: ? cycles per access
2MB Data TLB L2: 2048 items. 16-way. Miss Penalty = ? cycles. Parallel miss: ? cycles per access

  Size        Latency       Increase   Description

  32 K     4                           
  64 K     8                       4   + 8 (L2)        
 128 K    10                       2   
 256 K    11                       1
 512 K    12                       1   
   1 M    25                      13   + 26 (L3)
   2 M    32                       7
   4 M    35                       3
   8 M    37                       2
  16 M    38 +  4 ns        1 + 4 ns   
  32 M    38 + 41 ns           37 ns   + 66 ns (RAM)
  64 M    38 + 55 ns           16 ns
 128 M    38 + 62 ns            7 ns   
 256 M    42 + 63 ns        4+  2 ns   + 7 (L1 TLB miss)
 512 M    44 + 65 ns        2+  1 ns
1024 M    45 + 66 ns        1+  1 ns

4 KB pages mode (64-bit)

Data TLB L1: works as 192 items (about 800 KB of memory). ?-assoc. Miss penalty = 6-7? cycles. Parallel miss: ? cycle per access
Data TLB L2: 2048 items. 16-way. Miss penalty = 54? cycles. Parallel miss: 18 ? cycles per access (read from L3) amd : 2 page table walkers to handle L2 TLB misses.
amd : PDE cache = ? items ? (same with Data TLB L2). Miss penalty = ? cycles.
amd : PDC cache = 64 items (PML4Es, PDPEs). Miss penalty = ? cycles.

  Size        Latency       Increase   Description

  32 K     4                           
  64 K     8                       4   + 8 (L2)        
 128 K    10                       2   
 256 K    11                       1
 512 K    12                       1   
   1 M    26                      14   + 26 (L3) 
   2 M    37                      11   + 7 (L1 TLB miss)
   4 M    41                       4	
   8 M    43                       2    
  16 M    44 +  6 ns       1 +  6 ns   
  32 M    45 + 41 ns       1 + 35 ns   + 66 ns (RAM)
  64 M    73 + 55 ns      28 + 21 ns   + 56 (L2 TLB miss)

MISC

Branch misprediction penalty = 18 cycles (mOp cache hit ?)

L1 Data Reading: 64-bytes range cross penalty = 1 cycle
L1 Data Reading: 4096-bytes range cross - no additional penalty
AMD DOC: Stores have two different alignment boundaries. The alignment boundary for accessing TLB and tags is 64 bytes, and the alignment boundary for writing data to the cache or memory system is 32 bytes. Throughput for misaligned loads and stores is half that of aligned loads and stores since a misaligned load or store requires two cycles to access the data cache (versus a single cycle for aligned loads and stores).

L1 B/W (Parallel Random Read) = 0.5 cycles per one access

L2->L1 B/W (Parallel Random Read) = 2.2 cycles per cache line
L2->L1 B/W (Read, 32-64 bytes step) = 2.0 cycles per cache line
L2 Write (Write, 64 bytes step) = 2.0 cycles per write (cache line)

L3->L1 B/W (Parallel Random Read) = 3.0- cycles per one access
L3->L1 B/W (Read, 32-64 bytes step) = 2.65 cycles per cache line
L3 Write (Write, 64 bytes step) = 2.7 cycles per write

RAM Read B/W (Parallel Random Read) = 6.5- ns / read (128 ? bytes read)
RAM Read B/W (Read, 8 Bytes step) = 23 GB/s
RAM Read B/W (Read, 16-32 Bytes step) = 31 GB/s
RAM Read B/W (Read, 64 Bytes step - pointer chasing) = 18 GB/s (HW prefetch)
RAM Write B/W (Write, 8-64 Bytes step, full) = 16 GB/s

Links

Zen 2 at Wikipedia

Zen 2 at WikiChip