AMD Zen2
AMD 3800X (Zen2), 7 nm. RAM: 32 GB, RAM DDR4-3200 16-18-18-38-56-1T (dual channel)
- CCD = 74 mm2, 3.8 BT
- sIOd = 416 mm2, 8.34 BT
- cIOd = 125 mm2, 2.09 BT
- L1 Data cache = 32 KB, 64 B/line, 8-WAY. write-back, ECC. Two 256-bit loads and one 256-bit store per cycle.
- L1 Instruction cache = 32 KB, 64 B/line, 8-WAY. 32 bytes fetch / cycle.
- L2 cache = 512 KB, 64 B/line, 8-WAY, write-back. 32 bytes L2->L1 datapath.
- L3 cache = 16 MB or 4 MB per CCX 4 cores), 64 B/line, 16-WAY, write-back. victim for L2.
- I TLB L1 : 64 items. full-assoc, all page sizes
- I TLB L2 : 512 items. 8-way, 4KB/2MB pages, 1G pages use 2MB items.
- Micro-tags for IC & OP cache
- L0 BTB : 16 entries: 8 forward and 8 backward taken branches
- L1 BTB : 512 entries, creates 1 bubble if prediction differs from L0 BTB
- L2 BTB : 7168 entries, creates 4 bubbles
- Return stack: 31 entry (2 * 15-entry in dual-thread)
- Inderect Target array: 1024 entry
- Instruction Byte Queue (IBQ) 20-entry (2*10-entry in dual-thread), 16 bytes/entry
- Pick window: 32 byte aligned on a 16-byte boundary.
- Only the first pick slot (of 4) can pick instructions greater than eight bytes in length.
- Fetch: 4 instructions
- fused to single MOP: CMP or TEST instruction immediately followed by a conditional jump.
(Agner). fused ad dispatch stage (not at decode stage)
- Op Cache (OC): 4096 items. up to 8 Instructions/entry. 8-way, 64 sets.
Entry limits: 8 instructions, 8 32-bit t immediates/displacements (64-bit immediates/displacements take two slots),
4 microcode instructions. An OC entry terminates at the end of a 64-byte aligned memory region.
- Dispatch: 6 ops/cycle
- Retire Unit: receive 6 macro-op/cycle, 224 macro ops. 8 macro-op/cycle retire
- 4 ALUS, 3 AGUs
- ALU schedulers : 4x16.
- AGU scheduler : 1x28.
- reorder buffer : 224.
- integer register file : 180.
- FP scheduler : 36 entry micro-op.
- 4 FPU pipes.
- 2x 32B loads + 1x 32B store per cycle
- Load Queue: 44 entry
- Store Queue: 48 entry
- LS unit can track up to 22 outstanding in-flight cache misses.
L1 Data Cache Latency:
- 4 cycles for simple access via pointer
- 5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).
- AMD DOC: 7- or 8-cycle FPU load-to-use latency.
- L2 Cache Latency = 12 cycles
- L3 Cache Latency = 38 cycles
- RAM Latency = 38 cycles + 66 ns
Note: Zen2 in Windows 10 probably can use Page Table Entry (PTE) Coalescing.
The benchmark results show that TLB misses start for blocks that are 4 times larger
than expected for 4 KB pages. So each item in L1 DTLB and L2 TLB probably cover 16 KB data (4 Coalesced 4 KB pages).
We don't know how it's implemented: does OS software (Windows/Linux) need special support for it?
1 GB pages (64-bit)
- 1GB Data TLB L1: 64 items. full-assoc. Miss Penalty = ? cycles. Parallel miss: ? cycles per access
page-directory entries (PDEs) used to speed up table
walks
- 1-Gbyte pages smashed into 2-Mbyte pages in Data TLB L2: 2048 items. 16-way.
Size Latency Increase Description
32 K 4
64 K 8 4 + 8 (L2)
128 K 10 2
256 K 11 1
512 K 12 1
1 M 25 13 + 26 (L3)
2 M 32 7
4 M 35 3
8 M 37 2
16 M 38 + 4 ns 1 + 4 ns
32 M 38 + 40 ns 36 ns + 66 ns (RAM)
64 M 38 + 55 ns 15 ns
128 M 38 + 62 ns 7 ns
256 M 38 + 63 ns 2 ns
512 M 38 + 65 ns 1 ns
1024 M 38 + 66 ns 1 ns
2 MB pages (32-bit)
- 2MB Data TLB L1: 64 items. full assoc. Miss Penalty = 7 cycles. Parallel miss: ? cycles per access
- 2MB Data TLB L2: 2048 items. 16-way. Miss Penalty = ? cycles. Parallel miss: ? cycles per access
Size Latency Increase Description
32 K 4
64 K 8 4 + 8 (L2)
128 K 10 2
256 K 11 1
512 K 12 1
1 M 25 13 + 26 (L3)
2 M 32 7
4 M 35 3
8 M 37 2
16 M 38 + 4 ns 1 + 4 ns
32 M 38 + 41 ns 37 ns + 66 ns (RAM)
64 M 38 + 55 ns 16 ns
128 M 38 + 62 ns 7 ns
256 M 42 + 63 ns 4+ 2 ns + 7 (L1 TLB miss)
512 M 44 + 65 ns 2+ 1 ns
1024 M 45 + 66 ns 1+ 1 ns
4 KB pages mode (64-bit)
- Data TLB L1: works as 192 items (about 800 KB of memory). ?-assoc. Miss penalty = 6-7? cycles. Parallel miss: ? cycle per access
- Data TLB L2: 2048 items. 16-way. Miss penalty = 54? cycles. Parallel miss: 18 ? cycles per access (read from L3)
amd : 2 page table walkers to handle L2 TLB misses.
- amd : PDE cache = ? items ? (same with Data TLB L2). Miss penalty = ? cycles.
- amd : PDC cache = 64 items (PML4Es, PDPEs). Miss penalty = ? cycles.
Size Latency Increase Description
32 K 4
64 K 8 4 + 8 (L2)
128 K 10 2
256 K 11 1
512 K 12 1
1 M 26 14 + 26 (L3)
2 M 37 11 + 7 (L1 TLB miss)
4 M 41 4
8 M 43 2
16 M 44 + 6 ns 1 + 6 ns
32 M 45 + 41 ns 1 + 35 ns + 66 ns (RAM)
64 M 73 + 55 ns 28 + 21 ns + 56 (L2 TLB miss)
MISC
- Branch misprediction penalty = 18 cycles (mOp cache hit ?)
- L1 Data Reading: 64-bytes range cross penalty = 1 cycle
- L1 Data Reading: 4096-bytes range cross - no additional penalty
- AMD DOC: Stores have two different alignment boundaries. The alignment boundary for accessing TLB and tags is 64 bytes,
and the alignment boundary for writing data to the cache or memory system is 32 bytes. Throughput
for misaligned loads and stores is half that of aligned loads and stores since a misaligned load or store
requires two cycles to access the data cache (versus a single cycle for aligned loads and stores).
- L1 B/W (Parallel Random Read) = 0.5 cycles per one access
- L2->L1 B/W (Parallel Random Read) = 2.2 cycles per cache line
- L2->L1 B/W (Read, 32-64 bytes step) = 2.0 cycles per cache line
- L2 Write (Write, 64 bytes step) = 2.0 cycles per write (cache line)
- L3->L1 B/W (Parallel Random Read) = 3.0- cycles per one access
- L3->L1 B/W (Read, 32-64 bytes step) = 2.65 cycles per cache line
- L3 Write (Write, 64 bytes step) = 2.7 cycles per write
- RAM Read B/W (Parallel Random Read) = 6.5- ns / read (128 ? bytes read)
- RAM Read B/W (Read, 8 Bytes step) = 23 GB/s
- RAM Read B/W (Read, 16-32 Bytes step) = 31 GB/s
- RAM Read B/W (Read, 64 Bytes step - pointer chasing) = 18 GB/s (HW prefetch)
- RAM Write B/W (Write, 8-64 Bytes step, full) = 16 GB/s
Links
Zen 2 at Wikipedia
Zen 2 at WikiChip