Intel Skylake
Intel i7-6700 (Skylake), 4.0 GHz (Turbo Boost), 14 nm. RAM: 16 GB, dual DDR4-2400 CL15 (PC-19200).
- L1 Data cache = 32 KB, 64 B/line, 8-WAY.
- L1 Instruction cache = 32 KB, 64 B/line, 8-WAY.
- L2 cache = 256 KB, 64 B/line, 4-WAY
- L3 cache = 8 MB, 64 B/line, 16-WAY
- L1 Data Cache Latency = 4 cycles for simple access via pointer
- L1 Data Cache Latency = 5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).
- L2 Cache Latency = 12 cycles
- L3 Cache Latency = 42 cycles (core 0) (i7-6700 Skylake 4.0 GHz)
- L3 Cache Latency = 38 cycles (i7-7700K 4 GHz, Kaby Lake)
- RAM Latency = 42 cycles + 51 ns (i7-6700 Skylake)
Note: It's possible that L2 Cache Latency can be 11 cycles in some cases. But dependency chain workload shows 12 cycles.
1 GB pages
- Data TLB: 4 entries, 4-way.
- L2 TLB: 16 entries, 4-way.
2 MB pages mode (64-bit Windows)
- Data TLB: 32 items (4-way). Miss Penalty = 9 cycles. Parallel miss: 1 cycles per access
- L2 TLB: 1536 items (12-way). Miss Penalty = ? cycles.
Size Latency Increase Description
32 K 4
64 K 8 4 + 8 (L2)
128 K 10 2
256 K 11 1
512 K 27 16 + 30 (L3)
1 M 34 7
2 M 38 4
4 M 40 2
8 M 42 2
16 M 42 + 28 ns 28 ns + 51 ns (RAM)
32 M 42 + 41 ns 13 ns
64 M 42 + 46 ns 5 ns
128 M 47 + 49 ns 5 + 3 ns + 9 (TLB miss)
256 M 49 + 51 ns 2 + 2 ns
512 M 50 + 51 ns 1 + ns
1024 M 51 + 51 ns 1 + ns
4 KB pages mode (64-bit Windows)
- Data TLB L1: 64 items. 4-way. Miss penalty = 9 cycles. Parallel miss: 1 cycle per access
- Data TLB L2 (STLB): 1536 items. 12-way. Miss penalty = 17 ? cycles. Parallel miss: 14 cycles per access
- PDE cache = ? items. Miss penalty = ? cycles.
Size Latency Increase Description
32 K 4
64 K 8 4 + 8 (L2)
128 K 10 2
256 K 11 1
512 K 32 21 + 30 (L3) +9 (L1 TLB miss)
1 M 41 9
2 M 46 5
4 M 49 3
8 M 60 + 4 ns 11 + 4 ns + 17 (L2 TLB miss)
16 M 66 + 28 ns 6 + 24 ns + 51 ns (RAM)
32 M 68 + 41 ns 2 + 13 ns
...
MISC
- Branch misprediction penalty = 16.5 cycles average (if mOp cache hit).
- Branch misprediction penalty = 19-20 cycles (if mOp cache miss).
- 64-bytes range cross penalty = 7 cycles
- 4096-bytes range cross - no additional penalty
- L1 B/W (Parallel Random Read) = 0.5 cycles per one access
- L2->L1 B/W (Parallel Random Read) = 1.8 - 2.2 cycles per cache line
- L2->L1 B/W (Read, 64 bytes step) = 1.6 - 2.0 cycles per cache line
- L2 Write (Write, 64 bytes step) = 3.5 cycles per write (cache line)
- L3->L1 B/W (Parallel Random Read) = 5.0 cycles per cache line
- L3->L1 B/W (Read, 64 bytes step) = 3.7 cycles per cache line
- L3 Write (Write, 64 bytes step) = 6.1 cycles per write (cache line)
- RAM Read B/W (Parallel Random Read) = 5.9 ns / cache line = 10800 MB/s
- RAM Read B/W (Read, 16-64 Bytes step) = 26000 MB/s
- RAM Read B/W (Read, 32 Bytes step - pointer chasing) = 16500 MB/s
- RAM Write B/W (Write, 8 Bytes step) = 17800 MB/s
Links
Skylake at Wikipedia