Intel Haswell
Intel i7-4770 (Haswell), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 32 GB (PC3-12800 cl11 cr2).
- L1 Data cache = 32 KB, 64 B/line, 8-WAY.
- L1 Instruction cache = 32 KB, 64 B/line, 8-WAY.
- L2 cache = 256 KB, 64 B/line, 8-WAY
- L3 cache = 8 MB, 64 B/line
- L1 Data Cache Latency = 4 cycles for simple access via pointer
- L1 Data Cache Latency = 5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).
- L2 Cache Latency = 12 cycles
- L3 Cache Latency = 36 cycles (3.4 GHz i7-4770)
- L3 Cache Latency = 43 cycles (1.6 GHz E5-2603 v3)
- L3 Cache Latency = 58 cycles (core9) - 66 cycles (core5) (3.6 GHz E5-2699 v3 - 18 cores)
- RAM Latency = 36 cycles + 57 ns (3.4 GHz i7-4770)
- RAM Latency = 62 cycles + 100 ns (3.6 GHz E5-2699 v3 dual)
1 GB pages
2 MB pages mode (64-bit Windows)
- Data TLB: 32 items (4-way). Miss Penalty = 8 cycles. Parallel miss: 1 cycles per access
- L2 TLB: 1024 items (8-way). Miss Penalty = 12 ? cycles. Parallel miss: 22 cycles per access
Size Latency Increase Description
32 K 4
64 K 8 4 + 8 (L2)
128 K 10 2
256 K 11 1
512 K 24 13 + 24 (L3)
1 M 30 6
2 M 33 3
4 M 35 2
8 M 36 + 6 ns 1 + 6 ns
16 M 36 + 34 ns 28 ns + 57 ns (RAM)
32 M 36 + 48 ns 14 ns
64 M 36 + 54 ns 6 ns
128 M 40 + 56 ns 4 + 2 ns + 8 (TLB miss)
256 M 42 + 57 ns 2 + 1 ns
512 M 43 + 57 ns 1 + ns
1024 M 44 + 57 ns 1 + ns
4 KB pages mode (64-bit Windows)
- Data TLB L1: 64 items. 4-way. Miss penalty = 8 cycles. Parallel miss: 1 cycle per access
- Data TLB L2 (STLB): 1024 items. 8-way. Miss penalty = 12 ? cycles. Parallel miss: 22 cycles per access
- PDE cache = 32? items. Miss penalty = ? cycles.
Size Latency Increase Description
32 K 4
64 K 8 4 + 8 (L2)
128 K 10 2
256 K 11 1
512 K 28 17 + 24 (L3) +8 (L1 TLB miss)
1 M 36 8
2 M 40 4
4 M 42 2
8 M 48 + 6 ns 6 + 6 ns + 9 (L2 TLB miss)
16 M 51 + 34 ns 3 + 28 ns + 57 ns (RAM)
32 M 52 + 48 ns 1 + 14 ns
64 M 53 + 54 ns 1 + 6 ns
128 M + 9? (PDE cache miss) + 19? (Page walk to L3)
MISC
- Branch misprediction penalty = 15.0 cycles (if mOp cache hit).
- Branch misprediction penalty = 18-20 cycles (if mOp cache miss).
- 64-bytes range cross penalty = 5 cycles
- 4096-bytes range cross penalty = 28 cycles
- L1 B/W (Parallel Random Read) = 0.5 cycles per one access
- L2->L1 B/W (Parallel Random Read) = 2.3 cycles per cache line
- L2->L1 B/W (Read, 64 bytes step) = 2.2 cycles per cache line
- L2 Write (Write, 64 bytes step) = 6.1 cycles per write (cache line)
- L3->L1 B/W (Parallel Random Read) = 5.0 cycles per cache line (3.4 GHz i7-4770)
- L3->L1 B/W (Parallel Random Read) = 7.3 cycles per cache line (3.6 GHz E5-2699 v3 - 18 cores)
- L3->L1 B/W (Read, 64 bytes step) = 4.7 cycles per cache line (3.4 GHz i7-4770)
- L3->L1 B/W (Read, 64 bytes step) = 6.3 cycles per cache line (3.6 GHz E5-2699 v3 - 18 cores)
- L3 Write (Write, 64 bytes step) = 8.4 cycles per write (cache line)
- RAM Read B/W (Parallel Random Read) = 8 ns / cache line = 7100 MB/s
- RAM Read B/W (Read, 8-64 Bytes step) = 17500 MB/s
- RAM Read B/W (Read, 32 Bytes step - pointer chasing) = 14500 MB/s
- RAM Write B/W (Write, 4-64 Bytes step) = 11000 MB/s
Links
Haswell at Wikipedia