Intel Westmere
- L1 Data cache = 32 KB. 64 B/line, 8-WAY. (Write-Allocate?)
- L1 Instruction cache = 32 KB. 4-WAY. ? B/line
- L2 cache = 256 KB. 64 B/line, 8-WAY
- Load Buffers = 48 items
- Store Buffers = 32 items
- Line fill buffers (LFB) = 10
- RS = 36 items
- ROB = 128 items
- L1 Data Cache Latency = 4 cycles
- L2 Data Cache Latency = 10 cycles
- 64-bytes range cross penalty = 4 cycles.
- 4096-bytes range cross penalty = 20 cycles.
- Branch misprediction penalty = 15-16 cycles.
Intel Xeon X5650 (Westmere-EP)
Intel Xeon X5650 (Westmere-EP), 32 nm, 6 cores, 1.17 B transistros, 248 mm2. LGA-1366, 95 W
- 2666 Mhz (Turbo-Boost off), tested in that mode
- 3060 Mhz (Turbo-Boost on)
- L3 cache = 12 MB. 64 B/line, 16-WAY
- L3 Cache Latency = 40 cycles for cores 3,5
- L3 Cache Latency = 42 cycles for cores 0,1,2,4
- L3 Cache Latency = 46 cycles for cores 0,1,2,4 (for Turbo-Boost 3060 MHz)
- RAM Latency = 40 cycles + 67 ns (RAM connected to this CPU)
- RAM Latency = 40 cycles + 105 ns (RAM connected to another CPU)
1 GB pages mode (64-bit Linux)
- There are no dedicated TLB items that cover full 1 GB pages.
- 1 GB pages TLB size = 32 items. Miss penalty = 23 cycles. Parallel miss: 28 cycles per access.
That miss penalty can be hidden partially by access to RAM.
Size Latency Increase Description
32 K 4
64 K 7 3 + 6 (L2)
128 K 9 2
256 K 9 0
512 K 26 17 + 30 (L3)
1 M 33 7
2 M 37 4
4 M 38 1
8 M 39 1
16 M 40 + 15 ns 1 + 15 ns + 67 ns (RAM)
32 M 40 + 44 ns + 29 ns
64 M 40 + 54 ns + 12 ns
128 M 52 + 61 ns 12 + 7 ns + 23 (TLB miss)
256 M 58 + 64 ns 6 + 3 ns
512 M 61 + 66 ns 3 + 2 ns
1024 M 63 + 67 ns 2 + 1 ns
- L1 B/W (Parallel Random Read) = 1 cycle per one access
- L2->L1 B/W (Parallel Random Read) = 4 cycles per cache line
- L2->L1 B/W (Read, 64 bytes step) = 3.8 cycles per cache line
- L2->L1 B/W (Read, 64 bytes step - pointer chasing) = 5.8 cycles per cache line
- L2 Write (Write, 64 bytes step) = 6.6 cycles per (cache line)
- L3->L1 B/W (Parallel Random Read) = 5.2 cycles per cache line
- L3->L1 B/W (Read, 64 bytes step) = 5.5 cycles per cache line
- L3->L1 B/W (Read, 64 bytes step - pointer chasing) = 10.5 cycles per cache line
- L3 Write (Write, 64 bytes step) = 9.8 cycles per write (cache line)
- RAM Read B/W (Parallel Random Read) = 13.5 ns / cache line = 4700 MB/s
- RAM Read B/W (Read, 8 Bytes step) = 5800 MB/s
- RAM Read B/W (Read, 64 Bytes step) = 6100 MB/s
- RAM Read B/W (Read, 64 Bytes step - pointer chasing) = 4600 MB/s
- RAM Write B/W (Write, 4-64 Bytes step) = 2500 MB/s
2 MB pages mode (64-bit Linux)
- 2 MB pages TLB size = 32 items. Miss penalty = 16 cycles. Parallel miss: 21 cycle per access.
- RAM Write B/W (Write, 4-64 Bytes step) = 2600-2900 MB/s
- latency table is similar to table for 1 GB pages
4 KB pages mode (64-bit Linux)
- Data TLB L1 size = 64 items. 4-way. Miss penalty = 7 cycles. Parallel miss: 2 cycles per access.
- TLB L2 size = 512 items. 4-way ? Miss penalty = 9 cycles. Parallel miss: 21 cycle per access.
- PDE Cache = 32 items (2 MB regions). ?-way. Miss penalty = 10 cycles.
Size Latency Increase Description
32 K 4
64 K 7 3 + 6 (L2)
128 K 9 2
256 K 12 3
512 K 31 19 + 30 (L3) +7 (L1 TLB miss)
1 M 41 10
2 M 45 4
4 M 52 7 + 9 (L2 TLB miss)
8 M 55 + 1 ns 3 + 1 ns
16 M 56 + 19 ns 1 + 18 ns + 67 ns (RAM)
32 M 56 + 44 ns 25 ns
64 M 66 + 54 ns 10 + 10 ns
128 M 76 + 61 ns 10 + 7 ns + 10 (PDE cache miss) + 30 (Page walk to L3)
256 M 86 + 65 ns 10 + 4 ns
512 M 91 + 66 ns 5 + 1 ns
1024 M 94 + 67 ns 3 + 1 ns
2048 M 96 + 75 ns 2 + 8 ns
4096 M 96 + 91 ns 16 ns + ? ns (Page walk to RAM)
Intel i5
Intel i5-650, Clarkdale, Westmere, 32 nm, 81 mm2, 382 M Transistors + GPU / RAM controller (45 nm, 177 M Transistors, 114 mm2).
GIGABYTE H55M-S2, Intel H55 (IbexPeak DH), Dual-Channel, 2 * 2048 MB PC3-10600 666.7 MHz DDR3 Kingston, 9-9-9-24, External Graphic Card.
- L3 cache = 4 MB. 64 B/line, ?-WAY
4 KB pages mode (64-bit Windows, 64-bit soft)
- L1 Read with (L1 TLB miss -> L2 TLB hit) = 2 cycles per read (throughput)
- L2 Read with (L2 TLB miss) doesn't allow similar parallel accesses.
- L2->L1 B/W (Parallel Random Read) = 4 cycles per cache line
- L2->L1 B/W (Read, 64 bytes step) = 3.71 cycles per cache line
- L2 Write (Write, 64 bytes step) = 6.70 cycles per write (cache line)
- L3->L1 B/W (Parallel Random Read) = 5.90 cycles per cache line
- L3->L1 B/W (Read, 64 bytes step) = 5.75 cycles per cache line
- L3 Write (Write, 64 bytes step) = 10.40 cycles per write (cache line)
- RAM Read B/W (Parallel Random Read) = 12 ns / cache line = 5300 MB/s
- RAM Read B/W (Read, 8 Bytes step) = 8400 MB/s
- RAM Read B/W (Read, 64 Bytes step) = 9860 MB/s
- RAM Read B/W (Read, 64 Bytes step - pointer chasing) = 7100 MB/s
- RAM Write B/W (Write, 4-64 Bytes step) = 6600 MB/s
Links
Westmere at Wikipedia