Intel Sandy Bridge
Configuration
Intel i3-2120 (Sandy Bridge), 3.3 GHz, 32 nm. RAM: 16 GB (4 x 4GB), PC3-10700 (667 MHz) 9-9-9-24-2T.
- L1 Data cache = 32 KB. 64 B/line, 8-WAY. (Write-Allocate?), 2 * 16 Bytes read ports + 16 Bytes store port.
- L1 Instruction cache = 32 KB. 8-WAY. 64 B/line
- L2 Cache = 256 KB. 64 B/line, 8-WAY
- L3 Cache = 3 MB. 64 B/line
- mOp Cache: 1.5k instructions, 8-WAY, 6 MOP / line.
3 lines of 6 mops each for each aligned and contiguous 32-bytes block of code (Agner).
- instruction decode/fetch throughput - 16 bytes/clock for ICache,
32 bytes/clock for uop cache (Agner).
- uop cache line is assigned to a specific 32-bytes block of code.
- Instructions that generate multiple uops cannot be split between two uop cache
lines.
- An unconditional jump or call always ends a uop cache line.
- The same piece of code can have multiple entries in the uop cache if it has multiple
jump entries.
- Each entry in the uop cache has 32 bits of storage space for address and data bits.
- L1 Data Cache Latency = 4 cycles for simple access via pointer
- L1 Data Cache Latency = 5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).
- L2 Cache Latency = 12 cycles
- L3 Cache Latency = 27.85 cycles
- RAM Latency = 28 cycles + 49 ns (for open RAM page). RAM page size = 16 KB?
- RAM Latency = 28 cycles + 56 ns (for random RAM page).
2 MB pages mode (64-bit Windows)
- Data TLB: 32 entries. 4-WAY, Miss Penalty = 16 cycles. Parallel miss: 20 cycles per access
- PDPTE cache: 4 entries (cover 4 GB). Miss Penalty = 18 cycles.
- PML4 cache: ? entries.
Size Latency Increase Description
32 K 4
64 K 8 4 + 8 (L2)
128 K 10 2
256 K 11 1
512 K 20 9 + 16 (L3)
1 M 24 4
2 M 26 2
4 M 27 + 18 ns 1 + 18 ns + 56 ns (RAM)
8 M 28 + 38 ns 1 + 20 ns
16 M 28 + 47 ns 9 ns
32 M 28 + 52 ns 5 ns
64 M 28 + 54 ns 2 ns
128 M 36 + 55 ns 8 + 1 ns + 16 (TLB miss)
256 M 40 + 56 ns 4 + 1 ns
512 M 42 + 56 ns 2
1024 M 43 + 56 ns 1
2048 M 44 + 56 ns 1
4096 M 44 + 56 ns 0
8192 M 53 + 56 ns 9 + 18 (PDPTE cache miss)
4 KB pages mode (64-bit Windows)
- Data TLB L1 size = 64 items. 4-WAY. Miss penalty = 7 cycles. Parallel miss: 1 cycle per access
- TLB L2 size = 512 items. 4-WAY. Miss penalty = 10 cycles. Parallel miss: 21 cycle per access
- Instruction TLB L1 size = 64 items per thread (128 per core). 4-WAY
- PDE cache = 32 items?
Size Latency Increase Description
32 K 4
64 K 8 4 + 8 (L2)
128 K 10 2
256 K 11 1
512 K 24 13 + 16 (L3) +7 (L1 TLB miss)
1 M 30 6
2 M 32 2
4 M 39 + 18 ns 7 + 18 ns + 56 ns (RAM) +10 (L1 TLB miss)
8 M 44 + 38 ns 5 + 20 ns
16 M 49 + 47 ns 5 + 9 ns
32 M 51 + 52 ns 2 + 5 ns
64 M 60 + 54 ns 9 + 2 ns
128 M 69 + 55 ns 9 + 1 ns + 18 (PDE cache miss) + 16 (Page walk to L3)
256 M 76 + 57 ns 7 + 2 ns
512 M 79 + 70 ns 3 + 13 ns
1024 M 79 + 86 ns 0 + 16 ns + 56 ns (Page walk to RAM)
2048 M 79 + 93 ns 0 + 7 ns
4096 M 79 + 103 ns 0 + 10 ns
8192 M 88 + 107 ns 9 + 4 ns + 18 (PDPTE cache miss)
MISC
- Branch misprediction penalty = 14 cycles (if mOp cache is used).
- Branch misprediction penalty = 17-18 cycles (if mOp cache miss, and L1 cache hit).
- 64-bytes range cross penalty = 5 cycles
- 4096-bytes range cross penalty = 24 cycles
- L1 B/W (Parallel Random Read) = 0.54 cycles per one access
- L2->L1 B/W (Parallel Random Read) = 2.50 cycles per cache line
- L2->L1 B/W (Read, 64 bytes step) = 2.10 cycles per cache line
- L2 Write (Write, 64 bytes step) = 6.70 cycles per write (cache line)
- L3->L1 B/W (Parallel Random Read) = 4.65 cycles per cache line
- L3->L1 B/W (Read, 64 bytes step) = 4.92 cycles per cache line
- L3 Write (Write, 64 bytes step) = 9.00 cycles per write (cache line)
- RAM Read B/W (Parallel Random Read) = 8.3 ns / cache line = 7700 MB/s
- RAM Read B/W (Read, 8-64 Bytes step) = 16000 MB/s
- RAM Write B/W (Write, 4-64 Bytes step) = 9200 MB/s