Intel P6 (Pentium II, Pentium III)
- L1 Data cache = 16 KB. 32 B/line, 4-WAY. (Write-Allocate)
- L1 Instruction cache = 16 KB. 32 B/line, 4-WAY.
- L2 cache size = 256 KB / 512 KB. 32 B/line, ?-WAY
- 4 KB pages data TLB size = 64 items (4-WAY). Miss penalty = 5
- 4 MB pages data TLB size = 8 items. Miss penalty = 6
- PDE cache size = 2 entries cover 8 MB (or 4 MB in PAE mode)
- 32-bytes range cross penalty = 9 cycles
- L1 B/W (Parallel Random Read) = 1 cycles per one access
- TLB miss handler works only with one request at time.
- L2 and RAM Write operation uses Read and Write operations. (Write-Allocate in L1 cache).
Intel Pentium III-S
Intel Pentium III-S 1400 Mhz (133 * 10.5), Tualatin (130 nm), FSB 133 MHz, 32.2W, dual CPU motherboard,
ServerWorks ServerSet III LE, 3 x 1 GB PC133 ECC CL3.
- L2 cache size = 512 KB, 32 B/line, ?-WAY. Latency = 8.
4 KB pages mode
Size | Latency | Description |
16 K | 3 | TLB + L1 |
256 K | 8 | +5 (L2) |
512 K | 13 | +5 (TLB miss -> L1 cache) |
8 M | 13 + 140 ns | + RAM |
~128 M | 21 + 140 ns | +3 (PDE cache miss) + 5 (TLB miss -> L2 cache) |
... | 21 + 280 ns | + RAM (TLB miss -> RAM) |
4 MB pages mode
Size | Latency | Description |
16 K | 3 | TLB + L1 |
512 K | 8 | +5 (L2) |
32 M | 8 + 140 ns | + RAM |
... | 14 + 140 ns | +6 (TLB miss) |
- 4096-bytes range cross penalty = 86 cycles
- L2->L1 B/W (Parallel Random Read) = 2.67 cycles per cache line (32-bytes)
- L2->L1 B/W (Read, 32 bytes step) = 3.6 cycles per cache line
- L2->L1 B/W (Read, 32 bytes step, pointer-chasing) = 8 cycles per cache line (NO hardware prefetch to L1)
- L2 Write (Write, 32 bytes step) = 9.50 cycles per write (cache line)
- RAM Read B/W (Parallel Random Read) = 70 ns / cache line = 460 MB/s
- RAM Read B/W (Read, 4 Bytes step) = 640 MB/s
- RAM Read B/W (Read, 32 Bytes step) = 1020 MB/s
- RAM Read B/W (Read, 4 Bytes step, pointer-chasing) = 608 MB/s
- RAM Read B/W (Read, 32 Bytes step, pointer-chasing) = 670 MB/s
- RAM Write B/W (Write, 4-32 Bytes step) = 213 MB/s, if 1 RAM module works.
- RAM Write B/W (Write, 4-32 Bytes step) = 276 MB/s, if 3 RAM modules work.
Intel Celeron (Tualatin)
Intel Celeron 1200 MHz (100*12), Tualatin (130 nm), SDRAM PC-100 2-2-2-5-7.
- L2 cache size = 256 KB, 32 B/line, 8-WAY
4 KB pages mode
Size | Latency | Description |
16 K | 3 | TLB + L1 |
256 K | 8 | +5 (L2) |
8 M | 13 + 150 ns | +5 (TLB miss) + RAM |
~64 M | 21 + 150 ns | +3 (PDE cache miss) + 5 (TLB miss -> L2 cache) |
... | 21 + 300 ns | + RAM (TLB miss -> RAM) |
4 MB pages mode
Size | Latency | Description |
16 K | 3 | TLB + L1 |
256 K | 8 | +5 (L2) |
32 M | 8 + 150 ns | + RAM |
... | 14 + 150 ns | +6 (TLB miss) |
- RAM Read B/W (Read, 4 Bytes step) = 600 MB/s
- RAM Read B/W (Read, 32 Bytes step) = 780 MB/s
- RAM Read B/W (Read, 4 Bytes step, pointer-chasing) = 210 MB/s
- RAM Read B/W (Read, 32 Bytes step, pointer-chasing) = 260 MB/s
- RAM Write B/W (Write, 4-32 Bytes step) = 210 MB/s
Pentium II
Dual Pentium II 350 Mhz (100*3.5), Deschutes (250 nm), core: 113 mm2.
L2 512 KB @ 175MT/s.
- L2 Cache: 512 KB. Latency = 22 cycles.
- RAM Latency = 22 + 100 ns (for open RAM page)
- RAM Latency = 22 + 140 ns (for new RAM page). RAM page size = 4 KB?
4 KB pages, Linux
Size Latency Increase Description
16 K 3
32 K 13 10 + 19 (L1 cache miss)
64 K 18 5
128 K 20 2
256 K 22 2
512 K 25 3 + 5 (TLB miss)
1 M 26 + 70 ns 1 + 70 ns + 140 ns (RAM)
2 M 27 + 105 ns 1 + 35 ns
4 M 27 + 123 ns 18 ns
8 M 32 + 137 ns 5 + 14 ns + 19 (Page walk to L2 cache) ?
16 M 38 + 160 ns 6 + 23 ns + 3 (PDE cache miss)
32 M 44 + 172 ns 6 + 12 ns
64 M 47 + 175 ns 3 3 ns
128 M 49 + 175 ns 2
- 4096-bytes range cross penalty = 77 cycles
- L2->L1 B/W (Parallel Random Read) = 10 cycles per cache line (32-bytes) = 3.2 GB/s
- L2->L1 B/W (Read, 32 bytes step) = 10 cycles per cache line (32-bytes) = 3.2 GB/s
- L2->L1 B/W (Read, 32 bytes step, pointer-chasing) = 22 cycles per cache line (NO hardware prefetch to L1)
- L2 Write (Write, 32 bytes step) = 36 cycles per write (32-bytes cache line). Write-Allocate.
- RAM Read B/W (Parallel Random Read) = 77 ns / cache line = 410 MB/s
- RAM Read B/W (Read, 4 Bytes step) = 290 MB/s
- RAM Read B/W (Read, 32 Bytes step) = 425 MB/s
- RAM Read B/W (Read, 4-16 Bytes step, pointer-chasing) = 143 MB/s
- RAM Read B/W (Read, 32 Bytes step, pointer-chasing) = 200 MB/s
- RAM Write B/W (Write, 4-32 Bytes step) = 180 MB/s. Write-Allocate
Pipeline
Branch misprediction penalty = 9 cycles.
# |
In-Order |
Out-of-Order |
1 | ICache |
2 | ILD |
3 | Decode1/Rotate |
4 | Decode2 |
5 | Decode3 |
6H | RAT |
6L |
ROB Write? |
RS Psrc/Pdsts write |
|
7H | ROB/RRF read | Ready: RS Pdst-CAM match |
7L | RS data write |
|
8H | | RS data Read |
8L | | ByPass |
9H | | Execute |
9L | | RS/ROB writeback |
10H | Retire-ROB Read |
10L | Ip -1 |
11H | Ip -2 |
11L | Retire-RRF Write |