IBM PowerPC 970
Configuration
IBM PowerPC 970FX.
- L1 Data cache = 32 KB:
- 128 B/line
- 2-WAY, LRU.
- EA index, RA tags
- Write-through, no-write-allocate,
- 3-ports: 2 reads and 1 write every cycle (no banking).
- 2-cycle load-use penalty for FXU loads
- Dedicated 32-byte reload interface from L2 cache
- L1 Instruction cache = 64 KB:
- 128 B/line (4 * 32-byte sectors)
- Direct-mapped
- EA index, RA tags
- Dedicated 32-byte read/write interface from L2 cache with a critical sector first reload policy
- One read or one write per cycle
- 5 additional predecode bits per word to aid in fast decoding and group formation
- Parity protected with a force invalidate and reload on parity error
- L2 cache = 512 KB:
- 128 B/line,
- 8-WAY, LRU.
- RA-indexed, RA-tagged.
- Write-back and write-allocate
- One read port, one write port (one read or write per cycle)
- Inclusive of L1 D-cache, not inclusive of L1 I-cache
- 4-entry, 128-byte, instruction prefetch queue above the I-cache; hardware-initiated prefetches
- Fetch a 32-byte aligned block of eight instructions per cycle
- Prediction for up to two branches per cycle
- Branch prediction
- Scan all 8 fetched instructions for branches each cycle
- Predict up to 2 branches per cycle
- 3-table prediction structure - global / local / selector (16K entries x 1-bit each)
- 16-entry link stack for address prediction (with stack recovery)
- 32-entry count cache for address prediction (indexed by the address of Branch Conditional to Count Register (bcctr) instructions)
- Instruction decode and preprocessing
- 3-cycle pipeline to decode and preprocess instructions
- Dedicated dataflow for cracking one instruction into two internal operations
- Microcoded templates for longer emulation sequences of internal operations
- All internal operations expanded into 86-bit internal form to simplify subsequent processing and
explicitly expose register dependencies for all register pools
- Dispatch groups (up to 5 instructions: 4 + branch) formulated along with inter-instruction dependence masks
- 8-entry * 16 bytes instruction fetch buffer (up to 8 instructions in and 5 instructions out during each cycle)
- Instruction dispatch, sequencing, and completion control
- 4 dispatch buffers, which can hold up to 4 dispatch groups when the global completion table (GCT) is full
- 20-entry global completion table
- Instruction queuing resources
- 2 * 18-entry issue queues for fixed-point and load/store instructions
- 2 * 10-entry issue queues for floating-point instructions
- 2 * 12-entry issue queue for branch instructions
- 10-entry issue queue for CR-logical instructions
- 16-entry issue queue for vector permute instructions
- 20-entry issue queue for vector ALU instructions and vector stores
- Support for up to 16 predicted branches in flight
- Prediction support for branch direction and branch addresses
- In-order dispatch of up to five operations into the distributed issue queue structure
- Out-of-order issue of up to 10 operations into 10 execution pipelines:
- 2 load or store operations
- 2 fixed-point register-register operations
- 2 Two floating-point operations
- 1 branch operation
- 1 Condition Register operation
- 1 vector permute operation
- 1 vector ALU operation
- Register renaming
- Up to 215 instructions in flight.
- Up to 16 instructions in the instruction fetch unit (fetch buffer and overflow buffer)
- Up to 32 instructions in the instruction fetch buffer in the instruction decode unit
- Up to 35 instructions in three decode pipe stages and four dispatch buffers
- Up to 100 instructions in the inner-core (after dispatch)
- Up to 32 stores queued in the store queue (STQ) (available for forwarding)
- Fast, selective flush of incorrect speculative instructions and results
- Specific focus on storage latency management
- Out-of-order and speculative issue of load operations
- Support for up to 8 outstanding L1 cache line misses
- Hardware-initiated instruction prefetching from L2 cache
- Software-initiated data stream prefetching with support for up to 8 active streams
- Critical word forwarding-critical sector first
- New branch processing-prediction hints on branch instructions
- 32-entry store queue logically above the D-cache (real address based; content-addressable memory CAM] structure).
Store addresses and store data can be supplied on different cycles
store-to-load forwarding
- 32-entry load reorder queue (real address based; CAM structure)
- 8-entry load miss queue (LMQ) (real address based). Keeps track of loads that have missed in the L1 D-cache.
Allows a second load from the same cache line to merge onto a single entry.
- Virtual Address (VA): 65 bits.
- Real Address (RA): 42 bits.
- 128-entry D-ERAT, 2-WAY
- 128-entry I-ERAT, 2-WAY
- 64-entry SLB. fully associative. SLB miss results in an interrupt and the software reload of the SLB
- Page Table - A page table is a hardware-accessed data structure in main storage that is
maintained by the operating system. Page-table entries (PTEs) provide VA-to-RA translations.
Pages are protected areas of real memory. There is one page table per logical partition.
- Page Sizes: 4 KB or 16 MB
- Number of Page Tables: With hypervisor : one page table per logical partition. Without hypervisor: one page table
- Table Structure: HTAB Hashed page table(s) in memory.
PTE size is 16-bytes: Hash function translates VA bits (excluding bits inside page) to PTEG index.
- HTAB min size = 256 KB.
- Primary Hash: is XOR of two parts of VA. 256 MB partial aliasing.
These hash value is used as index. So PTE doesn't need to contain full VPN.
- Secondary Hash: use one's complement of Primary Hash
- PTE: virtual page number: 42 bits (65 - 12 - 11);
- PTE: real page number: 30 bits (42 - 12);
- PTE group (PTEG): 8 PTEs (128-byte in one cache line). PTE search sequence:
- Primary PTEG: PTE[0], PTE[2], PTE[4], PTE[6].
- Primary PTEG: PTE[1], PTE[3], PTE[5], PTE[7].
- Secodary PTEG: PTE[0], PTE[2], PTE[4], PTE[6].
- Secodary PTEG: PTE[1], PTE[3], PTE[5], PTE[7].
- Raise exception.
- TLB size = 1024 items (unified). 4-WAY.
Hardware-based reload (from the L2 cache interface in order to ensure no L1 D-cache impact)
4 KB pages mode (Linux)
Size |
Latency |
Description |
32 K | 5 | ERAT + L1 |
512 K | 13 | +8 (L2) |
4 MB | 27 + 170 ns | +14 (ERAT miss -> TLB hit) + 170 ns (RAM) | |
... | 27 + 170 ns | + 170 ns (TLB miss) |
- 64-bytes range cross penalty = 27 cycles.
- L2->L1 B/W (Parallel Random Read) = 4 cycles per cache line
- L2->L1 B/W (128 bytes step) = 5 cycles per cache line
- RAM Read B/W (Parallel Random Read) = 80 ns per cache line
- RAM Read B/W (4 Bytes step) = 1700 MB/s
- RAM Read B/W (128 Bytes step) = 2900 MB/s
- RAM Read B/W (128 Bytes step - pointer chasing) = 2100 MB/s
- RAM Write B/W (Linear) = 1100 MB/s
Pipeline
Branch misprediction penalty = 16 cycles.
Integer pipeline:
- 16 stages for most fixed-point register-to-register operations
- 18 stages for most load and store operations (assuming an L1 D-cache hit)
- 21 stages for most floating-point operations
- 19 stages for fixed-point, 22 stages for complex-fixed, and 25 stages for floating-point operations
in the vector arithmetic logic unit (VALU)
- 19 stages for vector permute operations
Links
PowerPC 970 at Wikipedia
IBM PowerPC 970FX RISC Microprocessor. User's Manual