IBM PowerPC 970

Configuration

IBM PowerPC 970FX.

L1 Data cache = 32 KB:
- 128 B/line
- 2-WAY, LRU.
- EA index, RA tags
- Write-through, no-write-allocate,
- 3-ports: 2 reads and 1 write every cycle (no banking).
- 2-cycle load-use penalty for FXU loads
- Dedicated 32-byte reload interface from L2 cache
L1 Instruction cache = 64 KB:
- 128 B/line (4 * 32-byte sectors)
- Direct-mapped
- EA index, RA tags
- Dedicated 32-byte read/write interface from L2 cache with a critical sector first reload policy
- One read or one write per cycle
- 5 additional predecode bits per word to aid in fast decoding and group formation
- Parity protected with a force invalidate and reload on parity error
L2 cache = 512 KB:
- 128 B/line,
- 8-WAY, LRU.
- RA-indexed, RA-tagged.
- Write-back and write-allocate
- One read port, one write port (one read or write per cycle)
- Inclusive of L1 D-cache, not inclusive of L1 I-cache
4-entry, 128-byte, instruction prefetch queue above the I-cache; hardware-initiated prefetches
Fetch a 32-byte aligned block of eight instructions per cycle
Prediction for up to two branches per cycle
Branch prediction
- Scan all 8 fetched instructions for branches each cycle
- Predict up to 2 branches per cycle
- 3-table prediction structure - global / local / selector (16K entries x 1-bit each)
- 16-entry link stack for address prediction (with stack recovery)
- 32-entry count cache for address prediction (indexed by the address of Branch Conditional to Count Register (bcctr) instructions)
Instruction decode and preprocessing
- 3-cycle pipeline to decode and preprocess instructions
  - Dedicated dataflow for cracking one instruction into two internal operations
  - Microcoded templates for longer emulation sequences of internal operations
  - All internal operations expanded into 86-bit internal form to simplify subsequent processing and explicitly expose register dependencies for all register pools
  - Dispatch groups (up to 5 instructions: 4 + branch) formulated along with inter-instruction dependence masks
- 8-entry * 16 bytes instruction fetch buffer (up to 8 instructions in and 5 instructions out during each cycle)
Instruction dispatch, sequencing, and completion control
- 4 dispatch buffers, which can hold up to 4 dispatch groups when the global completion table (GCT) is full
- 20-entry global completion table
- Instruction queuing resources
  - 2 * 18-entry issue queues for fixed-point and load/store instructions
  - 2 * 10-entry issue queues for floating-point instructions
  - 2 * 12-entry issue queue for branch instructions
  - 10-entry issue queue for CR-logical instructions
  - 16-entry issue queue for vector permute instructions
  - 20-entry issue queue for vector ALU instructions and vector stores
Support for up to 16 predicted branches in flight
Prediction support for branch direction and branch addresses
In-order dispatch of up to five operations into the distributed issue queue structure
Out-of-order issue of up to 10 operations into 10 execution pipelines:
- 2 load or store operations
- 2 fixed-point register-register operations
- 2 Two floating-point operations
- 1 branch operation
- 1 Condition Register operation
- 1 vector permute operation
- 1 vector ALU operation
Register renaming
Up to 215 instructions in flight.
- Up to 16 instructions in the instruction fetch unit (fetch buffer and overflow buffer)
- Up to 32 instructions in the instruction fetch buffer in the instruction decode unit
- Up to 35 instructions in three decode pipe stages and four dispatch buffers
- Up to 100 instructions in the inner-core (after dispatch)
- Up to 32 stores queued in the store queue (STQ) (available for forwarding)
- Fast, selective flush of incorrect speculative instructions and results
Specific focus on storage latency management
- Out-of-order and speculative issue of load operations
- Support for up to 8 outstanding L1 cache line misses
- Hardware-initiated instruction prefetching from L2 cache
- Software-initiated data stream prefetching with support for up to 8 active streams
- Critical word forwarding-critical sector first
- New branch processing-prediction hints on branch instructions
32-entry store queue logically above the D-cache (real address based; content-addressable memory CAM] structure). Store addresses and store data can be supplied on different cycles store-to-load forwarding
32-entry load reorder queue (real address based; CAM structure)
8-entry load miss queue (LMQ) (real address based). Keeps track of loads that have missed in the L1 D-cache. Allows a second load from the same cache line to merge onto a single entry.

Virtual Address (VA): 65 bits.
Real Address (RA): 42 bits.
128-entry D-ERAT, 2-WAY
128-entry I-ERAT, 2-WAY
64-entry SLB. fully associative. SLB miss results in an interrupt and the software reload of the SLB
Page Table - A page table is a hardware-accessed data structure in main storage that is maintained by the operating system. Page-table entries (PTEs) provide VA-to-RA translations. Pages are protected areas of real memory. There is one page table per logical partition.
- Page Sizes: 4 KB or 16 MB
- Number of Page Tables: With hypervisor : one page table per logical partition. Without hypervisor: one page table
- Table Structure: HTAB Hashed page table(s) in memory. PTE size is 16-bytes: Hash function translates VA bits (excluding bits inside page) to PTEG index.
  - HTAB min size = 256 KB.
  - Primary Hash: is XOR of two parts of VA. 256 MB partial aliasing. These hash value is used as index. So PTE doesn't need to contain full VPN.
  - Secondary Hash: use one's complement of Primary Hash
  - PTE: virtual page number: 42 bits (65 - 12 - 11);
  - PTE: real page number: 30 bits (42 - 12);
  - PTE group (PTEG): 8 PTEs (128-byte in one cache line). PTE search sequence:
    - Primary PTEG: PTE[0], PTE[2], PTE[4], PTE[6].
    - Primary PTEG: PTE[1], PTE[3], PTE[5], PTE[7].
    - Secodary PTEG: PTE[0], PTE[2], PTE[4], PTE[6].
    - Secodary PTEG: PTE[1], PTE[3], PTE[5], PTE[7].
    - Raise exception.
- TLB size = 1024 items (unified). 4-WAY. Hardware-based reload (from the L2 cache interface in order to ensure no L1 D-cache impact)

4 KB pages mode (Linux)

Size	Latency	Description
32 K	5	ERAT + L1
512 K	13	+8 (L2)
4 MB	27 + 170 ns	+14 (ERAT miss -> TLB hit) + 170 ns (RAM)
...	27 + 170 ns	+ 170 ns (TLB miss)

64-bytes range cross penalty = 27 cycles.
L2->L1 B/W (Parallel Random Read) = 4 cycles per cache line
L2->L1 B/W (128 bytes step) = 5 cycles per cache line
RAM Read B/W (Parallel Random Read) = 80 ns per cache line
RAM Read B/W (4 Bytes step) = 1700 MB/s
RAM Read B/W (128 Bytes step) = 2900 MB/s
RAM Read B/W (128 Bytes step - pointer chasing) = 2100 MB/s
RAM Write B/W (Linear) = 1100 MB/s

Pipeline

Branch misprediction penalty = 16 cycles.

Integer pipeline:

16 stages for most fixed-point register-to-register operations
18 stages for most load and store operations (assuming an L1 D-cache hit)
21 stages for most floating-point operations
19 stages for fixed-point, 22 stages for complex-fixed, and 25 stages for floating-point operations in the vector arithmetic logic unit (VALU)
19 stages for vector permute operations

Links

PowerPC 970 at Wikipedia

IBM PowerPC 970FX RISC Microprocessor. User's Manual