Sony Cell

Configuration

Sony Cell. 256 MB (XDR DRAM 400MHz).

235M-250M Transistors, 235 mm2, 90 nm CMOS SOI, 8 metal layers, FC PBGA. Sony Cell.

Power Processor Element (PPE)

L1 Data cache = 32 KB:
- 128 B/line, physically implemented with 32-byte sectors.
- 4-WAY, pseudo-LRU replacement policy.
- Write-through, no-write-allocate,
- 8-entry miss queue, shared between threads
- 16-entry store queue, shared between threads
- Nonblocking (hit-under-miss for out-of-order loads on a load miss.
- EA-indexed, RA-tagged.
- Supports snooping.
- One read port, one write port (one read or write per cycle)
- Contents fully included in the L2 cache
- Parity protected.
- If it is later determined that the load missed the L1 DCache, any instructions that are dependent on the load are flushed, refetched, and held at dispatch until the load data has been returned.
Load-miss queue (LMQ) (shared by 2 threads)
Instruction prefetch request queue (IPFQ) 2 entries. (shared by 2 threads)
L1 Instruction cache = 32 KB:
- 128 B/line
- 2-WAY
- EA index, RA tags
- (1 parity bit per 1 byte).
- 3 cyles latency: 1) address decode; 2) array access; 3) parity checking, data select and way select.
- Reads 16 bytes / cycle.
- Out-of-order completion of cache-miss load instructions (otherwise in-order execution).
- Contents not guaranteed to be included in the L2 cache
L2 cache = 512 KB: PowerPC Processor Storage Subsystem:
- 128 B/line,
- 128-byte coherence granularity
- 8-WAY, can be programmed to operate as True binary LRU, pseudo-LRU with replacement management table (RMT) locking, or direct-mapped.
- RA-indexed, RA-tagged.
- Write-back and write-allocate
- One read port, one write port (one read or write per cycle)
- MERSI with a separate directory for snoops.
- Made up of 4 identical 1024*8*140 macros.
- Clocked at half the global clock rate.
- ECC
- Way select occurs before array access, so only 1/8 of any macro is activated at once.
- 140-bit words can be written in a pipelined fashion in two cycles, and 280-bit double words can be read in three to four cycles (three for the first 140-bit word, one more cycle for the next 140-bit word).
- There is a reload/store-miss queue, castout queue, store queue, and snoop intervention/push queue.
- single-port read/write interface to main storage that supports 8 software-managed data-prefetch streams.
- includes the contents of the L1 data cache, but is not guaranteed to contain the contents of the L1 instruction cache
- provides fully coherent symmetric multiprocessor (SMP) support.
- 32-byte load port: shared by MMU, L1 Icache, L1 Dcache,
- 16-byte store port: shared by MMU and L1 Dcache.
- interface between the PPSS and EIB supports 16-byte load and 16-byte store buses.
- 4-entry, 128-byte snoop intervention/push queue
- 8*64-byte entries store queue STQ in L2.
- 6*128-byte cache-line entries reload miss queue (RMS).
- 6*128-byte castout queue .
- Data is returned from the L2 in 32-byte beats on four consecutive cycles. The first cycle contains the critical section of data, which is sent directly to the register file.
4 instruction fetch per cycle.
2 * (4 KB 2-bit BHT with 6 bits Glogal History) (1 per thread).
2 * (4 instructions * 5 entries, Instruction buffer - IBuf) (1 per thread)
2 * (4 entries Link stack) (1 per thread)
2-issue, In-order, 2-threads (SMT) 64-bit Power core.
Units: L/S, ALU, BRANCH; 2 * VMX / 2 * FPU.
2 64-bit float operations / cycle using a scalar-fused multiply-add instruction (6.4 GFLOPS / 3.2 GHz).
8 32-bit float operations / cycle using a vector fused-multiply-add instruction (25.6 GFLOPS / 3.2 GHz).
8 Synergistic Processor Elements (SPE)
Dual XDRAM Memory controller: 25.6 GB/s @ 3.2 GHz.
It can load 32 bytes and store 16 bytes, independently and memory-coherently, per processor cycle.
11 FO4.

Effective Address (EA): 64 bits: low 28 bits - offset in segment. high 36 bits - ESID (eggective segment ID).
Virtual Address (VA): 65 bits (15 high bits of full 80-bit PowerPC VA are zeros).
Real Address (RA): 42 bits.
EA -> RA Translation:
- EA -> ERAT hit -> RA
- EA -> ERAT miss -> SLB -> VA -> TLB (or Page-Table lookup) -> RA.
Effective-to-Real-Address Translation (ERAT) Buffers. Each ERAT entry contains recent EA-to-RA translations for a 4-KB block of main storage, even if this block of storage is translated using a large page
- 64-entry D-ERAT, shared by 2 threads (2 way x 32).
- 64-entry I-ERAT, shared by 2 threads (2 way x 32).
Segment Lookaside Buffer (SLB) - Two unified (instruction and data), 64-entry caches, one per PPE thread, that provide EA-to-VA translations. Segments are protected areas of virtual memory.
- 2 * 64-entry (64 per one thread)
- 2^37 segments of 256 MB each.
- VSID uses only 37 bits.
Page Table - A page table is a hardware-accessed data structure in main storage that is maintained by the operating system. Page-table entries (PTEs) provide VA-to-RA translations. Pages are protected areas of real memory. There is one page table per logical partition.
- Page Sizes: 4 KB, plus two of the following large-page sizes: 64 KB, 1 MB, 16 MB
- Number of Page Tables: With hypervisor : one page table per logical partition. Without hypervisor: one page table
- Table Structure: HTAB Hashed page table(s) in memory. PTE size is 16-bytes: Hash function translates VA bits (excluding bits inside page) to PTEG index.
  - HTAB min size = 256 KB.
  - Primary Hash: is XOR of two parts of VA. 256 MB partial aliasing. These hash value is used as index. So PTE doesn't need to contain full VPN.
  - Secondary Hash: use one's complement of Primary Hash
  - PTE: virtual page number: 42 bits (65 - 12 - 11);
  - PTE: real page number: 30 bits (42 - 12);
  - PTE group (PTEG): 8 PTEs (128-byte in one cache line). PTE search sequence:
    - Primary PTEG: PTE[0], PTE[2], PTE[4], PTE[6].
    - Primary PTEG: PTE[1], PTE[3], PTE[5], PTE[7].
    - Secodary PTEG: PTE[0], PTE[2], PTE[4], PTE[6].
    - Secodary PTEG: PTE[1], PTE[3], PTE[5], PTE[7].
    - Raise exception.
- TLB size = 1024 items (unified, shared by 2 threads). 4-WAY. 3-bit Pseudo-LRU replacement policy. It stores recently accessed PTEs. Hash function that translates some bits of VPN (subset of VA) to 8-bit TLB-index (set or row) for pages:
  - 4 KB: low 4 VPN bits, (next 4 bits) ^ (bits 24-27 of VA);
  - 64 KB: low 4 bits, (next 4 bits) ^ (next 4 bits);
  - 1 MB: low 8 bits
  - 16 MB: low 8 bits

4 KB pages mode (Linux)

Size	Latency	Description
32 K	5	ERAT + L1
256 K	41	+36 (L2)
512 K	65	+24 (ERAT miss -> TLB hit)
4 MB	65 + 120 ns	+ 120 ns (RAM)
...	65 + 240 ns	+ 120 ns (TLB miss)

32-bytes range cross penalty = 55 cycles. When one of these misaligned loads or stores first attempts to access the L1 DCache, the misalignment is detected and the pipeline is flushed. The flushed load or store is then refetched, converted to microcode at the decode stage, and split into the appropriate loads or stores, as well as any instructions needed to merge the values together into a single register.
L2->L1 B/W (Parallel Random Read) = 10 cycles per cache line
L2->L1 B/W (128 bytes step) = 42 cycles per cache line
L2 Write (linear) = 2.36 cycles per 4 bytes
L2 Write (128 bytes step) = 14 cycles per write (cache line)
RAM Read B/W (Parallel Random Read) = 50 ns per cache line = 2500 MB/s
RAM Read B/W (8 Bytes step) = 700 MB/s
RAM Read B/W (128 Bytes step) = 960 MB/s
RAM Read B/W (128 Bytes step - pointer chasing) = 960 MB/s
RAM Write B/W (Linear) = 3400 MB/s

Pipeline

Branch misprediction penalty = 24 cycles.

Integer pipeline:

#	Description	Stage	Stage2
1	Cache	IC1
2		IC2
3		IC3
4		IC4
5	Buffer	IB1	BP1
6	Buffer	IB2	BP2
7	Decode	ID1	BP3
8		ID2	BP4
9		ID3
10	Issue	IS1
11		IS2
12		IS3
13	Delay	.	RF1
14		.	RF2
15		.	MEM1
16	Register	RF1	MEM2
17	Register	RF2	MEM3
18	Execute	EX1	MEM4
19		EX2	MEM5
20		EX3	.
21		EX4	.
22		EX5	.
23	Writeback	WB	WB

FPU 64-bit instruction latency: 10 cycles.
Integer instruction latency: 2 cycles.
Integer MUL latency: 11? cycles.
Memory Load latency: 5 cycles.

Links

Cell at Wikipedia

Cell Broadband Engine resource center at IBM

Cell Broadband Engine downloads at Sony