IBM POWER8
Configuration: IBM Power System S822: IBM POWER8 3690 MHz,
2 sockets (2 chips per DCM (socket), 5 cores per chip, 8 threads per core), 160 GB (DDR3 1600MHz).
- 362 mm2 (for 6 cores chip version), 22 nm, 15 layers, Cu, SOI,
- L1 Data cache = 64 KB, 128 B/line, 8-WAY
- L1 Instruction cache = 32 KB, 8-WAY
- L2 cache = 512 KB per core, 128 B/line, 8-WAY
- L3 local cache (Fast-L3 Region cache) = 8 MB (eDRAM), 128 B/line, 8-WAY
- L3 cache = (8 MB * 5) per chip (eDRAM) consist of LOCAL-L3 from another cores, 128 B/line,
- L4: Off chip: 16 MB memory buffer chip per channel, 8 chips per socket.
- RAM: Up to 8 high speed channels, each running up to 9.6 Gb/s for up to 230 GB/s sustained
- RAM: Up to 32 total DDR ports yielding 410 GB/s peak at the DRAM
- On-chip accelerators, including on-chip encryption, compression, and random number
generation accelerators
- POWER8 provides 8 SMT hardware threads per core and can be configured to
run in SMT8, SMT4, SMT2, SMT1 (ST) mode.
- Fetch: 8 instructions
- 16-entry link stack
- 256-entry count cache
- 8-wide in-order instruction dispatch
- 16 execution units:
- 2 Fixed point units
- 2 Load store units (can also execute simple fixed-point operations)
- 2 Load units (can also execute simple fixed-point operations)
- 4 Double precision floating point,
- 2 Vector unit 128-bit VMX/AltiVec
- 1 Crypto (AES)
- 1 Branch
- 1 Condition register
- 1 Decimal floating point unit
- Hardware data prefetching with 16 independent data streams.
- issue queue (UQ): 4 * 16-entry queues
- 4x16 B L1<->Core reads or 1x16 B writes per cycle
- 64 B L2 -> L1 data bus and 16 B L2 <- L1
- 32 B L3 -> L4 data bus and 32 B L3 <- L4
- 10 Wide Issue, Out of Order Execution.
- D-ERAT : 48-entry : fully assoc,
- D-ERAT L2 : (128/256)-entry (tests don't show that structure)
- L1 Data Cache Latency = 3 cycles for simple access via pointer
- L1 Data Cache Latency = 5 cycles for access with complex address calculation (size_t n, *p; n = p[n]).
- L2 Cache Latency = 12 cycles
- L3 Cache Latency = 27 cycles
- L3-REMOTE Cache Latency = 130 cycles
- RAM Latency = 27 cycles + 80 ns (including L3-GLOBAL access)
64 KB pages mode (64-bit Linux)
- D-ERAT : 48-entry (full assoc) (cover 3 MB). Miss Penalty = 11 cycles. Parallel miss: 20 cycles per access
- ? D-ERAT L2 : (128/256)-entry : (cover ? MB) Miss Penalty = ? cycles.
Size Latency Increase Description
64 K 3
128 K 8 5 + 9 (L2)
256 K 10 2
512 K 12 2
1 M 20 8 + 15 (L3)
2 M 24 4
4 M 28 4 + 11 (D-ERAT miss)
8 M 33 + 5 ns 5 + 5 ns
16 M 36 + 19 ns 3 + 14 ns + 80 ns (L3-G,RAM)
32 M 38 + 31 ns 2 + 12 ns
64 M 38 + 44 ns + 13 ns
128 M 38 + 63 ns + 19 ns
256 M 66 + 74 ns 28 + 11 ns + 52 (translation cache miss)
512 M 78 + 81 ns 12 + 7 ns
1 G 86 + 89 ns 8 + 8 ns
2 G 90 + 99 ns 4 + 10 ns
4 G 90 + 109 ns + 10 ns + 80 ns (RAM)
8 G 90 + 128 ns + 19 ns
16 G 90 + 177 ns + 49 ns
32 G 90 + 228 ns + 51 ns + 80 ns (RAM)
64 G 90 + 300 ns + 72 ns
- 128-bytes range cross penalty = 13 cycles
- page range cross penalty (4 KB pages) = 41 cycles
- Branch misprediction penalty = 19 cycles.
- Execution Latency = 2 cycles for simple dependent integer instructions !!!
- L1 B/W (Parallel Random Read) = 0.60 cycles per one access
- L2 -> L1 B/W (Parallel Random Read) = 2.1 cycles per cache line (128 bytes)
- L2 -> L1 B/W (Read, 128 bytes step) = 2.2 cycles per cache line (128 bytes)
- L2 -> L1 B/W (Read, 128 bytes step, pointer chasing) = 13 cycles per cache line (128 bytes)
- L3-Local -> L1 B/W (Parallel Random Read) = 6 cycles per cache line (128 bytes)
- L3-Local -> L1 B/W (Read, 128 bytes step) = 6 cycles per cache line (128 bytes)
- L3-Local -> L1 B/W (Read, 128 bytes step, pointer chasing) = 23 cycles per cache line (128 bytes)
- L3-Global -> L1 B/W (Parallel Random Read) = 30 cycles per read
- L3-Global -> L1 B/W (Read, 128 bytes step) = 16 cycles per 1 cache lines (128 bytes)
- L3-Global -> L1 B/W (Read, 128 bytes step, pointer chasing) = 32 cycles per 1 cache lines (128 bytes)
- RAM Read B/W (Parallel Random Read) = 14 ns per one read
- RAM Read B/W (Read, 8 Bytes step, pointer chasing) = 9 GB/s
- RAM Read B/W (Read, 64 Bytes step, pointer chasing) = 15 GB/s
- RAM Read B/W (Read, 8 Bytes step) = 18 GB/s
- RAM Read B/W (Read, 128 Bytes step) = 20 GB/s
- RAM Write B/W (8 Bytes step) = 25 GB/s
Links
Power8 at Wikipedia