ARM Cortex-A15

Samsung Exynos 5250

Samsung Exynos 5250 1.7 GHz, dual-core ARM Cortex-A15 + ARM Mali-T604 GPU, 32 nm HKMG. 2 GB DDR3L-1600 32-bit, 2-port, 12.8 GB/s. Samsung Chromebook (Samsung XE303C12).

4 KB pages mode

L1 Data cache = 32 KB, 64 B/line, 2-WAY, LRU
L2 Cache = 1 MB, 64 B/line, 16-WAY
L1 Data TLB (for loads): 32 entries, fully associative
L2 TLB: 512 entries, 4-WAY
PDE Cache: 16 entries (one entry per 1 MB of virtual space)

L1 Data Cache Latency = 4 cycles
L2 Cache Latency = 21 cycles
RAM Latency = 21 cycles + 110 ns
L1 TLB Miss Penalty = 12 cycles
L2 TLB Miss Penalty = 30 cycles
PDE Cache Miss Penalty = 20 cycles

Size	Latency	Increase	Description
32 K	4	0	TLB + L1
64 K	13	9	+ 17 (L2)
128 K	17	4	+ 17 (L2)
256 K	25	8	+ 12 (L1 TLB miss)
512 K	29	4
1 M	31	2
2 M	32 + 55 ns	1 + 55 ns	+ 110 ns (RAM)
4 M	48 + 82 ns	16 + 27 ns	+ 30 (L2 TLB miss)
8 M	56 + 96 ns	8 + 14 ns
16 M	59 + 103 ns	4 + 7 ns
32 M	71 + 108 ns	12 + 5 ns	+ 20 (PDE cache miss)
64 M	77 + 115 ns	6 + 7 ns	+ 20 (PDE cache miss)
128 M	80 + 125 ns	3 + 10 ns	+ 100 ns (page walk to RAM)
256 M	82 + 140 ns	2 + 15 ns
512 M	83 + 160 ns	1 + 20 ns
1024 M	83 + 185 ns	25 ns
2048 M	83 + 210 ns	25 ns

64-bytes range cross penalty = 1 cycle
CPU can process several L1 TLB misses concurrently.
CPU can't process several L2 TLB misses concurrently?
L2 TLB miss penalty is 10-30 cycles. (Maybe 10 / 30 cycles for page walk to L1 / L2 caches)?
L1 B/W (Parallel Random Read) = 1 cycle per one access
L2->L1 B/W (Parallel Random Read) = 9 cycles per cache line, 12.0 GB/s
L2->L1 B/W (Read, 4 bytes step) = 2 cycles per access, 3.4 GB/s
L2->L1 B/W (Read, 64 bytes step) = 8 cycles per cache line, 13.6 GB/s
L2->L1 B/W (Read, 64 bytes step, pointer-chasing) = 20 cycles per cache line. 5.4 GB/s
L2->L1 B/W (Read, 128+ bytes step, pointer-chasing) = 23 cycles per cache line
RAM Read B/W (Parallel Random Read) = 40 ns / cache line (including L2 TLB miss)
RAM Read B/W (Read, 4 Bytes step) = 2.6 GB/s
RAM Read B/W (Read, 64 Bytes step) = 4.2 GB/s
RAM Read B/W (Read, 8-64 Bytes step, pointer-chasing) = 33 ns per cache line, 1.9 GB/s (hardware prefetch)
RAM Read B/W (Read, 128+ Bytes step, pointer-chasing) = 95 ns per cache line, 0.7 GB/s (no hardware prefetch)
L2 Write (64 Bytes step) = 8 cycle per write (64 bytes cache line), 13.6 GB/s
RAM Write (4 Bytes step) = 1.1 cycle per write, 6.0 GB/s
RAM Write (64 Bytes step) = 38 ns per write (64-byte cache line). 1.6 GB/s

ARM Cortex-A15 core

L1 Data cache = 32 KB. 2-Way, LRU. 64 B/line, PIPT. ECC protection per 32 bits. Maximum of 16 outstanding misses with 6 linefill buffers.
L1 Instruction cache = 32 KB, 2-way, LRU, 64 B/line, PIPT. parity protection per 16 bits
L2 cache = 512 KB - 4 MB. 64 B/line, 16 WAY,
- Sequential TAG and Data RAM access.
- Programmable RAM latencies.
- 4 independent Tag banks handle multiple requests in parallel
- Integrated Snoop Control Unit into L2 pipeline
- Direct data transfer line migration supported from cpu to cpu
- Full AMBA4 system coherency support on 128-bit master interface
- 64/128 bit AXI3 slave interface for ACP
- Full ECC capability
- Automatic data prefetching into L2 cache for load streaming
L1 Data TLB: 2 separate 32-entry fully-associative TLBs that are used for data loads and stores, respectively. TLB caches entries at the 4KB granularity of Virtual Address (VA) to Physical Address (PA) mapping only. If the page tables map the memory region to a larger granularity than 4K, it only allocates one mapping for the particular 4K region to which the current access corresponds.
L1 instruction TLB: 32-entry fully-associative structure. 4KB granularity of VA to PA
L2 TLB size = 512 items. 4-Way, supports all the VMSAv7 page sizes of 4K, 64K, 1MB and 16MB in addition to the LPAE page sizes of 2MB and 1GB. walking cache structures.
1 TB physical addressing
Loop buffer: 32-entry, up to 2 fwd branches and 1 backwards branch, Completely disables Fetch and Decode stages of pipeline.
ECC on L1 and L2. Single error correct, 2 error detect. Protects 32 bits for L1, 64 bits for L2.
128-bit AMBA 4 interface with coherency extensions:
Fetch: 128-bit datapath (4-8 instructions). Full support for unaligned fetch address.
Decode: 3 instructions
NEON/VFP issue: 2 instructions (out-of-order)
Integer issue: 4 instructions (out-of-order).
Load/store issue: 1 load + 1 store. 128-bit datapath. (Partial out-of-order).
microBTB: 64 entry. Fully associative, Caches taken branches only, Overruled by main predictor when they disagree.
BTB (Branch Target Buffer): ? entries
GHB (Global history Buffer): 3 arrays: Taken array, Not taken array, and Selector.
Indirect predictor: 256 entry BTB indexed by XOR of history and address, Multiple Target addresses allowed per address.
Return Stack: ?
2 Register rename tables: for ARM and Extended (NEON) registers.
Result queue: Queue of renamed register results pending update to the register file, Shared for both ARM and Extended register results.
Dispatch Unit:
- 40-entry Commit queue in the dispatch unit
- Speculative result queues:
  - 128 entry main result queue
  - 24 entry Flag result queue for holding speculative flag register updates
Execution Clusters:
- Simple cluster: 2 ALUs, 2 shifters (in parallel, includes v6-SIMD)
- Complex cluster: NEON and Floating Point:
  - Dual issue queues of 8 entries each.
  - 2 operations per cycle.
  - Includes support for quad FMAC per cycle.
- Branch cluster: All operations that have the PC as a destination
- Multiply and Divide cluster: All ARM multiply and Integer divide operations.
- Load/Store cluster:
  - 1 Load and 1 Store executed per cycle.
  - Loads issue out-of-order but cannot bypass stores
  - Stores issue in order, but only require address sources to issue.
  - 16 entry issue queue (for ARM and NEON/memory operation)
  - 4 stage load pipeline

Pipeline

Integer pipeline:

#	Name	Stage
1	F0	Fetch
2	F1
3	F2
4	F3
5	F4
6	D0	Decode Rename Dispatch
7	D1
8	D2
9	D3
10	D4
11	D5
12	D6
13	E0	Execute
14	E1
15	E2

Links

ARM Cortex-A15 at Wikipedia

ARM Cortex-A15 at arm.com