ARM Cortex-A15
Samsung Exynos 5250
Samsung Exynos 5250 1.7 GHz, dual-core ARM Cortex-A15 + ARM Mali-T604 GPU, 32 nm HKMG.
2 GB DDR3L-1600 32-bit, 2-port, 12.8 GB/s. Samsung Chromebook (Samsung XE303C12).
4 KB pages mode
- L1 Data cache = 32 KB, 64 B/line, 2-WAY, LRU
- L2 Cache = 1 MB, 64 B/line, 16-WAY
- L1 Data TLB (for loads): 32 entries, fully associative
- L2 TLB: 512 entries, 4-WAY
- PDE Cache: 16 entries (one entry per 1 MB of virtual space)
- L1 Data Cache Latency = 4 cycles
- L2 Cache Latency = 21 cycles
- RAM Latency = 21 cycles + 110 ns
- L1 TLB Miss Penalty = 12 cycles
- L2 TLB Miss Penalty = 30 cycles
- PDE Cache Miss Penalty = 20 cycles
Size |
Latency |
Increase |
Description |
32 K | 4 | 0 | TLB + L1 |
64 K | 13 | 9 | + 17 (L2) |
128 K | 17 | 4 |
256 K | 25 | 8 | + 12 (L1 TLB miss) |
512 K | 29 | 4 |
1 M | 31 | 2 |
2 M | 32 + 55 ns | 1 + 55 ns | + 110 ns (RAM) |
4 M | 48 + 82 ns | 16 + 27 ns | + 30 (L2 TLB miss) |
8 M | 56 + 96 ns | 8 + 14 ns |
16 M | 59 + 103 ns | 4 + 7 ns |
32 M | 71 + 108 ns | 12 + 5 ns | + 20 (PDE cache miss) |
64 M | 77 + 115 ns | 6 + 7 ns |
128 M | 80 + 125 ns | 3 + 10 ns | + 100 ns (page walk to RAM) |
256 M | 82 + 140 ns | 2 + 15 ns |
512 M | 83 + 160 ns | 1 + 20 ns |
1024 M | 83 + 185 ns | 25 ns |
2048 M | 83 + 210 ns | 25 ns |
- 64-bytes range cross penalty = 1 cycle
- CPU can process several L1 TLB misses concurrently.
- CPU can't process several L2 TLB misses concurrently?
- L2 TLB miss penalty is 10-30 cycles. (Maybe 10 / 30 cycles for page walk to L1 / L2 caches)?
- L1 B/W (Parallel Random Read) = 1 cycle per one access
- L2->L1 B/W (Parallel Random Read) = 9 cycles per cache line, 12.0 GB/s
- L2->L1 B/W (Read, 4 bytes step) = 2 cycles per access, 3.4 GB/s
- L2->L1 B/W (Read, 64 bytes step) = 8 cycles per cache line, 13.6 GB/s
- L2->L1 B/W (Read, 64 bytes step, pointer-chasing) = 20 cycles per cache line. 5.4 GB/s
- L2->L1 B/W (Read, 128+ bytes step, pointer-chasing) = 23 cycles per cache line
- RAM Read B/W (Parallel Random Read) = 40 ns / cache line (including L2 TLB miss)
- RAM Read B/W (Read, 4 Bytes step) = 2.6 GB/s
- RAM Read B/W (Read, 64 Bytes step) = 4.2 GB/s
- RAM Read B/W (Read, 8-64 Bytes step, pointer-chasing) = 33 ns per cache line, 1.9 GB/s (hardware prefetch)
- RAM Read B/W (Read, 128+ Bytes step, pointer-chasing) = 95 ns per cache line, 0.7 GB/s (no hardware prefetch)
- L2 Write (64 Bytes step) = 8 cycle per write (64 bytes cache line), 13.6 GB/s
- RAM Write (4 Bytes step) = 1.1 cycle per write, 6.0 GB/s
- RAM Write (64 Bytes step) = 38 ns per write (64-byte cache line). 1.6 GB/s
ARM Cortex-A15 core
- L1 Data cache = 32 KB. 2-Way, LRU. 64 B/line, PIPT. ECC protection per 32 bits.
Maximum of 16 outstanding misses with 6 linefill buffers.
- L1 Instruction cache = 32 KB, 2-way, LRU, 64 B/line, PIPT. parity protection per 16 bits
- L2 cache = 512 KB - 4 MB. 64 B/line, 16 WAY,
- Sequential TAG and Data RAM access.
- Programmable RAM latencies.
- 4 independent Tag banks handle multiple requests in parallel
- Integrated Snoop Control Unit into L2 pipeline
- Direct data transfer line migration supported from cpu to cpu
- Full AMBA4 system coherency support on 128-bit master interface
- 64/128 bit AXI3 slave interface for ACP
- Full ECC capability
- Automatic data prefetching into L2 cache for load streaming
- L1 Data TLB: 2 separate 32-entry fully-associative TLBs that are used for
data loads and stores, respectively.
TLB caches entries at the 4KB granularity of Virtual Address (VA) to Physical Address (PA)
mapping only. If the page tables map the memory region to a larger granularity than 4K,
it only allocates one mapping for the particular 4K region to which the current access corresponds.
- L1 instruction TLB: 32-entry fully-associative structure. 4KB granularity of VA to PA
- L2 TLB size = 512 items. 4-Way, supports all the VMSAv7 page sizes of
4K, 64K, 1MB and 16MB in addition to the LPAE page sizes of 2MB and 1GB.
walking cache structures.
- 1 TB physical addressing
- Loop buffer: 32-entry, up to 2 fwd branches and 1 backwards branch, Completely disables Fetch and Decode stages of pipeline.
- ECC on L1 and L2. Single error correct, 2 error detect. Protects 32 bits for L1, 64 bits for L2.
- 128-bit AMBA 4 interface with coherency extensions:
- Fetch: 128-bit datapath (4-8 instructions). Full support for unaligned fetch address.
- Decode: 3 instructions
- NEON/VFP issue: 2 instructions (out-of-order)
- Integer issue: 4 instructions (out-of-order).
- Load/store issue: 1 load + 1 store. 128-bit datapath. (Partial out-of-order).
- microBTB: 64 entry. Fully associative, Caches taken branches only, Overruled by main predictor when they disagree.
- BTB (Branch Target Buffer): ? entries
- GHB (Global history Buffer): 3 arrays: Taken array, Not taken array, and Selector.
- Indirect predictor: 256 entry BTB indexed by XOR of history and address, Multiple Target addresses allowed per address.
- Return Stack: ?
- 2 Register rename tables: for ARM and Extended (NEON) registers.
- Result queue: Queue of renamed register results pending update to the register file, Shared for both ARM and Extended register results.
- Dispatch Unit:
- 40-entry Commit queue in the dispatch unit
- Speculative result queues:
- 128 entry main result queue
- 24 entry Flag result queue for holding speculative flag register updates
- Execution Clusters:
- Simple cluster: 2 ALUs, 2 shifters (in parallel, includes v6-SIMD)
- Complex cluster: NEON and Floating Point:
- Dual issue queues of 8 entries each.
- 2 operations per cycle.
- Includes support for quad FMAC per cycle.
- Branch cluster: All operations that have the PC as a destination
- Multiply and Divide cluster: All ARM multiply and Integer divide operations.
- Load/Store cluster:
- 1 Load and 1 Store executed per cycle.
- Loads issue out-of-order but cannot bypass stores
- Stores issue in order, but only require address sources to issue.
- 16 entry issue queue (for ARM and NEON/memory operation)
- 4 stage load pipeline
Pipeline
Integer pipeline:
# |
Name |
Stage |
1 | F0 | Fetch |
2 | F1 |
3 | F2 |
4 | F3 |
5 | F4 |
6 | D0 | Decode Rename Dispatch |
7 | D1 |
8 | D2 |
9 | D3 |
10 | D4 |
11 | D5 |
12 | D6 |
13 | E0 | Execute |
14 | E1 |
15 | E2 |
Links
ARM Cortex-A15 at Wikipedia
ARM Cortex-A15 at arm.com