ARM Cortex-A9
Samsung Exynos 4210: Cortex-A9 dual core, 1200 MHz,
2-ports 32-bit 800Mbps LPDDR2/DDR2/DDR3 (6.4GB/s).
- L1 Data cache = 32 KB. 4-WAY, 32 B/line, Physically Indexed, Physically Tagged.
Two 32-byte linefill buffers and one 32-byte eviction buffer.
A 4-entry, 64-bit merging store buffer.
- Automatic prefetcher that monitors cache misses, it can monitor and prefetch two independent data streams.
- L1 Instruction cache = 32 KB, 4-WAY, Virtually Indexed, Physically Tagged, 64-bit accesses.
- L2 cache = 1 MB. 32 B/line, ?-WAY
- 2*ALU, LS, MUL
- BTAC: 512 entries, 2-WAY.
- GHB (Global history Buffer): 4K entries, 2-bit
- Instruction buffer (<64 bytes) for short loops to disable the instruction cache.
- PRF (Physical Register File): 56 x 32-bit.
- Return Stack: 8 items ?
- Store buffer: 4 x 64-bit slots with data merging capability.
4 KB pages mode
- Micro TLB Data (L1 TLB): 32 entries (8 in first revision ?), fully associative.
- Micro TLB Instr. (L1 TLB): 32 entries (8 in first revision ?), fully associative.
- Main TLB (L2 TLB): 128 entries, 2-WAY. + fully-associative lockable array of 4 elements.
Size |
Latency |
Description |
32 K | 4 | TLB + L1 |
64 K | 23 | + 19 (L2) |
128 K |
256 K | 30 | + 7 (L1 TLB miss) |
512 K |
1 M | 37 | + 7 (L2 TLB miss) |
... | 37 + 110 ns | + 110 ns (RAM) |
Data prefetcher monitors only RAM misses. It doesn't prefetch data from L2 cache.
- 4-bytes range cross penalty = 1 cycle
- 8-bytes range cross penalty = 6 cycles
- CPU can handle TLB misses in parallel (it works with two parallel accesses at least).
- L1 B/W (Parallel Random Read) = 1 cycles per one access
- L2->L1 B/W (Parallel Random Read) = 7 cycles per cache line
- L2->L1 B/W (Read, 32 bytes step) = 8.7 cycles per cache line
- L2 Write (Sequential) = 1 cycle per 4 bytes.
- L2 Write (Write, 32 bytes step) = 11.5 cycles per write (cache line), probably write allocate to L1 is enabled
- RAM Read B/W (Parallel Random Read) = 68 ns / cache line = 470 MB/s
- RAM Read B/W (Read, 4 Bytes step) = 890 MB/s
- RAM Read B/W (Read, 32 Bytes step) = 1010 MB/s
- RAM Write B/W (Sequential, or 4 bytes step) = 1600 MB/s
- RAM Write B/W (32 bytes step) = 725 MB/s, probably write allocate is enabled
Pipeline
Branch misprediction penalty = 11 cycles.
Integer pipeline:
# |
Name |
Stage |
1 | Fe1 | Fetch |
2 | Fe2 |
3 | Fe3 |
4 | De1 | Decode |
5 | De2 |
6 | Re | Rename |
7 | Iss | Issue |
8 | Ex | Execute |
9 | WB | WriteBack |
Links
ARM Cortex-A9 at Wikipedia
ARM Cortex-A9 at arm.com