ARM Cortex-A9

Samsung Exynos 4210: Cortex-A9 dual core, 1200 MHz, 2-ports 32-bit 800Mbps LPDDR2/DDR2/DDR3 (6.4GB/s).

L1 Data cache = 32 KB. 4-WAY, 32 B/line, Physically Indexed, Physically Tagged. Two 32-byte linefill buffers and one 32-byte eviction buffer. A 4-entry, 64-bit merging store buffer.
Automatic prefetcher that monitors cache misses, it can monitor and prefetch two independent data streams.
L1 Instruction cache = 32 KB, 4-WAY, Virtually Indexed, Physically Tagged, 64-bit accesses.
L2 cache = 1 MB. 32 B/line, ?-WAY
2*ALU, LS, MUL
BTAC: 512 entries, 2-WAY.
GHB (Global history Buffer): 4K entries, 2-bit
Instruction buffer (<64 bytes) for short loops to disable the instruction cache.
PRF (Physical Register File): 56 x 32-bit.
Return Stack: 8 items ?
Store buffer: 4 x 64-bit slots with data merging capability.

4 KB pages mode

Micro TLB Data (L1 TLB): 32 entries (8 in first revision ?), fully associative.
Micro TLB Instr. (L1 TLB): 32 entries (8 in first revision ?), fully associative.
Main TLB (L2 TLB): 128 entries, 2-WAY. + fully-associative lockable array of 4 elements.

Data prefetcher monitors only RAM misses. It doesn't prefetch data from L2 cache.

4-bytes range cross penalty = 1 cycle
8-bytes range cross penalty = 6 cycles
CPU can handle TLB misses in parallel (it works with two parallel accesses at least).
L1 B/W (Parallel Random Read) = 1 cycles per one access
L2->L1 B/W (Parallel Random Read) = 7 cycles per cache line
L2->L1 B/W (Read, 32 bytes step) = 8.7 cycles per cache line
L2 Write (Sequential) = 1 cycle per 4 bytes.
L2 Write (Write, 32 bytes step) = 11.5 cycles per write (cache line), probably write allocate to L1 is enabled
RAM Read B/W (Parallel Random Read) = 68 ns / cache line = 470 MB/s
RAM Read B/W (Read, 4 Bytes step) = 890 MB/s
RAM Read B/W (Read, 32 Bytes step) = 1010 MB/s
RAM Write B/W (Sequential, or 4 bytes step) = 1600 MB/s
RAM Write B/W (32 bytes step) = 725 MB/s, probably write allocate is enabled

Branch misprediction penalty = 11 cycles.

Integer pipeline: