ARM Cortex-A9
Samsung Exynos 4210: Cortex-A9 dual core, 1200 MHz, 
2-ports 32-bit 800Mbps LPDDR2/DDR2/DDR3 (6.4GB/s).
- L1 Data cache = 32 KB. 4-WAY, 32 B/line, Physically Indexed,  Physically Tagged.
    Two 32-byte linefill buffers and one 32-byte eviction buffer. 
    A 4-entry, 64-bit merging store buffer.
 - Automatic prefetcher that monitors cache misses, it can monitor and prefetch two independent data streams. 
 - L1 Instruction cache = 32 KB, 4-WAY, Virtually Indexed, Physically Tagged, 64-bit accesses.
 - L2 cache = 1 MB. 32 B/line, ?-WAY 
 - 2*ALU, LS, MUL
 - BTAC: 512 entries, 2-WAY.
 - GHB (Global history Buffer): 4K entries, 2-bit
 - Instruction buffer (<64 bytes) for short loops to disable the instruction cache.
 - PRF (Physical Register File): 56 x 32-bit.
 - Return Stack: 8 items ?
 - Store buffer: 4 x 64-bit slots with data merging capability. 
 
4 KB pages mode
- Micro TLB Data (L1 TLB): 32 entries (8 in first revision ?), fully associative.
 - Micro TLB Instr. (L1 TLB): 32 entries (8 in first revision ?), fully associative.
 - Main TLB (L2 TLB): 128 entries, 2-WAY. + fully-associative lockable array of 4 elements.
 
  | Size | 
  Latency | 
  Description | 
 |   32 K  |    4  |   TLB + L1  |  
 |   64 K  |   23  |   + 19 (L2)  |  
 |  128 K  |  
 |  256 K  |   30  |   + 7 (L1 TLB miss)  |  
 |  512 K  |  
 |    1 M  |   37  |   + 7 (L2 TLB miss)  |  
 |  ...    |   37 + 110 ns  |   + 110 ns (RAM)  |  
Data prefetcher monitors only RAM misses. It doesn't prefetch data from L2 cache.
- 4-bytes range cross penalty = 1 cycle
 - 8-bytes range cross penalty = 6 cycles
 - CPU can handle TLB misses in parallel (it works with two parallel accesses at least).
 - L1 B/W (Parallel Random Read) = 1 cycles per one access
 - L2->L1 B/W (Parallel Random Read) = 7 cycles per cache line
 - L2->L1 B/W (Read, 32 bytes step) = 8.7 cycles per cache line
 - L2 Write (Sequential) = 1 cycle per 4 bytes. 
 - L2 Write (Write, 32 bytes step) = 11.5 cycles per write (cache line), probably write allocate to L1 is enabled
 - RAM Read B/W (Parallel Random Read) = 68 ns / cache line = 470 MB/s
 - RAM Read B/W (Read, 4 Bytes step) = 890 MB/s
 - RAM Read B/W (Read, 32 Bytes step) = 1010 MB/s
 - RAM Write B/W (Sequential, or 4 bytes step) = 1600 MB/s
 - RAM Write B/W (32 bytes step) = 725 MB/s, probably write allocate is enabled
 
Pipeline
Branch misprediction penalty = 11 cycles.
Integer pipeline:
  | # | 
  Name | 
  Stage | 
 | 1 |  Fe1 |  Fetch |  
 | 2 |  Fe2 |  
 | 3 |  Fe3 |  
 | 4 |  De1 |  Decode |  
 | 5 |  De2 |  
 | 6 |  Re |  Rename |  
 | 7 |  Iss |  Issue |  
 | 8 |  Ex |  Execute |  
 | 9 |  WB |  WriteBack |  
Links
ARM Cortex-A9 at Wikipedia
ARM Cortex-A9 at arm.com