AMD Jaguar
AMD AMD A8-6410 (Puma/Beema), 4 cores, 18000 Mhz, 28 nm, 8 GB DDR3-1600.
- L1 Data cache = 32 KB. 64 B/line, 8-WAY. Parity Protected,
Write-back.
One 128-bit load and one 128-bit store per cycle.
Prefetcher.
- L1 Instruction cache = 32 KB, 64 B/line, 2-WAY, Parity Protected.
32 bytes are fetched in a cycle. On misses, the L1 requests the
missed line and 1 or 2 sequential lines (prefetches).
- L2 cache size = 2 MB. 64 B/line, 16-WAY. ECC protected, Write-back.
Shared by up to 4 cores.
L2 cache is inclusive of the L1 caches in the cores.
The L2 to L1 data path is 16 bytes wide;
critical data within a cache line is forwarded first.
4 512-Kbyte banks: bits 7:6 of the cache line address determine banks.
- 4 KB pages DATA L1 TLB size = 40 items, full-assoc. Miss penalty = 6 cycles.
- 2 MB pages DATA L1 TLB size = 8 items, full-assoc.
- 4 KB pages DATA L2 TLB size = 512 items, 4-WAY. Miss penalty = 26 cycles?
- 2 MB pages DATA L2 TLB size = 256 items, 2-WAY.
- 4 KB pages Instruction L1 TLB size = 32 items, full-assoc.
- 2 MB pages Instruction L1 TLB size = 8 items, full-assoc.
- 4 KB pages Instruction L2 TLB size = 512 items, 4-WAY.
- Page Directory Cache (PDC): 16 entries.
- The page table walker supports 1-Gbyte pages by smashing the page into a
2-Mbyte window, and returning a 2-Mbyte TLB entry.
In legacy mode, 4-Mbyte entries are also supported by returning a smashed 2-Mbyte TLB entry.
- L1 BTB: 1024 entries. is a sparse branch predictor and maps up to the first two branches
per instruction cache line (64 bytes).
- The L2 BTB is a dense branch predictor and contains 1024 branch entries,
mapped as up to an additional 2 branches per 8 byte instruction chunk,
if located in the same 64-byte aligned block.
- return address stack (RAS): 16-entry.
- 2 ALU, LOAD, STORE, 2 * FPU.
- ALU scheduler: 20-entry
- Aaddress generation unit (AGU) scheduler: 12-entry
- The floating-point retire queue: 44 floating-point micro-ops.
- LS unit: 20-entry store queue
- L1 Data Cache Latency = 3 cycles for simple access via pointer
- L1 Data Cache Latency = 4 cycles for access with complex address calculation (size_t n, *p; n = p[n]).
- L2 Cache Latency = 26 cycles
- RAM Latency = 26 cycles + 100 ns
2 MB pages mode (64-bit Windows)
- Data TLB L1 size = 8 items. full assoc. Miss penalty = 6 cycles. Parallel miss: 2 cycle per access
- Data TLB L2 size = 256 items. 4-WAY. Miss penalty = ? cycles. Parallel miss: ? cycles per access
Size Latency Increase Description
32 K 3
64 K 15 12 + 23 (L2)
128 K 21 6
256 K 24 3
512 K 25 1
1 M 25 0
2 M 26 1
4 M 26 + 55 ns + 55 ns + 100 ns (RAM)
8 M 26 + 80 ns + 25 ns
16 M 26 + 92 ns + 12 ns
32 M 29 + 98 ns 3 + 6 ns + 6 (L1 TLB miss)
64 M 31 + 100 ns 2 + 2 ns
128 M 32 + 100 ns 1
256 M 32 + 100 ns
512 M 32 + 100 ns
4 KB pages mode (64-bit Windows)
- Data TLB L1 size = 40 items. full assoc. Miss penalty = 6 cycles. Parallel miss: 2 cycles per access
- TLB L2 size = 512 items. 4-WAY. Miss penalty = 28? cycles. Parallel miss: 28? cycles per access
Size Latency Increase Description
32 K 3
64 K 15 12 + 23 (L2)
128 K 21 6
256 K 26 5 + 6 (L1 TLB miss)
512 K 29 3
1 M 30 1
2 M 32 2
4 M 49 + 55 ns 17 + 55 ns + 100 ns (RAM) + 28 (L2 TLB miss)
8 M 59 + 80 ns 10 + 25 ns
16 M 70 + 92 ns 11 + 12 ns
32 M 76 + 98 ns 6 + 5 ns
64 M 84 + 100 ns 8 + 2 ns
MISC
- Branch misprediction penalty = 15-16 cycles
- 16-bytes range cross penalty = 2 cycles
- L1 B/W (Parallel Random Read) = 1 cycles per one access
- L2->L1 B/W (Parallel Random Read) = 8 cycles per cache line
- L2->L1 B/W (Read, 64 bytes step) = 8 cycles per cache line
- L2 Write (Write, 64 bytes step) = 10 cycles per write (cache line)
- RAM Read B/W (Parallel Random Read) = 19 ns / access
- RAM Read B/W (Read, 64 Bytes step) = 6800 MB/s
- RAM Write B/W (Write, 4-64 Bytes step) = 3600 MB/s
Links
AMD Jaguar at Wikipedia
Software Optimization Guide for AMD Family 16h Processors.