Qualcomm Krait 300
Qualcomm Krait 300 (MSM8230AB, Snapdragon 400), 2 cores, 1728 MHz, 28 nm, Samsung Galaxy S4 Mini,
1.5 GB (32-bit LPDDR2).
- L0 Data cache = 4 KB. 64 B/line, direct mapped
- L0 Instruction cache = 4 KB.
- L1 Data cache = 16 KB. 64 B/line, 4-way
- L1 Instruction cache = 16 KB, 4-way
- L2 Cache = 1 MB, 128 B/line, 8-way. Each core has fast access only to 512 KB of L2 cache.
- L0 Data Cache Latency = 3 cycles
- L1 Data Cache Latency = 6 cycles
- L2 Cache Latency = 36 cycles = 12 cycles + 14 ns
- RAM Latency = 36 cycles + 110 ns
4 KB pages mode (Android / 32-bit Linux)
- Data TLB L1: 32 items. Miss penalty = 5 cycles. There is no parallel TLB L1 miss handling.
- Data TLB L2: 128 items. Miss penalty = 65 cycles (2 accesses to L2 cache ?). There is no parallel TLB L2 miss handling.
Size Latency Increase Description
4 K 3
8 K 4 1 + 3 (L1)
16 K 5 1
32 K 22 17 + 30 (L2)
64 K 29 7
128 K 33 4
256 K 38 5 + 5 (L1 TLB miss)
512 K 41 3
1 M 74 + 21 ns 33 + 21 ns + 65 (L2 TLB miss)
2 M 90 + 57 ns 16 + 36 ns + 110 ns (RAM)
4 M 98 + 84 ns 8 + 27 ns
8 M 102 + 99 ns 4 + 15 ns
16 M 104 + 106 ns 2 + 7 ns
32 M 105 + 113 ns 1 + 7 ns
64 M 106 + 123 ns 1 + 10 ns
128 M 106 + 135 ns 12 ns
256 M 106 + 160 ns 25 ns
512 M 106 + 190 ns 30 ns + 110 ns (RAM) (Page walk to RAM)
MISC
Branch misprediction penalty = 12 cycles.
- 4-bytes range cross penalty = 1 cycle
- 64-bytes range cross penalty = 2 cycles
- L0 B/W (Parallel Random Read) = 1 cycle per one access
- L1->L0 B/W (Parallel Random Read) = 3 cycles per L0 cache line (64 bytes)
- L1->L0 B/W (Read, 64 bytes step) = 3 cycles per L0 cache line (64 bytes)
- L1->L0 B/W (Read, 64 bytes step - pointer chasing) = 6 cycles per L0 cache line (64 bytes)
- L0/L1/L2 Write (Write, 4-8 bytes step) = 1 cycle per write
- L0/L1/L2 Write (Write, 64 bytes step, 4 bytes write) = 5 cycles per write
- L2->L0 B/W (Parallel Random Read) = 6.5 cycles per L0 cache line (64 bytes)
- L2->L0 B/W (Read, 64 bytes step) = 7 cycles per L0 cache line (64 bytes)
- L2->L0 B/W (Read, 64 bytes step - pointer chasing) = 9 cycles per L0 cache line (64 bytes)
- RAM Read B/W (Parallel Random Read) = 47 ns per L2 cache line (128 bytes) = 2700 MB/s (blocking by TLB miss handling)
- RAM Read B/W (Read, 4 Bytes step) = 1960 MB/s (Thumb2)
- RAM Read B/W (Read, 128 Bytes step) = 4400 MB/s
- RAM Read B/W (Read, 4 Bytes step - pointer chasing) = 580 MB/s
- RAM Read B/W (Read, 128 Bytes step - pointer chasing) = 3750 MB/s
- RAM Write B/W (Write, 4 Bytes step) = 4000 MB/s
- RAM Write B/W (Write, 8-64 Bytes step, 4 bytes write) = 1800 MB/s (for step block)
Decoding and Execution problems
Krait core has some performance problems with instruction decoding and execution for
some instruction sequences. For example, there are some stalls for dependency chains
of simple ALU instructions.
ISA Cycles / group The group of instructions in sequence
ARM32 1.64 eor r0, r0, r1;
ARM32 1.33 eor r0, r0, r1; eor r2, r2, r3;
ARM32 1.12 eor r0, r0, r1; eor r2, r2, r3; eor r4, r4, r5;
Thumb 1.50 eor r0, r1;
Thumb 1.00 eor r0, r1; eor r2, r3;
Thumb 1.74 eor r0, r1; eor r2, r1; eor r3, r1;
Thumb-2 1.00 eor r0, r0, r1; nop; nop;
Links
Krait at Wikipedia