Apple A9 (Twister), 1850 MHz, RAM: 2 GB. iPhone SE.
Notes:
arm32: Instruction "ldr r1, [r2, r3, lsl #2]" is slow (6 cycles). Probably Apple A9 is not optimized for arm32 code.
arm64: CLANG can produce slow instruction "ldr w1, [x2, w3, uxtw #2]" (6 cycles for A9) for array access commands: v = array[uint32_index]. Read more about it here : CLANG uxtw->lsl hack.
Size Latency Increase Description 64 K 3 128 K 10 7 + 13 (L2) 256 K 13 3 512 K 15 2 1 M 16 1 2 M 16 0 4 M 29 + 7 ns 13 + 7 ns + 29 + 24 ns (L3) + 7 (L1 TLB miss) 8 M 41 + 35 ns 12 + 28 ns 16 M 47 + 91 ns 6 + 56 ns + 136 ns (RAM) 32 M 64 +125 ns 17 + 34 ns + 29 (L2 TLB miss) 64 M 73 +142 ns 9 + 17 ns 128 M 77 +150 ns 4 + 8 ns 256 M 79 +155 ns 2 + 5 ns 512 M 80 +158 ns 1 + 3 ns
Notes:
7z b : MIPS values are normalized with Intel Core 2 cpu.
7z b -mm=* : MIPS and Effectiveness values are normalized with AMD K8 cpu.
## iOS 10.2
## vanilla 16.04 + {CpuArch.h,7zCrcOpt.c,XzCrc64.c,XzCrc64Opt.c,Sha1.c,Sha256.c,Aes.c} from 17.00
## + __builtin_bswap{16,32,64} + CrcUpdateT8
# clang-4 -arch arm64 -mcpu=cyclone -O3
7z b -mmt1
7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04
p7zip Version 16.04-hash17+crct8 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE)
LE
CPU Freq: 504 909 1180 1350 1613 1805 1824 1834 1838
RAM size: 2009 MB, # CPU hardware threads: 2
RAM usage: 435 MB, # Benchmark threads: 1
Compressing | Decompressing
Dict Speed Usage R/U Rating | Speed Usage R/U Rating
KiB/s % MIPS MIPS | KiB/s % MIPS MIPS
22: 2558 99 2520 2489 | 29396 99 2530 2510
23: 2365 100 2423 2411 | 28876 99 2519 2500
24: 2246 99 2429 2415 | 28215 100 2489 2477
25: 2168 99 2489 2476 | 27281 99 2445 2428
---------------------------------- | ------------------------------
Avr: 99 2465 2448 | 99 2496 2479
Tot: 99 2480 2463
7z b -mmt2
7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04
p7zip Version 16.04-hash17+crct8 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE)
LE
CPU Freq: 499 898 1188 1389 1641 1833 1845 1835 1842
RAM size: 2009 MB, # CPU hardware threads: 2
RAM usage: 441 MB, # Benchmark threads: 2
Compressing | Decompressing
Dict Speed Usage R/U Rating | Speed Usage R/U Rating
KiB/s % MIPS MIPS | KiB/s % MIPS MIPS
22: 5505 172 3116 5356 | 57155 197 2471 4880
23: 5087 164 3160 5183 | 55978 198 2450 4846
24: 5123 169 3254 5508 | 54363 198 2415 4773
25: 5043 169 3405 5759 | 52711 198 2373 4692
---------------------------------- | ------------------------------
Avr: 169 3234 5452 | 198 2427 4797
Tot: 183 2830 5125
7z b -mmt4
7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04
p7zip Version 16.04-hash17+crct8 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE)
LE
CPU Freq: 513 916 1196 1407 1650 1830 1843 1844 1843
RAM size: 2009 MB, # CPU hardware threads: 2
RAM usage: 882 MB, # Benchmark threads: 4
Compressing | Decompressing
Dict Speed Usage R/U Rating | Speed Usage R/U Rating
KiB/s % MIPS MIPS | KiB/s % MIPS MIPS
22: 6141 195 3060 5974 | 55483 197 2405 4734
23: 5930 197 3064 6042 | 54274 197 2380 4696
24: 5723 196 3136 6154 | 52653 197 2347 4622
25: 5590 197 3248 6383 | 51315 197 2314 4567
---------------------------------- | ------------------------------
Avr: 196 3127 6138 | 197 2361 4655
Tot: 197 2744 5397
7z b -mm=* -mmt1
7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04
p7zip Version 16.04-hash17+crct8 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE)
LE
CPU Freq: 492 889 1184 1388 1641 1832 1844 1835 1843
RAM size: 2009 MB, # CPU hardware threads: 2
RAM usage: 225 MB, # Benchmark threads: 1
Method Speed Usage R/U Rating E/U Effec
KiB/s % MIPS MIPS % %
CPU 100 1845 1843
CPU 100 1840 1844
CPU 100 1845 1843 100 100
LZMA:x1 13026 99 4794 4762 260 258
29111 100 2378 2370 129 129
LZMA:x5:mt1 2250 100 2822 2812 153 153
28349 100 2401 2391 130 130
LZMA:x5:mt2 5152 169 3811 6436 207 349
28425 99 2411 2398 131 130
Deflate:x1 30030 100 3820 3813 207 207
90040 100 2803 2798 152 152
Deflate:x5 9445 100 3655 3637 198 197
89790 100 2794 2787 152 151
Deflate:x7 3352 100 3722 3715 202 202
90721 99 2833 2815 154 153
Deflate64:x5 8757 100 3797 3784 206 205
89858 100 2820 2811 153 153
BZip2:x1 5101 100 3094 3082 168 167
24657 100 2682 2673 146 145
BZip2:x5 4354 100 3643 3634 198 197
20641 100 4065 4051 221 220
BZip2:x5:mt2 7782 192 3383 6495 184 352
32565 173 3692 6391 200 347
BZip2:x7 1314 100 3417 3405 185 185
20490 100 4029 4018 219 218
PPMD:x1 3942 100 4092 4077 222 221
3259 100 3845 3839 209 208
PPMD:x5 2575 99 4388 4364 238 237
2175 100 4089 4076 222 221
Delta:4 530765 100 3273 3261 178 177
347619 100 2142 2136 116 116
BCJ 1214521 100 4999 4975 271 270
1218895 100 4982 4993 270 271
AES256CBC:1 113274 100 2792 2784 152 151
116539 100 2868 2864 156 155
AES256CBC:2
CRC32:1 177702 100 1298 1294 70 70
CRC32:4 602622 100 1348 1345 73 73
CRC32:8 902418 100 1227 1224 67 66
CRC64 531344 100 1091 1088 59 59
SHA256 166773 100 3410 3402 185 185
SHA1 376511 100 3532 3524 192 191
BLAKE2sp 243593 100 5384 5359 292 291
CPU 100 1833 1829
------------------------------------------------------
Tot: 110 3129 3491 173 189
7z b -mm=* -mmt2
7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04
p7zip Version 16.04-hash17+crct8 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE)
LE
CPU Freq: 503 901 1194 1390 1646 1833 1825 1845 1844
RAM size: 2009 MB, # CPU hardware threads: 2
RAM usage: 450 MB, # Benchmark threads: 2
Method Speed Usage R/U Rating E/U Effec
KiB/s % MIPS MIPS % %
CPU 198 1791 3552
CPU 198 1793 3557
CPU 198 1790 3539 101 200
LZMA:x1 25117 197 4653 9182 263 519
56757 198 2335 4622 132 261
LZMA:x5:mt1 4710 197 2982 5885 169 333
54343 198 2318 4583 131 259
LZMA:x5:mt2 5700 196 3626 7121 205 402
54251 198 2316 4575 131 259
Deflate:x1 58068 198 3730 7373 211 417
175370 198 2757 5449 156 308
Deflate:x5 18566 198 3613 7149 204 404
175615 198 2760 5452 156 308
Deflate:x7 6505 198 3646 7208 206 407
176203 197 2774 5468 157 309
Deflate64:x5 17054 197 3732 7370 211 416
175343 198 2774 5485 157 310
BZip2:x1 9878 198 3020 5968 171 337
48153 198 2636 5220 149 295
BZip2:x5 7959 198 3361 6643 190 375
32866 197 3267 6451 185 365
BZip2:x5:mt2 7599 197 3220 6342 182 358
28021 197 2794 5500 158 311
BZip2:x7 2396 194 3205 6209 181 351
32661 197 3258 6405 184 362
PPMD:x1 7656 198 4008 7919 226 447
6326 198 3771 7450 213 421
PPMD:x5 4779 197 4109 8099 232 458
3992 197 3800 7482 215 423
Delta:4 1032456 198 3200 6343 181 358
583218 196 1829 3583 103 202
BCJ 2360323 197 4900 9668 277 546
2379895 198 4930 9748 279 551
AES256CBC:1 218206 198 2707 5363 153 303
227263 198 2817 5585 159 316
AES256CBC:2
CRC32:1 346198 198 1273 2520 72 142
CRC32:4 1175244 198 1323 2623 75 148
CRC32:8 1759603 198 1203 2386 68 135
CRC64 1033592 198 1070 2117 60 120
SHA256 326090 198 3354 6652 190 376
SHA1 727875 198 3448 6813 195 385
BLAKE2sp 474195 198 5261 10432 297 589
CPU 198 1793 3552
------------------------------------------------------
Tot: 197 3013 5945 170 336
If C code uses 32-bit unsigned integer variable as index to access array:
v = array[uint32_index];CLANG 3.7 / 4.0 can produce instruction like this:
ldr w1, [x2, w3, uxtw #2] // 6 cycles at Apple A9
But we can replace that instruction to similar instruction:
ldr w1, [x2, x3, lsl #2] // 4 cycles at Apple A9
These instructions are not equal for 100%. But all 7-Zip's benchmark tests work OK after hack.
Some benchmarks from 7-Zip work 10-20% faster after hack on Apple A9, and average gain is 2-3%.
# clang-4 -arch arm64 -mcpu=cyclone -O3
# sed -i -e '/\(st\|ld\)r.*[xw].*x.*w.* uxtw #/ {s/w\([^,]*\), uxtw/x\1, lsl/}'
# -e '/\(st\|ld\)rb.*w[^,]*, uxtw\]/ {s/w\([^,]*\), uxtw/x\1/}' *.s
7z b -mmt1
7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04
p7zip Version 16.04-hash17+crct8-lsl-v3 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE)
LE
CPU Freq: 495 884 1180 1380 1641 1831 1822 1843 1844
RAM size: 2009 MB, # CPU hardware threads: 2
RAM usage: 435 MB, # Benchmark threads: 1
Compressing | Decompressing
Dict Speed Usage R/U Rating | Speed Usage R/U Rating
KiB/s % MIPS MIPS | KiB/s % MIPS MIPS
22: 2594 99 2543 2524 | 30455 99 2614 2600
23: 2380 99 2438 2426 | 30022 100 2612 2599
24: 2264 100 2446 2435 | 29202 100 2575 2564
25: 2188 100 2511 2499 | 28167 100 2519 2507
---------------------------------- | ------------------------------
Avr: 99 2485 2471 | 100 2580 2567
Tot: 99 2532 2519
7z b -mm=* -mmt1
7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04
p7zip Version 16.04-hash17+crct8-lsl-v3 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE)
LE
CPU Freq: 505 904 1197 1394 1646 1833 1843 1837 1843
RAM size: 2009 MB, # CPU hardware threads: 2
RAM usage: 225 MB, # Benchmark threads: 1
Method Speed Usage R/U Rating E/U Effec
KiB/s % MIPS MIPS % %
CPU 100 1841 1839
CPU 100 1845 1843
CPU 100 1843 1843 100 100
LZMA:x1 13202 99 4868 4826 264 262
29969 100 2445 2440 133 132
LZMA:x5:mt1 2270 100 2843 2836 154 154
29049 100 2461 2450 134 133
LZMA:x5:mt2 5236 170 3839 6542 208 355
29242 100 2473 2466 134 134
Deflate:x1 30416 100 3863 3862 210 210
93075 100 2903 2892 158 157
Deflate:x5 9489 100 3668 3654 199 198
93232 100 2906 2894 158 157
Deflate:x7 3382 100 3764 3748 204 203
93939 100 2920 2915 158 158
Deflate64:x5 8769 100 3803 3790 206 206
92910 100 2908 2906 158 158
BZip2:x1 5265 100 3193 3181 173 173
25094 99 2736 2720 148 148
BZip2:x5 4513 100 3773 3766 205 204
20853 100 4110 4093 223 222
BZip2:x5:mt2 7996 192 3482 6674 189 362
32607 170 3766 6400 204 347
BZip2:x7 1347 100 3503 3491 190 189
21122 100 4143 4142 225 225
PPMD:x1 3959 100 4108 4095 223 222
3279 100 3872 3861 210 210
PPMD:x5 2578 100 4388 4370 238 237
2197 100 4132 4117 224 223
Delta:4 530891 100 3275 3262 178 177
356465 100 2193 2190 119 119
BCJ 1204041 99 4970 4932 270 268
1221370 100 5009 5003 272 271
AES256CBC:1 128023 100 3146 3146 171 171
131847 100 3251 3240 176 176
AES256CBC:2
CRC32:1 224780 100 1637 1636 89 89
CRC32:4 698401 100 1563 1559 85 85
CRC32:8 994911 100 1353 1349 73 73
CRC64 631082 100 1295 1292 70 70
SHA256 167135 100 3426 3410 186 185
SHA1 374018 100 3512 3501 191 190
BLAKE2sp 242960 100 5326 5345 289 290
CPU 100 1846 1843
------------------------------------------------------
Tot: 110 3196 3567 176 194