Intel i7-3770 (Ivy Bridge), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 4 GB (Single PC3-12800 10-10-10-28).
The cache latency for reading from different L3 Slices to different Cores with additional ALU OPs between LOADs:
0 1 2 3 4 5 6 7 8 ALU OPs 4 5 5 5 5 5 5 5 5 L1 12 12 12 12 13 12 12 12 12 L2 30 30 30 31 30 30 30 30 30 L3 core 0,3 29 29 29 30 29 29 29 29 29 L3 core 1,2 26 27 26 27 26 27 26 27 26 core-N slice-N 28 31 30 29 28 29 28 29 28 core-0 slice-1 / core-1 slice-0 32 33 32 33 32 33 32 34 32 core-0 slice-2 34 33 34 35 34 33 34 33 34 core-0 slice-3 32 31 30 31 30 29 30 29 30 core-1 slice-2 32 33 32 33 32 33 32 33 32 core-1 slice-3
The total L3 iteration latency is always EVEN number, when ALU OPS are included:
L3 Latency penalty for reading from different L3 Slices:
Core-0 =##= Slice-0 || 2c Core-1 =##= Slice-1 || 4c Core-2 =##= Slice-2 || 2c Core-3 =##= Slice-3Note: the large latency between Slice-1 and Slice-2 can be some effect of slices polarity, where some structures work with 2 cycles periods.
To read data from required slice we use the following hash (xor) functions for L3 slice number, from physical address bits [1]:
Note: L3 cache in Sandy Bridge uses Pseudo-LRU policy for LLC. But LLC replacement policy in Ivy Bridge looks like random replacement policy.
Size Latency Increase Description 32 K 4 64 K 8 4 + 8 (L2) 128 K 10 2 256 K 11 1 512 K 21 10 + 18 (L3) 1 M 26 5 2 M 28 2 4 M 29 1 8 M 30 1 16 M 30 + 27 ns 27 ns + 53 ns (RAM) 32 M 30 + 40 ns 13 ns 64 M 30 + 47 ns 7 ns 128 M 38 + 50 ns 8 + 3 ns + 16 (TLB miss) 256 M 42 + 52 ns 4 + 2 ns 512 M 44 + 53 ns 2 + 1 ns 1024 M 45 + 53 ns 1 2048 M 46 + 53 ns 1
Size Latency Increase Description 32 K 4 64 K 8 4 + 8 (L2) 128 K 10 2 256 K 14 4 512 K 25 11 + 18 (L3) +7 (L1 TLB miss) 1 M 31 6 2 M 34 3 4 M 41 7 + 9 (L2 TLB miss) 8 M 44 3 16 M 45 + 27 ns 1 + 27 ns + 53 ns (RAM) 32 M 46 + 40 ns 1 + 13 ns 64 M 49 + 47 ns 3 + 7 ns 128 M 64 + 50 ns 15 + 3 ns + 9 (PDE cache miss) + 19 (Page walk to L3) 256 M 69 + 52 ns 5 + 2 ns + 512 M 76 + 53 ns 7 + 1 ns 1024 M 84 + 53 ns 12 2048 M 94 + 53 ns 10
Branch misprediction penalty = 14 cycles.
[1]: Reverse Engineering Intel Last-Level Cache Complex Addressing Using Performance Counters. Maurice, 2015