Intel i7-3770 (Ivy Bridge), 3.4 GHz (Turbo Boost off), 22 nm. RAM: 4 GB (Single PC3-12800 10-10-10-28).
The cache latency for reading from different L3 Slices to different Cores with additional ALU OPs between LOADs:
  0   1   2   3   4   5   6   7   8   ALU OPs   
  4   5   5   5   5   5   5   5   5   L1
 12  12  12  12  13  12  12  12  12   L2
            
 30  30  30  31  30  30  30  30  30   L3 core 0,3
 29  29  29  30  29  29  29  29  29   L3 core 1,2
                                
 26  27  26  27  26  27  26  27  26   core-N slice-N
 28  31  30  29  28  29  28  29  28   core-0 slice-1 / core-1 slice-0
 32  33  32  33  32  33  32  34  32   core-0 slice-2
 34  33  34  35  34  33  34  33  34   core-0 slice-3
                                
 32  31  30  31  30  29  30  29  30   core-1 slice-2
 32  33  32  33  32  33  32  33  32   core-1 slice-3
The total L3 iteration latency is always EVEN number, when ALU OPS are included:
L3 Latency penalty for reading from different L3 Slices:
Core-0 =##= Slice-0
        || 2c
Core-1 =##= Slice-1
        || 4c
Core-2 =##= Slice-2
        || 2c
Core-3 =##= Slice-3  
Note: the large latency between Slice-1 and Slice-2 can be some effect of 
slices polarity, where some structures work with 2 cycles periods.
To read data from required slice we use the following hash (xor) functions for L3 slice number, from physical address bits [1]:
Note: L3 cache in Sandy Bridge uses Pseudo-LRU policy for LLC. But LLC replacement policy in Ivy Bridge looks like random replacement policy.
Size Latency Increase Description 32 K 4 64 K 8 4 + 8 (L2) 128 K 10 2 256 K 11 1 512 K 21 10 + 18 (L3) 1 M 26 5 2 M 28 2 4 M 29 1 8 M 30 1 16 M 30 + 27 ns 27 ns + 53 ns (RAM) 32 M 30 + 40 ns 13 ns 64 M 30 + 47 ns 7 ns 128 M 38 + 50 ns 8 + 3 ns + 16 (TLB miss) 256 M 42 + 52 ns 4 + 2 ns 512 M 44 + 53 ns 2 + 1 ns 1024 M 45 + 53 ns 1 2048 M 46 + 53 ns 1
Size Latency Increase Description 32 K 4 64 K 8 4 + 8 (L2) 128 K 10 2 256 K 14 4 512 K 25 11 + 18 (L3) +7 (L1 TLB miss) 1 M 31 6 2 M 34 3 4 M 41 7 + 9 (L2 TLB miss) 8 M 44 3 16 M 45 + 27 ns 1 + 27 ns + 53 ns (RAM) 32 M 46 + 40 ns 1 + 13 ns 64 M 49 + 47 ns 3 + 7 ns 128 M 64 + 50 ns 15 + 3 ns + 9 (PDE cache miss) + 19 (Page walk to L3) 256 M 69 + 52 ns 5 + 2 ns + 512 M 76 + 53 ns 7 + 1 ns 1024 M 84 + 53 ns 12 2048 M 94 + 53 ns 10
Branch misprediction penalty = 14 cycles.
[1]: Reverse Engineering Intel Last-Level Cache Complex Addressing Using Performance Counters. Maurice, 2015