MIPS 74K
Atheros AR9344 (MIPS 74K), 560MHz, 128 MB (16-bit DDR2-667D x 2). TP-Link WDR3600. 
-  L1 Data cache = 32 KB. 32 B/line. 4-way. Write allocate
-  DTLB size = 32 items. (2 pages per item),
-  L1 Data cache latency = 4 cycles.
-  MIPS ISA doesn't support complex address modes in LOAD instruction. The latence for LOAD from integer array (n=p[n]) is 7 cycles.
-  RAM Latency = 4 cycles + 155 ns (32 cycles + 100 ns ?)
-  DTLB miss penalty  = 40 cycles + 100 ns ?
 4 KB pages 
  32 K     4                              TLB + L1
  64 K     4 +  80 ns           80 ns     + 150 ns RAM
 128 K     4 + 120 ns           40 ns
 256 K     4 + 140 ns           20 ns          
 512 K    24 + 200 ns      20 + 60 ns     + 40 + 100 ns (TLB miss)
   1 M    34 + 225 ns      10 + 25 ns
   2 M    39 + 237 ns       5 + 12 ns               
   4 M    42 + 246 ns       3 +  9 ns               
   8 M    44 + 260 ns       2 + 14 ns     
  16 M    44 + 290 ns           30 ns     + ??? ns (Page walk)
  32 M    44 + 340 ns           50 ns     
  64 M    44 + 370 ns           30 ns     
 16 KB pages 
  Size        Latency        Increase     Description
  32 K     4                              TLB + L1
  64 K     4 +  80 ns           80 ns     + 155 ns RAM
 128 K     4 + 120 ns           40 ns
 256 K     4 + 140 ns           20 ns          
 512 K     4 + 150 ns           10 ns
   1 M     4 + 155 ns            5 ns
   2 M    24 + 207 ns      20 + 52 ns     + 40 + 100 ns (TLB miss)               
   4 M    34 + 230 ns      10 + 23 ns               
   8 M    39 + 243 ns       5 + 13 ns     
  16 M    43 + 248 ns       3 +  5 ns     
  32 M    44 + 259 ns       2 + 11 ns     
  64 M    44 + 294 ns           35 ns     + ??? ns (Page walk)
- 4-bytes range cross penalty = 320 cycles
- CPU can't process several TLB misses concurrently.
- L1 B/W (Parallel Random Read) = 1 cycle per one access
- RAM Read B/W (Parallel Random Read) = 44 ns / cache line. (720 MB/S)
- RAM Read B/W (Read, 4 Bytes step) = 200 MB/s
- RAM Read B/W (Read, 32 Bytes step) = 860 MB/s
- RAM Read B/W (Read, 32 Bytes step, pointer-chasing) = 260 MB/s (no hardware prefetch) 
- RAM Write (4 Bytes step) = 220 MB/s
- RAM Write (32 Bytes step) = 120 ns per write. Write Allocate? 270 MB/s (32-byte cache line)
Branch misprediction penalty = 10 cycles.
Cache aliasing problem (32 KB data cache, 4-way, 4 KB pages): 
There is some penalty for data cache accesses, if there are some 
uninitialized data in cache (the data from another process?). 
 MIPS 74K 
   -  L1 Caches
     
       -  4-way set associative
       
-  32-byte cache line size
       
-  Virtually indexed, physically tagged
       
-  Cache line locking support
       
-  Up to 4 outstanding I-cache misses
       
-  Virtual tag based hit prediction in data cache
       
-  Up to 4 unique outstanding D-cache misses and 9 total load misses
       
-  Writeback and write-through support in data cache
       
-  Non-blocking data cache prefetches
    
 
-  L1 Data cache:
     
       -  Cache Protocols: uncached, write-back (with write-allocate), write-through (without write-allocate).
       
-  Data cache misses are non-blocking and up to 4 may be outstanding.
       
-  The tag array also has a virtual address portion, which is used to compare against the 
        virtual address being accessed and generate a data cache hit prediction.
       
-  64- or 128-bit wide access to the data cache
    
 
-  L1 Instruction cache.
     
       -  128-bit wide access to the instruction cache
       
-  Instruction cache tag and data access are staggered across 2 cycles, 
       with up to 4 instructions fetched per cycle.
    
 
-  Instruction Fetch Unit
    
      -  4-instruction fetch per cycle
      
-  8-entry Return Prediction Stack
      
-  Combined Majority Branch Predictor using three 256-entry Branch History Tables (BHT)
      
-  64-entry (4-way) jump register cache to predict target for indirect jumps
      
-  Hardware prefetching of the next 1 or 2 sequential cache lines on a miss.
      
-  In the MIPS16e mode, the IFU takes an additional 3 stages to recode and expand the compressed code.
    
 
-  Combined majority branch predictor using three 256-entry BHT; 8-entry return prediction stack
  
-  Dual Out-of-Order Instruction Issue
    
     -  12-stage ALU fetch and execution pipe. The latency of the ALU operation is 1 or 2 cycles.
     
-  13-stage AGEN fetch and execution pipe. AGEN pipe executes load/store and control 
          transfer instructions
     
-  Common 2-stage graduation pipe
     
-  32 (18 ALU, 14 AGEN) completion buffers hold execution results until instructions
          are graduated in program order
     
-  12-entry Instruction Buffer to decouple the instruction fetch from execution. 
          Up to 4 instructions can be written into this buffer, 
          but a maximum of 2 instructions can be read from this buffer by the IDU.
     
-  Up to 4 instructions issued per cycle in 74Kf core with dual issue FPU
    
 
-  Programmable Memory Management Unit
    
      -  16/32/48/64 dual-entry, dual-ported TLB shared by Instruction and Data MMU
      
-  4-entry ITLB (4KB, 16KB page size)
      
-  4K, 16K, 64K, 256K, 1M, 4M, 16M, 64M, 256M byte page size supported in JTLB
    
 
-  TLB: 2 virtual pages (odd and even) per entry. dual-ported TLB shared by Instruction and Data MMU.
  
-  4-entry ITLB (4KB, 1MB page size)
Integer pipeline:
  | Unit | # | Stage | Name | Description | 
 | Fetch (IFU)
 | 1 | IT | Instruction Tag Read | I-cache tag arrays accessed Branch History Table, JRC accessed
 ITLB address translation performed
 Instruction watch and EJTAG break comparesdone
 | 
 | 2 | ID | Instruction Data Read | I-cache data array accesses Tag compare, Detect I-cache hit
 | 
 | 3 | IS | Instruction Select | Way select Target calculation start
 | 
 | 4 | IB | Instruction Buffer | Instruction Buffer write Target calculation done
 | 
 | Decode & Despatch
 (IDU)
 | 5 | DD | Decode | Access Rename Map, get source register availability to resolve source dependency Decode instructions and assign pipe and instruction identifier
 Check execution resources
 | 
 | 6 | DR | Rename | Update Rename Map at destination register to resolve output dependency Send instruction information to Graduation Unit (GRU)
 Send instruction to Decode and Dispatch Queue (DDQ)
 | 
 | 7 | DS | Select for Dispatch | Check for operand and resource availability and mark valid instructions as ready for dispatch Select 1 out of 8 (6-entry DDQ + 2 staging registers) ready instructions in each ALU and AGEN pipe independently
 | 
 | 8 | DM | Instruction Mux | Read out the selected instruction from the previous stage and update the selection information Generate controls for source-operand bypass mux
 ALU pipe will start premuxing operands based on the selected instruction
 AGEN pipe will starting reading source operands from Register File and Completion Buffers.
 | 
 | ALU | 9 | AF | ALU Register File Read |  | 
 | 10 | AM | ALU Operand Mux |  | 
 | 11 | AC | ALU Compute |  | 
 | 12 | AB | ALU Results Bypass |  | 
    
 | Graduation Unit
 (GRU)
 | 13 | WB | Writeback |  | 
 | 14 | WC | Graduation Complete |  | 
  
Links
MIPS32 74K