ICT Loongson (Godson)

ICT Loongson 2F (ST STLS2F01) (Godson-2)

ICT Loongson 2F (800 MHz) (90 nm) + 1024 MB of DDR2. Lemote YeeLoong 8089 notebook.

Cache

16 KB pages mode

Size Latency Description
64 K 5 TLB + L1-Cache
512 K 20 + 15 (L1-Cache miss -> L2-Cache hit)
2 M 20 + 130 ns + 130 ns RAM
... 20 + 260 ns + 130 ns (TLB miss)

ICT Loongson 3A (Godson-3)

ICT Loongson 3A (900 MHz), DDR3 SDRAM, 1333MHz (PC3-10666), 8-8-8-24.

ICT Loongson 3: 7-metal 65-nm CMOS, 425 MTransistors, 14.240mm x 12.205mm.

GS464 core

16 KB pages mode

Size Latency Description
64 K 5 TLB + L1-Cache
2 M 35 + 30 (L1-Cache miss -> L2-Cache hit)
4 M 35 + 180 ns + 180 ns (TLB miss)
... 35 + 199 ns + 19 ns (RAM)

32 MB pages mode

Size Latency Description
64 K 5 TLB + L1-Cache
4 M 35 + 30 (L1-Cache miss -> L2-Cache hit)
... 35 + 19 ns + 19 ns (RAM)

Integer Pipeline

Execution Latency of simple dependent integer instructions is 2 cycles !!!

Branch misprediction penalty = 10 cycles.

# Name Description
1 Fetch The instruction cache and TLB are read, according to the contents of the program counter (PC). Four new instructions are sent to the instruction register (IR) if the instruction fetch is a TLB hit and a cache hit.
2 Pre-Decode Branch instructions are found and their branch directions are dynamically predicted.
3 Decode The four instructions in IR are decoded in the internal format and sent to the register renaming module.
4 Register Rename A new physical register is allocated for each logical destination register, and the logical source register is renamed according to the latest physical register allocated for the same logical register. Inter-instruction dependencies among four instructions mapped in the same cycle are also checked. The renamed instructions are latched to be sent to reservation stations and queues in next cycle.
5 Dispatch Renamed instructions are dispatched to the fixed- or floating-point reservation station to be executed, and are sent to the reorder queue for in-order graduation. Associated instructions are also sent to branch queue and memory queue. Each empty entry of reservation stations and queues selects among four dispatched instructions in this cycle.
6 Issue One instruction with all required operands ready is selected from the fixed- or floating-point reservation station for each functional unit. When there are multiple instructions ready for the same functional unit, the oldest one is selected. Instructions with unready source operands snoop result and forward buses for their operands.
7 Register Read The issued instruction reads its source operands from the physical register file and is sent to the associated functional units. It may also get the data directly from one of the result buses if its source register number matches the destination register number of the result bus.
8 Execution Instructions are executed according to its type and execution results are written back to the register file. Result buses are also sent to the reservation station for snooping and to the register mapping table to notify that the associated physical register is ready.
9 Commit Up to four instructions can be committed in program order per cycle. Committed instructions are sent to the register mapping module to confirm the mapping of its destination register and release the old one. They are also sent to the memory access queue to allow committed store instructions to write cache or memory.

Only one branch instruction can be decoded in one cycle.

Memory Pipeline

# Description
1 address is calculated and the CAM of TLB is searched to form the index of TLB RAM.
2 TLB RAM is accessed in parallel with cache RAM access. Tag compare is also performed at this stage, but value selection according to tag compare result is delayed to next cycle.
3 access value is formed according to the tag compare result of last stage, memory access exception bits are also form at this stage. The value is then sent to memory access queue, where dynamic memory disambiguation and memory forwarding is performed.
4 the results are written back when ready.

Godson-3B processor is an 8-core high-performance processor implemented in a 65nm CMOS LP/GP mixed process with 7 layers of Cu metallization. It contains 582.6M transistors in a 299.8mm2 area. The highest frequency of Godson-3B is 1.05GHz. Its peak performance is 128/256GFLOPS for double/single-precision with 40W power consumption.

Links

Loongson at Wikipedia

ST STLS2F01 Loongson 2F UserGuide