ICT Loongson (Godson)

Loongson 2F
Loongson 3A

ICT Loongson 2F (ST STLS2F01) (Godson-2)

ICT Loongson 2F (800 MHz) (90 nm) + 1024 MB of DDR2. Lemote YeeLoong 8089 notebook.

4-way superscalar
Execution units: ALU1, ALU2, MEM, FALU1, FALU1.
9-bit global history register (GHR).
2K-entry (or 4K-entry?) pattern history table (PHT). Each PHT entry has a 2-bit saturating up/down counter.
The 16-entry BTB predicts the target PC of the jump register instruction. Each BTB entry contains the PC and target PC of the jump register instruction. Besides, a 2-bit saturating up/down counter is associated with each BTB entry. On replacement, entries with counter values 0 or 1 will be replaced prior to others.
The 4-entry return address stack (RAS).
64-entry physical register file (64-bit) for fixed-point.
64-entry physical register file (64-bit) for floating-point.
16-entry reservation station (RS) for Fixed-point and memory instructions.
16-entry reservation station (RS) for Floating-point instructions.
64-entry reorder queue.
8-entry branch queue.
24-entry memory access queue.

Cache

L1 Data cache = 64 KB. 32 B/line, 4-WAY, 64-bit read / write / refill, virtually indexed and physically tagged, Write-back, Non-blocking (16 outstanding). The replacement policy is random, but two continuous replacement of the same block is avoided by hardware. Single port RAM is used for both tag and data. STLS2F01 allows simultaneous loads and write-back of stores provided they access different banks to alleviate cache access conflict. When cache port conflict does occur among refills, loads (stores read only the tag array) and write-back of stores (which write cache data only), refills have the highest priority while write-back of stores have the lowest priority.
L1 Instruction cache = 64 KB. 32 B/line, 4-WAY. 128-bit read, 64-bit write. Non-blocking (2 outstanding).
L2 cache size = 512 KB. 32 B/line, 4-WAY, physically indexed and tagged. Random replacement algorithm. Write-back. Non-blocking (8 outstanding). Critical word first.
Page size can be configured from 4KB to 16MB (in multiples of 4).
40-bit virtual address and 40-bit physical address.
Joint TLB size = 64 entries, fully associative. 2 virtual pages (odd and even) per entry. JTLB covers 128 pages.
Instruction TLB size = 8 entries. When a ITLB miss occurs, a randomly selected ITLB entry is filled from the joint TLB. If no match occurs (TLB miss), an exception is taken and software refills the TLB from the page table resident in memory. Software can write over a selected TLB entry or use a hardware mechanism to write into a random entry.

16 KB pages mode

Size	Latency	Description
64 K	5	TLB + L1-Cache
512 K	20	+ 15 (L1-Cache miss -> L2-Cache hit)
2 M	20 + 130 ns	+ 130 ns RAM
...	20 + 260 ns	+ 130 ns (TLB miss)

4-bytes range cross penalty = 641 cycles
L2 B/W Read (parallel read) = (6 cycles per cache line)
RAM B/W Read (4 Bytes stride) = 420 MB/s
RAM B/W Read (32 Bytes stride) = 1200 MB/s
RAM B/W Write (4 bytes stride) = 280 MB/s

ICT Loongson 3A (Godson-3)

ICT Loongson 3A (900 MHz), DDR3 SDRAM, 1333MHz (PC3-10666), 8-8-8-24.

ICT Loongson 3: 7-metal 65-nm CMOS, 425 MTransistors, 14.240mm x 12.205mm.

GS464 core

X86 binary translation optimization
MIPS64, 200+ more instructions for X86 binary translation and media acceleration
48-bit VA and PA, 128-bit memory access
4 x L2 cache blocks of 4 MB total cache
4-issue 64-bit superscalar OOO pipeline
2 fix, 2 FP, 1 memory units
64KB ICache and 64KB DCache, 4-way
64-entry TLB, 16-entry ITLB
Directory-based cache-coherence
Parity check for ICache, ECC for DCache
GS464V:Additional multiple purpose cores (GStera). They consist of 8 to 16 multiply-accumulate (MAC) units, a gigantic register file, and an AXI interface.

16 KB pages mode

Size	Latency	Description
64 K	5	TLB + L1-Cache
2 M	35	+ 30 (L1-Cache miss -> L2-Cache hit)
4 M	35 + 180 ns	+ 180 ns (TLB miss)
...	35 + 199 ns	+ 19 ns (RAM)

32 MB pages mode

Size	Latency	Description
64 K	5	TLB + L1-Cache
4 M	35	+ 30 (L1-Cache miss -> L2-Cache hit)
...	35 + 19 ns	+ 19 ns (RAM)

4-bytes range cross penalty = 492 cycles
L1 B/W (Parallel Random Read) = 1 cycles per one access
L2->L1 B/W (Parallel Random Read) = 6 cycles per cache line (32 bytes)
L2->L1 B/W (Read, 32 bytes step) = 6 cycles per cache line (32 bytes)
L2 Write (Write, 64 bytes step) = 6 cycles per write (cache line)
RAM B/W Read (Parallel Random Read) can be slower than individual Random reads.
RAM B/W Read (4 Bytes step) = 300 MB/s
RAM B/W Read (64 Bytes step) = 1700 MB/s
RAM B/W Write (4 bytes step) = 180 MB/s

Integer Pipeline

Execution Latency of simple dependent integer instructions is 2 cycles !!!

Branch misprediction penalty = 10 cycles.

#	Name	Description
1	Fetch	The instruction cache and TLB are read, according to the contents of the program counter (PC). Four new instructions are sent to the instruction register (IR) if the instruction fetch is a TLB hit and a cache hit.
2	Pre-Decode	Branch instructions are found and their branch directions are dynamically predicted.
3	Decode	The four instructions in IR are decoded in the internal format and sent to the register renaming module.
4	Register Rename	A new physical register is allocated for each logical destination register, and the logical source register is renamed according to the latest physical register allocated for the same logical register. Inter-instruction dependencies among four instructions mapped in the same cycle are also checked. The renamed instructions are latched to be sent to reservation stations and queues in next cycle.
5	Dispatch	Renamed instructions are dispatched to the fixed- or floating-point reservation station to be executed, and are sent to the reorder queue for in-order graduation. Associated instructions are also sent to branch queue and memory queue. Each empty entry of reservation stations and queues selects among four dispatched instructions in this cycle.
6	Issue	One instruction with all required operands ready is selected from the fixed- or floating-point reservation station for each functional unit. When there are multiple instructions ready for the same functional unit, the oldest one is selected. Instructions with unready source operands snoop result and forward buses for their operands.
7	Register Read	The issued instruction reads its source operands from the physical register file and is sent to the associated functional units. It may also get the data directly from one of the result buses if its source register number matches the destination register number of the result bus.
8	Execution	Instructions are executed according to its type and execution results are written back to the register file. Result buses are also sent to the reservation station for snooping and to the register mapping table to notify that the associated physical register is ready.
9	Commit	Up to four instructions can be committed in program order per cycle. Committed instructions are sent to the register mapping module to confirm the mapping of its destination register and release the old one. They are also sent to the memory access queue to allow committed store instructions to write cache or memory.

Only one branch instruction can be decoded in one cycle.

Memory Pipeline

#	Description
1	address is calculated and the CAM of TLB is searched to form the index of TLB RAM.
2	TLB RAM is accessed in parallel with cache RAM access. Tag compare is also performed at this stage, but value selection according to tag compare result is delayed to next cycle.
3	access value is formed according to the tag compare result of last stage, memory access exception bits are also form at this stage. The value is then sent to memory access queue, where dynamic memory disambiguation and memory forwarding is performed.
4	the results are written back when ready.

Godson-3B processor is an 8-core high-performance processor implemented in a 65nm CMOS LP/GP mixed process with 7 layers of Cu metallization. It contains 582.6M transistors in a 299.8mm2 area. The highest frequency of Godson-3B is 1.05GHz. Its peak performance is 128/256GFLOPS for double/single-precision with 40W power consumption.

Links

Loongson at Wikipedia

ST STLS2F01 Loongson 2F UserGuide