So, I had been sinking a lot of time recently into my ISA / Emulator / C Compiler / (now) CPU project. This has run a bit out of control, having now consumed about the last year of my life.
The base ISA, from which it derived, is the SuperH SH4 ISA, probably most well known for the Sega Dreamcast, and mostly used in various pieces of consumer electronics. It is a RISC, but one which used fixed 16-bit instruction words, rather than the 32-bits more typical of RISC, and SH had later inspired the Thumb ISA.
A drawback of this ISA, however, is that it sometimes needs longer than ideal instruction sequences to perform operations. My workaround to this has become an ISA I call BJX1 (yet to come up with an actual name, could backronym it to “BetaJeX” or something; originally meant “BGB J-eXtensions”), which inherits the basic SH4 ISA (albeit with some instruction forms marked as “deprecated”), but adds a few escape-coded extension blocks (a major one being 8Exx; which single handed implements the bulk of the BJX1-32 ISA).
Extensions mostly include expanded MOV forms, expanded immediate values, and some “Reg, Reg, Reg” and “Reg, Imm, Reg” forms for arithmetic operations.
Similarly, there is the BJX1-64 ISA, which is closely related, but expands the ISA to 64 bits. This includes some additional instruction forms, and now also the CCxx and CExx blocks (much of the Cxxx block was dropped in BJX1-64, and was reused for operations which are “more useful”). Thus far, these blocks are mostly used to encode instruction forms which extend the ISA to 32 GPRs, in a form I am calling 64A (a 64B variant exists which stays with using 16 GPRs, and similarly lacks CCxx and CExx). There is a penalty involving use of R16..R31, but they are helpful for functions with a lot of register pressure. These blocks don’t exist in the 32-bit ISA (where these instead encode some rarely-used “@(GBR, disp)” operations).
Effectively, the 8Exx, CExx, and CC0e blocks work as prefixes, essentially modifying the normal 16-bit instructions words:
- 8Exx: Adds or extends immediate for most instructions.
- MOV forms either get an 8-bit displacement (“MOV.L @Rm, Rn” => “MOV.L @(Rm, disp8s), Rn”), or transform into @(Rm, Ro, disp) forms (“MOV.W @(Rm, R0), Rn” => “MOV.W @(Rm, Ro, disp4), Rn”).
- Many arithmetic operators get an 8-bit immediate, and a 3-register block is added.
- Values which use an immediate or displacement generally get a bigger immediate or displacement, and a few blocks are repurposed.
- “BT disp16”, “BRA/N disp16”, …
- “ADD #imm16s, Rn”
- “LDSH16 #imm16s, Rn” replaces “MOV.W @(PC,disp), Rn”; Encodes “Rn=(Rn<<16)+Imm;” This is mostly used for constructing larger inline constants.
- Some other additions include LEA instructions and similar, which are useful for pointer arithmetic.
- CExx: Does similar to the 8Exx block, but takes off a few bits for extended GPRs, resulting in similar operations generally having a 4 or 6-bit immediate vs an 8 bit immediate.
- Some of the MOV forms simply become @(Rm, Ro) instead of @(Rm, Ro, disp4).
- Size-independent single-register operations keep the full width immediate (typically 16 bits), but simply encode a register in the range of R16..R31 instead of R0..R15.
- “SHADQ/SHLDQ Rm, Imm, Rn” was split into multiple alternate instruction forms (SHALQ/SHARQ/SHLLQ/SHLRQ), as 6-bits is only sufficient to encode the shift amount.
- CC0e: Functions essentially similar to an x86-64 REX prefix.
- e=Qnst/Qnmz; Q=QWord, nst add a bit to each register; nmz adds a bit to Rn and Rm, with z usually indicating whether an applicable DWord operator will use sign or zero extension on the result.
- More subtly modifies some other operations, so CC00 isn’t exactly NOP.
- TBD: Define CC00 as simply padding a 16-bit opcode to 32 bits?…
- CC3e: Implements basically a set of 3R arithmetic operators over all 32 GPRs. This one isn’t technically a prefix as the following block is special-purpose using all 16 bits (onst, or a 4-bit operator followed by Rn, Rs, and Rt).
Code density doesn’t quite reach the levels of Thumb2 as of yet, but gets “reasonably close”. Now whether any of this has a point now due to the existence of RISC-V’s RV32C and RV64C is to be seen (these seem to be in the same general area regarding code density).
A lot of the time in recent months has mostly been thrown at the C compiler, trying to make the ISA variants work, find bugs, and improve the quality of the generated code. This is still not fully solved, as bugs remain which have yet to be located and fixed (particularly in the 64 bit ISA’s).
In some ways, Verilog proves challenging. In a basic sense, it is a language like many others, but it is a language with many awkward restrictions. You can write stuff, and it will work in simulation. However, synthesis becomes a problem, with lots of little things which conspire to eat up the FPGA’s budget. Little things like the relative length vs element width in an array, how the array is accessed, whether the array matches nicely with the size of the internal Block-RAM, when values are used directly vs delayed to a subsequent clock cycle, whether multiple versions of an operator are used vs forcing multiple paths through a single operator, … can have fairly significant effects on the cost of synthesizing the design.
Like, in C, one can use memory more freely, and C really doesn’t care when and where one uses shifts and multiplies. Verilog cares, and you will be judged for how often you multiply numbers, and the bits involved in the input and products of the multiplication, etc. It is like “That huge and ugly casez() block with hundreds of patterns; yeah that is fine. But as for that integer multiply working with 128 bit quantities; I am not happy about this…”. Often, things which work well in VL would be horrible in C, and things one takes for granted in C are very expensive in VL.
For example, BJX1-64 has a DMULS.Q operation, which takes two 64-bit signed quantities and produces a 128 bit result. So, trying to do it directly, one gets some big ugly mass of LUTs and DSP units which proceeds to eat up a scary chunk of the total budget. But, assuming I keep DMULS.Q, there is some incentive to find a cheaper way to do the multiply (such as a state machine built from narrower multiplies, spreading the multiply over several clock cycles).
Similar sorts of things happen of one tries to implement memory as a long skinny array vs shorter wider arrays. Despite being the same total capacity, the longer array may be notably more expensive. Accesses to memory arrays also sort of follows a similar principle to Highlander: “There can only be one!” You are allowed one read and one write to an array, not per mutually exclusive control path (this would make it too easy), but in total. Failing to do so, and even a relatively modest sized array will try to eat the budget.
But, it would be nice to be able to have an implementation of my BJX1 ISA in hardware. Nevermind if it seems at this point like I have little idea what I am doing.
But, this has been along with me writing sci-fi, and am now technically an author. This is even if, at the time of this writing, I only have about $3 to my name from ebook sales. Like “Woo, several people have bought it!” But, of these, how many will read it? How many will think it isn’t crap? Will it sell enough to “actually matter”? …