A lot of stuff happened over the course of the year. Development on BJX2 continues.
While not exactly the same as it was (various breaking changes happened along the way, including a few major ones to the ISA encoding), the addition of Jumbo and SIMD, …
So, BJX2 has mostly stabilized on a 3-wide VLIW design known as WEX-3W (Wide Execute, 3-Wide), and many other ISA features have been built around this assumption. As-is, both Jumbo and SIMD implicitly assume the existence of WEX.
Jumbo allows for larger immediate fields in various instructions, which consists of a “Jumbo Prefix” (FEjjjjjj) which carries an additional 24-bits of payload, and is used alongside an instruction which takes an immediate. This expands 9 and 10 bit immediate fields to 33 bits (the MSB serves as a sign-extension for the remaining bits; so it can also encode the 32-bit unsigned range). Using two jumbo prefixes will expand these instructions to 57 bits, and if used with an Imm16 instruction, will produce a 64-bit immediate.
This means a full-width 64-bit constant can be loaded in 1 cycle, using 12-bytes of memory, or a 32 bits constant can be loaded using 8 bytes.
This did come with the loss of the original 48-bit encoding space. It is possible that 48-bit encodings could be reintroduced in a new encoding space (a jumbo prefix followed by a 16-bit instruction). But this hasn’t really been done yet, and doesn’t mix well with how my C compiler implements WEX. Instead, one will likely need to live with the slightly worse code density of 64-bit Jumbo encodings.
Another addition was 128-bit SIMD operations (the GSVX extension). Unlike several other ISA’s, these do not add a separate register space, but instead reuse the GPR space. In this case, the GPRs are paired together, and used as a logical 128-bit register. In terms of notation, these are expressed as Xn, where X8 is the pair of R8 and R9, X10 is R10 and R11, … These logical Xn registers are only allowed for even register numbers. Xn registers with the LSB set are reserved for future expansion (most likely the implementation of a larger register set).
There is support for both Packed Integer and Floating Point SIMD. For packed integer, the ALU is subdivided (as opposed to using multiple ALUs). For FP SIMD, the vector elements are fed through the FPU using pipelining.
In terms of implementation, these ‘X’ instructions use multiple lanes internally (treating the 64-bit 6R+3W register file as a 128-bit 3R+1W register file), so typically may only be expressed as scalar operations (with a partial exception for Jumbo prefixes). Many may work by also combining units across multiple lanes into a single larger virtual unit (this is the case for most of the ALUX instructions). Otherwise, most of these instructions take the form of normal 32-bit (4-byte) instruction encodings.
Another more recent addition is the GFPX (or FPUX) extension, which allows extend the FPU ops to 128-bits, using the IEEE-754 Binary128 format. In the current implementation, this doesn’t support the full precision of the format (doing so would be quite expensive), so instead uses an intermediate precision (at the time of this writing, 72 mantissa bits for FMUL, and 87 for FADD). The exact precision will be implementation dependent. This is also being called the “Long Double” extension, partly as it is primarily intended for C’s “long double” type.
Another related extension is ALUX, which adds various 128-bit integer operations (ADD/SUB/AND/OR/XOR/CMP/SHAD/…). Most of the basic integer ops worked by joining the ALUs across two lanes together by passing carry-bits and similar between them, allowing them to function as-if they were a single larger ALU.
The shift instructions got a little more complicated, and involved replacing the original 64-bit barrel shift design with a logarithmic funnel shift. In this case, each shift unit produces a 64-bit output, but may use 128-bits on the input size (internally, this expands to 192 bits to allow for things like sign and zero extension of the 128-bit value). Using two of these units in parallel allows for performing a 128-bit shift. Despite the larger input size of the shift unit, the overall cost of the new shift-unit is roughly break-even with the old shift unit.
Most ALUX instructions lack immediate forms. This isn’t seen as a huge issue though, as even in lacking immediate forms, they are still a big improvement over needing to use function calls. Well, and also it may potentially cost around 24 bytes to load a 128-bit constant. Typically, a 128-bit constant load would be done as a pair of 64-bit constant loads.
Though, ironically, one of the main likely use-cases for ALUX would be for more efficient implementation of Binary128 operations in software.
Some time ago, I had also started working on a “OpenGL Style” software rasterizer on BJX2. Ironically, it does things like alpha blending, color interpolation, and bilinear filtering pretty well. But suffers from a pretty big problem in terms of memory bandwidth.
In this case, I did end up adding a few misc instructions to support a few custom compressed-texture formats (which I am calling UTX1 and UTX2), which aim to do something similar to DXT1/BC1 and DXT5/BC3 at roughly half the effective size. UTX1 uses a 12-bit RGB center value, 4-bit Y-delta, and 16 selector bits. As-such, UTX1 can encode non-transparent textures at 2bpp, with “roughly passable” image quality (I suspect unusable performance would be a bigger problem than the textures looking kinda ugly). UTX2 is similar to DXT1, but uses a pair of RGB555 or RGB444A3 endpoints instead. Where, RGB444A3 uses the same layout as RGB555, but takes the LSB of each component and uses it as an Alpha value (xrrr-ragg-ggab-bbba).
The two endpoint MSB’s select how to interpolate the block:
- 00=Interpolated Opaque (RGB555), (00=A, 01=2/3A+1/3B, 10=1/3A+2/3B, 11=B);
- 01=Translucent (RGB444A3), With two selector bits (Color, Alpha).
- 10=Reserved for now
- 11=Interpolated Alpha (RGB444A3), Alpha interpolated along with RGB.
The UTX2 format can serve as a rough approximation for DXT5 at roughly the same size as DXT1. Quality will be a little less where alpha is used, but typically color fidelity is less important in blended areas.
Unlike DXT1 and DXT5, for these formats it would assume the images are stored in Morton order (Z order), including at the level of block pixels. This is mostly because this is useful for a rasterizer (if, albeit, it only deals well with square power-of-2 textures, but this use case dominates in an OpenGL style renderer). It is assumed that UTX2 images would likely be transcoded from DXT1 or DXT5 during texture upload, rather than UTX2 being used directly by programs.
Have now sort of gotten GLQuake working on this rasterizer as well, though as of yet its performance still falls a bit short of “usable”. I had started work on a new memory bus design which should hopefully be able to improve performance, but there is still a bit of work needed for this part. In this case, GLQuake is more limited by memory bandwidth than by computational cost.
Also, in all of this, TestKern has started to mutate into something resembling a Unix-style OS (at least in terms of filesystem, and rudimentary implementation of parts of the POSIX API), though it is still a bit of a stretch to call it a “real OS”. As of yet, it still lacks things like dynamic linking, multiple processes, or a preemptive task scheduler. These may be worked on some more, but haven’t been a high priority. Support for things like virtual memory and similar are still in a half-implemented state.
Binaries in TestKern still use the PEL4 format, which is essentially a modified LZ compressed PE/COFF variant.