BJX2 ISA Continues, but for how long?…

So, I have still been working on the BJX2 ISA, but things seem to be slowing down: New features are becoming less frequent, and progressively more of my “effort budget” is getting taken up more in things like looking for and trying to fix long standing bugs (and trying to fix bugs hopefully without adding new ones, which is itself a problem).

Virtual Memory: I have reached a stage where virtual memory kinda works, but still falls short of being sufficiently reliable to “call it good”. One achievement this year is, at least, that I am now more-or-less able to run stuff using pagefile backed virtual memory. However, beyond this, “TestKern” is still a far cry from what would be considered a “good” (or even “usable”) OS (functionality and reliability is lacking even vs MS-DOS era standards).

Did at least now get virtual memory and pagefile support mostly working at least, which is something at least. Its general reliability is still unproven, but “it is probably better than nothing”.

The ISA has gained a few features: Partial RISC-V (RV64IM) support as an alternate mode, but still not well tested (and how much it “makes sense” has yet to be proven). It is unclear how much sense it would make to run standalone RISC-V code on top of my CPU core, when its design isn’t really “optimal” for using it as a RISC-V core.

At some point, I did add an XGPR extension, which can expand the GPR space to 64 registers, though with some awkwardness in terms of encoding rules and limitations (using these registers has an adverse effect on orthogonality; so end up more as a “break glass in case of emergency” kind of feature).

I did eventually add integer and floating point divide instructions, though they are not particularly high performance (around 120 cycles for a Binary64 FDIV; 68 cycles for Int64 DIV; and 36 cycles for Int32 DIV). This was alongside adding 64-bit integer multiply (also kinda slow). This was along with a few optional BCD helper instructions, which can be used potentially for faster binary to decimal conversion.

Another experimental feature was an LDTEX instruction, which performs a compressed texture load (in 3L/1T), which can be useful for 3D rendering tasks. The base register encodes the texture type and resolution in the high order bits, with the index register giving the texture coordinates. The current implementation is limited to Morton-order textures (raster order textures will require using a more traditional approach for now).

I have also added a faster “low precision SIMD” unit, where SIMD operations on Binary16 and reduced-precision Binary32 (S.E8.F16.P7, where P=Pad/Ignored) can now be handled in 3 cycles (3L/1T). This goes along with some additional Binary16 SIMD helper instructions, and some instructions to help working with vectors of FP8 values (E4.F4 and E4.F3.S).

I guess my ISA in general does pretty OK with software OpenGL rasterization and JPEG decoding. But, lacks anything that is a strong selling point, more so given that it is limited to relatively low clock speeds (and “general case” performance still isn’t all that high).

Lots of ongoing compiler work has gone on as well, including fixing a few longstanding bugs, but the compiler is seemingly an endless source of bugs.

Interest and traffic to the project now seems to be less than it was in the past.

Stuff continues for now, but for how long?…

BJX2 Status Update (2021)

A lot of stuff happened over the course of the year. Development on BJX2 continues.

While not exactly the same as it was (various breaking changes happened along the way, including a few major ones to the ISA encoding), the addition of Jumbo and SIMD, …

So, BJX2 has mostly stabilized on a 3-wide VLIW design known as WEX-3W (Wide Execute, 3-Wide), and many other ISA features have been built around this assumption. As-is, both Jumbo and SIMD implicitly assume the existence of WEX.

Jumbo allows for larger immediate fields in various instructions, which consists of a “Jumbo Prefix” (FEjjjjjj) which carries an additional 24-bits of payload, and is used alongside an instruction which takes an immediate. This expands 9 and 10 bit immediate fields to 33 bits (the MSB serves as a sign-extension for the remaining bits; so it can also encode the 32-bit unsigned range). Using two jumbo prefixes will expand these instructions to 57 bits, and if used with an Imm16 instruction, will produce a 64-bit immediate.

This means a full-width 64-bit constant can be loaded in 1 cycle, using 12-bytes of memory, or a 32 bits constant can be loaded using 8 bytes.

This did come with the loss of the original 48-bit encoding space. It is possible that 48-bit encodings could be reintroduced in a new encoding space (a jumbo prefix followed by a 16-bit instruction). But this hasn’t really been done yet, and doesn’t mix well with how my C compiler implements WEX. Instead, one will likely need to live with the slightly worse code density of 64-bit Jumbo encodings.

Another addition was 128-bit SIMD operations (the GSVX extension). Unlike several other ISA’s, these do not add a separate register space, but instead reuse the GPR space. In this case, the GPRs are paired together, and used as a logical 128-bit register. In terms of notation, these are expressed as Xn, where X8 is the pair of R8 and R9, X10 is R10 and R11, … These logical Xn registers are only allowed for even register numbers. Xn registers with the LSB set are reserved for future expansion (most likely the implementation of a larger register set).

There is support for both Packed Integer and Floating Point SIMD. For packed integer, the ALU is subdivided (as opposed to using multiple ALUs). For FP SIMD, the vector elements are fed through the FPU using pipelining.

In terms of implementation, these ‘X’ instructions use multiple lanes internally (treating the 64-bit 6R+3W register file as a 128-bit 3R+1W register file), so typically may only be expressed as scalar operations (with a partial exception for Jumbo prefixes). Many may work by also combining units across multiple lanes into a single larger virtual unit (this is the case for most of the ALUX instructions). Otherwise, most of these instructions take the form of normal 32-bit (4-byte) instruction encodings.

Another more recent addition is the GFPX (or FPUX) extension, which allows extend the FPU ops to 128-bits, using the IEEE-754 Binary128 format. In the current implementation, this doesn’t support the full precision of the format (doing so would be quite expensive), so instead uses an intermediate precision (at the time of this writing, 72 mantissa bits for FMUL, and 87 for FADD). The exact precision will be implementation dependent. This is also being called the “Long Double” extension, partly as it is primarily intended for C’s “long double” type.

Another related extension is ALUX, which adds various 128-bit integer operations (ADD/SUB/AND/OR/XOR/CMP/SHAD/…). Most of the basic integer ops worked by joining the ALUs across two lanes together by passing carry-bits and similar between them, allowing them to function as-if they were a single larger ALU.

The shift instructions got a little more complicated, and involved replacing the original 64-bit barrel shift design with a logarithmic funnel shift. In this case, each shift unit produces a 64-bit output, but may use 128-bits on the input size (internally, this expands to 192 bits to allow for things like sign and zero extension of the 128-bit value). Using two of these units in parallel allows for performing a 128-bit shift. Despite the larger input size of the shift unit, the overall cost of the new shift-unit is roughly break-even with the old shift unit.

Most ALUX instructions lack immediate forms. This isn’t seen as a huge issue though, as even in lacking immediate forms, they are still a big improvement over needing to use function calls. Well, and also it may potentially cost around 24 bytes to load a 128-bit constant. Typically, a 128-bit constant load would be done as a pair of 64-bit constant loads.

Though, ironically, one of the main likely use-cases for ALUX would be for more efficient implementation of Binary128 operations in software.

Some time ago, I had also started working on a “OpenGL Style” software rasterizer on BJX2. Ironically, it does things like alpha blending, color interpolation, and bilinear filtering pretty well. But suffers from a pretty big problem in terms of memory bandwidth.

In this case, I did end up adding a few misc instructions to support a few custom compressed-texture formats (which I am calling UTX1 and UTX2), which aim to do something similar to DXT1/BC1 and DXT5/BC3 at roughly half the effective size. UTX1 uses a 12-bit RGB center value, 4-bit Y-delta, and 16 selector bits. As-such, UTX1 can encode non-transparent textures at 2bpp, with “roughly passable” image quality (I suspect unusable performance would be a bigger problem than the textures looking kinda ugly). UTX2 is similar to DXT1, but uses a pair of RGB555 or RGB444A3 endpoints instead. Where, RGB444A3 uses the same layout as RGB555, but takes the LSB of each component and uses it as an Alpha value (xrrr-ragg-ggab-bbba).

The two endpoint MSB’s select how to interpolate the block:

  • 00=Interpolated Opaque (RGB555), (00=A, 01=2/3A+1/3B, 10=1/3A+2/3B, 11=B);
  • 01=Translucent (RGB444A3), With two selector bits (Color, Alpha).
  • 10=Reserved for now
  • 11=Interpolated Alpha (RGB444A3), Alpha interpolated along with RGB.

The UTX2 format can serve as a rough approximation for DXT5 at roughly the same size as DXT1. Quality will be a little less where alpha is used, but typically color fidelity is less important in blended areas.

Unlike DXT1 and DXT5, for these formats it would assume the images are stored in Morton order (Z order), including at the level of block pixels. This is mostly because this is useful for a rasterizer (if, albeit, it only deals well with square power-of-2 textures, but this use case dominates in an OpenGL style renderer). It is assumed that UTX2 images would likely be transcoded from DXT1 or DXT5 during texture upload, rather than UTX2 being used directly by programs.

Have now sort of gotten GLQuake working on this rasterizer as well, though as of yet its performance still falls a bit short of “usable”. I had started work on a new memory bus design which should hopefully be able to improve performance, but there is still a bit of work needed for this part. In this case, GLQuake is more limited by memory bandwidth than by computational cost.

Also, in all of this, TestKern has started to mutate into something resembling a Unix-style OS (at least in terms of filesystem, and rudimentary implementation of parts of the POSIX API), though it is still a bit of a stretch to call it a “real OS”. As of yet, it still lacks things like dynamic linking, multiple processes, or a preemptive task scheduler. These may be worked on some more, but haven’t been a high priority. Support for things like virtual memory and similar are still in a half-implemented state.

Binaries in TestKern still use the PEL4 format, which is essentially a modified LZ compressed PE/COFF variant.

BJX2 Status, Testkern “OS”

OK, so it has been a while. BJX2 has changed significantly from what it was a year ago. It has changed from being a Scalar RISC style ISA, to a RISC/VLIW hybrid. Many aspects of the core ISA and C ABI have since changed, FPRs no longer exist (the FPU has been merged into GPR space), etc…

A lot of this was intended to help both improve the performance I could hope to get from it, and other changes were to reduce cost (it turned out, for example, to put the FPU in GPR land than have a separate set of FPRs, dedicated load/store mechanisms, …). Parts of the original FPU were either dropped, or were essentially folded back into the integer path and ALU.

The initially mostly unoccupied encoding space is now mostly full, because the addition of explicitly parallel encodings, and predicated instructions, effectively ate up nearly all of the remaining top-level encoding space.

The basic aspects of the instruction encoding survived, so in terms of its encodings it is still basically a RISC with variable 16 or 32 bit instructions, just that now groups of 32-bit instructions can be encoded to be able to execute in parallel, or instructions may be encoded to execute or not execute based on the state of the SR.T flag; or some otherwise illegal encodings may be used to encode compound (“jumbo”) operations with 32-bit or (in a few rare cases) 64-bit immediate value fields, etc…

Further changes on this scale are unlikely, and changes to the ISA have dropped off significantly as things have been gradually “maturing”.

I now have a functioning FPGA implementation as well, with a core able to run at 50MHz on the Arty S7 and Nexys A7 boards. It turned out to be ultimately too much of an issue trying to keep it passing timing at 100MHz, and so I ended up choosing to focus on a slightly more capable core, if albeit it runs at a lower MHz.

One issue that has become apparent though is that memory bandwidth is a problem, and one which is a serious issue for my current implementation. Things like Doom are now mostly usable, but usable framerates in Quake and similar remain elusive. These programs seem to be depend highly on memory bandwidth, and with slow memory operations, performance suffers.

 

Interesting aspects of the design IMO are that this is one of the few ISA’s to have both scalar encodings as well as explicitly parallel VLIW-like encodings. Many other ISA’s, such as ARM or RISC-V, are based around a scalar design and as such would rely on superscalar mechanisms to be able to execute more than one instruction at a time. Superscalar, however, adds a lot of complexity, and so would put the possibility of a “simple” yet moderately high performance core out of reach.

Another common form of VLIW is to pack a series of instructions into a large fixed-size block, with any unused lanes being filled with NOPs. This is a bad situation if insufficient parallelism exists to fill all of the lanes, in which case code-density suffers. Because my ISA does not use fixed-sized bundles, it is not necessary to waste space encoding a lot of NOPs.

Some of these features are optional, and a scalar subset of the ISA exists which may still be used on the wider cores, and to some extent VLIW-like code may be used on a scalar-only implementation (just it will be executed as scalar code). It is required that code be generated in such a way that it will execute as expected with both scalar and parallel execution.

 

I also have an “OS” of sorts which is gradually forming as part of this project. For now, I am calling this project “Testkern”, and at the time of this writing it is technically vaguely DOS-like with aspirations of rising to the level of being a vaguely Unix-like OS. There are some “task state” structures in place, but it doesn’t yet have a functioning task scheduler so is not yet a multi-tasking OS (work is still ongoing here).

As can be noted, it is for now mostly using the FAT filesystem and a PE/COFF variant (PEL4) for binaries, with PEL4 being a PE/COFF variant which omits the MZ stub and may also compress the binary with an LZ4 variant (reduces the time needed to read it from an SDcard).

At present, it uses a single large address space, with a special ABI (PBO) designed to allow multiple logical processes to exist in a shared address space. This works via a somewhat difference from the ELF FDPIC ABI.

With ELF FDPIC, pretty much anything accessing global variables or calling functions needs to go through the GOT, leading to a lot of double-indirection. Calling functions is reasonably involved, as it is necessary to save and restore the current GOT as well as load the target GOT, then load and branch to the target address.

In PBO, this is mostly avoided by having the ‘.data’ and ‘.bss’ sections addressed via a displacement relative to GBR (Global Base Register), with GBR taking the place of the GOT. Calls within an image do not need to save, restore, or reload GOT, and so proceed mostly as they would in absolute or base-relocatable images. This somewhat reduces overhead in the common case.

The main exception then, is exported functions and calls through function pointers, which need to deal with reloading the correct GBR (either by the called function or through a thunk). At negative addresses relative to GBR, there is a small table of pointers to every other valid GBR within the process, and it is then merely necessary to reload GBR from the index assigned to the current program image (this part is currently handled using base-relocations to patch the index at load time).

From the perspective of the caller though, the function pointer is called as a raw function pointer, and small leaf functions don’t necessarily have to deal with any of this if they don’t make use of global variables.

 

The plan though is that using a shared address space will not entirely give up on memory protection though, but rather that a different mechanism may be used. Instead of using protection rings (User/Supervisor) and separate address spaces, a variation on the UID/GID model may be applied to memory pages, which allows separate logical processes to have their own pages. If another process attempts to access these pages (without the corresponding UGID permissions in their keyring) it will raise an exception. Since the keyring is a register, it can be updated on task switch (and consequently change which pages can be accessed) without requiring the caches or TLBs to be flushed (and thus reduce the overhead associated with task switches).

If there is a match between the TLB entry for a page being accessed, and one of the entries in the current keyring, access may be granted according to the access mode specified for the page in question (following similar rules to those for Unix-style file permissions). There is not currently an equivalent of ACL checking for memory pages, but the current mechanism should hopefully be sufficient.

Note that this doesn’t add any real latency (and not even all that much in terms of FPGA resource costs) compared with a traditional User/Supervisor mechanism. The way it is implemented also doesn’t add any real space overhead in the page-tables either (page table entries remain as 64-bits), however this mechanism isn’t currently available for page-tables with 32-bit page-table entries.

Note that the UGID doesn’t correspond to the “current user” in the user login sense, but generally UGID’s may be more compared with groups of PID’s (Process ID’s), and with kernel facilities and services having special permissions. It is likely the case that UGID memory permissions will be distinct from the UIDs and GIDs used to access files.

 

As for when I will post on here again, I don’t know…

 

Mystery world of IRL sensory perception.

There are things I notice sometimes, which is either curious or seemingly slightly outside of what should be normal human experience (nevermind if this is possibly just insane ramblings).

 

Colors:

I seem to have slightly unusual color vision, but not in an obvious way. I am able to see all the normal RGB colors, and on a monitor I can see a world with all its colors.

In normal life, I see a world with colors as well. But there is an issue: the colors, don’t quite match up. So, cameras, monitors, … all seemingly conspire to show a world just a little different from the one in which I see IRL.

It goes as far as, if I look at a rainbow caused by light diffraction, I actually see a different rainbow. At the spot where cyan/azure would normally be, I instead see sort of a dull “anomalous gray” color. So, the rainbow jumps from green, to this anomaly color, to blue, and then two “slightly different” violets (which I will call “pink” and “yellow” violet).

Or:

  • Computer Rainbow: Red, Yellow, Green, Cyan, Blue, Violet
  • IRL Rainbow: Red, Yellow, Green, AnomGray, Blue, PinkViolet, YellowViolet

Of these, the anomolous gray and the two violets have no direct equivalent on computers (nor an obvious location in RGB space). Despite cyan being absent from the rainbow, I can still see it in most cases (though, things which “should be” this color may appear to be other colors).

The anomalous gray is difficult to describe, as there isn’t really much else that looks like it. If anything, it is like “grayness” was a color in its own right, similar to, but somehow different from, the gray one might see on a monitor. It is generally a bit duller if compared with the other colors. It is sort of like the color things are when one is outside at night.

Pink-violet is similar to the traditional violet color, but has a slightly more pinkish tinge. It can also mix with gray and white.

Yellow-violet is also violet, but seems to be opposed to anom-gray (similar to how blue and yellow don’t get along; if gray is like blue, this color would be like yellow). As with anom-gray, it also seems like it sits somewhere outside the normal RGB space.

 

Interestingly, I can put on sunglasses (ex: UV400), and all this changes. The grey band in the rainbow disappears (reverting to cyan), and pretty much everything past the blue band disappears. The world through sunglasses then looks a lot more like the world seen by cameras, where things are almost the same, but the colors are a lot more desaturated. Similarly, objects which may be blue or purple normally may turn entirely black with sunglasses on (these objects also typically appear black on camera).

I suspect maybe cyan may reappear on the rainbow because anom-gray can’t really exist by itself unless it can contrast itself with violet?…

I really don’t know how this works in a color-theory sense, I don’t even have good names for these colors.

 

Some things change around as well. In my normal vision, wool is often green (or sometimes yellow or red), but I just don’t see wool as gray. So, for example, if a suit looks gray, it is usually polyester or rayon or something; and if it is wool, it is typically yellow or green.

Similarly, normal glass usually has a greenish tint, and things like tonic water are yellow. Similarly, UV filter glass (from halogen lamps) is pretty obviously yellow.

In general, people also look a little darker IRL than they do on camera: pretty much everyone looks brown if they are standing in the sunlight, but appear lighter under artificial light sources. Well, excluding halogen lighting, where often pretty much everyone is looks brown and the whole area is covered in a violet haze (the violet colors are almost always slightly unfocused and hazy).

In general though, I am also pretty used to the way things look on monitors and TV, so probably can’t complain too much about it.

It is like averaging red and green together. Despite the loss of color depth the world still looks mostly about the same; just imagine this is how things look through cameras “in-general”. Like if one were in a world where one could see colors normally, but all of the displays were in shades of blues and yellows…

 

It is worth noting, this is just how it has always been for me, but otherwise my vision functions mostly normally (I can still focus on things, etc).

Not sure of a specific cause, I suspect possibly some sort of genetic anomaly or something, or maybe just psychology or delusion (no means existing to produce any sort of physical evidence of these effects).

I can’t say for certain that the effects aren’t just psychological or imaginary, or people can believe I am just making stuff up or whatever if they want, either way…