Making a CPU, SH4+BJX1

So, I had been sinking a lot of time recently into my ISA / Emulator / C Compiler / (now) CPU project. This has run a bit out of control, having now consumed about the last year of my life.

The base ISA, from which it derived, is the SuperH SH4 ISA, probably most well known for the Sega Dreamcast, and mostly used in various pieces of consumer electronics. It is a RISC, but one which used fixed 16-bit instruction words, rather than the 32-bits more typical of RISC, and SH had later inspired the Thumb ISA.

A drawback of this ISA, however, is that it sometimes needs longer than ideal instruction sequences to perform operations. My workaround to this has become an ISA I call BJX1 (yet to come up with an actual name, could backronym it to “BetaJeX” or something; originally meant “BGB J-eXtensions”), which inherits the basic SH4 ISA (albeit with some instruction forms marked as “deprecated”), but adds a few escape-coded extension blocks (a major one being 8Exx; which single handed implements the bulk of the BJX1-32 ISA).

Extensions mostly include expanded MOV forms, expanded immediate values, and some “Reg, Reg, Reg” and “Reg, Imm, Reg” forms for arithmetic operations.

Similarly, there is the BJX1-64 ISA, which is closely related, but expands the ISA to 64 bits. This includes some additional instruction forms, and now also the CCxx and CExx blocks (much of the Cxxx block was dropped in BJX1-64, and was reused for operations which are “more useful”). Thus far, these blocks are mostly used to encode instruction forms which extend the ISA to 32 GPRs, in a form I am calling 64A (a 64B variant exists which stays with using 16 GPRs, and similarly lacks CCxx and CExx). There is a penalty involving use of R16..R31, but they are helpful for functions with a lot of register pressure. These blocks don’t exist in the 32-bit ISA (where these instead encode some rarely-used “@(GBR, disp)” operations).

Effectively, the 8Exx, CExx, and CC0e blocks work as prefixes, essentially modifying the normal 16-bit instructions words:

  • 8Exx: Adds or extends immediate for most instructions.
    • MOV forms either get an 8-bit displacement (“MOV.L @Rm, Rn” => “MOV.L @(Rm, disp8s), Rn”), or transform into @(Rm, Ro, disp) forms (“MOV.W @(Rm, R0), Rn” => “MOV.W @(Rm, Ro, disp4), Rn”).
    • Many arithmetic operators get an 8-bit immediate, and a 3-register block is added.
    • Values which use an immediate or displacement generally get a bigger immediate or displacement, and a few blocks are repurposed.
      • “BT disp16”, “BRA/N disp16”, …
      • “ADD #imm16s, Rn”
      • “LDSH16 #imm16s, Rn” replaces “MOV.W @(PC,disp), Rn”; Encodes “Rn=(Rn<<16)+Imm;” This is mostly used for constructing larger inline constants.
    • Some other additions include LEA instructions and similar, which are useful for pointer arithmetic.
  • CExx: Does similar to the 8Exx block, but takes off a few bits for extended GPRs, resulting in similar operations generally having a 4 or 6-bit immediate vs an 8 bit immediate.
    • Some of the MOV forms simply become @(Rm, Ro) instead of @(Rm, Ro, disp4).
    • Size-independent single-register operations keep the full width immediate (typically 16 bits), but simply encode a register in the range of R16..R31 instead of R0..R15.
    • “SHADQ/SHLDQ Rm, Imm, Rn” was split into multiple alternate instruction forms (SHALQ/SHARQ/SHLLQ/SHLRQ), as 6-bits is only sufficient to encode the shift amount.
  • CC0e: Functions essentially similar to an x86-64 REX prefix.
    • e=Qnst/Qnmz; Q=QWord, nst add a bit to each register; nmz adds a bit to Rn and Rm, with z usually indicating whether an applicable DWord operator will use sign or zero extension on the result.
    • More subtly modifies some other operations, so CC00 isn’t exactly NOP.
      • TBD: Define CC00 as simply padding a 16-bit opcode to 32 bits?…
  • CC3e: Implements basically a set of 3R arithmetic operators over all 32 GPRs. This one isn’t technically a prefix as the following block is special-purpose using all 16 bits (onst, or a 4-bit operator followed by Rn, Rs, and Rt).

Code density doesn’t quite reach the levels of Thumb2 as of yet, but gets “reasonably close”. Now whether any of this has a point now due to the existence of RISC-V’s RV32C and RV64C is to be seen (these seem to be in the same general area regarding code density).

A lot of the time in recent months has mostly been thrown at the C compiler, trying to make the ISA variants work, find bugs, and improve the quality of the generated code. This is still not fully solved, as bugs remain which have yet to be located and fixed (particularly in the 64 bit ISA’s).


In some ways, Verilog proves challenging. In a basic sense, it is a language like many others, but it is a language with many awkward restrictions. You can write stuff, and it will work in simulation. However, synthesis becomes a problem, with lots of little things which conspire to eat up the FPGA’s budget. Little things like the relative length vs element width in an array, how the array is accessed, whether the array matches nicely with the size of the internal Block-RAM, when values are used directly vs delayed to a subsequent clock cycle, whether multiple versions of an operator are used vs forcing multiple paths through a single operator, … can have fairly significant effects on the cost of synthesizing the design.

Like, in C, one can use memory more freely, and C really doesn’t care when and where one uses shifts and multiplies. Verilog cares, and you will be judged for how often you multiply numbers, and the bits involved in the input and products of the multiplication, etc. It is like “That huge and ugly casez() block with hundreds of patterns; yeah that is fine. But as for that integer multiply working with 128 bit quantities; I am not happy about this…”. Often, things which work well in VL would be horrible in C, and things one takes for granted in C are very expensive in VL.

For example, BJX1-64 has a DMULS.Q operation, which takes two 64-bit signed quantities and produces a 128 bit result. So, trying to do it directly, one gets some big ugly mass of LUTs and DSP units which proceeds to eat up a scary chunk of the total budget. But, assuming I keep DMULS.Q, there is some incentive to find a cheaper way to do the multiply (such as a state machine built from narrower multiplies, spreading the multiply over several clock cycles).

Similar sorts of things happen of one tries to implement memory as a long skinny array vs shorter wider arrays. Despite being the same total capacity, the longer array may be notably more expensive. Accesses to memory arrays also sort of follows a similar principle to Highlander: “There can only be one!” You are allowed one read and one write to an array, not per mutually exclusive control path (this would make it too easy), but in total. Failing to do so, and even a relatively modest sized array will try to eat the budget.

But, it would be nice to be able to have an implementation of my BJX1 ISA in hardware. Nevermind if it seems at this point like I have little idea what I am doing.


But, this has been along with me writing sci-fi, and am now technically an author. This is even if, at the time of this writing, I only have about $3 to my name from ebook sales. Like “Woo, several people have bought it!” But, of these, how many will read it? How many will think it isn’t crap? Will it sell enough to “actually matter”? …



Misc projects: Fast Compressors.

How to introduce this? These are something one works on, they serve a purpose, but don’t get much attention.


I have a design for a format I had called FeLZ32, which is basically “like LZ4 but decodes faster”. How? It basically dispenses with bytes; entire format is based around working with 32-bit DWORDs. So, the compressed data is DWORDs, and the output is also seen as DWORDs. The DWORDs are normally 32-bit aligned, and no fuss with tailing bytes.

Otherwise, FeLZ32 is sort of similar to LZ4: there is a Tag word, which encodes both the match (length/distance) as well as the number of raw/uncompressed DWORDs to copy to the output (before copying the matched symbols). The format of the Tag is fairly simple: [15:0]=distance; [22:16]=match length; [29:23]=raw length; [31:30]=alignment. There are special cases, namely for long runs of raw/unencoded symbols, and an end-of-data case (Creatively, encoded as 0x00000000).

( Sadly, some of the shifts are a little awkward, eg, for SuperH assembler. Normal SH only provides immediate shifts of 1/2/8/16 bits; so X>>7 requires shifting 1b left and 8b right in this case, or the use of SHLD, but otherwise. )

FeLZ32 wiki entry on GitHub.

So, basically, it is like LZ4, but in-general doesn’t compress quite as well (depends on content, for some things it fares a little better). In turn, LZ4 doesn’t usually compress as well as Deflate, so same sort of thing. Each is generally faster, but at the expense of compression.

There is some types of content where it does better, but here is for some random stuff I had laying around (not really ideal cases either for FeLZ32, but alas):


As noted, my PC (an AMD FX-8350 with 32GB of DDR3 1600) achieves a “general” memcpy speed of around 3.4 GB/s ( I previously had been using a Phenom II which peaked at ~ 1.8GB/s ).

As noted, FeLZ32 was typically getting decode speeds of around 2 to 3 GB/sec, which seems pretty solid. Encoding doesn’t reliably beat LZ4 though, which is sad ( encoding mostly consists of searching for LZ matches, so its speed is pretty dependent on how much “effort” the encoder puts into the search, and the efficiency of the matching strategy. )

I guess a possible “research path” could be a format which can alternate between an FeLZ32-like mode and a more LZ4 like mode.


BTIC4B (History)

There was also BTIC4B, which is a speed-oriented image/video codec, following along with my prior “BTIC family” designs.

For the most part, members of the BTIC family were focused on speed, and go back originally to a format I had called BTIC1C, which was basically a hacked version of the Apple RPZA format with a Deflate-based backend glued on (which I had also tuned for speed).

In this format, blocks consisted of a pair of colors in RGB555 format, followed by 32-bits of pixel data representing a 4×4 block (at 2 bits/pixel). Anyone who has looked at DXT1 probably gets the idea (but, this happened roughly 5 years earlier). The MSB of the color endpoints were used to invoke other magic, namely skipping runs of blocks, or doing runs of a flat color.

There were a few failed attempts to replace it: BTIC1D, which used a conceptually similar design but was YUV420, and had unacceptably poor decode speeds combined with lackluster compression. There was BTIC1F, which added the concept of using a YUV color-vector in place of discrete endpoints, but ran into awkward design issues.

The next semi-successful member was BTIC1G, which was basically a stripped down design intended mostly for some robots I was building. This had a different set of requirements, namely a fast/simple encoder which could work incrementally and stream the video as small spans of blocks (via UDP datagrams). This version was basically straight bit-twiddly.

With BTIC1G, I had also discovered a way to more quickly classify and encode blocks formats, namely to measure some properties of the pixel block and feeding them through some weighting and a weighted decision tree (in some contexts these things are passed off as “neural nets”). This allowed determining in real-time what the most likely ideal representation was for a block. ( Some parts of the design process may have also been assisted using genetic algorithms as well… ).

Then later, started work on a design I called BTIC1H, which was originally intended for the same use-case as 1G, and basically replaced the bit-twiddly blocks with Adaptive Rice Coding (and a trick I had called SMTF; which can mimic the behavior of Huffman coding but without needing Huffman tables or the severe performance overheads associated with Adaptive Huffman and Arithmetic Coding). It had also reintroduced some concepts from 1D and 1F, namely the use of (now optional) YUV420 blocks, and the use of delta-coded color vectors.

Elsewhere, I had began work on a format called BTIC4A, which was focused initially on fixed-size blocks and packing pixels into blocks; in a matter conceptually similar to BC7; just with 8×8 pixel blocks in either 256bit, 384bit, or 512bit forms.

The 4A effort was quickly mutated into BTIC4B, which combined 4A’s block formats with a bitstream format more or less copy/pasted from 1H. Most of the pixel-data is copied as-is into the bitstream. Decoding consists then of reconstructing the blocks from the bitstream, and handing them to a block-decoder.

As a side effect of the bigger 8×8 blocks though, there was only about 1/4 as much overhead from things like entropy coding, allowing it to go faster, and there was also an improvement in terms of compression due to the reduction of side-data for each block. Many blocks use subsampled chroma, ranging from 4:2:0 up to giving the whole 8×8 block a single chroma value (only luma varying).

Likewise, vectors were labeled as (Y, U, V, Dy, Du, Dv), with Y/U/V as a central color, and Dy/Du/Dv being the range relative to the center. Dy is always positive, but Du and Dv may be negative. Unused components may be skipped, and the vector deltas may be based on a difference from a vector predicted via a Paeth predictor (similar to PNG). The vector is then transformed (when decoded) into a pair of color endpoints (in YUV space).

In chroma subsampled blocks, the U/V components are interpolated to form the U/V for a given pixel sub-block, which is then unpacked into possible pixels based on Y. Y then selects the pixels. In non-subsampled blocks, the endpoints are converted into colors and used directly (thus Y may also determine the chroma value). Though, some blocks may ignore Du/Dv with the Y axis only interpolating along Y.

Things are pretty variable, but will note that on my AMD-FX I am often seeing decode speeds in the area of 400-600 megapixels per second for a single threaded decoder, which is plenty fast to be usable. With multiple threads, gigapixel speeds are possible (however, per-thread speed drops sharply as the number of threads increases). I have a Dual Xeon E5410 box (a discarded rack server) which, while initially slower, scales a bit better (I have seen decode speeds of around 4 gigapixels/second on this machine).

But, at its core, the “transform” ultimately isn’t that much different from RPZA and CRAM; we have often 1 or 2 bits pixel (though 3 or 4 are possible). There are variable block-sizes (Y subsampling), and various levels of chroma-subsampling.

The Y block is has various sizes (none; 2×2, 4×2, 2×4, 4×4, 8×4, 4×8, 8×8) which map to the logical 8×8 output block,  and UV chroma blocks sizes (none; 2×2, 2×4, 4×4, or 4×8); these in turn come in certain combinations, with various combinations of pixel bit-depths (to fit within a size budget).

No real fancy math, just good ol’ bit twiddly.


AdRice and SMTF

AdRice was another nifty trick I ran across. In traditional Rice Coding, the ‘k’ value is determined up-front, so two passes are typically used to encode this. One pass determined the k values, the other does the actual encoding.

In AdRice, a nifty trick is used, namely, if the Q value (length of the unary prefix) is 0, then k is made smaller; if 1 it remains the same; and if >1 it gets bigger. Starting k is then ‘something sane’, and the rest takes care of itself.

To avoid epic slowness, it is possible to use a lookup table, where both the coded symbol and new k value are held (both the ‘k’ and matched bit pattern are used as the table-index). The range of ‘k’ can also be constrained, and a length-limiting escape case can be used, such that the whole process can be driven by lookup tables (as in a typical Huffman coder).

Unlike Huffman, AdRice has the advantage of allowing encoding/decoding without needing to drag around a Huffman table (potentially a big cost with small payloads; and cases where data may be encoded/decoded in pieces or in no particular order).

Though, Rice may have an obvious limitation: Normally it only works effectively at encoding values following a geometric distribution.

A possible few workarounds exist:

  • SMTF: If symbol is small, swap towards front, else rotate space and swap the symbol to the new front (the symbol that “falls off the end” goes into the “hole” left previously).
  • STF: Swap towards front (if N>0: swap N with N-1).
  • STF2: Swap towards front, but at a fixed ratio (eg: swap N with 7*N/8).

The exact swapping strategy varies and what is ideal is variable. But, these basically tend to put symbols in an order of descending probability (and thus make the Rice coding happy), and isn’t too terribly expensive (SMTF is usually implemented via masking off an rover+index into a buffer; rotating the space consists of decrementing the rover).

STF tends to result in a closest-to-optimal distribution, but suffers from the problem of adapting slowly; STF2 is faster adapting than STF but prone to a limitation that alternating patterns of symbols may get “stuck” swapping with each other (rather than advancing towards the front); …

The combination of AdRice and SMTF, while not quite as fast or as good as Static Huffman, generally works pretty decently as a substitute when an adaptive coding is needed or when payloads are small enough that the cost of passing around and initializing the Huffman tables outweighs the costs of AdRice and SMTF.


Retroactive: Doing It C Style.

Probably a bit of an unpopular stance, but I mostly prefer C, and even then, I tend towards a subset of C. It isn’t that I am afraid of C++, like it is some paradoxical and incomprehensible thing, but rather, what it has to offer doesn’t seem to offsets some of the costs of using them, and when using C++, a lot of the often touted selling points of C++ themselves come with pretty serious drawbacks (such as producing bloated executable files or taking absurdly longer to compile than “just doing it the C way”).

So, some advantages of C or a C-like approach to C++:

  • Faster compile times.
  • Typically smaller binaries
  • (Typically) better performance (excluding doing something stupid).
  • (C++) Freedom to pick and choose which features are actually relevant to the task at hand (avoiding “using features because they are there” or “‘Modern C++’ style demands doing this thing in this way”).

And, some advantages of plain C:

  • Doesn’t depend on having a C++ compiler (more relevant to small targets than mainstream ones).
  • The language is more within reach for a “mere mortal” to write a parser or compiler for it.
  • The ABI is more standardized (less subject to the whim of the compiler or which version of the compiler one is using, and with far fewer edge cases).
  • The code is frequently more reusable, as it can be used in both C and C++ projects.

One common criticism against C is its lack of “OO”, but this seems to mistake a lot of superficial language details for the underlying concepts. A class in C++ isn’t that much functionally different from a struct in C, and most things that can be done with a C++ can also be done, without all that much more effort or code, in C using structs. It is more a case of there being none of the syntactic sugar, and thus the code looks different / less pretty (“foo->Bar()” becoming one of “FooLib_FooClass_Bar(foo)”, “foo->vt->Bar(foo)”, or “(*foo)->Bar(foo)”). There is no operator overloading, so one uses function calls, and no templates so one either uses macros or does the code inline (this being itself a tradeoff; this sort of thing is harder and more limited, but on the other hand even in C++, overuse of templates is one of the major factors leading to bloat and slow compile times).

A difference though is in terms of problem solving, as C does a lot less hand-holding, and even when it does, things provided even by the standard library may not be the best way to do them: C string functions can get slow, but frequently it is a sign of “you aren’t doing it right”. Functions like “qsort()” are lame, but usually more of a tradeoff between “simple but slow” (ex: half-assing a bubble sort) or doing something “fast but more effort” (writing out a specialized sort function, or finding a non-sort solution); “std::sort” may well give a better sort than “qsort()”, but OTOH, expecting qsort to be the end-all solution to sorting in C is sort-of missing the point.

Actually, if one gets “serious” in C, they may well come to realize that even a fair portion of the C library is ancillary anyways (ex: the parts which aren’t about either using host OS facilities or are builtin functions typically handled by the compiler itself).

Similarly, if working on small embedded targets, they may not have a lot of the graces that come with a hosted environment on a relatively fast and powerful CPU. On most higher-end 32-bit architectures, things are usually pretty sane (apart from the smaller amount of memory available).

However, on 8 and 16 bit targets, a lot of this goes out the window, and being limited to C is probably the lesser of the issues. For example, one might have to deal with issues such as:

  • Lack of a hardware multiplier, making multiply expensive.
  • Lack of the ability to divide by a non-constant value.
    • Because there is neither a hardware divider nor the resources to fake it in software.
  • Shifts may also be done in software.
    • Adding a value to itself is basically a left shift, right?…
  • You only have 16-bit integer types, but potentially 20 or 24 bit addresses.
    • And potentially only the ability to add positive 16b displacements to pointers.
    • Say, other ops require composing/decomposing pointers with macros.
  • Want FPU? Too bad, not gonna happen.
  • Here is to hoping the C dialect itself doesn’t have funky limitations.
    • Say, there are 1D arrays, or an inability to directly copy structs by value, …
    • Or, if it only accepts K&R declaration syntax.
      • Though, As of 2017, I feel these are pretty solidly dead now.


2bpp Framebuffer and NTSC Composite

So, status (besides class):
Had recently started messing with Verilog (for if/when I can eventually get an FPGA, still limited mostly to simulation), and have thus far written a partial implementation of a CPU core implementing the my BJX1 ISA in it (an extended 64-bit version of the SuperH ISA).

For part of it, was also working on a display subsystem (intending to work from a small framebuffer in SRAM and then produce composite output).

Was able to experimentally get passable image quality from around 2 bits/pixel via a color-cell scheme. This has pretty comparable memory use to CGA in 4-color mode; but with higher image quality.

One sub-mode uses a pair or color endpoints from a 64-color RGB colorspace. Image quality isn’t great, but is reasonably passable. Here, each R/G/B component has 4 brightness levels, with an ability to get intermediates, sort-of, by dithering.

Another sub-mode uses YUVD colors encoded in a way similar to ADPCM. Basically, there is a YUV centroid with D=Ymax-Ymin (of the difference in Y between the brighter and darker value). They are stored as a delta from the prior vector, with 3 bits/component encoding the delta, which may dynamically adapt to fit the range of the component values. There is an escape case to allow colors which fall too far outside the current range to be encoded as an absolute YUVD centroid; or alternatively as a pair of 64-color RGB endpoints.

The reason for this mostly be able to fit the screen buffer into SRAM in an FPGA, and not “eat it” with an overly large framebuffer. As-is 320×200 would require 16kB, and 640×400 would require 64kB.

This is mostly because most of the rest of the SRAM (around 200kB in the FPGAs I was looking at) is needed for other purposes; like a CPU cache, … Alternatives would require either a mechanism to stream in a framebuffer from external DRAM, or require generating interrupts to have the CPU fill in scanlines.


Image of a Dog (Original, apart from scaling to 640×480):



Image of a dog, but encoded at 2 bits/pixel:


It doesn’t look great, but you can’t expect too much from 2-bits / pixel.

If I added a full 64kB (to support 640×400) I could possibly have a higher quality sub-mode for 320×200 display.

Another possibility is to have 32kB of VRAM, which would allow a 2bpp 640×200 mode, or a higher quality 4bpp 320×200 mode (probably with either a pair of RGB555 endpoints or a YUVDyuv scheme).

Less certain ATM is that this will support a text submode, but it is unclear as to the specifics (dedicated text-cells; or use of drawing the text cells via color-cell graphics, drawing each glyph as a 2×2 grid of cells).


This would feed into the next part:
I am considering doing the video output mostly as composite (ex: NTSC over an RCA cable), and so was messing around with trying to modulate and demodulate NTSC in software (and maybe later do a version of the modulator in Verilog; ADD/EDIT: have done a modulator in Verilog).


Dog Image (of experimental NTSC output, done via 100MHz PWM):


In this test, the NTSC is done via PWM (or DPWM); in this case turning an output pin (simulated in this case) on and off on a 100MHz clock in a pattern to approximate an analog signal. This produces output at a rate of ~100Mbps, and thus a frame of video requires around 417kB (ironically, smaller than most of the output images saved as PNG).

For actual output, this would be fed into a voltage divider and capacitor to turn this into a “true” analog signal (would need to reduce from 3.3V to 1.0V and also smooth out a lot of the high-frequency noise). This is pretty similar to how audio output is done on most modern computers, as DACs are more expensive and people don’t really hear the difference; though using a DAC would only need a signal at around 14MHz rather than 100MHz (and higher video quality would similarly require a higher PWM frequency). This frequency is nowhere near high enough to be able to directly produce VHF modulated output though (would need to PWM the pin at around 240-300MHz for this).

Color makes things a little more complicated, and a lot of the effort in implementing the NTSC demodulator was mostly going into trying to make the QAM demodulation work. Color involves basically magic of making ~3 analog signals coexist in the same place at the same time, with the components mostly not interfering with each other.

The Y component is transmitted via the signal’s DC bias, and the UV/IQ components via phase angles (I at 0 degrees and Q at 90 degrees); with a non-modulated 3.579545MHz “colorburst” signal being sent every scanline to help align the decoder with the encoder’s signal. These are all handled together as a single combined PWM signal. Here, YUV can be fudged into being YIQ (by slightly rotating the signal), so it basically works.

In the case of the image, the decoder was given a copy of the PWM’ed output bitstream, and it recreates an output image from this. In an actual hardware implementation, this role would be served by plugging the cable into a TV or some other similar display with support for a composite input. Though, I would need some sort of decoder to test a Verilog simulation of the display subsystem.

Though, I don’t know yet if it will work with an actual TV, as information on NTSC specifics is a bit fragmentary and it was annoyingly difficult to piece together via web-searches. So, it is possible I missed something important and the generated signal isn’t actually compatible with real TVs, who knows?… ( Could maybe consider writing a basic spec on this. )

Why not VGA? Less wires, mostly. Though, on the modulation side VGA (and component) would be both simpler and probably produce higher output quality than composite (basically by not needing to modulate multiple signals together). VGA would require wiring up a VGA connector though (and a lot more filters); Component would have the drawback of being less commonly supported by low-cost displays.


Other Thoughts:

  • Noticed after posting this, that I could improve image quality by 8-bit aligning the PWM bits for the NTSC test, likely because the decoder samples 8 or 16 bits at a time (shouldn’t matter for a real TV).
  • Though more complex, a hybrid DAC design (say, with 2 or 3 high-order bits and the low-order bits as PWM) could likely be able to give better video quality. Though, this would require more complex electronics.
  • Note that the colorburst clock doesn’t align sufficiently closely with an integer fraction of 100MHz, as a workaround it is possible to accumulate 150137 (once per clock-cycle) and use bits (21:17) of the accumulator as a clock-phase (3.579545/100*2^22; which approximates the ratio of the frequencies relative to a power-of-2; the choice of 2^22 in this case being based mostly on “number magic”).

But, yeah, work in progress.


ADD: Experimental 4-bit PCM

Dog Image, NTSC 100MHz 4-bit PCM w/ dither and rotating colorburst (45 degrees every half-frame), and an average over several frames.


By itself, the 4bpp PCM does give considerably better results than the PWM (much less noise), though even as such there was still a slight ripple pattern apparently due to the colorburst signal (the slightly noticeable diagonal lines in the original NTSC image, but more obvious with the PWM due to lower overall noise).

The signal seems to align so that nominally the carrier is at the same phase in the same places in the frame. A workaround seems to be to average the past two frames and rotate the signal by 45 degrees per half-frame, so that for two consecutive versions of a scanline the carrier is out of phase by 90 degrees, and the ripples mostly cancel out. In this case, the image is encoded via multiple consecutive frames.

Though, it also appears that rotating the carrier somehow mostly defeats the effects of CGA style artifact colors (instead, these tests mostly lose saturation).

I am not entirely sure how actual TVs would respond to a rotating colorburst, but at least theoretically the TV will be able to realign with the colorburst each frame.


Forgot about this…

So, what happened? I wrote a few posts, then for a while forgot I had set up a blog.

So what has been going on?

Class, basically doing Machining stuff, and learning CNC stuff, because, that is what sort of jobs “actually exist” around where I live.


Spent a good number of months working on a emulator and a C compiler targeting a variation of the SuperH ISA. Started fiddling with doing an implementation of the ISA in Verilog. Not like it is “actually useful” or anything. Working on stuff like this, one becomes aware both of the limitations of the ISA, and doing stuff in Verilog, some of the challenges which exist for trying to implement an ISA in hardware, sorta (this exists in a gray area between that of code and hardware, writing stuff in a language which can nominally run on an FPGA, or theoretically be turned into a real chip if enough money is thrown at the problem).


So, what is Verilog:

It is sorta like C, if you were doing everything using branch-free logic, with a few control-flow constructs thrown in the mix; and a convenient bit-vector notation (counterbalanced by an ugly syntax for declaring numeric values; begin/end keywords; …). It is also in some ways behaviorally sort-of like ladder logic, though at least thankfully, much less terrible (the way combinatorial blocks behave is reminiscent of a scan in a ladder program; though luckily we are spared from everything being presented via the abstraction of relays switching things on and off between a pair of power rails).

One thing that becomes more obvious here is the distinction between division and most other arithmetic operators: Most operators can be implemented fairly easily; but division is one of those operators that generally requires an iterative implementation in both the integer and FPU case.

Integer division is possible in one of several ways, but not going to go into them.

FPU division is at least a little easier, in that one can hack together a passable approximation for relatively cheap (mostly using subtraction and shifts or similar). But, then a problem comes up if one wants “accurate” division. In a software FPU, one can be like, “well, I will just use a fixed-point a integer divide and some bit-twiddly”. Here, not so much (if one has any concern for gate budget).

There is always Newton-Raphson, but this also costs gates (for a probably one-off special purpose FMA unit), and one would still basically need to follow an initial approximation with some number of “refinement” operations.

The other alternatives are having a floating-point divide operator which has lackluster precision, or is comparatively slow (if emulated in firmware).


Interestingly, one feature I got backlash over, namely the addition of x86-like memory addressing, was comparably fairly cheap and simple to implement, and has the advantage of somewhat reducing the length of instruction sequences when working with structures, arrays, or when doing pointer-arithmetic.

Basically, In SuperH, memory addressing ops were basically like this:

MOV.B  //Load or Store a Byte
MOV.W  //Load or Store a Word (16-bit)
MOV.L  //Load or Store an Integer/Long (32-bit)
FMOV.S //Load/Store a Float (or half of a Double)

With several types of addressing available (basic/common forms):
MOV.L @Rm, Rn  //Load from Rm into Rn
MOV.L @Rm+, Rn //Load from Rm into Rn, with a post-increment
MOV.L @(Rm, disp4), Rn //Load from Rm with 4 bit displacement
MOV.L @(R0, Rm), Rn //Load from Rm with 4 bit displacement
MOV.L Rm, @-Rn //Store into Rn with pre-decrement

A useful special case:
MOV.L @(PC,disp8), Rn  //Load a constant

And, some not so useful special cases:
MOV.L @(GBR, disp8), R0 //What do we use this one for, exactly?

Looks all well and good, until you want to load from an array. Spot that @(R0, Rn) addressing form, yeah, you’ll be seeing that one a lot (pretty much any time accessing an array or a structure field which falls outside of a very limited displacement range). That 4 bit displacement form is frequently not even sufficient to access local variables on the stack. Similarly, because lots of operations need to go through R0, it makes it extremely prone to being stomped.

Similarly, MOV.B and MOV.W implement fewer forms, and fewer still for floats or doubles (the SH FPU is probably worthy of its own topic).

This means typically a lot more constant loads, explicit shifts, arithmetic ops, … as part of doing a memory load or store operation.

There are workarounds though:

SH-2A MOV I-Forms (were added later on with the SH-2A ISA):
MOV.L @(Rm, disp12), Rn  //Oh, amazing, a whole 12 bits for the displacement

BJX1 MOV I-Forms
MOV.L @(Rm, Ro, disp8s), Rn  //Address is computed as Rm+(Ro+disp)*4

And this gold nugget:
LEA.L @(Rm, Ro, disp4), Rn //Calculate address and store into Rn

MOV.Q  //Load/Store Quadword

These are 32-bit instructions, and are not entirely without drawbacks, but can generally “pay for themselves” in terms of simultaneously reducing code size and (potentially) helping with performance. This is mostly because, if you can displace 3 or 4 16-bit instructions with a single 32-bit instruction form, this is typically a savings.

What about all these instruction-forms for MOV.L and friends; wont this lead to an absurdly complex load/store and address calculation machinery? Not really.

In Verilog, one can sort of just use the general case and special case things, leading to a side effect: Add of the memory addressing is done internally @(Rm, Ro, disp) calculations. But, how, some operations lack a scaled index, a displacement, or the displacement is unscaled.

Partial solution: What can happen in the decoder:

MOV.L @Rm, Rn         => MOV.L @(Rm, ZZR, 0), Rn
MOV.L @Rm+, Rn        => MOV.L @(Rm, ZZR, 0), Rn; LEA.L @(Rm, ZZR, 1), Rm
MOV.L @(Rm, disp), Rn => MOV.L @(Rm, ZZR, disp), Rn
MOV.L @(R0, Rm), Rn   => MOV.L @(Rm, R0, 0), Rn
MOV.L @(GBR, Rm), Rn  => MOV.L @(GBR, ZZR, 0), Rn

With a few special notes:

  • ZZR is a special Null Register, and is treated here as if it were always 0.
  • ZZR may overlap with R15/SP, which may not be used as a scaled index.
  • R0 is not scaled if used as an index.
  • Some specifics here are left to the discretion of the implementation.
  • Operations inside the core use wider register numbering than in the ISA.

Basically seems workable.

Well, there was also 3-address integer arithmetic and similar, which is supported by the ISA, but as can be noted hasn’t had as significant of an impact in terms of effects (generally, cases where two-address forms are more applicable are more common, as very often arithmetic is either modifying a value, or the lifetime of one of the values ends just as the new value is being created).

Similarly, there doesn’t seem to be much reason to deviate from load/store, and if anything there is the added drawback of potentially needing to deal with read/modify/write to memory (which is avoided if operations take place in registers).

Why make post-increment / pre-decrement operations take 2 cycles? Well, main reason mostly has to do with an inability to (efficiently) write to multiple registers in a single cycle, which could potentially happen if both are done in a single operation. However, one can fairly safely read from 3 or 4 registers at a time.

The registers are internally split into multiple banks to avoid exceeding these limits.

  • A bank for the low halves of registers.
  • A bank for the high halves of registers.
  • More banks for system and control registers.

The handling of 32-bit registers is special in that, in BJX1, it is possible for some operations to address the high halves of 64-bit GPR registers. This is handled internally via some multiplexing magic. But, splitting them up allows the halves to be accessed in parallel without posing as much of an issue for a 3 read + 1 write limit.

For related reasons, the FPU register space is internally permutated, and structured similarly to the GPRs:

  • F0L-F16L: Low half of Double Registers
  • F0H-F16H: High half of Double Registers
  • F0L=FR1, F1L=XF1, F2L=FR3, F3L=XF3, …
  • F0H=FR0, F1H=XF0, F2H=FR2, F3H=XF2, …

So, in this scheme, Double is the natural value size, though this should be mostly invisible to code.


But, still to be seen if I can make a working core out of all this…


Trade-offs in Entropy Coding

Well, I suspect no one takes me all that seriously about compression, partly because I suspect I tend to use slightly unusual designs, and tend to focus more on speed than on ratio.

There are a variety of entropy coders, with different trade-offs:

  • Static Huffman
    • Basically the “gold standard” of entropy coders which use a normal bitstream.
    • Pros:
      • Has generally pretty fast encode/decode speeds.
      • Theoretically optimal for a bitstream.
    • Cons:
      • Requires transmitting / initializing / … a Huffman table (non-zero cost).
      • For chaotic streams (such as small messages over UDP), ensuring that the decoder has a matching table can become problematic.
      • More so the space cost of a Huffman table may be a problem for small messages, as can the constant overhead of initializing the tables.
  • Adaptive Huffman
    • Usually unfavorable as it tends to be very slow vs most other options.
    • Need to adjust trees per-symbol.
    • Does not favor the use of a lookup-table driven decoder.
  • Rice Coding
    • Useful, but largely forgotten.
    • Pros
      • Very little constant overhead.
      • Is easy to make adaptive, allowing a simple single-pass encoder.
      • Can achieve similar speeds to static Huffman.
    • Cons
      • Natively only matches certain distributions.
      • Absent length-limiting hacks, has the potential for excessively long symbols.
    • Tricky Hacks
      • Length limiting to avoid extremely long symbols.
      • Reorganizing symbols can allow for more Huffman-like behavior.
        • Dynamic symbol swapping can be fast, effective, and is adaptive.
        • Will generally call this category. “Rice+STF”
  • Bitwise Range Coding
    • Compresses good, but not particularly fast.
    • Pros:
      • Adaptive
      • Can beat Huffman variants regarding compression.
      • Still faster than Adaptive Huffman.
      • Can be similarly flexible as a raw bitstream.
    • Cons
      • Slower than Huffman and Rice
      • Slower than a raw bitstream (if used as a bitstream).
    • Tricky Hacks
      • Can be stuck on the back-end of a Huffman or Rice coder for improving compression (slightly) with less speed cost than using it directly.
      • Sadly, the compression gains of using it over a raw bitstream are frequently not high enough to make its speed costs worthwhile.
  • Symbol Oriented Range Coding
    • Compresses good, also not exactly fast.
    • Cons (vs Bitwise range coding)
      • Doesn’t really compress any better
      • Much more complex and often slower
      • Much less flexible about how it is used.
  • ANS and friends
    • A sort of magic unicorn I don’t really understand that well.
    • Pros
      • Speed can be more competitive with that of Static Huffman.
      • Compression is more like that of a Range Coder.
    • Cons
      • I personally can’t really make much sense of how it works.
      • Requires encoding symbols in reverse order.
      • Seems to have some restrictions similar to symbol-oriented range coders.
      • Its high speeds only seem to materialize on certain hardware in my past tests.

Relative Tradeoffs:

  • Between Huffman and Range Coding
    • Have only rarely seen ratio differences much over about 10%
    • Cases like high symbol skew (where range coding fares well) can usually be addressed more cheaply via other means.
    • A slowdown of typically 2-4x isn’t usually worth a 5-10% gain in compression.
  • Between Huffman and Rice+STF
    • Huffman generally compresses better and is faster than Rice+STF past a certain message size.
    • Rice+STF can allow for a simpler encoder/decoder though, and is faster for small payloads.
      • Its origins actually came about as a cheaper alternative to Deflate-style Huffman-coded Huffman tables.
      • In some cases, the space savings of not transmitting a Huffman table with the data may offset its less-than-optimal coding, putting it ahead.
  • Huffman or Rice + Bitwise Range Coder Backend
    • Can regain some of the compression advantages of Range coding.
    • Has less steep of an impact on codec speed
      • Given the range coder deals only with the already-compressed data.
    • A drawback is that now one has both the code complexity of a Huffman or Rice coder in addition to the Range coder, in addition to potentially other unique funkiness (such as interleaving multiple bitstreams to allow for multiple RC contexts, etc).
  • Not using any entropy coding
    • Typically the fastest option.
    • Typically the worst compression
      • Because, you know, raw bytes.
    • May still be a good option where speed is a lot more important than compression ratio.


It can be noted that most of my older designs had used static Huffman, but many of my newer ones have used Rice+STF variants. Generally the relative speed, simplicity, and flexibility has been enough to offset the arguably “bad” compression.

If the total ratio difference is maybe 20% between a “bad” Rice-coded backend, and a much slower range-coded backend, but the result is a 400% speed difference, is it worthwhile? Probably depends whether size or speed is more relevant.

Both may save significantly over the raw uncompressed data, which is usually where most of the gains lie. It often seems more sensible to invest a lot of the time/effort budget in areas which have a more direct effect on overall compression (such as any filtering/quantization/transform steps in image/video compression, or the effectiveness of the LZ matching and similar for LZ77 compressors).

For example, in an image codec, one may find it may well matter a lot more what type of DC predictor is used, or how deltas or coefficients are represented, than the specific choice of entropy coder back-end or similar.

Though, granted, I remain sort of an outcast here.


Brief History of BGBScript

Earlier Efforts

Well, once I was young, and early on spent a lot of time messing around with Quake. Quake had a scripting language, known as QuakeC, which was fairly limited but made it a lot easier to tweak things in-game, at least compared with the use of C for scripting in Quake II.

Eventually, I was working on more of my own projects, and came across a language known as Lisp. I had tried using some of the existing scripting VMs I could find, which IIRC was a standalone variant of Emacs Lisp, and the SCM / Guile implementations of Scheme.

After running into frustrations with these VMs, and noting the general simplicity of the language, decided to do my own VM. It wasn’t very good, having relied on a sort of modified S-Expressions for interpretation, and later wrote a backend which spit out C. This project, however, turned into a mess, and Scheme’s syntax and language design wasn’t particularly usable, so I later ended up abandoning it.

At first, I thought maybe I was done messing with scripting VMs, but started to face the issue that writing code purely in C is limiting, given even if only a small amount of code ever ends up as scripts, the parts that do make a difference. Around this time, I had also become basically familiar with HTML and JavaScript, and had recently done an implementation of XML-RPC. So, I did a nasty hack: I built a script interpreter for a JavaScript knock-off interpreter on top of XML-RPC, which parsed into an XML tree and then directly interpreted the XML expressions (this was around mid 2004). This first VM suffered from horrible performance, and basically spewed garbage at an incredible rate, so in not too long I wrote a replacement.

The next iteration of the language was partially redesigned (in 2006), and basically copy-pasted a lot of parts from my Scheme interpreter, but made use of a bytecode interpreter. This VM was a lot faster, but suffered some of the problems of my original Scheme interpreter, for example the relative pain of trying to deal with a precise garbage collector from C land.

The next VM (2009) was basically a lot of the machinery from the 2006 VM, re-merged with parts of the 2004 VM, namely using a more C friendly memory manager. In 2010 and 2011 I started adding a lot of features I had sort of copied from ActionScript 3.0, as well as some features from Java and C#.

The bytecode used was originally a dynamically-typed stack-oriented bytecode, but opcode types annotations were added on after the fact via prefixes. The VM then used an awkward mix of static and dynamic typing.

Had also been running side efforts trying to do natively compiled and 3AC compiled code, but my early efforts here were very hit or miss (my first real code-generator collapsed into being unmaintainable, …). A lot of optimizations for my more efficient 3AC interpreters, and a few other VM designs, ended up back-ported to the stack machine VM, so there ended up being not a huge pressure to replace it with something different.


First BGBTech Engine

Around this time, I also wrote the first BGBTech Engine by pulling together parts from my various previous projects, so BGBScript was the primary scripting VM in BGBTech. The engine was developed out of a combination of a 3D modeller and mapper I had previously written mostly for the Quake 2 engine, mostly as by that time QuArK had decayed into an almost unusable mess, and couldn’t really be easily extended to deal with heavily modified games or engines.

Its performance was overall pretty bad, early on me feeling it was doing good when framerates were in the double-digits, and I later ended up switching over mostly to a Minecraft style terrain system after noting that my mapping skills were pretty bad.

Most of the horrible performance was due to me early on not really understanding OpenGL that well, and doing lots of things in ways which were pretty but not all that fast, then noting that they scaled poorly to larger scenes. A lot of the performance gains were mostly figuring out how to draw things more efficiently, and cutting things back to limit their performance costs.

By around 2014, progress on this engine was pretty much stalled.


Formation of BGBScript 2

Likewise, I had gone off to some side projects, mostly working on trying to build robots, so I put a lot of this on hold.

At this point I ran into a problem that my older BGBScript VM was too heavyweight for use in my robot projects, and wasn’t really well suited to low-latency use-cases. Had started on a redesigned version of the language I had called BGBScript-RT (for BGBScript Real-Time). It was redesigned to be much more like C in many areas, switching from garbage collection to the use of manual memory management.

In the meantime, needing something more immediate, I had used a variation of C compiled to a 3AC Bytecode. The bytecode VM was originally designed as part of a new faster experimental backend for the BGBScript VM, but it was easier to get C working on it. While the VM was pretty fast, it had a number of more subtle design issues.

Partly as a result, the newer BGBScript RT/2 VM had made use of a statically typed stack-machine, more like that in the JVM. This initial form of the VM stalled out with the general stall of my robot projects.


BGBTech 2 Engine

Around late 2015, and the general stall of my robot-building efforts, I ended up starting on a reboot of my 3D engine. My initial thinking was that I could take what I learned from my prior engine, and probably reboot it into a much faster and better engineered engine.

Since it quickly became apparent I needed scripting, I ended up reviving the BGBScript 2 VM, and worked a lot on beating the compiler/VM into a usable form. This was along with working a lot on stuff relevant to “engine architecture”.

This Script VM has ended up a fair bit bigger and more complicated than my original vision, but a lot of this was for things relevant to a game engine, and to performance: A simpler VM with a smaller / simpler ISA either comes at the cost of time or complexity in the back-end, or in worse performance. Generally, getting better performance comes to some extent at the cost of VM complexity.

As of the beginning of 2017, while it hasn’t really caught up to the old engine in many areas, and it is debatable whether the “better engineering” part is actually paying off, I at least seem to be making progress (and my new engine is much less prone to running out of memory or having awkward GC stalls).


Bytecode Tradeoffs

Over time, it can be noted that different ways of expressing operations has different tradeoffs.

Bytecode Tradeoffs

  • Dynamically Typed Stack Machine
    • Simplest to design/implement
      • Front-end mostly ignores types.
    • Naive case, is also the slowest
      • Deterimining types and operations at runtime does not come for free.
    • Getting decent performance requires a very complicated back-end.
      • Back-end needs to deal with type inference and JIT
    • FWIW: A few VM’s (such as Parrot) have a Dynamically-Typed Register IR
  • Implicit-Type Statically-Typed Stack Machine
    • Used, for example, in the .NET CLR.
    • Most elegant design on paper
    • Reliance on propagating types via the stack is problematic for interpreters
      • Sensible if the VM is JIT only
  • Explicitly-Typed Stack Machine
    • An example of this is the JVM
    • Requires use of more opcodes and/or prefixes to express types.
      • Less elegant design due to lots of repeated operations per-type.
      • May seem less orthogonal as operations may only exist for certain types.
    • Is a decent design for an interpreter.
      • Explicit type information means back-end doesn’t need to figure this out.
      • May also be easier for a simplistic JIT.
  • Statically-Typed Register-IR (Non-SSA)
    • An example of this would include the Dalvik VM
    • Can allow for a faster interpreter than what is typical from a stack machine.
    • A drawback is that things can also get more complicated.
      • Compiler also needs to deal with the management of temporary variables.
      • No convenient way implicitly pass arguments to functions.
  • Statically-Typed Register-IR (SSA)
    • An example of this is LLVM BitCode
    • Best suited for compiler analysis than interpretation
      • The additional temporary registers and phi operations are counter-productive for an interpreter.
      • Though, sub-scripting rather than duplicating registers is an option.
        • In the subscripted case, identity is shared, and phi is made implicit.
    • Is good for knowing register lifetime and may aid in register allocation.
      • Can know more easily when the lifespan of a register has ended.


In its current form, the BS2 VM uses a mix of an explicitly typed stack machine with some number of register ops mixed in. The use of hybrid register-ops can in some cases be used to reduce the length of instruction sequences, and thus improve performance for an interpreter, at the cost of a larger ISA.

There is some possible debate as to whether this is better or worse than a purely register IR would have been.

Seen another way, the stack can be thought of as a collection implicit temporaries which are accessed in LIFO order. Within the interpreter (if using threaded-code), there is no need to maintain a stack-pointer at runtime (the stack pointer then only really exists when decoding functions). A side-effect of this is needing some non-executed instructions which primarily serve the purpose of keeping the stack aligned around labels and branches.

For example, compiling “z=25*x+y;”with a pure stack model:

LDI L0   //(=>S)=x
LDC.I 25 //(=>S)=25
MULI     //(=>S)=(S=>)+(S=>)
LDI L1   //(=>S)=y
ADDI     //(=>S)=(S=>)+(S=>)
STI L2   //z=(S=>)

Or, with some additional “hybrid” opcodes:

MULISLC L0, 25  //(=>S)=x*25
ADDILSL L2, L1  //z=(S=>)+y

Which in this case, can match the density of a pure register form.


Typically, the bytecode is internally converted into a sort of threaded code, where instructions are decoded into structures with a function pointer, and instructions are executed by calling the function pointers.

These instructions are then stored in arrays, and in C side, unrolled loops can be used to execute a sequence of instructions via such arrays. Execution then primarily works via a main trampoline loop, which calls function pointers holding the unrolled loops for executing instructions.

A simplistic JIT strategy for this is to emit machine code blobs which contain, in the general cases, function calls to the embedded function pointers. For operations handled directly by the JIT, it may instead emit a blob of instructions for executing its operation. Extra fancy is one can add a makeshift register allocator, so values may be cached in CPU registers rather than using memory loads/stores.

But, if the bytecode is kept friendly to interpreters, an interpreter is a good fallback in case a JIT is not available or is too expensive to support on the target, and directly interpreting bytecode operations can be good if for whatever reason threaded-code is not a good option, or if the code will probably only ever be executed maybe once or twice.