New ISA, CNC stuff.

CNC Stuff

Recently, have undergone some projects partly outside my usual coding stuff. In particular, got a manual mill (G0704) and then retrofitted it with a CNC kit (the FlashCut kit). It basically worked, as in the hardware went on easy enough and is basically functional. There was an issue with the Z axis pulley not staying on, as it basically screws on to the threads on the end and uses a set-screw, onto the threads. I ended up needing to ‘Loctite’ the threads (or, more accurately “apply the use of off-brand blue thread-locking compound, and ignore the use of a generalized trademark”).

Was having an issue with it, in that I couldn’t really get/keep the machine holding position while using the motor drivers internal to the 2.5A / 5A “Compact Microstepping Controller” box (which runs 3 fairly large steppers off a single 5 amp power brick; similar to a largish laptop-style power brick). Tried various options (using half-stepping rather than microstepping, this being at least a little more reliable, etc), and ended up making the rapid speeds and acceleration fairly slow.

Eventually got some external motor drivers (wired up to the box which luckily has a DB25 output for the motor signals), with a 48V 10A power supply, and the results are a little more impressive (as-in, now it doesn’t just randomly lose its position). Interesting, the motors run a lot cooler and quieter with the external drivers, despite better holding position and seemingly having more output torque (a bit of a mystery here). Ended up settling with running the motors at 2.84A/phase, which seemed like a good balance (between making the motors sufficiently powerful and not overheating stuff), though peak power could go outside the rated 10A for the supply (maybe should have got a bigger one).

Well, at least the software and the controller box’s output signals work reasonably reliably.

The kit cost a bit much though to be worth this much hassle. When a kit like this costs $k, you would hope it is better and more reliable and everything “just works”.

For this project, also went and did a more makeshift CNC conversion of a rotary table, using a NEMA17 motor, some custom made mounting hardware and pulleys, and a vacuum cleaner belt as the drive belt. Also used a knurling tool in the pulleys to try to minimize belt slip. Added another driver, and now the machine also has an ‘A’ axis.

 

CNC Lathe

Was also planning to do a retrofit to a lathe, but go a little more custom. Planning on replacing the main spindle motor with a big stepper (1800 oz-in NEMA34), and get a particularly large stepper driver for it (100V 8A per phase). This way I can use it both as a spindle and for C axis operations. I would use similar steppers for the X/Z axis to those on the mill (some 570 oz-in NEMA23 motors).

This project would differ some in that I am planning on going more custom on the CNC controller/software side. Current plan is to run some custom CNC software on a Raspberry Pi, with another controller generating the motor signals. While the RasPi could generate signals directly, the periodic interruptions due to Linux are enough to potentially result in lost steps with some of the faster moving motors (such as the spindle).

Did do a mock-up of some MSP430 code for doing motor controls, which basically works but limits motor speed some and has slow-ish (dial-up like) speeds for sending SPI commands due to the overall limited performance of the MSP430.

While I could technically get a faster microcontroller, another potentially interesting solution is to go full custom with an FPGA as well. In this case, some of the low-level motor control would be done directly in Verilog, with a customized microcontroller used for some of the software parts (the motor control part effectively would be done via an MMIO device). Ended up designing a new ISA for this.

BSR1 (or BtSR1) ISA

I decided to break with my prior SuperH based ISAs for this, while technically my B32V ISA variant would have also been fairly applicable.

BSR1 and B32V have a few points in common:

  • Both use fixed-width 16-bit instructions.
    • Though certain instruction pairs “may” be decoded as larger instructions.
      • With some restrictions, both cases are treated as functionally equivalent.
  • Both use 16x 32-bit GPRs (and a very similar C ABI).
  • Both currently omit the use of a barrel shifter (TBD: may decide to keep shifter).
  • Both include a limited form of the Load/Shift mechanism from my BJX1 ISA.
    • Though the mechanism itself has many functional differences.
  • Neither includes a hardware division operator.
    • Things like integer division are emulated in software.

But, there are also points of difference:

  • BSR1 is a new ISA, partly influenced by both SH and MSP430.
  • BSR1 replaces (R0, Rn) memory access with (Rn, DLR).
    • Drops @-Rn and @Rm+ addressing.
    • Adds (PC, DLR) addressing.
      • Also adds a (PC, DLR_i4) special case to help with code density.
    • This reduces clashes with R0 being used as a GPR.
  • Many operations with immediate fields are gone to free up more encoding space.
    • Most immediate values have been moved to a DLR Load/Shift mechanism.
    • Some cases still exist in the name of code density.
  • Instruction encoding is different/incompatible.
    • BSR1 is mostly OOnm, OOii, or OOnO
    • B32V is mostly OnmO, OnOO, … with a SH-derived ISA.
  • BSR1 will currently focus on smaller address spaces.
    • My current prototype core design assumes the use of a 16-bit address space.
    • Current design is focusing on a 32kB ROM space with 8kB of RAM.
    • Technically, ISA can handle larger address spaces without too much issue.

The operation of the BSR1 ISA revolves a fair bit around the DLR register, which in the compiler is partially conflated with the SH MACL register. This register is also (like SH) used as a multiplier output, but (unlike SH) there is no MAC operation. DLR is generally treated as highly volatile, and takes over much of the role R0 served as a “stomping ground” register. As a result, R0 is freed up to be used as a GPR, and simplifies a lot of compiler logic which no longer has to have special cases depending on whether or not R0 is available. By extension, DLR is always assumed to be available, and in most cases it is undefined to either expect DLR to preserve a value nor whether the value in DLR will be a defined value following many operations (a processor could choose to treat several smaller operations as a single larger operation, and in doing so, need not necessarily update DLR in the expected way). Similarly, a processor can choose to execute the operations sequentially (as 16 bit instruction words) and still get the desired result.

So, operations may load an immediate or displacement value into DLR, and then an operation may use this value. Changing the exact sequence may allow a larger or smaller immediate or displacement to be encoded as needed.

For a small subset of the ISA, a signed 17-bit displacement may be represented in 32 bits:

  • MOV.x Rm, (PC, disp17s)
  • MOV.x (PC, disp17s), Rn
  • BRA disp17s
  • BSR disp17s

This allows accessing the whole 16-bit address space fairly easily, and extends reasonably cleanly to 25 or 33 bits (in a 48 or 64 bit instruction sequence). This is achieved via instructions which can load 13 bits into DLR, with the following opcode supplying an additional 4 bits. Many other instructions use DLR directly, which allows a 13-bit immediate to be encoded in an instruction pair (or 21 bits via a triple).

In some tests, code density seems to be roughly comparable to similar code compiled on the MSP430, which should hopefully allow getting a reasonable amount of code into a 32kB ROM or similar.

From the perspective of writing ASM code or generally working with the ISA, it seems to be a little less painful than SH (and B32V), which is sort of an improvement I guess. This is due mostly to the instruction set being a little more orthogonal and being able to gloss over a little more in the assembler/emitter stage.

Development is ongoing, and this is still at a fairly early stage, but results thus far seem to look promising.

Advertisements

Making a CPU, SH4+BJX1

So, I had been sinking a lot of time recently into my ISA / Emulator / C Compiler / (now) CPU project. This has run a bit out of control, having now consumed about the last year of my life.

The base ISA, from which it derived, is the SuperH SH4 ISA, probably most well known for the Sega Dreamcast, and mostly used in various pieces of consumer electronics. It is a RISC, but one which used fixed 16-bit instruction words, rather than the 32-bits more typical of RISC, and SH had later inspired the Thumb ISA.

A drawback of this ISA, however, is that it sometimes needs longer than ideal instruction sequences to perform operations. My workaround to this has become an ISA I call BJX1 (yet to come up with an actual name, could backronym it to “BetaJeX” or something; originally meant “BGB J-eXtensions”), which inherits the basic SH4 ISA (albeit with some instruction forms marked as “deprecated”), but adds a few escape-coded extension blocks (a major one being 8Exx; which single handed implements the bulk of the BJX1-32 ISA).

Extensions mostly include expanded MOV forms, expanded immediate values, and some “Reg, Reg, Reg” and “Reg, Imm, Reg” forms for arithmetic operations.

Similarly, there is the BJX1-64 ISA, which is closely related, but expands the ISA to 64 bits. This includes some additional instruction forms, and now also the CCxx and CExx blocks (much of the Cxxx block was dropped in BJX1-64, and was reused for operations which are “more useful”). Thus far, these blocks are mostly used to encode instruction forms which extend the ISA to 32 GPRs, in a form I am calling 64A (a 64B variant exists which stays with using 16 GPRs, and similarly lacks CCxx and CExx). There is a penalty involving use of R16..R31, but they are helpful for functions with a lot of register pressure. These blocks don’t exist in the 32-bit ISA (where these instead encode some rarely-used “@(GBR, disp)” operations).

Effectively, the 8Exx, CExx, and CC0e blocks work as prefixes, essentially modifying the normal 16-bit instructions words:

  • 8Exx: Adds or extends immediate for most instructions.
    • MOV forms either get an 8-bit displacement (“MOV.L @Rm, Rn” => “MOV.L @(Rm, disp8s), Rn”), or transform into @(Rm, Ro, disp) forms (“MOV.W @(Rm, R0), Rn” => “MOV.W @(Rm, Ro, disp4), Rn”).
    • Many arithmetic operators get an 8-bit immediate, and a 3-register block is added.
    • Values which use an immediate or displacement generally get a bigger immediate or displacement, and a few blocks are repurposed.
      • “BT disp16”, “BRA/N disp16”, …
      • “ADD #imm16s, Rn”
      • “LDSH16 #imm16s, Rn” replaces “MOV.W @(PC,disp), Rn”; Encodes “Rn=(Rn<<16)+Imm;” This is mostly used for constructing larger inline constants.
    • Some other additions include LEA instructions and similar, which are useful for pointer arithmetic.
  • CExx: Does similar to the 8Exx block, but takes off a few bits for extended GPRs, resulting in similar operations generally having a 4 or 6-bit immediate vs an 8 bit immediate.
    • Some of the MOV forms simply become @(Rm, Ro) instead of @(Rm, Ro, disp4).
    • Size-independent single-register operations keep the full width immediate (typically 16 bits), but simply encode a register in the range of R16..R31 instead of R0..R15.
    • “SHADQ/SHLDQ Rm, Imm, Rn” was split into multiple alternate instruction forms (SHALQ/SHARQ/SHLLQ/SHLRQ), as 6-bits is only sufficient to encode the shift amount.
  • CC0e: Functions essentially similar to an x86-64 REX prefix.
    • e=Qnst/Qnmz; Q=QWord, nst add a bit to each register; nmz adds a bit to Rn and Rm, with z usually indicating whether an applicable DWord operator will use sign or zero extension on the result.
    • More subtly modifies some other operations, so CC00 isn’t exactly NOP.
      • TBD: Define CC00 as simply padding a 16-bit opcode to 32 bits?…
  • CC3e: Implements basically a set of 3R arithmetic operators over all 32 GPRs. This one isn’t technically a prefix as the following block is special-purpose using all 16 bits (onst, or a 4-bit operator followed by Rn, Rs, and Rt).

Code density doesn’t quite reach the levels of Thumb2 as of yet, but gets “reasonably close”. Now whether any of this has a point now due to the existence of RISC-V’s RV32C and RV64C is to be seen (these seem to be in the same general area regarding code density).

A lot of the time in recent months has mostly been thrown at the C compiler, trying to make the ISA variants work, find bugs, and improve the quality of the generated code. This is still not fully solved, as bugs remain which have yet to be located and fixed (particularly in the 64 bit ISA’s).

 

In some ways, Verilog proves challenging. In a basic sense, it is a language like many others, but it is a language with many awkward restrictions. You can write stuff, and it will work in simulation. However, synthesis becomes a problem, with lots of little things which conspire to eat up the FPGA’s budget. Little things like the relative length vs element width in an array, how the array is accessed, whether the array matches nicely with the size of the internal Block-RAM, when values are used directly vs delayed to a subsequent clock cycle, whether multiple versions of an operator are used vs forcing multiple paths through a single operator, … can have fairly significant effects on the cost of synthesizing the design.

Like, in C, one can use memory more freely, and C really doesn’t care when and where one uses shifts and multiplies. Verilog cares, and you will be judged for how often you multiply numbers, and the bits involved in the input and products of the multiplication, etc. It is like “That huge and ugly casez() block with hundreds of patterns; yeah that is fine. But as for that integer multiply working with 128 bit quantities; I am not happy about this…”. Often, things which work well in VL would be horrible in C, and things one takes for granted in C are very expensive in VL.

For example, BJX1-64 has a DMULS.Q operation, which takes two 64-bit signed quantities and produces a 128 bit result. So, trying to do it directly, one gets some big ugly mass of LUTs and DSP units which proceeds to eat up a scary chunk of the total budget. But, assuming I keep DMULS.Q, there is some incentive to find a cheaper way to do the multiply (such as a state machine built from narrower multiplies, spreading the multiply over several clock cycles).

Similar sorts of things happen of one tries to implement memory as a long skinny array vs shorter wider arrays. Despite being the same total capacity, the longer array may be notably more expensive. Accesses to memory arrays also sort of follows a similar principle to Highlander: “There can only be one!” You are allowed one read and one write to an array, not per mutually exclusive control path (this would make it too easy), but in total. Failing to do so, and even a relatively modest sized array will try to eat the budget.

But, it would be nice to be able to have an implementation of my BJX1 ISA in hardware. Nevermind if it seems at this point like I have little idea what I am doing.

 

But, this has been along with me writing sci-fi, and am now technically an author. This is even if, at the time of this writing, I only have about $3 to my name from ebook sales. Like “Woo, several people have bought it!” But, of these, how many will read it? How many will think it isn’t crap? Will it sell enough to “actually matter”? …

 

Misc projects: Fast Compressors.

How to introduce this? These are something one works on, they serve a purpose, but don’t get much attention.

FeLZ32

I have a design for a format I had called FeLZ32, which is basically “like LZ4 but decodes faster”. How? It basically dispenses with bytes; entire format is based around working with 32-bit DWORDs. So, the compressed data is DWORDs, and the output is also seen as DWORDs. The DWORDs are normally 32-bit aligned, and no fuss with tailing bytes.

Otherwise, FeLZ32 is sort of similar to LZ4: there is a Tag word, which encodes both the match (length/distance) as well as the number of raw/uncompressed DWORDs to copy to the output (before copying the matched symbols). The format of the Tag is fairly simple: [15:0]=distance; [22:16]=match length; [29:23]=raw length; [31:30]=alignment. There are special cases, namely for long runs of raw/unencoded symbols, and an end-of-data case (Creatively, encoded as 0x00000000).

( Sadly, some of the shifts are a little awkward, eg, for SuperH assembler. Normal SH only provides immediate shifts of 1/2/8/16 bits; so X>>7 requires shifting 1b left and 8b right in this case, or the use of SHLD, but otherwise. )

FeLZ32 wiki entry on GitHub.

So, basically, it is like LZ4, but in-general doesn’t compress quite as well (depends on content, for some things it fares a little better). In turn, LZ4 doesn’t usually compress as well as Deflate, so same sort of thing. Each is generally faster, but at the expense of compression.

There is some types of content where it does better, but here is for some random stuff I had laying around (not really ideal cases either for FeLZ32, but alas):

felz_vs_lz4_1

As noted, my PC (an AMD FX-8350 with 32GB of DDR3 1600) achieves a “general” memcpy speed of around 3.4 GB/s ( I previously had been using a Phenom II which peaked at ~ 1.8GB/s ).

As noted, FeLZ32 was typically getting decode speeds of around 2 to 3 GB/sec, which seems pretty solid. Encoding doesn’t reliably beat LZ4 though, which is sad ( encoding mostly consists of searching for LZ matches, so its speed is pretty dependent on how much “effort” the encoder puts into the search, and the efficiency of the matching strategy. )

I guess a possible “research path” could be a format which can alternate between an FeLZ32-like mode and a more LZ4 like mode.

 

BTIC4B (History)

There was also BTIC4B, which is a speed-oriented image/video codec, following along with my prior “BTIC family” designs.

For the most part, members of the BTIC family were focused on speed, and go back originally to a format I had called BTIC1C, which was basically a hacked version of the Apple RPZA format with a Deflate-based backend glued on (which I had also tuned for speed).

In this format, blocks consisted of a pair of colors in RGB555 format, followed by 32-bits of pixel data representing a 4×4 block (at 2 bits/pixel). Anyone who has looked at DXT1 probably gets the idea (but, this happened roughly 5 years earlier). The MSB of the color endpoints were used to invoke other magic, namely skipping runs of blocks, or doing runs of a flat color.

There were a few failed attempts to replace it: BTIC1D, which used a conceptually similar design but was YUV420, and had unacceptably poor decode speeds combined with lackluster compression. There was BTIC1F, which added the concept of using a YUV color-vector in place of discrete endpoints, but ran into awkward design issues.

The next semi-successful member was BTIC1G, which was basically a stripped down design intended mostly for some robots I was building. This had a different set of requirements, namely a fast/simple encoder which could work incrementally and stream the video as small spans of blocks (via UDP datagrams). This version was basically straight bit-twiddly.

With BTIC1G, I had also discovered a way to more quickly classify and encode blocks formats, namely to measure some properties of the pixel block and feeding them through some weighting and a weighted decision tree (in some contexts these things are passed off as “neural nets”). This allowed determining in real-time what the most likely ideal representation was for a block. ( Some parts of the design process may have also been assisted using genetic algorithms as well… ).

Then later, started work on a design I called BTIC1H, which was originally intended for the same use-case as 1G, and basically replaced the bit-twiddly blocks with Adaptive Rice Coding (and a trick I had called SMTF; which can mimic the behavior of Huffman coding but without needing Huffman tables or the severe performance overheads associated with Adaptive Huffman and Arithmetic Coding). It had also reintroduced some concepts from 1D and 1F, namely the use of (now optional) YUV420 blocks, and the use of delta-coded color vectors.

Elsewhere, I had began work on a format called BTIC4A, which was focused initially on fixed-size blocks and packing pixels into blocks; in a matter conceptually similar to BC7; just with 8×8 pixel blocks in either 256bit, 384bit, or 512bit forms.

The 4A effort was quickly mutated into BTIC4B, which combined 4A’s block formats with a bitstream format more or less copy/pasted from 1H. Most of the pixel-data is copied as-is into the bitstream. Decoding consists then of reconstructing the blocks from the bitstream, and handing them to a block-decoder.

As a side effect of the bigger 8×8 blocks though, there was only about 1/4 as much overhead from things like entropy coding, allowing it to go faster, and there was also an improvement in terms of compression due to the reduction of side-data for each block. Many blocks use subsampled chroma, ranging from 4:2:0 up to giving the whole 8×8 block a single chroma value (only luma varying).

Likewise, vectors were labeled as (Y, U, V, Dy, Du, Dv), with Y/U/V as a central color, and Dy/Du/Dv being the range relative to the center. Dy is always positive, but Du and Dv may be negative. Unused components may be skipped, and the vector deltas may be based on a difference from a vector predicted via a Paeth predictor (similar to PNG). The vector is then transformed (when decoded) into a pair of color endpoints (in YUV space).

In chroma subsampled blocks, the U/V components are interpolated to form the U/V for a given pixel sub-block, which is then unpacked into possible pixels based on Y. Y then selects the pixels. In non-subsampled blocks, the endpoints are converted into colors and used directly (thus Y may also determine the chroma value). Though, some blocks may ignore Du/Dv with the Y axis only interpolating along Y.

Things are pretty variable, but will note that on my AMD-FX I am often seeing decode speeds in the area of 400-600 megapixels per second for a single threaded decoder, which is plenty fast to be usable. With multiple threads, gigapixel speeds are possible (however, per-thread speed drops sharply as the number of threads increases). I have a Dual Xeon E5410 box (a discarded rack server) which, while initially slower, scales a bit better (I have seen decode speeds of around 4 gigapixels/second on this machine).

But, at its core, the “transform” ultimately isn’t that much different from RPZA and CRAM; we have often 1 or 2 bits pixel (though 3 or 4 are possible). There are variable block-sizes (Y subsampling), and various levels of chroma-subsampling.

The Y block is has various sizes (none; 2×2, 4×2, 2×4, 4×4, 8×4, 4×8, 8×8) which map to the logical 8×8 output block,  and UV chroma blocks sizes (none; 2×2, 2×4, 4×4, or 4×8); these in turn come in certain combinations, with various combinations of pixel bit-depths (to fit within a size budget).

No real fancy math, just good ol’ bit twiddly.

 

AdRice and SMTF

AdRice was another nifty trick I ran across. In traditional Rice Coding, the ‘k’ value is determined up-front, so two passes are typically used to encode this. One pass determined the k values, the other does the actual encoding.

In AdRice, a nifty trick is used, namely, if the Q value (length of the unary prefix) is 0, then k is made smaller; if 1 it remains the same; and if >1 it gets bigger. Starting k is then ‘something sane’, and the rest takes care of itself.

To avoid epic slowness, it is possible to use a lookup table, where both the coded symbol and new k value are held (both the ‘k’ and matched bit pattern are used as the table-index). The range of ‘k’ can also be constrained, and a length-limiting escape case can be used, such that the whole process can be driven by lookup tables (as in a typical Huffman coder).

Unlike Huffman, AdRice has the advantage of allowing encoding/decoding without needing to drag around a Huffman table (potentially a big cost with small payloads; and cases where data may be encoded/decoded in pieces or in no particular order).

Though, Rice may have an obvious limitation: Normally it only works effectively at encoding values following a geometric distribution.

A possible few workarounds exist:

  • SMTF: If symbol is small, swap towards front, else rotate space and swap the symbol to the new front (the symbol that “falls off the end” goes into the “hole” left previously).
  • STF: Swap towards front (if N>0: swap N with N-1).
  • STF2: Swap towards front, but at a fixed ratio (eg: swap N with 7*N/8).

The exact swapping strategy varies and what is ideal is variable. But, these basically tend to put symbols in an order of descending probability (and thus make the Rice coding happy), and isn’t too terribly expensive (SMTF is usually implemented via masking off an rover+index into a buffer; rotating the space consists of decrementing the rover).

STF tends to result in a closest-to-optimal distribution, but suffers from the problem of adapting slowly; STF2 is faster adapting than STF but prone to a limitation that alternating patterns of symbols may get “stuck” swapping with each other (rather than advancing towards the front); …

The combination of AdRice and SMTF, while not quite as fast or as good as Static Huffman, generally works pretty decently as a substitute when an adaptive coding is needed or when payloads are small enough that the cost of passing around and initializing the Huffman tables outweighs the costs of AdRice and SMTF.

 

Retroactive: Doing It C Style.

Probably a bit of an unpopular stance, but I mostly prefer C, and even then, I tend towards a subset of C. It isn’t that I am afraid of C++, like it is some paradoxical and incomprehensible thing, but rather, what it has to offer doesn’t seem to offsets some of the costs of using them, and when using C++, a lot of the often touted selling points of C++ themselves come with pretty serious drawbacks (such as producing bloated executable files or taking absurdly longer to compile than “just doing it the C way”).

So, some advantages of C or a C-like approach to C++:

  • Faster compile times.
  • Typically smaller binaries
  • (Typically) better performance (excluding doing something stupid).
  • (C++) Freedom to pick and choose which features are actually relevant to the task at hand (avoiding “using features because they are there” or “‘Modern C++’ style demands doing this thing in this way”).

And, some advantages of plain C:

  • Doesn’t depend on having a C++ compiler (more relevant to small targets than mainstream ones).
  • The language is more within reach for a “mere mortal” to write a parser or compiler for it.
  • The ABI is more standardized (less subject to the whim of the compiler or which version of the compiler one is using, and with far fewer edge cases).
  • The code is frequently more reusable, as it can be used in both C and C++ projects.

One common criticism against C is its lack of “OO”, but this seems to mistake a lot of superficial language details for the underlying concepts. A class in C++ isn’t that much functionally different from a struct in C, and most things that can be done with a C++ can also be done, without all that much more effort or code, in C using structs. It is more a case of there being none of the syntactic sugar, and thus the code looks different / less pretty (“foo->Bar()” becoming one of “FooLib_FooClass_Bar(foo)”, “foo->vt->Bar(foo)”, or “(*foo)->Bar(foo)”). There is no operator overloading, so one uses function calls, and no templates so one either uses macros or does the code inline (this being itself a tradeoff; this sort of thing is harder and more limited, but on the other hand even in C++, overuse of templates is one of the major factors leading to bloat and slow compile times).

A difference though is in terms of problem solving, as C does a lot less hand-holding, and even when it does, things provided even by the standard library may not be the best way to do them: C string functions can get slow, but frequently it is a sign of “you aren’t doing it right”. Functions like “qsort()” are lame, but usually more of a tradeoff between “simple but slow” (ex: half-assing a bubble sort) or doing something “fast but more effort” (writing out a specialized sort function, or finding a non-sort solution); “std::sort” may well give a better sort than “qsort()”, but OTOH, expecting qsort to be the end-all solution to sorting in C is sort-of missing the point.

Actually, if one gets “serious” in C, they may well come to realize that even a fair portion of the C library is ancillary anyways (ex: the parts which aren’t about either using host OS facilities or are builtin functions typically handled by the compiler itself).

Similarly, if working on small embedded targets, they may not have a lot of the graces that come with a hosted environment on a relatively fast and powerful CPU. On most higher-end 32-bit architectures, things are usually pretty sane (apart from the smaller amount of memory available).

However, on 8 and 16 bit targets, a lot of this goes out the window, and being limited to C is probably the lesser of the issues. For example, one might have to deal with issues such as:

  • Lack of a hardware multiplier, making multiply expensive.
  • Lack of the ability to divide by a non-constant value.
    • Because there is neither a hardware divider nor the resources to fake it in software.
  • Shifts may also be done in software.
    • Adding a value to itself is basically a left shift, right?…
  • You only have 16-bit integer types, but potentially 20 or 24 bit addresses.
    • And potentially only the ability to add positive 16b displacements to pointers.
    • Say, other ops require composing/decomposing pointers with macros.
  • Want FPU? Too bad, not gonna happen.
  • Here is to hoping the C dialect itself doesn’t have funky limitations.
    • Say, there are 1D arrays, or an inability to directly copy structs by value, …
    • Or, if it only accepts K&R declaration syntax.
      • Though, As of 2017, I feel these are pretty solidly dead now.

 

2bpp Framebuffer and NTSC Composite

So, status (besides class):
Had recently started messing with Verilog (for if/when I can eventually get an FPGA, still limited mostly to simulation), and have thus far written a partial implementation of a CPU core implementing the my BJX1 ISA in it (an extended 64-bit version of the SuperH ISA).

For part of it, was also working on a display subsystem (intending to work from a small framebuffer in SRAM and then produce composite output).

Was able to experimentally get passable image quality from around 2 bits/pixel via a color-cell scheme. This has pretty comparable memory use to CGA in 4-color mode; but with higher image quality.

One sub-mode uses a pair or color endpoints from a 64-color RGB colorspace. Image quality isn’t great, but is reasonably passable. Here, each R/G/B component has 4 brightness levels, with an ability to get intermediates, sort-of, by dithering.

Another sub-mode uses YUVD colors encoded in a way similar to ADPCM. Basically, there is a YUV centroid with D=Ymax-Ymin (of the difference in Y between the brighter and darker value). They are stored as a delta from the prior vector, with 3 bits/component encoding the delta, which may dynamically adapt to fit the range of the component values. There is an escape case to allow colors which fall too far outside the current range to be encoded as an absolute YUVD centroid; or alternatively as a pair of 64-color RGB endpoints.

The reason for this mostly be able to fit the screen buffer into SRAM in an FPGA, and not “eat it” with an overly large framebuffer. As-is 320×200 would require 16kB, and 640×400 would require 64kB.

This is mostly because most of the rest of the SRAM (around 200kB in the FPGAs I was looking at) is needed for other purposes; like a CPU cache, … Alternatives would require either a mechanism to stream in a framebuffer from external DRAM, or require generating interrupts to have the CPU fill in scanlines.

 

Image of a Dog (Original, apart from scaling to 640×480):

DE3D1_org0

 

Image of a dog, but encoded at 2 bits/pixel:

DE3D1_2bpp1

It doesn’t look great, but you can’t expect too much from 2-bits / pixel.

If I added a full 64kB (to support 640×400) I could possibly have a higher quality sub-mode for 320×200 display.

Another possibility is to have 32kB of VRAM, which would allow a 2bpp 640×200 mode, or a higher quality 4bpp 320×200 mode (probably with either a pair of RGB555 endpoints or a YUVDyuv scheme).

Less certain ATM is that this will support a text submode, but it is unclear as to the specifics (dedicated text-cells; or use of drawing the text cells via color-cell graphics, drawing each glyph as a 2×2 grid of cells).

 

This would feed into the next part:
I am considering doing the video output mostly as composite (ex: NTSC over an RCA cable), and so was messing around with trying to modulate and demodulate NTSC in software (and maybe later do a version of the modulator in Verilog; ADD/EDIT: have done a modulator in Verilog).

 

Dog Image (of experimental NTSC output, done via 100MHz PWM):

DE3D1_ntsc_clr2

In this test, the NTSC is done via PWM (or DPWM); in this case turning an output pin (simulated in this case) on and off on a 100MHz clock in a pattern to approximate an analog signal. This produces output at a rate of ~100Mbps, and thus a frame of video requires around 417kB (ironically, smaller than most of the output images saved as PNG).

For actual output, this would be fed into a voltage divider and capacitor to turn this into a “true” analog signal (would need to reduce from 3.3V to 1.0V and also smooth out a lot of the high-frequency noise). This is pretty similar to how audio output is done on most modern computers, as DACs are more expensive and people don’t really hear the difference; though using a DAC would only need a signal at around 14MHz rather than 100MHz (and higher video quality would similarly require a higher PWM frequency). This frequency is nowhere near high enough to be able to directly produce VHF modulated output though (would need to PWM the pin at around 240-300MHz for this).

Color makes things a little more complicated, and a lot of the effort in implementing the NTSC demodulator was mostly going into trying to make the QAM demodulation work. Color involves basically magic of making ~3 analog signals coexist in the same place at the same time, with the components mostly not interfering with each other.

The Y component is transmitted via the signal’s DC bias, and the UV/IQ components via phase angles (I at 0 degrees and Q at 90 degrees); with a non-modulated 3.579545MHz “colorburst” signal being sent every scanline to help align the decoder with the encoder’s signal. These are all handled together as a single combined PWM signal. Here, YUV can be fudged into being YIQ (by slightly rotating the signal), so it basically works.

In the case of the image, the decoder was given a copy of the PWM’ed output bitstream, and it recreates an output image from this. In an actual hardware implementation, this role would be served by plugging the cable into a TV or some other similar display with support for a composite input. Though, I would need some sort of decoder to test a Verilog simulation of the display subsystem.

Though, I don’t know yet if it will work with an actual TV, as information on NTSC specifics is a bit fragmentary and it was annoyingly difficult to piece together via web-searches. So, it is possible I missed something important and the generated signal isn’t actually compatible with real TVs, who knows?… ( Could maybe consider writing a basic spec on this. )

Why not VGA? Less wires, mostly. Though, on the modulation side VGA (and component) would be both simpler and probably produce higher output quality than composite (basically by not needing to modulate multiple signals together). VGA would require wiring up a VGA connector though (and a lot more filters); Component would have the drawback of being less commonly supported by low-cost displays.

 

Other Thoughts:

  • Noticed after posting this, that I could improve image quality by 8-bit aligning the PWM bits for the NTSC test, likely because the decoder samples 8 or 16 bits at a time (shouldn’t matter for a real TV).
  • Though more complex, a hybrid DAC design (say, with 2 or 3 high-order bits and the low-order bits as PWM) could likely be able to give better video quality. Though, this would require more complex electronics.
  • Note that the colorburst clock doesn’t align sufficiently closely with an integer fraction of 100MHz, as a workaround it is possible to accumulate 150137 (once per clock-cycle) and use bits (21:17) of the accumulator as a clock-phase (3.579545/100*2^22; which approximates the ratio of the frequencies relative to a power-of-2; the choice of 2^22 in this case being based mostly on “number magic”).

But, yeah, work in progress.

 

ADD: Experimental 4-bit PCM

Dog Image, NTSC 100MHz 4-bit PCM w/ dither and rotating colorburst (45 degrees every half-frame), and an average over several frames.

DE3D1_ntsc_clr3

By itself, the 4bpp PCM does give considerably better results than the PWM (much less noise), though even as such there was still a slight ripple pattern apparently due to the colorburst signal (the slightly noticeable diagonal lines in the original NTSC image, but more obvious with the PWM due to lower overall noise).

The signal seems to align so that nominally the carrier is at the same phase in the same places in the frame. A workaround seems to be to average the past two frames and rotate the signal by 45 degrees per half-frame, so that for two consecutive versions of a scanline the carrier is out of phase by 90 degrees, and the ripples mostly cancel out. In this case, the image is encoded via multiple consecutive frames.

Though, it also appears that rotating the carrier somehow mostly defeats the effects of CGA style artifact colors (instead, these tests mostly lose saturation).

I am not entirely sure how actual TVs would respond to a rotating colorburst, but at least theoretically the TV will be able to realign with the colorburst each frame.

 

Forgot about this…

So, what happened? I wrote a few posts, then for a while forgot I had set up a blog.

So what has been going on?

Class, basically doing Machining stuff, and learning CNC stuff, because, that is what sort of jobs “actually exist” around where I live.

 

Spent a good number of months working on a emulator and a C compiler targeting a variation of the SuperH ISA. Started fiddling with doing an implementation of the ISA in Verilog. Not like it is “actually useful” or anything. Working on stuff like this, one becomes aware both of the limitations of the ISA, and doing stuff in Verilog, some of the challenges which exist for trying to implement an ISA in hardware, sorta (this exists in a gray area between that of code and hardware, writing stuff in a language which can nominally run on an FPGA, or theoretically be turned into a real chip if enough money is thrown at the problem).

 

So, what is Verilog:

It is sorta like C, if you were doing everything using branch-free logic, with a few control-flow constructs thrown in the mix; and a convenient bit-vector notation (counterbalanced by an ugly syntax for declaring numeric values; begin/end keywords; …). It is also in some ways behaviorally sort-of like ladder logic, though at least thankfully, much less terrible (the way combinatorial blocks behave is reminiscent of a scan in a ladder program; though luckily we are spared from everything being presented via the abstraction of relays switching things on and off between a pair of power rails).

One thing that becomes more obvious here is the distinction between division and most other arithmetic operators: Most operators can be implemented fairly easily; but division is one of those operators that generally requires an iterative implementation in both the integer and FPU case.

Integer division is possible in one of several ways, but not going to go into them.

FPU division is at least a little easier, in that one can hack together a passable approximation for relatively cheap (mostly using subtraction and shifts or similar). But, then a problem comes up if one wants “accurate” division. In a software FPU, one can be like, “well, I will just use a fixed-point a integer divide and some bit-twiddly”. Here, not so much (if one has any concern for gate budget).

There is always Newton-Raphson, but this also costs gates (for a probably one-off special purpose FMA unit), and one would still basically need to follow an initial approximation with some number of “refinement” operations.

The other alternatives are having a floating-point divide operator which has lackluster precision, or is comparatively slow (if emulated in firmware).

 

Interestingly, one feature I got backlash over, namely the addition of x86-like memory addressing, was comparably fairly cheap and simple to implement, and has the advantage of somewhat reducing the length of instruction sequences when working with structures, arrays, or when doing pointer-arithmetic.

Basically, In SuperH, memory addressing ops were basically like this:

MOV.B  //Load or Store a Byte
MOV.W  //Load or Store a Word (16-bit)
MOV.L  //Load or Store an Integer/Long (32-bit)
FMOV.S //Load/Store a Float (or half of a Double)

With several types of addressing available (basic/common forms):
MOV.L @Rm, Rn  //Load from Rm into Rn
MOV.L @Rm+, Rn //Load from Rm into Rn, with a post-increment
MOV.L @(Rm, disp4), Rn //Load from Rm with 4 bit displacement
MOV.L @(R0, Rm), Rn //Load from Rm with 4 bit displacement
...
MOV.L Rm, @-Rn //Store into Rn with pre-decrement
...

A useful special case:
MOV.L @(PC,disp8), Rn  //Load a constant

And, some not so useful special cases:
MOV.L @(GBR, disp8), R0 //What do we use this one for, exactly?

Looks all well and good, until you want to load from an array. Spot that @(R0, Rn) addressing form, yeah, you’ll be seeing that one a lot (pretty much any time accessing an array or a structure field which falls outside of a very limited displacement range). That 4 bit displacement form is frequently not even sufficient to access local variables on the stack. Similarly, because lots of operations need to go through R0, it makes it extremely prone to being stomped.

Similarly, MOV.B and MOV.W implement fewer forms, and fewer still for floats or doubles (the SH FPU is probably worthy of its own topic).

This means typically a lot more constant loads, explicit shifts, arithmetic ops, … as part of doing a memory load or store operation.

There are workarounds though:

SH-2A MOV I-Forms (were added later on with the SH-2A ISA):
MOV.L @(Rm, disp12), Rn  //Oh, amazing, a whole 12 bits for the displacement

BJX1 MOV I-Forms
MOV.L @(Rm, Ro, disp8s), Rn  //Address is computed as Rm+(Ro+disp)*4
...

And this gold nugget:
LEA.L @(Rm, Ro, disp4), Rn //Calculate address and store into Rn

Also:
MOV.Q  //Load/Store Quadword

These are 32-bit instructions, and are not entirely without drawbacks, but can generally “pay for themselves” in terms of simultaneously reducing code size and (potentially) helping with performance. This is mostly because, if you can displace 3 or 4 16-bit instructions with a single 32-bit instruction form, this is typically a savings.

What about all these instruction-forms for MOV.L and friends; wont this lead to an absurdly complex load/store and address calculation machinery? Not really.

In Verilog, one can sort of just use the general case and special case things, leading to a side effect: Add of the memory addressing is done internally @(Rm, Ro, disp) calculations. But, how, some operations lack a scaled index, a displacement, or the displacement is unscaled.

Partial solution: What can happen in the decoder:

MOV.L @Rm, Rn         => MOV.L @(Rm, ZZR, 0), Rn
MOV.L @Rm+, Rn        => MOV.L @(Rm, ZZR, 0), Rn; LEA.L @(Rm, ZZR, 1), Rm
MOV.L @(Rm, disp), Rn => MOV.L @(Rm, ZZR, disp), Rn
MOV.L @(R0, Rm), Rn   => MOV.L @(Rm, R0, 0), Rn
MOV.L @(GBR, Rm), Rn  => MOV.L @(GBR, ZZR, 0), Rn

With a few special notes:

  • ZZR is a special Null Register, and is treated here as if it were always 0.
  • ZZR may overlap with R15/SP, which may not be used as a scaled index.
  • R0 is not scaled if used as an index.
  • Some specifics here are left to the discretion of the implementation.
  • Operations inside the core use wider register numbering than in the ISA.

Basically seems workable.

Well, there was also 3-address integer arithmetic and similar, which is supported by the ISA, but as can be noted hasn’t had as significant of an impact in terms of effects (generally, cases where two-address forms are more applicable are more common, as very often arithmetic is either modifying a value, or the lifetime of one of the values ends just as the new value is being created).

Similarly, there doesn’t seem to be much reason to deviate from load/store, and if anything there is the added drawback of potentially needing to deal with read/modify/write to memory (which is avoided if operations take place in registers).

Why make post-increment / pre-decrement operations take 2 cycles? Well, main reason mostly has to do with an inability to (efficiently) write to multiple registers in a single cycle, which could potentially happen if both are done in a single operation. However, one can fairly safely read from 3 or 4 registers at a time.

The registers are internally split into multiple banks to avoid exceeding these limits.

  • A bank for the low halves of registers.
  • A bank for the high halves of registers.
  • More banks for system and control registers.

The handling of 32-bit registers is special in that, in BJX1, it is possible for some operations to address the high halves of 64-bit GPR registers. This is handled internally via some multiplexing magic. But, splitting them up allows the halves to be accessed in parallel without posing as much of an issue for a 3 read + 1 write limit.

For related reasons, the FPU register space is internally permutated, and structured similarly to the GPRs:

  • F0L-F16L: Low half of Double Registers
  • F0H-F16H: High half of Double Registers
  • F0L=FR1, F1L=XF1, F2L=FR3, F3L=XF3, …
  • F0H=FR0, F1H=XF0, F2H=FR2, F3H=XF2, …

So, in this scheme, Double is the natural value size, though this should be mostly invisible to code.

 

But, still to be seen if I can make a working core out of all this…

 

Trade-offs in Entropy Coding

Well, I suspect no one takes me all that seriously about compression, partly because I suspect I tend to use slightly unusual designs, and tend to focus more on speed than on ratio.

There are a variety of entropy coders, with different trade-offs:

  • Static Huffman
    • Basically the “gold standard” of entropy coders which use a normal bitstream.
    • Pros:
      • Has generally pretty fast encode/decode speeds.
      • Theoretically optimal for a bitstream.
    • Cons:
      • Requires transmitting / initializing / … a Huffman table (non-zero cost).
      • For chaotic streams (such as small messages over UDP), ensuring that the decoder has a matching table can become problematic.
      • More so the space cost of a Huffman table may be a problem for small messages, as can the constant overhead of initializing the tables.
  • Adaptive Huffman
    • Usually unfavorable as it tends to be very slow vs most other options.
    • Need to adjust trees per-symbol.
    • Does not favor the use of a lookup-table driven decoder.
  • Rice Coding
    • Useful, but largely forgotten.
    • Pros
      • Very little constant overhead.
      • Is easy to make adaptive, allowing a simple single-pass encoder.
      • Can achieve similar speeds to static Huffman.
    • Cons
      • Natively only matches certain distributions.
      • Absent length-limiting hacks, has the potential for excessively long symbols.
    • Tricky Hacks
      • Length limiting to avoid extremely long symbols.
      • Reorganizing symbols can allow for more Huffman-like behavior.
        • Dynamic symbol swapping can be fast, effective, and is adaptive.
        • Will generally call this category. “Rice+STF”
  • Bitwise Range Coding
    • Compresses good, but not particularly fast.
    • Pros:
      • Adaptive
      • Can beat Huffman variants regarding compression.
      • Still faster than Adaptive Huffman.
      • Can be similarly flexible as a raw bitstream.
    • Cons
      • Slower than Huffman and Rice
      • Slower than a raw bitstream (if used as a bitstream).
    • Tricky Hacks
      • Can be stuck on the back-end of a Huffman or Rice coder for improving compression (slightly) with less speed cost than using it directly.
      • Sadly, the compression gains of using it over a raw bitstream are frequently not high enough to make its speed costs worthwhile.
  • Symbol Oriented Range Coding
    • Compresses good, also not exactly fast.
    • Cons (vs Bitwise range coding)
      • Doesn’t really compress any better
      • Much more complex and often slower
      • Much less flexible about how it is used.
  • ANS and friends
    • A sort of magic unicorn I don’t really understand that well.
    • Pros
      • Speed can be more competitive with that of Static Huffman.
      • Compression is more like that of a Range Coder.
    • Cons
      • I personally can’t really make much sense of how it works.
      • Requires encoding symbols in reverse order.
      • Seems to have some restrictions similar to symbol-oriented range coders.
      • Its high speeds only seem to materialize on certain hardware in my past tests.

Relative Tradeoffs:

  • Between Huffman and Range Coding
    • Have only rarely seen ratio differences much over about 10%
    • Cases like high symbol skew (where range coding fares well) can usually be addressed more cheaply via other means.
    • A slowdown of typically 2-4x isn’t usually worth a 5-10% gain in compression.
  • Between Huffman and Rice+STF
    • Huffman generally compresses better and is faster than Rice+STF past a certain message size.
    • Rice+STF can allow for a simpler encoder/decoder though, and is faster for small payloads.
      • Its origins actually came about as a cheaper alternative to Deflate-style Huffman-coded Huffman tables.
      • In some cases, the space savings of not transmitting a Huffman table with the data may offset its less-than-optimal coding, putting it ahead.
  • Huffman or Rice + Bitwise Range Coder Backend
    • Can regain some of the compression advantages of Range coding.
    • Has less steep of an impact on codec speed
      • Given the range coder deals only with the already-compressed data.
    • A drawback is that now one has both the code complexity of a Huffman or Rice coder in addition to the Range coder, in addition to potentially other unique funkiness (such as interleaving multiple bitstreams to allow for multiple RC contexts, etc).
  • Not using any entropy coding
    • Typically the fastest option.
    • Typically the worst compression
      • Because, you know, raw bytes.
    • May still be a good option where speed is a lot more important than compression ratio.

 

It can be noted that most of my older designs had used static Huffman, but many of my newer ones have used Rice+STF variants. Generally the relative speed, simplicity, and flexibility has been enough to offset the arguably “bad” compression.

If the total ratio difference is maybe 20% between a “bad” Rice-coded backend, and a much slower range-coded backend, but the result is a 400% speed difference, is it worthwhile? Probably depends whether size or speed is more relevant.

Both may save significantly over the raw uncompressed data, which is usually where most of the gains lie. It often seems more sensible to invest a lot of the time/effort budget in areas which have a more direct effect on overall compression (such as any filtering/quantization/transform steps in image/video compression, or the effectiveness of the LZ matching and similar for LZ77 compressors).

For example, in an image codec, one may find it may well matter a lot more what type of DC predictor is used, or how deltas or coefficients are represented, than the specific choice of entropy coder back-end or similar.

Though, granted, I remain sort of an outcast here.