Author: bgbtechblog

Misc projects: Fast Compressors.

How to introduce this? These are something one works on, they serve a purpose, but don’t get much attention.

FeLZ32

I have a design for a format I had called FeLZ32, which is basically “like LZ4 but decodes faster”. How? It basically dispenses with bytes; entire format is based around working with 32-bit DWORDs. So, the compressed data is DWORDs, and the output is also seen as DWORDs. The DWORDs are normally 32-bit aligned, and no fuss with tailing bytes.

Otherwise, FeLZ32 is sort of similar to LZ4: there is a Tag word, which encodes both the match (length/distance) as well as the number of raw/uncompressed DWORDs to copy to the output (before copying the matched symbols). The format of the Tag is fairly simple: [15:0]=distance; [22:16]=match length; [29:23]=raw length; [31:30]=alignment. There are special cases, namely for long runs of raw/unencoded symbols, and an end-of-data case (Creatively, encoded as 0x00000000).

( Sadly, some of the shifts are a little awkward, eg, for SuperH assembler. Normal SH only provides immediate shifts of 1/2/8/16 bits; so X>>7 requires shifting 1b left and 8b right in this case, or the use of SHLD, but otherwise. )

FeLZ32 wiki entry on GitHub.

So, basically, it is like LZ4, but in-general doesn’t compress quite as well (depends on content, for some things it fares a little better). In turn, LZ4 doesn’t usually compress as well as Deflate, so same sort of thing. Each is generally faster, but at the expense of compression.

There is some types of content where it does better, but here is for some random stuff I had laying around (not really ideal cases either for FeLZ32, but alas):

felz_vs_lz4_1

As noted, my PC (an AMD FX-8350 with 32GB of DDR3 1600) achieves a “general” memcpy speed of around 3.4 GB/s ( I previously had been using a Phenom II which peaked at ~ 1.8GB/s ).

As noted, FeLZ32 was typically getting decode speeds of around 2 to 3 GB/sec, which seems pretty solid. Encoding doesn’t reliably beat LZ4 though, which is sad ( encoding mostly consists of searching for LZ matches, so its speed is pretty dependent on how much “effort” the encoder puts into the search, and the efficiency of the matching strategy. )

I guess a possible “research path” could be a format which can alternate between an FeLZ32-like mode and a more LZ4 like mode.

 

BTIC4B (History)

There was also BTIC4B, which is a speed-oriented image/video codec, following along with my prior “BTIC family” designs.

For the most part, members of the BTIC family were focused on speed, and go back originally to a format I had called BTIC1C, which was basically a hacked version of the Apple RPZA format with a Deflate-based backend glued on (which I had also tuned for speed).

In this format, blocks consisted of a pair of colors in RGB555 format, followed by 32-bits of pixel data representing a 4×4 block (at 2 bits/pixel). Anyone who has looked at DXT1 probably gets the idea (but, this happened roughly 5 years earlier). The MSB of the color endpoints were used to invoke other magic, namely skipping runs of blocks, or doing runs of a flat color.

There were a few failed attempts to replace it: BTIC1D, which used a conceptually similar design but was YUV420, and had unacceptably poor decode speeds combined with lackluster compression. There was BTIC1F, which added the concept of using a YUV color-vector in place of discrete endpoints, but ran into awkward design issues.

The next semi-successful member was BTIC1G, which was basically a stripped down design intended mostly for some robots I was building. This had a different set of requirements, namely a fast/simple encoder which could work incrementally and stream the video as small spans of blocks (via UDP datagrams). This version was basically straight bit-twiddly.

With BTIC1G, I had also discovered a way to more quickly classify and encode blocks formats, namely to measure some properties of the pixel block and feeding them through some weighting and a weighted decision tree (in some contexts these things are passed off as “neural nets”). This allowed determining in real-time what the most likely ideal representation was for a block. ( Some parts of the design process may have also been assisted using genetic algorithms as well… ).

Then later, started work on a design I called BTIC1H, which was originally intended for the same use-case as 1G, and basically replaced the bit-twiddly blocks with Adaptive Rice Coding (and a trick I had called SMTF; which can mimic the behavior of Huffman coding but without needing Huffman tables or the severe performance overheads associated with Adaptive Huffman and Arithmetic Coding). It had also reintroduced some concepts from 1D and 1F, namely the use of (now optional) YUV420 blocks, and the use of delta-coded color vectors.

Elsewhere, I had began work on a format called BTIC4A, which was focused initially on fixed-size blocks and packing pixels into blocks; in a matter conceptually similar to BC7; just with 8×8 pixel blocks in either 256bit, 384bit, or 512bit forms.

The 4A effort was quickly mutated into BTIC4B, which combined 4A’s block formats with a bitstream format more or less copy/pasted from 1H. Most of the pixel-data is copied as-is into the bitstream. Decoding consists then of reconstructing the blocks from the bitstream, and handing them to a block-decoder.

As a side effect of the bigger 8×8 blocks though, there was only about 1/4 as much overhead from things like entropy coding, allowing it to go faster, and there was also an improvement in terms of compression due to the reduction of side-data for each block. Many blocks use subsampled chroma, ranging from 4:2:0 up to giving the whole 8×8 block a single chroma value (only luma varying).

Likewise, vectors were labeled as (Y, U, V, Dy, Du, Dv), with Y/U/V as a central color, and Dy/Du/Dv being the range relative to the center. Dy is always positive, but Du and Dv may be negative. Unused components may be skipped, and the vector deltas may be based on a difference from a vector predicted via a Paeth predictor (similar to PNG). The vector is then transformed (when decoded) into a pair of color endpoints (in YUV space).

In chroma subsampled blocks, the U/V components are interpolated to form the U/V for a given pixel sub-block, which is then unpacked into possible pixels based on Y. Y then selects the pixels. In non-subsampled blocks, the endpoints are converted into colors and used directly (thus Y may also determine the chroma value). Though, some blocks may ignore Du/Dv with the Y axis only interpolating along Y.

Things are pretty variable, but will note that on my AMD-FX I am often seeing decode speeds in the area of 400-600 megapixels per second for a single threaded decoder, which is plenty fast to be usable. With multiple threads, gigapixel speeds are possible (however, per-thread speed drops sharply as the number of threads increases). I have a Dual Xeon E5410 box (a discarded rack server) which, while initially slower, scales a bit better (I have seen decode speeds of around 4 gigapixels/second on this machine).

But, at its core, the “transform” ultimately isn’t that much different from RPZA and CRAM; we have often 1 or 2 bits pixel (though 3 or 4 are possible). There are variable block-sizes (Y subsampling), and various levels of chroma-subsampling.

The Y block is has various sizes (none; 2×2, 4×2, 2×4, 4×4, 8×4, 4×8, 8×8) which map to the logical 8×8 output block,  and UV chroma blocks sizes (none; 2×2, 2×4, 4×4, or 4×8); these in turn come in certain combinations, with various combinations of pixel bit-depths (to fit within a size budget).

No real fancy math, just good ol’ bit twiddly.

 

AdRice and SMTF

AdRice was another nifty trick I ran across. In traditional Rice Coding, the ‘k’ value is determined up-front, so two passes are typically used to encode this. One pass determined the k values, the other does the actual encoding.

In AdRice, a nifty trick is used, namely, if the Q value (length of the unary prefix) is 0, then k is made smaller; if 1 it remains the same; and if >1 it gets bigger. Starting k is then ‘something sane’, and the rest takes care of itself.

To avoid epic slowness, it is possible to use a lookup table, where both the coded symbol and new k value are held (both the ‘k’ and matched bit pattern are used as the table-index). The range of ‘k’ can also be constrained, and a length-limiting escape case can be used, such that the whole process can be driven by lookup tables (as in a typical Huffman coder).

Unlike Huffman, AdRice has the advantage of allowing encoding/decoding without needing to drag around a Huffman table (potentially a big cost with small payloads; and cases where data may be encoded/decoded in pieces or in no particular order).

Though, Rice may have an obvious limitation: Normally it only works effectively at encoding values following a geometric distribution.

A possible few workarounds exist:

  • SMTF: If symbol is small, swap towards front, else rotate space and swap the symbol to the new front (the symbol that “falls off the end” goes into the “hole” left previously).
  • STF: Swap towards front (if N>0: swap N with N-1).
  • STF2: Swap towards front, but at a fixed ratio (eg: swap N with 7*N/8).

The exact swapping strategy varies and what is ideal is variable. But, these basically tend to put symbols in an order of descending probability (and thus make the Rice coding happy), and isn’t too terribly expensive (SMTF is usually implemented via masking off an rover+index into a buffer; rotating the space consists of decrementing the rover).

STF tends to result in a closest-to-optimal distribution, but suffers from the problem of adapting slowly; STF2 is faster adapting than STF but prone to a limitation that alternating patterns of symbols may get “stuck” swapping with each other (rather than advancing towards the front); …

The combination of AdRice and SMTF, while not quite as fast or as good as Static Huffman, generally works pretty decently as a substitute when an adaptive coding is needed or when payloads are small enough that the cost of passing around and initializing the Huffman tables outweighs the costs of AdRice and SMTF.

 

Advertisements

Retroactive: Doing It C Style.

Probably a bit of an unpopular stance, but I mostly prefer C, and even then, I tend towards a subset of C. It isn’t that I am afraid of C++, like it is some paradoxical and incomprehensible thing, but rather, what it has to offer doesn’t seem to offsets some of the costs of using them, and when using C++, a lot of the often touted selling points of C++ themselves come with pretty serious drawbacks (such as producing bloated executable files or taking absurdly longer to compile than “just doing it the C way”).

So, some advantages of C or a C-like approach to C++:

  • Faster compile times.
  • Typically smaller binaries
  • (Typically) better performance (excluding doing something stupid).
  • (C++) Freedom to pick and choose which features are actually relevant to the task at hand (avoiding “using features because they are there” or “‘Modern C++’ style demands doing this thing in this way”).

And, some advantages of plain C:

  • Doesn’t depend on having a C++ compiler (more relevant to small targets than mainstream ones).
  • The language is more within reach for a “mere mortal” to write a parser or compiler for it.
  • The ABI is more standardized (less subject to the whim of the compiler or which version of the compiler one is using, and with far fewer edge cases).
  • The code is frequently more reusable, as it can be used in both C and C++ projects.

One common criticism against C is its lack of “OO”, but this seems to mistake a lot of superficial language details for the underlying concepts. A class in C++ isn’t that much functionally different from a struct in C, and most things that can be done with a C++ can also be done, without all that much more effort or code, in C using structs. It is more a case of there being none of the syntactic sugar, and thus the code looks different / less pretty (“foo->Bar()” becoming one of “FooLib_FooClass_Bar(foo)”, “foo->vt->Bar(foo)”, or “(*foo)->Bar(foo)”). There is no operator overloading, so one uses function calls, and no templates so one either uses macros or does the code inline (this being itself a tradeoff; this sort of thing is harder and more limited, but on the other hand even in C++, overuse of templates is one of the major factors leading to bloat and slow compile times).

A difference though is in terms of problem solving, as C does a lot less hand-holding, and even when it does, things provided even by the standard library may not be the best way to do them: C string functions can get slow, but frequently it is a sign of “you aren’t doing it right”. Functions like “qsort()” are lame, but usually more of a tradeoff between “simple but slow” (ex: half-assing a bubble sort) or doing something “fast but more effort” (writing out a specialized sort function, or finding a non-sort solution); “std::sort” may well give a better sort than “qsort()”, but OTOH, expecting qsort to be the end-all solution to sorting in C is sort-of missing the point.

Actually, if one gets “serious” in C, they may well come to realize that even a fair portion of the C library is ancillary anyways (ex: the parts which aren’t about either using host OS facilities or are builtin functions typically handled by the compiler itself).

Similarly, if working on small embedded targets, they may not have a lot of the graces that come with a hosted environment on a relatively fast and powerful CPU. On most higher-end 32-bit architectures, things are usually pretty sane (apart from the smaller amount of memory available).

However, on 8 and 16 bit targets, a lot of this goes out the window, and being limited to C is probably the lesser of the issues. For example, one might have to deal with issues such as:

  • Lack of a hardware multiplier, making multiply expensive.
  • Lack of the ability to divide by a non-constant value.
    • Because there is neither a hardware divider nor the resources to fake it in software.
  • Shifts may also be done in software.
    • Adding a value to itself is basically a left shift, right?…
  • You only have 16-bit integer types, but potentially 20 or 24 bit addresses.
    • And potentially only the ability to add positive 16b displacements to pointers.
    • Say, other ops require composing/decomposing pointers with macros.
  • Want FPU? Too bad, not gonna happen.
  • Here is to hoping the C dialect itself doesn’t have funky limitations.
    • Say, there are 1D arrays, or an inability to directly copy structs by value, …
    • Or, if it only accepts K&R declaration syntax.
      • Though, As of 2017, I feel these are pretty solidly dead now.

 

2bpp Framebuffer and NTSC Composite

So, status (besides class):
Had recently started messing with Verilog (for if/when I can eventually get an FPGA, still limited mostly to simulation), and have thus far written a partial implementation of a CPU core implementing the my BJX1 ISA in it (an extended 64-bit version of the SuperH ISA).

For part of it, was also working on a display subsystem (intending to work from a small framebuffer in SRAM and then produce composite output).

Was able to experimentally get passable image quality from around 2 bits/pixel via a color-cell scheme. This has pretty comparable memory use to CGA in 4-color mode; but with higher image quality.

One sub-mode uses a pair or color endpoints from a 64-color RGB colorspace. Image quality isn’t great, but is reasonably passable. Here, each R/G/B component has 4 brightness levels, with an ability to get intermediates, sort-of, by dithering.

Another sub-mode uses YUVD colors encoded in a way similar to ADPCM. Basically, there is a YUV centroid with D=Ymax-Ymin (of the difference in Y between the brighter and darker value). They are stored as a delta from the prior vector, with 3 bits/component encoding the delta, which may dynamically adapt to fit the range of the component values. There is an escape case to allow colors which fall too far outside the current range to be encoded as an absolute YUVD centroid; or alternatively as a pair of 64-color RGB endpoints.

The reason for this mostly be able to fit the screen buffer into SRAM in an FPGA, and not “eat it” with an overly large framebuffer. As-is 320×200 would require 16kB, and 640×400 would require 64kB.

This is mostly because most of the rest of the SRAM (around 200kB in the FPGAs I was looking at) is needed for other purposes; like a CPU cache, … Alternatives would require either a mechanism to stream in a framebuffer from external DRAM, or require generating interrupts to have the CPU fill in scanlines.

 

Image of a Dog (Original, apart from scaling to 640×480):

DE3D1_org0

 

Image of a dog, but encoded at 2 bits/pixel:

DE3D1_2bpp1

It doesn’t look great, but you can’t expect too much from 2-bits / pixel.

If I added a full 64kB (to support 640×400) I could possibly have a higher quality sub-mode for 320×200 display.

Another possibility is to have 32kB of VRAM, which would allow a 2bpp 640×200 mode, or a higher quality 4bpp 320×200 mode (probably with either a pair of RGB555 endpoints or a YUVDyuv scheme).

Less certain ATM is that this will support a text submode, but it is unclear as to the specifics (dedicated text-cells; or use of drawing the text cells via color-cell graphics, drawing each glyph as a 2×2 grid of cells).

 

This would feed into the next part:
I am considering doing the video output mostly as composite (ex: NTSC over an RCA cable), and so was messing around with trying to modulate and demodulate NTSC in software (and maybe later do a version of the modulator in Verilog; ADD/EDIT: have done a modulator in Verilog).

 

Dog Image (of experimental NTSC output, done via 100MHz PWM):

DE3D1_ntsc_clr2

In this test, the NTSC is done via PWM (or DPWM); in this case turning an output pin (simulated in this case) on and off on a 100MHz clock in a pattern to approximate an analog signal. This produces output at a rate of ~100Mbps, and thus a frame of video requires around 417kB (ironically, smaller than most of the output images saved as PNG).

For actual output, this would be fed into a voltage divider and capacitor to turn this into a “true” analog signal (would need to reduce from 3.3V to 1.0V and also smooth out a lot of the high-frequency noise). This is pretty similar to how audio output is done on most modern computers, as DACs are more expensive and people don’t really hear the difference; though using a DAC would only need a signal at around 14MHz rather than 100MHz (and higher video quality would similarly require a higher PWM frequency). This frequency is nowhere near high enough to be able to directly produce VHF modulated output though (would need to PWM the pin at around 240-300MHz for this).

Color makes things a little more complicated, and a lot of the effort in implementing the NTSC demodulator was mostly going into trying to make the QAM demodulation work. Color involves basically magic of making ~3 analog signals coexist in the same place at the same time, with the components mostly not interfering with each other.

The Y component is transmitted via the signal’s DC bias, and the UV/IQ components via phase angles (I at 0 degrees and Q at 90 degrees); with a non-modulated 3.579545MHz “colorburst” signal being sent every scanline to help align the decoder with the encoder’s signal. These are all handled together as a single combined PWM signal. Here, YUV can be fudged into being YIQ (by slightly rotating the signal), so it basically works.

In the case of the image, the decoder was given a copy of the PWM’ed output bitstream, and it recreates an output image from this. In an actual hardware implementation, this role would be served by plugging the cable into a TV or some other similar display with support for a composite input. Though, I would need some sort of decoder to test a Verilog simulation of the display subsystem.

Though, I don’t know yet if it will work with an actual TV, as information on NTSC specifics is a bit fragmentary and it was annoyingly difficult to piece together via web-searches. So, it is possible I missed something important and the generated signal isn’t actually compatible with real TVs, who knows?… ( Could maybe consider writing a basic spec on this. )

Why not VGA? Less wires, mostly. Though, on the modulation side VGA (and component) would be both simpler and probably produce higher output quality than composite (basically by not needing to modulate multiple signals together). VGA would require wiring up a VGA connector though (and a lot more filters); Component would have the drawback of being less commonly supported by low-cost displays.

 

Other Thoughts:

  • Noticed after posting this, that I could improve image quality by 8-bit aligning the PWM bits for the NTSC test, likely because the decoder samples 8 or 16 bits at a time (shouldn’t matter for a real TV).
  • Though more complex, a hybrid DAC design (say, with 2 or 3 high-order bits and the low-order bits as PWM) could likely be able to give better video quality. Though, this would require more complex electronics.
  • Note that the colorburst clock doesn’t align sufficiently closely with an integer fraction of 100MHz, as a workaround it is possible to accumulate 150137 (once per clock-cycle) and use bits (21:17) of the accumulator as a clock-phase (3.579545/100*2^22; which approximates the ratio of the frequencies relative to a power-of-2; the choice of 2^22 in this case being based mostly on “number magic”).

But, yeah, work in progress.

 

ADD: Experimental 4-bit PCM

Dog Image, NTSC 100MHz 4-bit PCM w/ dither and rotating colorburst (45 degrees every half-frame), and an average over several frames.

DE3D1_ntsc_clr3

By itself, the 4bpp PCM does give considerably better results than the PWM (much less noise), though even as such there was still a slight ripple pattern apparently due to the colorburst signal (the slightly noticeable diagonal lines in the original NTSC image, but more obvious with the PWM due to lower overall noise).

The signal seems to align so that nominally the carrier is at the same phase in the same places in the frame. A workaround seems to be to average the past two frames and rotate the signal by 45 degrees per half-frame, so that for two consecutive versions of a scanline the carrier is out of phase by 90 degrees, and the ripples mostly cancel out. In this case, the image is encoded via multiple consecutive frames.

Though, it also appears that rotating the carrier somehow mostly defeats the effects of CGA style artifact colors (instead, these tests mostly lose saturation).

I am not entirely sure how actual TVs would respond to a rotating colorburst, but at least theoretically the TV will be able to realign with the colorburst each frame.

 

Forgot about this…

So, what happened? I wrote a few posts, then for a while forgot I had set up a blog.

So what has been going on?

Class, basically doing Machining stuff, and learning CNC stuff, because, that is what sort of jobs “actually exist” around where I live.

 

Spent a good number of months working on a emulator and a C compiler targeting a variation of the SuperH ISA. Started fiddling with doing an implementation of the ISA in Verilog. Not like it is “actually useful” or anything. Working on stuff like this, one becomes aware both of the limitations of the ISA, and doing stuff in Verilog, some of the challenges which exist for trying to implement an ISA in hardware, sorta (this exists in a gray area between that of code and hardware, writing stuff in a language which can nominally run on an FPGA, or theoretically be turned into a real chip if enough money is thrown at the problem).

 

So, what is Verilog:

It is sorta like C, if you were doing everything using branch-free logic, with a few control-flow constructs thrown in the mix; and a convenient bit-vector notation (counterbalanced by an ugly syntax for declaring numeric values; begin/end keywords; …). It is also in some ways behaviorally sort-of like ladder logic, though at least thankfully, much less terrible (the way combinatorial blocks behave is reminiscent of a scan in a ladder program; though luckily we are spared from everything being presented via the abstraction of relays switching things on and off between a pair of power rails).

One thing that becomes more obvious here is the distinction between division and most other arithmetic operators: Most operators can be implemented fairly easily; but division is one of those operators that generally requires an iterative implementation in both the integer and FPU case.

Integer division is possible in one of several ways, but not going to go into them.

FPU division is at least a little easier, in that one can hack together a passable approximation for relatively cheap (mostly using subtraction and shifts or similar). But, then a problem comes up if one wants “accurate” division. In a software FPU, one can be like, “well, I will just use a fixed-point a integer divide and some bit-twiddly”. Here, not so much (if one has any concern for gate budget).

There is always Newton-Raphson, but this also costs gates (for a probably one-off special purpose FMA unit), and one would still basically need to follow an initial approximation with some number of “refinement” operations.

The other alternatives are having a floating-point divide operator which has lackluster precision, or is comparatively slow (if emulated in firmware).

 

Interestingly, one feature I got backlash over, namely the addition of x86-like memory addressing, was comparably fairly cheap and simple to implement, and has the advantage of somewhat reducing the length of instruction sequences when working with structures, arrays, or when doing pointer-arithmetic.

Basically, In SuperH, memory addressing ops were basically like this:

MOV.B  //Load or Store a Byte
MOV.W  //Load or Store a Word (16-bit)
MOV.L  //Load or Store an Integer/Long (32-bit)
FMOV.S //Load/Store a Float (or half of a Double)

With several types of addressing available (basic/common forms):
MOV.L @Rm, Rn  //Load from Rm into Rn
MOV.L @Rm+, Rn //Load from Rm into Rn, with a post-increment
MOV.L @(Rm, disp4), Rn //Load from Rm with 4 bit displacement
MOV.L @(R0, Rm), Rn //Load from Rm with 4 bit displacement
...
MOV.L Rm, @-Rn //Store into Rn with pre-decrement
...

A useful special case:
MOV.L @(PC,disp8), Rn  //Load a constant

And, some not so useful special cases:
MOV.L @(GBR, disp8), R0 //What do we use this one for, exactly?

Looks all well and good, until you want to load from an array. Spot that @(R0, Rn) addressing form, yeah, you’ll be seeing that one a lot (pretty much any time accessing an array or a structure field which falls outside of a very limited displacement range). That 4 bit displacement form is frequently not even sufficient to access local variables on the stack. Similarly, because lots of operations need to go through R0, it makes it extremely prone to being stomped.

Similarly, MOV.B and MOV.W implement fewer forms, and fewer still for floats or doubles (the SH FPU is probably worthy of its own topic).

This means typically a lot more constant loads, explicit shifts, arithmetic ops, … as part of doing a memory load or store operation.

There are workarounds though:

SH-2A MOV I-Forms (were added later on with the SH-2A ISA):
MOV.L @(Rm, disp12), Rn  //Oh, amazing, a whole 12 bits for the displacement

BJX1 MOV I-Forms
MOV.L @(Rm, Ro, disp8s), Rn  //Address is computed as Rm+(Ro+disp)*4
...

And this gold nugget:
LEA.L @(Rm, Ro, disp4), Rn //Calculate address and store into Rn

Also:
MOV.Q  //Load/Store Quadword

These are 32-bit instructions, and are not entirely without drawbacks, but can generally “pay for themselves” in terms of simultaneously reducing code size and (potentially) helping with performance. This is mostly because, if you can displace 3 or 4 16-bit instructions with a single 32-bit instruction form, this is typically a savings.

What about all these instruction-forms for MOV.L and friends; wont this lead to an absurdly complex load/store and address calculation machinery? Not really.

In Verilog, one can sort of just use the general case and special case things, leading to a side effect: Add of the memory addressing is done internally @(Rm, Ro, disp) calculations. But, how, some operations lack a scaled index, a displacement, or the displacement is unscaled.

Partial solution: What can happen in the decoder:

MOV.L @Rm, Rn         => MOV.L @(Rm, ZZR, 0), Rn
MOV.L @Rm+, Rn        => MOV.L @(Rm, ZZR, 0), Rn; LEA.L @(Rm, ZZR, 1), Rm
MOV.L @(Rm, disp), Rn => MOV.L @(Rm, ZZR, disp), Rn
MOV.L @(R0, Rm), Rn   => MOV.L @(Rm, R0, 0), Rn
MOV.L @(GBR, Rm), Rn  => MOV.L @(GBR, ZZR, 0), Rn

With a few special notes:

  • ZZR is a special Null Register, and is treated here as if it were always 0.
  • ZZR may overlap with R15/SP, which may not be used as a scaled index.
  • R0 is not scaled if used as an index.
  • Some specifics here are left to the discretion of the implementation.
  • Operations inside the core use wider register numbering than in the ISA.

Basically seems workable.

Well, there was also 3-address integer arithmetic and similar, which is supported by the ISA, but as can be noted hasn’t had as significant of an impact in terms of effects (generally, cases where two-address forms are more applicable are more common, as very often arithmetic is either modifying a value, or the lifetime of one of the values ends just as the new value is being created).

Similarly, there doesn’t seem to be much reason to deviate from load/store, and if anything there is the added drawback of potentially needing to deal with read/modify/write to memory (which is avoided if operations take place in registers).

Why make post-increment / pre-decrement operations take 2 cycles? Well, main reason mostly has to do with an inability to (efficiently) write to multiple registers in a single cycle, which could potentially happen if both are done in a single operation. However, one can fairly safely read from 3 or 4 registers at a time.

The registers are internally split into multiple banks to avoid exceeding these limits.

  • A bank for the low halves of registers.
  • A bank for the high halves of registers.
  • More banks for system and control registers.

The handling of 32-bit registers is special in that, in BJX1, it is possible for some operations to address the high halves of 64-bit GPR registers. This is handled internally via some multiplexing magic. But, splitting them up allows the halves to be accessed in parallel without posing as much of an issue for a 3 read + 1 write limit.

For related reasons, the FPU register space is internally permutated, and structured similarly to the GPRs:

  • F0L-F16L: Low half of Double Registers
  • F0H-F16H: High half of Double Registers
  • F0L=FR1, F1L=XF1, F2L=FR3, F3L=XF3, …
  • F0H=FR0, F1H=XF0, F2H=FR2, F3H=XF2, …

So, in this scheme, Double is the natural value size, though this should be mostly invisible to code.

 

But, still to be seen if I can make a working core out of all this…

 

Trade-offs in Entropy Coding

Well, I suspect no one takes me all that seriously about compression, partly because I suspect I tend to use slightly unusual designs, and tend to focus more on speed than on ratio.

There are a variety of entropy coders, with different trade-offs:

  • Static Huffman
    • Basically the “gold standard” of entropy coders which use a normal bitstream.
    • Pros:
      • Has generally pretty fast encode/decode speeds.
      • Theoretically optimal for a bitstream.
    • Cons:
      • Requires transmitting / initializing / … a Huffman table (non-zero cost).
      • For chaotic streams (such as small messages over UDP), ensuring that the decoder has a matching table can become problematic.
      • More so the space cost of a Huffman table may be a problem for small messages, as can the constant overhead of initializing the tables.
  • Adaptive Huffman
    • Usually unfavorable as it tends to be very slow vs most other options.
    • Need to adjust trees per-symbol.
    • Does not favor the use of a lookup-table driven decoder.
  • Rice Coding
    • Useful, but largely forgotten.
    • Pros
      • Very little constant overhead.
      • Is easy to make adaptive, allowing a simple single-pass encoder.
      • Can achieve similar speeds to static Huffman.
    • Cons
      • Natively only matches certain distributions.
      • Absent length-limiting hacks, has the potential for excessively long symbols.
    • Tricky Hacks
      • Length limiting to avoid extremely long symbols.
      • Reorganizing symbols can allow for more Huffman-like behavior.
        • Dynamic symbol swapping can be fast, effective, and is adaptive.
        • Will generally call this category. “Rice+STF”
  • Bitwise Range Coding
    • Compresses good, but not particularly fast.
    • Pros:
      • Adaptive
      • Can beat Huffman variants regarding compression.
      • Still faster than Adaptive Huffman.
      • Can be similarly flexible as a raw bitstream.
    • Cons
      • Slower than Huffman and Rice
      • Slower than a raw bitstream (if used as a bitstream).
    • Tricky Hacks
      • Can be stuck on the back-end of a Huffman or Rice coder for improving compression (slightly) with less speed cost than using it directly.
      • Sadly, the compression gains of using it over a raw bitstream are frequently not high enough to make its speed costs worthwhile.
  • Symbol Oriented Range Coding
    • Compresses good, also not exactly fast.
    • Cons (vs Bitwise range coding)
      • Doesn’t really compress any better
      • Much more complex and often slower
      • Much less flexible about how it is used.
  • ANS and friends
    • A sort of magic unicorn I don’t really understand that well.
    • Pros
      • Speed can be more competitive with that of Static Huffman.
      • Compression is more like that of a Range Coder.
    • Cons
      • I personally can’t really make much sense of how it works.
      • Requires encoding symbols in reverse order.
      • Seems to have some restrictions similar to symbol-oriented range coders.
      • Its high speeds only seem to materialize on certain hardware in my past tests.

Relative Tradeoffs:

  • Between Huffman and Range Coding
    • Have only rarely seen ratio differences much over about 10%
    • Cases like high symbol skew (where range coding fares well) can usually be addressed more cheaply via other means.
    • A slowdown of typically 2-4x isn’t usually worth a 5-10% gain in compression.
  • Between Huffman and Rice+STF
    • Huffman generally compresses better and is faster than Rice+STF past a certain message size.
    • Rice+STF can allow for a simpler encoder/decoder though, and is faster for small payloads.
      • Its origins actually came about as a cheaper alternative to Deflate-style Huffman-coded Huffman tables.
      • In some cases, the space savings of not transmitting a Huffman table with the data may offset its less-than-optimal coding, putting it ahead.
  • Huffman or Rice + Bitwise Range Coder Backend
    • Can regain some of the compression advantages of Range coding.
    • Has less steep of an impact on codec speed
      • Given the range coder deals only with the already-compressed data.
    • A drawback is that now one has both the code complexity of a Huffman or Rice coder in addition to the Range coder, in addition to potentially other unique funkiness (such as interleaving multiple bitstreams to allow for multiple RC contexts, etc).
  • Not using any entropy coding
    • Typically the fastest option.
    • Typically the worst compression
      • Because, you know, raw bytes.
    • May still be a good option where speed is a lot more important than compression ratio.

 

It can be noted that most of my older designs had used static Huffman, but many of my newer ones have used Rice+STF variants. Generally the relative speed, simplicity, and flexibility has been enough to offset the arguably “bad” compression.

If the total ratio difference is maybe 20% between a “bad” Rice-coded backend, and a much slower range-coded backend, but the result is a 400% speed difference, is it worthwhile? Probably depends whether size or speed is more relevant.

Both may save significantly over the raw uncompressed data, which is usually where most of the gains lie. It often seems more sensible to invest a lot of the time/effort budget in areas which have a more direct effect on overall compression (such as any filtering/quantization/transform steps in image/video compression, or the effectiveness of the LZ matching and similar for LZ77 compressors).

For example, in an image codec, one may find it may well matter a lot more what type of DC predictor is used, or how deltas or coefficients are represented, than the specific choice of entropy coder back-end or similar.

Though, granted, I remain sort of an outcast here.

 

Brief History of BGBScript

Earlier Efforts

Well, once I was young, and early on spent a lot of time messing around with Quake. Quake had a scripting language, known as QuakeC, which was fairly limited but made it a lot easier to tweak things in-game, at least compared with the use of C for scripting in Quake II.

Eventually, I was working on more of my own projects, and came across a language known as Lisp. I had tried using some of the existing scripting VMs I could find, which IIRC was a standalone variant of Emacs Lisp, and the SCM / Guile implementations of Scheme.

After running into frustrations with these VMs, and noting the general simplicity of the language, decided to do my own VM. It wasn’t very good, having relied on a sort of modified S-Expressions for interpretation, and later wrote a backend which spit out C. This project, however, turned into a mess, and Scheme’s syntax and language design wasn’t particularly usable, so I later ended up abandoning it.

At first, I thought maybe I was done messing with scripting VMs, but started to face the issue that writing code purely in C is limiting, given even if only a small amount of code ever ends up as scripts, the parts that do make a difference. Around this time, I had also become basically familiar with HTML and JavaScript, and had recently done an implementation of XML-RPC. So, I did a nasty hack: I built a script interpreter for a JavaScript knock-off interpreter on top of XML-RPC, which parsed into an XML tree and then directly interpreted the XML expressions (this was around mid 2004). This first VM suffered from horrible performance, and basically spewed garbage at an incredible rate, so in not too long I wrote a replacement.

The next iteration of the language was partially redesigned (in 2006), and basically copy-pasted a lot of parts from my Scheme interpreter, but made use of a bytecode interpreter. This VM was a lot faster, but suffered some of the problems of my original Scheme interpreter, for example the relative pain of trying to deal with a precise garbage collector from C land.

The next VM (2009) was basically a lot of the machinery from the 2006 VM, re-merged with parts of the 2004 VM, namely using a more C friendly memory manager. In 2010 and 2011 I started adding a lot of features I had sort of copied from ActionScript 3.0, as well as some features from Java and C#.

The bytecode used was originally a dynamically-typed stack-oriented bytecode, but opcode types annotations were added on after the fact via prefixes. The VM then used an awkward mix of static and dynamic typing.

Had also been running side efforts trying to do natively compiled and 3AC compiled code, but my early efforts here were very hit or miss (my first real code-generator collapsed into being unmaintainable, …). A lot of optimizations for my more efficient 3AC interpreters, and a few other VM designs, ended up back-ported to the stack machine VM, so there ended up being not a huge pressure to replace it with something different.

 

First BGBTech Engine

Around this time, I also wrote the first BGBTech Engine by pulling together parts from my various previous projects, so BGBScript was the primary scripting VM in BGBTech. The engine was developed out of a combination of a 3D modeller and mapper I had previously written mostly for the Quake 2 engine, mostly as by that time QuArK had decayed into an almost unusable mess, and couldn’t really be easily extended to deal with heavily modified games or engines.

Its performance was overall pretty bad, early on me feeling it was doing good when framerates were in the double-digits, and I later ended up switching over mostly to a Minecraft style terrain system after noting that my mapping skills were pretty bad.

Most of the horrible performance was due to me early on not really understanding OpenGL that well, and doing lots of things in ways which were pretty but not all that fast, then noting that they scaled poorly to larger scenes. A lot of the performance gains were mostly figuring out how to draw things more efficiently, and cutting things back to limit their performance costs.

By around 2014, progress on this engine was pretty much stalled.

 

Formation of BGBScript 2

Likewise, I had gone off to some side projects, mostly working on trying to build robots, so I put a lot of this on hold.

At this point I ran into a problem that my older BGBScript VM was too heavyweight for use in my robot projects, and wasn’t really well suited to low-latency use-cases. Had started on a redesigned version of the language I had called BGBScript-RT (for BGBScript Real-Time). It was redesigned to be much more like C in many areas, switching from garbage collection to the use of manual memory management.

In the meantime, needing something more immediate, I had used a variation of C compiled to a 3AC Bytecode. The bytecode VM was originally designed as part of a new faster experimental backend for the BGBScript VM, but it was easier to get C working on it. While the VM was pretty fast, it had a number of more subtle design issues.

Partly as a result, the newer BGBScript RT/2 VM had made use of a statically typed stack-machine, more like that in the JVM. This initial form of the VM stalled out with the general stall of my robot projects.

 

BGBTech 2 Engine

Around late 2015, and the general stall of my robot-building efforts, I ended up starting on a reboot of my 3D engine. My initial thinking was that I could take what I learned from my prior engine, and probably reboot it into a much faster and better engineered engine.

Since it quickly became apparent I needed scripting, I ended up reviving the BGBScript 2 VM, and worked a lot on beating the compiler/VM into a usable form. This was along with working a lot on stuff relevant to “engine architecture”.

This Script VM has ended up a fair bit bigger and more complicated than my original vision, but a lot of this was for things relevant to a game engine, and to performance: A simpler VM with a smaller / simpler ISA either comes at the cost of time or complexity in the back-end, or in worse performance. Generally, getting better performance comes to some extent at the cost of VM complexity.

As of the beginning of 2017, while it hasn’t really caught up to the old engine in many areas, and it is debatable whether the “better engineering” part is actually paying off, I at least seem to be making progress (and my new engine is much less prone to running out of memory or having awkward GC stalls).

 

Bytecode Tradeoffs

Over time, it can be noted that different ways of expressing operations has different tradeoffs.

Bytecode Tradeoffs

  • Dynamically Typed Stack Machine
    • Simplest to design/implement
      • Front-end mostly ignores types.
    • Naive case, is also the slowest
      • Deterimining types and operations at runtime does not come for free.
    • Getting decent performance requires a very complicated back-end.
      • Back-end needs to deal with type inference and JIT
    • FWIW: A few VM’s (such as Parrot) have a Dynamically-Typed Register IR
  • Implicit-Type Statically-Typed Stack Machine
    • Used, for example, in the .NET CLR.
    • Most elegant design on paper
    • Reliance on propagating types via the stack is problematic for interpreters
      • Sensible if the VM is JIT only
  • Explicitly-Typed Stack Machine
    • An example of this is the JVM
    • Requires use of more opcodes and/or prefixes to express types.
      • Less elegant design due to lots of repeated operations per-type.
      • May seem less orthogonal as operations may only exist for certain types.
    • Is a decent design for an interpreter.
      • Explicit type information means back-end doesn’t need to figure this out.
      • May also be easier for a simplistic JIT.
  • Statically-Typed Register-IR (Non-SSA)
    • An example of this would include the Dalvik VM
    • Can allow for a faster interpreter than what is typical from a stack machine.
    • A drawback is that things can also get more complicated.
      • Compiler also needs to deal with the management of temporary variables.
      • No convenient way implicitly pass arguments to functions.
  • Statically-Typed Register-IR (SSA)
    • An example of this is LLVM BitCode
    • Best suited for compiler analysis than interpretation
      • The additional temporary registers and phi operations are counter-productive for an interpreter.
      • Though, sub-scripting rather than duplicating registers is an option.
        • In the subscripted case, identity is shared, and phi is made implicit.
    • Is good for knowing register lifetime and may aid in register allocation.
      • Can know more easily when the lifespan of a register has ended.

 

In its current form, the BS2 VM uses a mix of an explicitly typed stack machine with some number of register ops mixed in. The use of hybrid register-ops can in some cases be used to reduce the length of instruction sequences, and thus improve performance for an interpreter, at the cost of a larger ISA.

There is some possible debate as to whether this is better or worse than a purely register IR would have been.

Seen another way, the stack can be thought of as a collection implicit temporaries which are accessed in LIFO order. Within the interpreter (if using threaded-code), there is no need to maintain a stack-pointer at runtime (the stack pointer then only really exists when decoding functions). A side-effect of this is needing some non-executed instructions which primarily serve the purpose of keeping the stack aligned around labels and branches.

For example, compiling “z=25*x+y;”with a pure stack model:

LDI L0   //(=>S)=x
LDC.I 25 //(=>S)=25
MULI     //(=>S)=(S=>)+(S=>)
LDI L1   //(=>S)=y
ADDI     //(=>S)=(S=>)+(S=>)
STI L2   //z=(S=>)

Or, with some additional “hybrid” opcodes:

MULISLC L0, 25  //(=>S)=x*25
ADDILSL L2, L1  //z=(S=>)+y

Which in this case, can match the density of a pure register form.

 

Typically, the bytecode is internally converted into a sort of threaded code, where instructions are decoded into structures with a function pointer, and instructions are executed by calling the function pointers.

These instructions are then stored in arrays, and in C side, unrolled loops can be used to execute a sequence of instructions via such arrays. Execution then primarily works via a main trampoline loop, which calls function pointers holding the unrolled loops for executing instructions.

A simplistic JIT strategy for this is to emit machine code blobs which contain, in the general cases, function calls to the embedded function pointers. For operations handled directly by the JIT, it may instead emit a blob of instructions for executing its operation. Extra fancy is one can add a makeshift register allocator, so values may be cached in CPU registers rather than using memory loads/stores.

But, if the bytecode is kept friendly to interpreters, an interpreter is a good fallback in case a JIT is not available or is too expensive to support on the target, and directly interpreting bytecode operations can be good if for whatever reason threaded-code is not a good option, or if the code will probably only ever be executed maybe once or twice.

 

On a path to moderately fast image/video compression.

Well, probably many here are familiar with the DCT transform, the aged and battle hardened star of JPEG and MPEG. Some may also be familiar with the “rising star” known as DWT, and its “will it ever appear in a usable mainstream video codec?” thing. Maybe one can then go on about the trade-offs between various new/shiny arithmetic and range coders, so much better than ye olde static Huffman.

But… I am not here to talk about these.

Rather, I am going to go back a little further, to a since forgotten era. Early on, before we had enough video memory to store 32-bit RGB(A) values for every pixel, how was it done? No, not 8-bit palette-based graphics, earlier still.

There was a time when things were basically 1 bit per pixel. This may be remembered via old display adapters, like the Hercules, which only did monochrome. Others, such as CGA, had multiple modes, where every pixel could be one of a few small palette of colors. On other systems, people had the idea of using a pair of colors for each block of pixels, where each bit would select between one of the two colors.

Say, hypothetically, one had 8 bits for a pair of 4-bit colors, and with 16 bits of pixel data one can do a 4×4 grid of pixels, in full color, in 24 bits. Alternatively, one can do 8×8 pixels in 72 bits.

Now, moving forwards a few years, we have an era where people had new shiny PCs with monitors running at 640×480, with 256 colors, and a 25 or 33 MHz processor. People were playing video on these things, but how? No, it wasn’t MPEG, it was typically something a lot simpler: You have a pair of 8-bit colors, and 16 bits to select between them. So, here, it is 4×4 pixels in 32 bits. This works out to about 2 bits/pixel, but since often many parts of the image wont change, special bit patterns could be used to encode skipping over blocks of pixels. This codec? MS-CRAM, or MS Video 1.

Elsewhere, there was a company known as Apple, who was working on a system known as QuickTime. As part of this, they developed another codec, known as “Apple Video” or “RPZA” (its FOURCC values being “RPZA” and “azpr”). Its design differed slightly: it used a pair of RGB555 pixels (16 bits each, the MSB was special), and 32 bits with 2 bits per pixel. Instead of simply selecting one of the colors, it was possible to select a pixel interpolated between these: 0 would give you ColorA, 3 would give ColorB, 1 would give 2/3A+1/3B, and 2 would give 1/3A+2/3B. Likewise, it was possible to skips runs of blocks, or indicate that a run of blocks was all a single color. For anyone familiar with DXT or BCn texture compression, it is basically the same basic idea.

One could also see the RPZA image as a sort of stream of commands:

  • 00-7F: Start of a block of pixels, 64 bits for 16 pixels.
    • Well, technically you could also do a block of 16x 15bit pixels (256 bits).
    • Which is was depended on the MSB of the following RGB555 color.
  • 80-9F: Skips 1-32 blocks
    • Whatever pixel data exists here is retained from the prior frame.
    • The low 5 bits of the byte give the length of the run of blocks.
  • A0-BF: Span of 1-32 blocks using a single flat RGB555 color.
  • C0-DF: Span of 1-32 blocks sharing a single pair of colors.
  • E0-FF: Reserved

Then, later along, there were new upstarts like Cinepak, etc.

But what happened? Well, a little later on, MPEG and friends gained a foothold on PC’s, and use cases for video were reduced down to mostly watching movies/anime/… and the rise of online streaming, which put a premium on the bitrate. Likewise, the early FMV games were by and large a steaming pile of crap, and in a short time people by-and-large forgot that this earlier era of video codecs even existed.

But, there is a little detail in this: decoding an MPEG style codec requires a bit of CPU power, but “we got a 1GHz CPU to play a 640×480 video, so who cares?”. In the rising resolutions of subsequent years, and hitting a limit to clock speeds, some modern video codecs remain mostly playable at HD resolutions mostly by the grace of dedicated graphics hardware (and people debate their “codec complexity” largely in terms of die space required on an ASIC or similar, but everyone has a GPU, right?…).

But, a simple question arises? What happens if we try to decode one of those older early 90s video codecs on a modern PC? Well, in simple terms, it goes “really damn fast” (if written sensibly, the video decoding can potentially reach “memcpy” speeds). But… They also tend to take a lot of space and deliver less than impressive image quality.

Some years ago (~ 2011 IIRC), while faced with this issue, I ended up hacking some features onto one of these codecs (RPZA). Most didn’t work out well (or were fairly specialized, like adding an Alpha channel), but a few stood out: 23 bit colors (for “looks less like ass than RGB555”), and the use of an external Deflate-based compressor (which didn’t slow it down too drastically, but could result in potentially significant savings in terms of frame sizes). I had dubbed this effort (“BTIC1C”, being 1C as it came after a few prior failed attempts at compressing S3TC). Some hybrid 1C video exists where JPEG was used for I-Frames but 1C was used for P-Frames, however, this hack was prone to numerous issues. BTIC1C was mostly developed originally for a mix of real-time screen capture and for streaming video into textures (used for most of the animated texture effects in the original BGBTech 3D engine).

After that was an effort called BTIC1D, which used YUV and 4:2:0 chroma subsampling (via using separate interpolation for Y, U, and V). This effort was initially deemed a failure as its compression sucked pretty hard and its speed was barely much faster than what I could get from a DCT based design (as with DCT based codecs, it inherited the serious drawback of needing a per-pixel YCbCr->RGB conversion). There were also 1E and 1F, which basically had design issues bad enough that they never got fully implemented (though 1F did develop a nifty trick I call YUVD, which had replaced the use of a pair of colors with a singular vector).

Now, it was 2014, and I was off messing with new things, namely writing code on a Raspberry Pi, screwing around with electronics wanting to build some robots. I went and built my sadly thus far only “successful” robot: a wheeled cardboard box with a webcam on it (so I could drive it around and see what it was facing). I wanted to stream video from the webcam back to my laptop, and initially underestimated the RasPi’s processing power (I also wanted a design which could work over UDP and allowed me to encode video data between the motor PWM pulses).

So, at the time, I created something relatively simpler: an RPZA like codec I called BTIC1G, but instead of RGB555, I used a YUVD color vector. Why? Mostly because the webcam gave me video in a format known as YUY2, and because given how my block-encoder logic generally worked, I generally ended up with a midpoint color, and a ‘max-avg’ gamma value. So, YUV encodes the center YUV, and D encoded the Y difference between the lightest and darkest pixel. Interpolation then, effectively, only applied to Y.

Pixel data was good old 32-bits and was 4×4 with 2 bits/pixel for interpolation, working out to 64 bits for a complete pixel block. To improve bitrate, there were also flat color blocks, and downsampled 2×2 blocks. Bits were shaved off of the YUV values to allow encoding the various block types.

Noted that it was plenty fast for the use-case, but I was sort of wishing for better quality. To stream 640×480 over UDP over a crappy 802.11b dongle basically resulted in not particularly great image quality.

After the relative “success” of my rolling cardboard box, thinking maybe I would build other fancier robots, I had begun on a successor to 1G, which I had called 1H. It retained a design intended to be incrementally encoded and streamed via UDP packets, but differed mostly in that it added entropy coding, and a few more features. Facing an issue that there isn’t really a good way to do static Huffman over UDP, I went instead with Rice coding (mostly for the YUVD color vector deltas and command tags, with pixel data is raw bits).

Sadly, my robot building efforts had fizzled out, but I had in the process made a working prototype of 1H for the PC, made usable as a VFW codec. Then made the observation that it basically did better in most regards compared with my older 1C codec (better quality/bitrate, …).

Later on, ended up expanding it to include 4:2:0 blocks based on those from 1D, and expanding YUVD to YUVDyuv (where YUV represents the center point of a bounding-box in YUV space, and Dyuv encodes the size of this box). This was to address an issue of ugly artifacts in some cases, most obvious with “My Little Pony” and “Rainbow Dash”: Whenever her character was on screen, her hair would cause severe blocking artifacts as there is no way to express the sharp color contrasts in the YUVD scheme (an RGB color pair fairs a little better, but is by no means immune to this issue, and I suspect is a big part of why “partitions” exist in BC7 / BPTC). Since the 4:2:0 blocks are basically limited mostly to emergency cases, one can largely sidestep their relatively steep performance cost (and needing around 50% more pixel bits, say, 32 bits for Y, and 8 bits each for U and V).

Additionally, some block types use 3 bit interpolation for added quality (0=A, 1=6/7A+1/7B, …, 6=1/7A+6/7B, 7=B), and others only use 1 bit per pixel, etc. So, 48 or 64 bits of pixel data may be used in some cases, in addition to maybe 4, 8, 16, or the standard 32 bits.

A lot of the encoding choices can be made via a “weighted decision tree” where one takes various values derived from the pixel data, the quality level, …, then this is evaluated to give a choice for how the block should be packed into a blob of bits. For example, if we “see” that the block is basically a flat color, we don’t really need full resolution or explicit chroma information, …

In the decoder, it basically decodes the blocks from the bitstream, into an internal fixed-size block format. To get the pixel data, the logic for the block-type is invoked, which may generate a small array of pixel colors, and the pixel data will select which colors are copied into the output buffer. ( Note that the YUV colorspace is not necessarily YCbCr, as more typically a modified RCT or GDbDr is used instead due to being faster to convert to/from RGB, but raw RGB suffers in that it isn’t really friendly to chroma subsampling. )

In another side branch, a project existed (within a standards group I will leave unnamed), where they expressed wanting: CBR, ~ 2-4 bpp, HDR, 4:2:2, and “low complexity”. Seeing this, I noted this was pretty close to what was doable with the internal block formats I had used in my prior codecs. Main difference was that I went from a block size of 4×4 to 8×8, as this increases the relative cost of the block header (say, rather than a fixed 128 bit block, one has a 512 bit block, but the header does not necessarily increase in size, and by having 256/384/512 bit blocks, it was possible to have different bitrates between 2 and 4 bits, with the block size controlling which types of blocks were allowed, though potentially for many use cases only a single type of block would be used depending on the associated raster format).

Had observed that pretty much no one cared, later realizing their thinking for CBR and “low complexity” was more along the lines of “throw a modified JPEG2000 at the problem” (the computational complexity of JP2K is pretty drastic vs BTIC, but they were thinking in terms of an ASIC, rather than doing it on the CPU, and achieving CBR mostly by fiddling the quality levels and truncating the image to a specific size). After realizing that there was little reconciliation of design goals, I went away.

This initial design was dubbed BTIC4A (2x and 3x were mostly failed lines, 2x being DCT/WHT based designs, and 3x being some attempts to hybridize blocky VQ and transforms, the 2x line failing to compete well with codecs like XviD and lacking a strong use-case in the original BGBTech engine and for still images not offering much gain over normal JPEG, and 3x designs tending to turn out too complicated to be worth the effort needed to implement them).

Some parts were adapted from 1H onto 4A to yield a design I had called BTIC4B, and I had developed tricks to allow multi-threaded encoding/decoding for individual frames (cutting them up into “slices” which could be encoded/decoded in parallel). Some of these were back-ported to 1H.

Thus far, haven’t really used 4B for video, but it does pretty good at still image compression (in terms of Q/bpp and speed), so could probably make for a decent video codec (though it is more complicated, as in all this there ended up being around 32 different block layouts, as there are a lot more ways to subsample an 8×8 pixel block vs a 4×4 block).

Though, as-is, BTIC1H can already reach gigapixel/second speeds in some cases, the immediate for something faster isn’t a huge need.

However, for static images, on my new FX-8350, given I am seeing damn near 600 megapixels/thread for single-threaded operation (~ 2.4 GB/s of RGBA pixel data), I suspect it could probably reach some pretty crazy fast speeds.

On my older Phenom II, it tended to run into a hard limit of around 400 Mpix/sec per thread (~ 1.6 GB/s of pixel data), which roughly matched what one could get doing a memcpy operation.

But, I expect few people to care, since for the “standard” use-cases (watching movies and online video streaming), it doesn’t have much to offer (ex: 1080p needs ~ 62 Mpix/sec, not Gpix/sec). I guess if we assume that video serves no other purpose than this, it is hard to see the point.