Author: bgbtechblog

2bpp Framebuffer and NTSC Composite

So, status (besides class):
Had recently started messing with Verilog (for if/when I can eventually get an FPGA, still limited mostly to simulation), and have thus far written a partial implementation of a CPU core implementing the my BJX1 ISA in it (an extended 64-bit version of the SuperH ISA).

For part of it, was also working on a display subsystem (intending to work from a small framebuffer in SRAM and then produce composite output).

Was able to experimentally get passable image quality from around 2 bits/pixel via a color-cell scheme. This has pretty comparable memory use to CGA in 4-color mode; but with higher image quality.

One sub-mode uses a pair or color endpoints from a 64-color RGB colorspace. Image quality isn’t great, but is reasonably passable. Here, each R/G/B component has 4 brightness levels, with an ability to get intermediates, sort-of, by dithering.

Another sub-mode uses YUVD colors encoded in a way similar to ADPCM. Basically, there is a YUV centroid with D=Ymax-Ymin (of the difference in Y between the brighter and darker value). They are stored as a delta from the prior vector, with 3 bits/component encoding the delta, which may dynamically adapt to fit the range of the component values. There is an escape case to allow colors which fall too far outside the current range to be encoded as an absolute YUVD centroid; or alternatively as a pair of 64-color RGB endpoints.

The reason for this mostly be able to fit the screen buffer into SRAM in an FPGA, and not “eat it” with an overly large framebuffer. As-is 320×200 would require 16kB, and 640×400 would require 64kB.

This is mostly because most of the rest of the SRAM (around 200kB in the FPGAs I was looking at) is needed for other purposes; like a CPU cache, … Alternatives would require either a mechanism to stream in a framebuffer from external DRAM, or require generating interrupts to have the CPU fill in scanlines.

 

Image of a Dog (Original, apart from scaling to 640×480):

DE3D1_org0

 

Image of a dog, but encoded at 2 bits/pixel:

DE3D1_2bpp1

It doesn’t look great, but you can’t expect too much from 2-bits / pixel.

If I added a full 64kB (to support 640×400) I could possibly have a higher quality sub-mode for 320×200 display.

Another possibility is to have 32kB of VRAM, which would allow a 2bpp 640×200 mode, or a higher quality 4bpp 320×200 mode (probably with either a pair of RGB555 endpoints or a YUVDyuv scheme).

Less certain ATM is that this will support a text submode, but it is unclear as to the specifics (dedicated text-cells; or use of drawing the text cells via color-cell graphics, drawing each glyph as a 2×2 grid of cells).

 

This would feed into the next part:
I am considering doing the video output mostly as composite (ex: NTSC over an RCA cable), and so was messing around with trying to modulate and demodulate NTSC in software (and maybe later do a version of the modulator in Verilog; ADD/EDIT: have done a modulator in Verilog).

 

Dog Image (of experimental NTSC output, done via 100MHz PWM):

DE3D1_ntsc_clr2

In this test, the NTSC is done via PWM (or DPWM); in this case turning an output pin (simulated in this case) on and off on a 100MHz clock in a pattern to approximate an analog signal. This produces output at a rate of ~100Mbps, and thus a frame of video requires around 417kB (ironically, smaller than most of the output images saved as PNG).

For actual output, this would be fed into a voltage divider and capacitor to turn this into a “true” analog signal (would need to reduce from 3.3V to 1.0V and also smooth out a lot of the high-frequency noise). This is pretty similar to how audio output is done on most modern computers, as DACs are more expensive and people don’t really hear the difference; though using a DAC would only need a signal at around 14MHz rather than 100MHz (and higher video quality would similarly require a higher PWM frequency). This frequency is nowhere near high enough to be able to directly produce VHF modulated output though (would need to PWM the pin at around 240-300MHz for this).

Color makes things a little more complicated, and a lot of the effort in implementing the NTSC demodulator was mostly going into trying to make the QAM demodulation work. Color involves basically magic of making ~3 analog signals coexist in the same place at the same time, with the components mostly not interfering with each other.

The Y component is transmitted via the signal’s DC bias, and the UV/IQ components via phase angles (I at 0 degrees and Q at 90 degrees); with a non-modulated 3.579545MHz “colorburst” signal being sent every scanline to help align the decoder with the encoder’s signal. These are all handled together as a single combined PWM signal. Here, YUV can be fudged into being YIQ (by slightly rotating the signal), so it basically works.

In the case of the image, the decoder was given a copy of the PWM’ed output bitstream, and it recreates an output image from this. In an actual hardware implementation, this role would be served by plugging the cable into a TV or some other similar display with support for a composite input. Though, I would need some sort of decoder to test a Verilog simulation of the display subsystem.

Though, I don’t know yet if it will work with an actual TV, as information on NTSC specifics is a bit fragmentary and it was annoyingly difficult to piece together via web-searches. So, it is possible I missed something important and the generated signal isn’t actually compatible with real TVs, who knows?… ( Could maybe consider writing a basic spec on this. )

Why not VGA? Less wires, mostly. Though, on the modulation side VGA (and component) would be both simpler and probably produce higher output quality than composite (basically by not needing to modulate multiple signals together). VGA would require wiring up a VGA connector though (and a lot more filters); Component would have the drawback of being less commonly supported by low-cost displays.

 

Other Thoughts:

  • Noticed after posting this, that I could improve image quality by 8-bit aligning the PWM bits for the NTSC test, likely because the decoder samples 8 or 16 bits at a time (shouldn’t matter for a real TV).
  • Though more complex, a hybrid DAC design (say, with 2 or 3 high-order bits and the low-order bits as PWM) could likely be able to give better video quality. Though, this would require more complex electronics.
  • Note that the colorburst clock doesn’t align sufficiently closely with an integer fraction of 100MHz, as a workaround it is possible to accumulate 150137 (once per clock-cycle) and use bits (21:17) of the accumulator as a clock-phase (3.579545/100*2^22; which approximates the ratio of the frequencies relative to a power-of-2; the choice of 2^22 in this case being based mostly on “number magic”).

But, yeah, work in progress.

 

Forgot about this…

So, what happened? I wrote a few posts, then for a while forgot I had set up a blog.

So what has been going on?

Class, basically doing Machining stuff, and learning CNC stuff, because, that is what sort of jobs “actually exist” around where I live.

 

Spent a good number of months working on a emulator and a C compiler targeting a variation of the SuperH ISA. Started fiddling with doing an implementation of the ISA in Verilog. Not like it is “actually useful” or anything. Working on stuff like this, one becomes aware both of the limitations of the ISA, and doing stuff in Verilog, some of the challenges which exist for trying to implement an ISA in hardware, sorta (this exists in a gray area between that of code and hardware, writing stuff in a language which can nominally run on an FPGA, or theoretically be turned into a real chip if enough money is thrown at the problem).

 

So, what is Verilog:

It is sorta like C, if you were doing everything using branch-free logic, with a few control-flow constructs thrown in the mix; and a convenient bit-vector notation (counterbalanced by an ugly syntax for declaring numeric values; begin/end keywords; …). It is also in some ways behaviorally sort-of like ladder logic, though at least thankfully, much less terrible (the way combinatorial blocks behave is reminiscent of a scan in a ladder program; though luckily we are spared from everything being presented via the abstraction of relays switching things on and off between a pair of power rails).

One thing that becomes more obvious here is the distinction between division and most other arithmetic operators: Most operators can be implemented fairly easily; but division is one of those operators that generally requires an iterative implementation in both the integer and FPU case.

Integer division is possible in one of several ways, but not going to go into them.

FPU division is at least a little easier, in that one can hack together a passable approximation for relatively cheap (mostly using subtraction and shifts or similar). But, then a problem comes up if one wants “accurate” division. In a software FPU, one can be like, “well, I will just use a fixed-point a integer divide and some bit-twiddly”. Here, not so much (if one has any concern for gate budget).

There is always Newton-Raphson, but this also costs gates (for a probably one-off special purpose FMA unit), and one would still basically need to follow an initial approximation with some number of “refinement” operations.

The other alternatives are having a floating-point divide operator which has lackluster precision, or is comparatively slow (if emulated in firmware).

 

Interestingly, one feature I got backlash over, namely the addition of x86-like memory addressing, was comparably fairly cheap and simple to implement, and has the advantage of somewhat reducing the length of instruction sequences when working with structures, arrays, or when doing pointer-arithmetic.

Basically, In SuperH, memory addressing ops were basically like this:

MOV.B  //Load or Store a Byte
MOV.W  //Load or Store a Word (16-bit)
MOV.L  //Load or Store an Integer/Long (32-bit)
FMOV.S //Load/Store a Float (or half of a Double)

With several types of addressing available (basic/common forms):
MOV.L @Rm, Rn  //Load from Rm into Rn
MOV.L @Rm+, Rn //Load from Rm into Rn, with a post-increment
MOV.L @(Rm, disp4), Rn //Load from Rm with 4 bit displacement
MOV.L @(R0, Rm), Rn //Load from Rm with 4 bit displacement
...
MOV.L Rm, @-Rn //Store into Rn with pre-decrement
...

A useful special case:
MOV.L @(PC,disp8), Rn  //Load a constant

And, some not so useful special cases:
MOV.L @(GBR, disp8), R0 //What do we use this one for, exactly?

Looks all well and good, until you want to load from an array. Spot that @(R0, Rn) addressing form, yeah, you’ll be seeing that one a lot (pretty much any time accessing an array or a structure field which falls outside of a very limited displacement range). That 4 bit displacement form is frequently not even sufficient to access local variables on the stack. Similarly, because lots of operations need to go through R0, it makes it extremely prone to being stomped.

Similarly, MOV.B and MOV.W implement fewer forms, and fewer still for floats or doubles (the SH FPU is probably worthy of its own topic).

This means typically a lot more constant loads, explicit shifts, arithmetic ops, … as part of doing a memory load or store operation.

There are workarounds though:

SH-2A MOV I-Forms (were added later on with the SH-2A ISA):
MOV.L @(Rm, disp12), Rn  //Oh, amazing, a whole 12 bits for the displacement

BJX1 MOV I-Forms
MOV.L @(Rm, Ro, disp8s), Rn  //Address is computed as Rm+(Ro+disp)*4
...

And this gold nugget:
LEA.L @(Rm, Ro, disp4), Rn //Calculate address and store into Rn

Also:
MOV.Q  //Load/Store Quadword

These are 32-bit instructions, and are not entirely without drawbacks, but can generally “pay for themselves” in terms of simultaneously reducing code size and (potentially) helping with performance. This is mostly because, if you can displace 3 or 4 16-bit instructions with a single 32-bit instruction form, this is typically a savings.

What about all these instruction-forms for MOV.L and friends; wont this lead to an absurdly complex load/store and address calculation machinery? Not really.

In Verilog, one can sort of just use the general case and special case things, leading to a side effect: Add of the memory addressing is done internally @(Rm, Ro, disp) calculations. But, how, some operations lack a scaled index, a displacement, or the displacement is unscaled.

Partial solution: What can happen in the decoder:

MOV.L @Rm, Rn         => MOV.L @(Rm, ZZR, 0), Rn
MOV.L @Rm+, Rn        => MOV.L @(Rm, ZZR, 0), Rn; LEA.L @(Rm, ZZR, 1), Rm
MOV.L @(Rm, disp), Rn => MOV.L @(Rm, ZZR, disp), Rn
MOV.L @(R0, Rm), Rn   => MOV.L @(Rm, R0, 0), Rn
MOV.L @(GBR, Rm), Rn  => MOV.L @(GBR, ZZR, 0), Rn

With a few special notes:

  • ZZR is a special Null Register, and is treated here as if it were always 0.
  • ZZR may overlap with R15/SP, which may not be used as a scaled index.
  • R0 is not scaled if used as an index.
  • Some specifics here are left to the discretion of the implementation.
  • Operations inside the core use wider register numbering than in the ISA.

Basically seems workable.

Well, there was also 3-address integer arithmetic and similar, which is supported by the ISA, but as can be noted hasn’t had as significant of an impact in terms of effects (generally, cases where two-address forms are more applicable are more common, as very often arithmetic is either modifying a value, or the lifetime of one of the values ends just as the new value is being created).

Similarly, there doesn’t seem to be much reason to deviate from load/store, and if anything there is the added drawback of potentially needing to deal with read/modify/write to memory (which is avoided if operations take place in registers).

Why make post-increment / pre-decrement operations take 2 cycles? Well, main reason mostly has to do with an inability to (efficiently) write to multiple registers in a single cycle, which could potentially happen if both are done in a single operation. However, one can fairly safely read from 3 or 4 registers at a time.

The registers are internally split into multiple banks to avoid exceeding these limits.

  • A bank for the low halves of registers.
  • A bank for the high halves of registers.
  • More banks for system and control registers.

The handling of 32-bit registers is special in that, in BJX1, it is possible for some operations to address the high halves of 64-bit GPR registers. This is handled internally via some multiplexing magic. But, splitting them up allows the halves to be accessed in parallel without posing as much of an issue for a 3 read + 1 write limit.

For related reasons, the FPU register space is internally permutated, and structured similarly to the GPRs:

  • F0L-F16L: Low half of Double Registers
  • F0H-F16H: High half of Double Registers
  • F0L=FR1, F1L=XF1, F2L=FR3, F3L=XF3, …
  • F0H=FR0, F1H=XF0, F2H=FR2, F3H=XF2, …

So, in this scheme, Double is the natural value size, though this should be mostly invisible to code.

 

But, still to be seen if I can make a working core out of all this…

 

Trade-offs in Entropy Coding

Well, I suspect no one takes me all that seriously about compression, partly because I suspect I tend to use slightly unusual designs, and tend to focus more on speed than on ratio.

There are a variety of entropy coders, with different trade-offs:

  • Static Huffman
    • Basically the “gold standard” of entropy coders which use a normal bitstream.
    • Pros:
      • Has generally pretty fast encode/decode speeds.
      • Theoretically optimal for a bitstream.
    • Cons:
      • Requires transmitting / initializing / … a Huffman table (non-zero cost).
      • For chaotic streams (such as small messages over UDP), ensuring that the decoder has a matching table can become problematic.
      • More so the space cost of a Huffman table may be a problem for small messages, as can the constant overhead of initializing the tables.
  • Adaptive Huffman
    • Usually unfavorable as it tends to be very slow vs most other options.
    • Need to adjust trees per-symbol.
    • Does not favor the use of a lookup-table driven decoder.
  • Rice Coding
    • Useful, but largely forgotten.
    • Pros
      • Very little constant overhead.
      • Is easy to make adaptive, allowing a simple single-pass encoder.
      • Can achieve similar speeds to static Huffman.
    • Cons
      • Natively only matches certain distributions.
      • Absent length-limiting hacks, has the potential for excessively long symbols.
    • Tricky Hacks
      • Length limiting to avoid extremely long symbols.
      • Reorganizing symbols can allow for more Huffman-like behavior.
        • Dynamic symbol swapping can be fast, effective, and is adaptive.
        • Will generally call this category. “Rice+STF”
  • Bitwise Range Coding
    • Compresses good, but not particularly fast.
    • Pros:
      • Adaptive
      • Can beat Huffman variants regarding compression.
      • Still faster than Adaptive Huffman.
      • Can be similarly flexible as a raw bitstream.
    • Cons
      • Slower than Huffman and Rice
      • Slower than a raw bitstream (if used as a bitstream).
    • Tricky Hacks
      • Can be stuck on the back-end of a Huffman or Rice coder for improving compression (slightly) with less speed cost than using it directly.
      • Sadly, the compression gains of using it over a raw bitstream are frequently not high enough to make its speed costs worthwhile.
  • Symbol Oriented Range Coding
    • Compresses good, also not exactly fast.
    • Cons (vs Bitwise range coding)
      • Doesn’t really compress any better
      • Much more complex and often slower
      • Much less flexible about how it is used.
  • ANS and friends
    • A sort of magic unicorn I don’t really understand that well.
    • Pros
      • Speed can be more competitive with that of Static Huffman.
      • Compression is more like that of a Range Coder.
    • Cons
      • I personally can’t really make much sense of how it works.
      • Requires encoding symbols in reverse order.
      • Seems to have some restrictions similar to symbol-oriented range coders.
      • Its high speeds only seem to materialize on certain hardware in my past tests.

Relative Tradeoffs:

  • Between Huffman and Range Coding
    • Have only rarely seen ratio differences much over about 10%
    • Cases like high symbol skew (where range coding fares well) can usually be addressed more cheaply via other means.
    • A slowdown of typically 2-4x isn’t usually worth a 5-10% gain in compression.
  • Between Huffman and Rice+STF
    • Huffman generally compresses better and is faster than Rice+STF past a certain message size.
    • Rice+STF can allow for a simpler encoder/decoder though, and is faster for small payloads.
      • Its origins actually came about as a cheaper alternative to Deflate-style Huffman-coded Huffman tables.
      • In some cases, the space savings of not transmitting a Huffman table with the data may offset its less-than-optimal coding, putting it ahead.
  • Huffman or Rice + Bitwise Range Coder Backend
    • Can regain some of the compression advantages of Range coding.
    • Has less steep of an impact on codec speed
      • Given the range coder deals only with the already-compressed data.
    • A drawback is that now one has both the code complexity of a Huffman or Rice coder in addition to the Range coder, in addition to potentially other unique funkiness (such as interleaving multiple bitstreams to allow for multiple RC contexts, etc).
  • Not using any entropy coding
    • Typically the fastest option.
    • Typically the worst compression
      • Because, you know, raw bytes.
    • May still be a good option where speed is a lot more important than compression ratio.

 

It can be noted that most of my older designs had used static Huffman, but many of my newer ones have used Rice+STF variants. Generally the relative speed, simplicity, and flexibility has been enough to offset the arguably “bad” compression.

If the total ratio difference is maybe 20% between a “bad” Rice-coded backend, and a much slower range-coded backend, but the result is a 400% speed difference, is it worthwhile? Probably depends whether size or speed is more relevant.

Both may save significantly over the raw uncompressed data, which is usually where most of the gains lie. It often seems more sensible to invest a lot of the time/effort budget in areas which have a more direct effect on overall compression (such as any filtering/quantization/transform steps in image/video compression, or the effectiveness of the LZ matching and similar for LZ77 compressors).

For example, in an image codec, one may find it may well matter a lot more what type of DC predictor is used, or how deltas or coefficients are represented, than the specific choice of entropy coder back-end or similar.

Though, granted, I remain sort of an outcast here.

 

Brief History of BGBScript

Earlier Efforts

Well, once I was young, and early on spent a lot of time messing around with Quake. Quake had a scripting language, known as QuakeC, which was fairly limited but made it a lot easier to tweak things in-game, at least compared with the use of C for scripting in Quake II.

Eventually, I was working on more of my own projects, and came across a language known as Lisp. I had tried using some of the existing scripting VMs I could find, which IIRC was a standalone variant of Emacs Lisp, and the SCM / Guile implementations of Scheme.

After running into frustrations with these VMs, and noting the general simplicity of the language, decided to do my own VM. It wasn’t very good, having relied on a sort of modified S-Expressions for interpretation, and later wrote a backend which spit out C. This project, however, turned into a mess, and Scheme’s syntax and language design wasn’t particularly usable, so I later ended up abandoning it.

At first, I thought maybe I was done messing with scripting VMs, but started to face the issue that writing code purely in C is limiting, given even if only a small amount of code ever ends up as scripts, the parts that do make a difference. Around this time, I had also become basically familiar with HTML and JavaScript, and had recently done an implementation of XML-RPC. So, I did a nasty hack: I built a script interpreter for a JavaScript knock-off interpreter on top of XML-RPC, which parsed into an XML tree and then directly interpreted the XML expressions (this was around mid 2004). This first VM suffered from horrible performance, and basically spewed garbage at an incredible rate, so in not too long I wrote a replacement.

The next iteration of the language was partially redesigned (in 2006), and basically copy-pasted a lot of parts from my Scheme interpreter, but made use of a bytecode interpreter. This VM was a lot faster, but suffered some of the problems of my original Scheme interpreter, for example the relative pain of trying to deal with a precise garbage collector from C land.

The next VM (2009) was basically a lot of the machinery from the 2006 VM, re-merged with parts of the 2004 VM, namely using a more C friendly memory manager. In 2010 and 2011 I started adding a lot of features I had sort of copied from ActionScript 3.0, as well as some features from Java and C#.

The bytecode used was originally a dynamically-typed stack-oriented bytecode, but opcode types annotations were added on after the fact via prefixes. The VM then used an awkward mix of static and dynamic typing.

Had also been running side efforts trying to do natively compiled and 3AC compiled code, but my early efforts here were very hit or miss (my first real code-generator collapsed into being unmaintainable, …). A lot of optimizations for my more efficient 3AC interpreters, and a few other VM designs, ended up back-ported to the stack machine VM, so there ended up being not a huge pressure to replace it with something different.

 

First BGBTech Engine

Around this time, I also wrote the first BGBTech Engine by pulling together parts from my various previous projects, so BGBScript was the primary scripting VM in BGBTech. The engine was developed out of a combination of a 3D modeller and mapper I had previously written mostly for the Quake 2 engine, mostly as by that time QuArK had decayed into an almost unusable mess, and couldn’t really be easily extended to deal with heavily modified games or engines.

Its performance was overall pretty bad, early on me feeling it was doing good when framerates were in the double-digits, and I later ended up switching over mostly to a Minecraft style terrain system after noting that my mapping skills were pretty bad.

Most of the horrible performance was due to me early on not really understanding OpenGL that well, and doing lots of things in ways which were pretty but not all that fast, then noting that they scaled poorly to larger scenes. A lot of the performance gains were mostly figuring out how to draw things more efficiently, and cutting things back to limit their performance costs.

By around 2014, progress on this engine was pretty much stalled.

 

Formation of BGBScript 2

Likewise, I had gone off to some side projects, mostly working on trying to build robots, so I put a lot of this on hold.

At this point I ran into a problem that my older BGBScript VM was too heavyweight for use in my robot projects, and wasn’t really well suited to low-latency use-cases. Had started on a redesigned version of the language I had called BGBScript-RT (for BGBScript Real-Time). It was redesigned to be much more like C in many areas, switching from garbage collection to the use of manual memory management.

In the meantime, needing something more immediate, I had used a variation of C compiled to a 3AC Bytecode. The bytecode VM was originally designed as part of a new faster experimental backend for the BGBScript VM, but it was easier to get C working on it. While the VM was pretty fast, it had a number of more subtle design issues.

Partly as a result, the newer BGBScript RT/2 VM had made use of a statically typed stack-machine, more like that in the JVM. This initial form of the VM stalled out with the general stall of my robot projects.

 

BGBTech 2 Engine

Around late 2015, and the general stall of my robot-building efforts, I ended up starting on a reboot of my 3D engine. My initial thinking was that I could take what I learned from my prior engine, and probably reboot it into a much faster and better engineered engine.

Since it quickly became apparent I needed scripting, I ended up reviving the BGBScript 2 VM, and worked a lot on beating the compiler/VM into a usable form. This was along with working a lot on stuff relevant to “engine architecture”.

This Script VM has ended up a fair bit bigger and more complicated than my original vision, but a lot of this was for things relevant to a game engine, and to performance: A simpler VM with a smaller / simpler ISA either comes at the cost of time or complexity in the back-end, or in worse performance. Generally, getting better performance comes to some extent at the cost of VM complexity.

As of the beginning of 2017, while it hasn’t really caught up to the old engine in many areas, and it is debatable whether the “better engineering” part is actually paying off, I at least seem to be making progress (and my new engine is much less prone to running out of memory or having awkward GC stalls).

 

Bytecode Tradeoffs

Over time, it can be noted that different ways of expressing operations has different tradeoffs.

Bytecode Tradeoffs

  • Dynamically Typed Stack Machine
    • Simplest to design/implement
      • Front-end mostly ignores types.
    • Naive case, is also the slowest
      • Deterimining types and operations at runtime does not come for free.
    • Getting decent performance requires a very complicated back-end.
      • Back-end needs to deal with type inference and JIT
    • FWIW: A few VM’s (such as Parrot) have a Dynamically-Typed Register IR
  • Implicit-Type Statically-Typed Stack Machine
    • Used, for example, in the .NET CLR.
    • Most elegant design on paper
    • Reliance on propagating types via the stack is problematic for interpreters
      • Sensible if the VM is JIT only
  • Explicitly-Typed Stack Machine
    • An example of this is the JVM
    • Requires use of more opcodes and/or prefixes to express types.
      • Less elegant design due to lots of repeated operations per-type.
      • May seem less orthogonal as operations may only exist for certain types.
    • Is a decent design for an interpreter.
      • Explicit type information means back-end doesn’t need to figure this out.
      • May also be easier for a simplistic JIT.
  • Statically-Typed Register-IR (Non-SSA)
    • An example of this would include the Dalvik VM
    • Can allow for a faster interpreter than what is typical from a stack machine.
    • A drawback is that things can also get more complicated.
      • Compiler also needs to deal with the management of temporary variables.
      • No convenient way implicitly pass arguments to functions.
  • Statically-Typed Register-IR (SSA)
    • An example of this is LLVM BitCode
    • Best suited for compiler analysis than interpretation
      • The additional temporary registers and phi operations are counter-productive for an interpreter.
      • Though, sub-scripting rather than duplicating registers is an option.
        • In the subscripted case, identity is shared, and phi is made implicit.
    • Is good for knowing register lifetime and may aid in register allocation.
      • Can know more easily when the lifespan of a register has ended.

 

In its current form, the BS2 VM uses a mix of an explicitly typed stack machine with some number of register ops mixed in. The use of hybrid register-ops can in some cases be used to reduce the length of instruction sequences, and thus improve performance for an interpreter, at the cost of a larger ISA.

There is some possible debate as to whether this is better or worse than a purely register IR would have been.

Seen another way, the stack can be thought of as a collection implicit temporaries which are accessed in LIFO order. Within the interpreter (if using threaded-code), there is no need to maintain a stack-pointer at runtime (the stack pointer then only really exists when decoding functions). A side-effect of this is needing some non-executed instructions which primarily serve the purpose of keeping the stack aligned around labels and branches.

For example, compiling “z=25*x+y;”with a pure stack model:

LDI L0   //(=>S)=x
LDC.I 25 //(=>S)=25
MULI     //(=>S)=(S=>)+(S=>)
LDI L1   //(=>S)=y
ADDI     //(=>S)=(S=>)+(S=>)
STI L2   //z=(S=>)

Or, with some additional “hybrid” opcodes:

MULISLC L0, 25  //(=>S)=x*25
ADDILSL L2, L1  //z=(S=>)+y

Which in this case, can match the density of a pure register form.

 

Typically, the bytecode is internally converted into a sort of threaded code, where instructions are decoded into structures with a function pointer, and instructions are executed by calling the function pointers.

These instructions are then stored in arrays, and in C side, unrolled loops can be used to execute a sequence of instructions via such arrays. Execution then primarily works via a main trampoline loop, which calls function pointers holding the unrolled loops for executing instructions.

A simplistic JIT strategy for this is to emit machine code blobs which contain, in the general cases, function calls to the embedded function pointers. For operations handled directly by the JIT, it may instead emit a blob of instructions for executing its operation. Extra fancy is one can add a makeshift register allocator, so values may be cached in CPU registers rather than using memory loads/stores.

But, if the bytecode is kept friendly to interpreters, an interpreter is a good fallback in case a JIT is not available or is too expensive to support on the target, and directly interpreting bytecode operations can be good if for whatever reason threaded-code is not a good option, or if the code will probably only ever be executed maybe once or twice.

 

On a path to moderately fast image/video compression.

Well, probably many here are familiar with the DCT transform, the aged and battle hardened star of JPEG and MPEG. Some may also be familiar with the “rising star” known as DWT, and its “will it ever appear in a usable mainstream video codec?” thing. Maybe one can then go on about the trade-offs between various new/shiny arithmetic and range coders, so much better than ye olde static Huffman.

But… I am not here to talk about these.

Rather, I am going to go back a little further, to a since forgotten era. Early on, before we had enough video memory to store 32-bit RGB(A) values for every pixel, how was it done? No, not 8-bit palette-based graphics, earlier still.

There was a time when things were basically 1 bit per pixel. This may be remembered via old display adapters, like the Hercules, which only did monochrome. Others, such as CGA, had multiple modes, where every pixel could be one of a few small palette of colors. On other systems, people had the idea of using a pair of colors for each block of pixels, where each bit would select between one of the two colors.

Say, hypothetically, one had 8 bits for a pair of 4-bit colors, and with 16 bits of pixel data one can do a 4×4 grid of pixels, in full color, in 24 bits. Alternatively, one can do 8×8 pixels in 72 bits.

Now, moving forwards a few years, we have an era where people had new shiny PCs with monitors running at 640×480, with 256 colors, and a 25 or 33 MHz processor. People were playing video on these things, but how? No, it wasn’t MPEG, it was typically something a lot simpler: You have a pair of 8-bit colors, and 16 bits to select between them. So, here, it is 4×4 pixels in 32 bits. This works out to about 2 bits/pixel, but since often many parts of the image wont change, special bit patterns could be used to encode skipping over blocks of pixels. This codec? MS-CRAM, or MS Video 1.

Elsewhere, there was a company known as Apple, who was working on a system known as QuickTime. As part of this, they developed another codec, known as “Apple Video” or “RPZA” (its FOURCC values being “RPZA” and “azpr”). Its design differed slightly: it used a pair of RGB555 pixels (16 bits each, the MSB was special), and 32 bits with 2 bits per pixel. Instead of simply selecting one of the colors, it was possible to select a pixel interpolated between these: 0 would give you ColorA, 3 would give ColorB, 1 would give 2/3A+1/3B, and 2 would give 1/3A+2/3B. Likewise, it was possible to skips runs of blocks, or indicate that a run of blocks was all a single color. For anyone familiar with DXT or BCn texture compression, it is basically the same basic idea.

One could also see the RPZA image as a sort of stream of commands:

  • 00-7F: Start of a block of pixels, 64 bits for 16 pixels.
    • Well, technically you could also do a block of 16x 15bit pixels (256 bits).
    • Which is was depended on the MSB of the following RGB555 color.
  • 80-9F: Skips 1-32 blocks
    • Whatever pixel data exists here is retained from the prior frame.
    • The low 5 bits of the byte give the length of the run of blocks.
  • A0-BF: Span of 1-32 blocks using a single flat RGB555 color.
  • C0-DF: Span of 1-32 blocks sharing a single pair of colors.
  • E0-FF: Reserved

Then, later along, there were new upstarts like Cinepak, etc.

But what happened? Well, a little later on, MPEG and friends gained a foothold on PC’s, and use cases for video were reduced down to mostly watching movies/anime/… and the rise of online streaming, which put a premium on the bitrate. Likewise, the early FMV games were by and large a steaming pile of crap, and in a short time people by-and-large forgot that this earlier era of video codecs even existed.

But, there is a little detail in this: decoding an MPEG style codec requires a bit of CPU power, but “we got a 1GHz CPU to play a 640×480 video, so who cares?”. In the rising resolutions of subsequent years, and hitting a limit to clock speeds, some modern video codecs remain mostly playable at HD resolutions mostly by the grace of dedicated graphics hardware (and people debate their “codec complexity” largely in terms of die space required on an ASIC or similar, but everyone has a GPU, right?…).

But, a simple question arises? What happens if we try to decode one of those older early 90s video codecs on a modern PC? Well, in simple terms, it goes “really damn fast” (if written sensibly, the video decoding can potentially reach “memcpy” speeds). But… They also tend to take a lot of space and deliver less than impressive image quality.

Some years ago (~ 2011 IIRC), while faced with this issue, I ended up hacking some features onto one of these codecs (RPZA). Most didn’t work out well (or were fairly specialized, like adding an Alpha channel), but a few stood out: 23 bit colors (for “looks less like ass than RGB555”), and the use of an external Deflate-based compressor (which didn’t slow it down too drastically, but could result in potentially significant savings in terms of frame sizes). I had dubbed this effort (“BTIC1C”, being 1C as it came after a few prior failed attempts at compressing S3TC). Some hybrid 1C video exists where JPEG was used for I-Frames but 1C was used for P-Frames, however, this hack was prone to numerous issues. BTIC1C was mostly developed originally for a mix of real-time screen capture and for streaming video into textures (used for most of the animated texture effects in the original BGBTech 3D engine).

After that was an effort called BTIC1D, which used YUV and 4:2:0 chroma subsampling (via using separate interpolation for Y, U, and V). This effort was initially deemed a failure as its compression sucked pretty hard and its speed was barely much faster than what I could get from a DCT based design (as with DCT based codecs, it inherited the serious drawback of needing a per-pixel YCbCr->RGB conversion). There were also 1E and 1F, which basically had design issues bad enough that they never got fully implemented (though 1F did develop a nifty trick I call YUVD, which had replaced the use of a pair of colors with a singular vector).

Now, it was 2014, and I was off messing with new things, namely writing code on a Raspberry Pi, screwing around with electronics wanting to build some robots. I went and built my sadly thus far only “successful” robot: a wheeled cardboard box with a webcam on it (so I could drive it around and see what it was facing). I wanted to stream video from the webcam back to my laptop, and initially underestimated the RasPi’s processing power (I also wanted a design which could work over UDP and allowed me to encode video data between the motor PWM pulses).

So, at the time, I created something relatively simpler: an RPZA like codec I called BTIC1G, but instead of RGB555, I used a YUVD color vector. Why? Mostly because the webcam gave me video in a format known as YUY2, and because given how my block-encoder logic generally worked, I generally ended up with a midpoint color, and a ‘max-avg’ gamma value. So, YUV encodes the center YUV, and D encoded the Y difference between the lightest and darkest pixel. Interpolation then, effectively, only applied to Y.

Pixel data was good old 32-bits and was 4×4 with 2 bits/pixel for interpolation, working out to 64 bits for a complete pixel block. To improve bitrate, there were also flat color blocks, and downsampled 2×2 blocks. Bits were shaved off of the YUV values to allow encoding the various block types.

Noted that it was plenty fast for the use-case, but I was sort of wishing for better quality. To stream 640×480 over UDP over a crappy 802.11b dongle basically resulted in not particularly great image quality.

After the relative “success” of my rolling cardboard box, thinking maybe I would build other fancier robots, I had begun on a successor to 1G, which I had called 1H. It retained a design intended to be incrementally encoded and streamed via UDP packets, but differed mostly in that it added entropy coding, and a few more features. Facing an issue that there isn’t really a good way to do static Huffman over UDP, I went instead with Rice coding (mostly for the YUVD color vector deltas and command tags, with pixel data is raw bits).

Sadly, my robot building efforts had fizzled out, but I had in the process made a working prototype of 1H for the PC, made usable as a VFW codec. Then made the observation that it basically did better in most regards compared with my older 1C codec (better quality/bitrate, …).

Later on, ended up expanding it to include 4:2:0 blocks based on those from 1D, and expanding YUVD to YUVDyuv (where YUV represents the center point of a bounding-box in YUV space, and Dyuv encodes the size of this box). This was to address an issue of ugly artifacts in some cases, most obvious with “My Little Pony” and “Rainbow Dash”: Whenever her character was on screen, her hair would cause severe blocking artifacts as there is no way to express the sharp color contrasts in the YUVD scheme (an RGB color pair fairs a little better, but is by no means immune to this issue, and I suspect is a big part of why “partitions” exist in BC7 / BPTC). Since the 4:2:0 blocks are basically limited mostly to emergency cases, one can largely sidestep their relatively steep performance cost (and needing around 50% more pixel bits, say, 32 bits for Y, and 8 bits each for U and V).

Additionally, some block types use 3 bit interpolation for added quality (0=A, 1=6/7A+1/7B, …, 6=1/7A+6/7B, 7=B), and others only use 1 bit per pixel, etc. So, 48 or 64 bits of pixel data may be used in some cases, in addition to maybe 4, 8, 16, or the standard 32 bits.

A lot of the encoding choices can be made via a “weighted decision tree” where one takes various values derived from the pixel data, the quality level, …, then this is evaluated to give a choice for how the block should be packed into a blob of bits. For example, if we “see” that the block is basically a flat color, we don’t really need full resolution or explicit chroma information, …

In the decoder, it basically decodes the blocks from the bitstream, into an internal fixed-size block format. To get the pixel data, the logic for the block-type is invoked, which may generate a small array of pixel colors, and the pixel data will select which colors are copied into the output buffer. ( Note that the YUV colorspace is not necessarily YCbCr, as more typically a modified RCT or GDbDr is used instead due to being faster to convert to/from RGB, but raw RGB suffers in that it isn’t really friendly to chroma subsampling. )

In another side branch, a project existed (within a standards group I will leave unnamed), where they expressed wanting: CBR, ~ 2-4 bpp, HDR, 4:2:2, and “low complexity”. Seeing this, I noted this was pretty close to what was doable with the internal block formats I had used in my prior codecs. Main difference was that I went from a block size of 4×4 to 8×8, as this increases the relative cost of the block header (say, rather than a fixed 128 bit block, one has a 512 bit block, but the header does not necessarily increase in size, and by having 256/384/512 bit blocks, it was possible to have different bitrates between 2 and 4 bits, with the block size controlling which types of blocks were allowed, though potentially for many use cases only a single type of block would be used depending on the associated raster format).

Had observed that pretty much no one cared, later realizing their thinking for CBR and “low complexity” was more along the lines of “throw a modified JPEG2000 at the problem” (the computational complexity of JP2K is pretty drastic vs BTIC, but they were thinking in terms of an ASIC, rather than doing it on the CPU, and achieving CBR mostly by fiddling the quality levels and truncating the image to a specific size). After realizing that there was little reconciliation of design goals, I went away.

This initial design was dubbed BTIC4A (2x and 3x were mostly failed lines, 2x being DCT/WHT based designs, and 3x being some attempts to hybridize blocky VQ and transforms, the 2x line failing to compete well with codecs like XviD and lacking a strong use-case in the original BGBTech engine and for still images not offering much gain over normal JPEG, and 3x designs tending to turn out too complicated to be worth the effort needed to implement them).

Some parts were adapted from 1H onto 4A to yield a design I had called BTIC4B, and I had developed tricks to allow multi-threaded encoding/decoding for individual frames (cutting them up into “slices” which could be encoded/decoded in parallel). Some of these were back-ported to 1H.

Thus far, haven’t really used 4B for video, but it does pretty good at still image compression (in terms of Q/bpp and speed), so could probably make for a decent video codec (though it is more complicated, as in all this there ended up being around 32 different block layouts, as there are a lot more ways to subsample an 8×8 pixel block vs a 4×4 block).

Though, as-is, BTIC1H can already reach gigapixel/second speeds in some cases, the immediate for something faster isn’t a huge need.

However, for static images, on my new FX-8350, given I am seeing damn near 600 megapixels/thread for single-threaded operation (~ 2.4 GB/s of RGBA pixel data), I suspect it could probably reach some pretty crazy fast speeds.

On my older Phenom II, it tended to run into a hard limit of around 400 Mpix/sec per thread (~ 1.6 GB/s of pixel data), which roughly matched what one could get doing a memcpy operation.

But, I expect few people to care, since for the “standard” use-cases (watching movies and online video streaming), it doesn’t have much to offer (ex: 1080p needs ~ 62 Mpix/sec, not Gpix/sec). I guess if we assume that video serves no other purpose than this, it is hard to see the point.

Trying this out.

Well, I am a programmer. I am not really a new programmer, but I have also never been employed as one. For, there is this issue: I write stuff on my own. In this world, no degree, and no prior work experience, means no job. Left starting classes for CNC Machining, at least then maybe someone will hire me (and I can have some semblance of income).

I have written two 3D engines, an older engine (BGBTech) which grew organically but became a bulky, slow, and difficult to maintain mess. I have a newer 3D engine (BGBTech2), which I am trying to make a more cleanly engineered piece of technology, taking lessons I have learned in my earlier years and trying to turn them into something a little more polished (well, by my standards, by more general standards it is more like a pile of feces given a shiny coating of acrylic floor wax). But, it regularly shows its lack of maturity, as I haven’t yet re-implemented many things from the old engine, and bugs/crashes are hardly an uncommon occurrence.

GitHub profile, for those feeling brave.

I have custom codecs, which go reasonably fast (in more recent iterations, I am getting speeds measurable in gigapixels/second for multi-threaded encoding/decoding, on the CPU). This is not via some epic optimization, so much as by doing a bit of retro-90s engineering. Though taken for granted, we now have around 20 years of people using progressively more expensive designs in the name of being “better”, then people are like, “well, you gotta use the GPU and/or dedicated hardware for this stuff?”

But, what if one sacrifices a little compression, and a little elegance, making something that looks like a monster which crawled out of 1993 and then grew some Cthulhu like appendages? Well, it isn’t pretty, but it can be fast.

I also have scripting languages for my 3D engines. The older engine used a language which started off resembling JavaScript but later on incorporated a lot more features (and a syntax more like ActionScript3). The newer engine used a reboot of the language, which more or less took the old language design and moved it over to a statically-typed core with manual memory management, in a way being sort of like if a mongrel of C and C# had a mixture of AS3 and Java style syntax.

It is not a language design meant to inspire, or to upset the status quo, it is a language to hopefully get “real work” done in, and to hopefully not perform like garbage. In some sense, conventional language design works well here. However, it is a scripting language in the sense that, at least at present, its dominant use-case is being loaded directly in the form of source code.

In a way, it is at odds with its “compiled” siblings which assume an explicit compilation step (even if they are to be JIT compiled from bytecode), and dynamic languages which are more frequently loaded from source but have performance which ranges between lackluster and dismal (“Wow, it is so fast! It is only 30 or 45 times slower than what it would have been had we written it in C!”). How about a scripting language where, for code which resembles C, it isn’t too drastically slower? Like, maybe we can have a language design which isn’t at odds with making it fast?

Granted, explaining much about these things may be a matter for another time.

Maybe next I will try to address some of these topics individually.