Tag: programming

On a path to moderately fast image/video compression.

Well, probably many here are familiar with the DCT transform, the aged and battle hardened star of JPEG and MPEG. Some may also be familiar with the “rising star” known as DWT, and its “will it ever appear in a usable mainstream video codec?” thing. Maybe one can then go on about the trade-offs between various new/shiny arithmetic and range coders, so much better than ye olde static Huffman.

But… I am not here to talk about these.

Rather, I am going to go back a little further, to a since forgotten era. Early on, before we had enough video memory to store 32-bit RGB(A) values for every pixel, how was it done? No, not 8-bit palette-based graphics, earlier still.

There was a time when things were basically 1 bit per pixel. This may be remembered via old display adapters, like the Hercules, which only did monochrome. Others, such as CGA, had multiple modes, where every pixel could be one of a few small palette of colors. On other systems, people had the idea of using a pair of colors for each block of pixels, where each bit would select between one of the two colors.

Say, hypothetically, one had 8 bits for a pair of 4-bit colors, and with 16 bits of pixel data one can do a 4×4 grid of pixels, in full color, in 24 bits. Alternatively, one can do 8×8 pixels in 72 bits.

Now, moving forwards a few years, we have an era where people had new shiny PCs with monitors running at 640×480, with 256 colors, and a 25 or 33 MHz processor. People were playing video on these things, but how? No, it wasn’t MPEG, it was typically something a lot simpler: You have a pair of 8-bit colors, and 16 bits to select between them. So, here, it is 4×4 pixels in 32 bits. This works out to about 2 bits/pixel, but since often many parts of the image wont change, special bit patterns could be used to encode skipping over blocks of pixels. This codec? MS-CRAM, or MS Video 1.

Elsewhere, there was a company known as Apple, who was working on a system known as QuickTime. As part of this, they developed another codec, known as “Apple Video” or “RPZA” (its FOURCC values being “RPZA” and “azpr”). Its design differed slightly: it used a pair of RGB555 pixels (16 bits each, the MSB was special), and 32 bits with 2 bits per pixel. Instead of simply selecting one of the colors, it was possible to select a pixel interpolated between these: 0 would give you ColorA, 3 would give ColorB, 1 would give 2/3A+1/3B, and 2 would give 1/3A+2/3B. Likewise, it was possible to skips runs of blocks, or indicate that a run of blocks was all a single color. For anyone familiar with DXT or BCn texture compression, it is basically the same basic idea.

One could also see the RPZA image as a sort of stream of commands:

  • 00-7F: Start of a block of pixels, 64 bits for 16 pixels.
    • Well, technically you could also do a block of 16x 15bit pixels (256 bits).
    • Which is was depended on the MSB of the following RGB555 color.
  • 80-9F: Skips 1-32 blocks
    • Whatever pixel data exists here is retained from the prior frame.
    • The low 5 bits of the byte give the length of the run of blocks.
  • A0-BF: Span of 1-32 blocks using a single flat RGB555 color.
  • C0-DF: Span of 1-32 blocks sharing a single pair of colors.
  • E0-FF: Reserved

Then, later along, there were new upstarts like Cinepak, etc.

But what happened? Well, a little later on, MPEG and friends gained a foothold on PC’s, and use cases for video were reduced down to mostly watching movies/anime/… and the rise of online streaming, which put a premium on the bitrate. Likewise, the early FMV games were by and large a steaming pile of crap, and in a short time people by-and-large forgot that this earlier era of video codecs even existed.

But, there is a little detail in this: decoding an MPEG style codec requires a bit of CPU power, but “we got a 1GHz CPU to play a 640×480 video, so who cares?”. In the rising resolutions of subsequent years, and hitting a limit to clock speeds, some modern video codecs remain mostly playable at HD resolutions mostly by the grace of dedicated graphics hardware (and people debate their “codec complexity” largely in terms of die space required on an ASIC or similar, but everyone has a GPU, right?…).

But, a simple question arises? What happens if we try to decode one of those older early 90s video codecs on a modern PC? Well, in simple terms, it goes “really damn fast” (if written sensibly, the video decoding can potentially reach “memcpy” speeds). But… They also tend to take a lot of space and deliver less than impressive image quality.

Some years ago (~ 2011 IIRC), while faced with this issue, I ended up hacking some features onto one of these codecs (RPZA). Most didn’t work out well (or were fairly specialized, like adding an Alpha channel), but a few stood out: 23 bit colors (for “looks less like ass than RGB555”), and the use of an external Deflate-based compressor (which didn’t slow it down too drastically, but could result in potentially significant savings in terms of frame sizes). I had dubbed this effort (“BTIC1C”, being 1C as it came after a few prior failed attempts at compressing S3TC). Some hybrid 1C video exists where JPEG was used for I-Frames but 1C was used for P-Frames, however, this hack was prone to numerous issues. BTIC1C was mostly developed originally for a mix of real-time screen capture and for streaming video into textures (used for most of the animated texture effects in the original BGBTech 3D engine).

After that was an effort called BTIC1D, which used YUV and 4:2:0 chroma subsampling (via using separate interpolation for Y, U, and V). This effort was initially deemed a failure as its compression sucked pretty hard and its speed was barely much faster than what I could get from a DCT based design (as with DCT based codecs, it inherited the serious drawback of needing a per-pixel YCbCr->RGB conversion). There were also 1E and 1F, which basically had design issues bad enough that they never got fully implemented (though 1F did develop a nifty trick I call YUVD, which had replaced the use of a pair of colors with a singular vector).

Now, it was 2014, and I was off messing with new things, namely writing code on a Raspberry Pi, screwing around with electronics wanting to build some robots. I went and built my sadly thus far only “successful” robot: a wheeled cardboard box with a webcam on it (so I could drive it around and see what it was facing). I wanted to stream video from the webcam back to my laptop, and initially underestimated the RasPi’s processing power (I also wanted a design which could work over UDP and allowed me to encode video data between the motor PWM pulses).

So, at the time, I created something relatively simpler: an RPZA like codec I called BTIC1G, but instead of RGB555, I used a YUVD color vector. Why? Mostly because the webcam gave me video in a format known as YUY2, and because given how my block-encoder logic generally worked, I generally ended up with a midpoint color, and a ‘max-avg’ gamma value. So, YUV encodes the center YUV, and D encoded the Y difference between the lightest and darkest pixel. Interpolation then, effectively, only applied to Y.

Pixel data was good old 32-bits and was 4×4 with 2 bits/pixel for interpolation, working out to 64 bits for a complete pixel block. To improve bitrate, there were also flat color blocks, and downsampled 2×2 blocks. Bits were shaved off of the YUV values to allow encoding the various block types.

Noted that it was plenty fast for the use-case, but I was sort of wishing for better quality. To stream 640×480 over UDP over a crappy 802.11b dongle basically resulted in not particularly great image quality.

After the relative “success” of my rolling cardboard box, thinking maybe I would build other fancier robots, I had begun on a successor to 1G, which I had called 1H. It retained a design intended to be incrementally encoded and streamed via UDP packets, but differed mostly in that it added entropy coding, and a few more features. Facing an issue that there isn’t really a good way to do static Huffman over UDP, I went instead with Rice coding (mostly for the YUVD color vector deltas and command tags, with pixel data is raw bits).

Sadly, my robot building efforts had fizzled out, but I had in the process made a working prototype of 1H for the PC, made usable as a VFW codec. Then made the observation that it basically did better in most regards compared with my older 1C codec (better quality/bitrate, …).

Later on, ended up expanding it to include 4:2:0 blocks based on those from 1D, and expanding YUVD to YUVDyuv (where YUV represents the center point of a bounding-box in YUV space, and Dyuv encodes the size of this box). This was to address an issue of ugly artifacts in some cases, most obvious with “My Little Pony” and “Rainbow Dash”: Whenever her character was on screen, her hair would cause severe blocking artifacts as there is no way to express the sharp color contrasts in the YUVD scheme (an RGB color pair fairs a little better, but is by no means immune to this issue, and I suspect is a big part of why “partitions” exist in BC7 / BPTC). Since the 4:2:0 blocks are basically limited mostly to emergency cases, one can largely sidestep their relatively steep performance cost (and needing around 50% more pixel bits, say, 32 bits for Y, and 8 bits each for U and V).

Additionally, some block types use 3 bit interpolation for added quality (0=A, 1=6/7A+1/7B, …, 6=1/7A+6/7B, 7=B), and others only use 1 bit per pixel, etc. So, 48 or 64 bits of pixel data may be used in some cases, in addition to maybe 4, 8, 16, or the standard 32 bits.

A lot of the encoding choices can be made via a “weighted decision tree” where one takes various values derived from the pixel data, the quality level, …, then this is evaluated to give a choice for how the block should be packed into a blob of bits. For example, if we “see” that the block is basically a flat color, we don’t really need full resolution or explicit chroma information, …

In the decoder, it basically decodes the blocks from the bitstream, into an internal fixed-size block format. To get the pixel data, the logic for the block-type is invoked, which may generate a small array of pixel colors, and the pixel data will select which colors are copied into the output buffer. ( Note that the YUV colorspace is not necessarily YCbCr, as more typically a modified RCT or GDbDr is used instead due to being faster to convert to/from RGB, but raw RGB suffers in that it isn’t really friendly to chroma subsampling. )

In another side branch, a project existed (within a standards group I will leave unnamed), where they expressed wanting: CBR, ~ 2-4 bpp, HDR, 4:2:2, and “low complexity”. Seeing this, I noted this was pretty close to what was doable with the internal block formats I had used in my prior codecs. Main difference was that I went from a block size of 4×4 to 8×8, as this increases the relative cost of the block header (say, rather than a fixed 128 bit block, one has a 512 bit block, but the header does not necessarily increase in size, and by having 256/384/512 bit blocks, it was possible to have different bitrates between 2 and 4 bits, with the block size controlling which types of blocks were allowed, though potentially for many use cases only a single type of block would be used depending on the associated raster format).

Had observed that pretty much no one cared, later realizing their thinking for CBR and “low complexity” was more along the lines of “throw a modified JPEG2000 at the problem” (the computational complexity of JP2K is pretty drastic vs BTIC, but they were thinking in terms of an ASIC, rather than doing it on the CPU, and achieving CBR mostly by fiddling the quality levels and truncating the image to a specific size). After realizing that there was little reconciliation of design goals, I went away.

This initial design was dubbed BTIC4A (2x and 3x were mostly failed lines, 2x being DCT/WHT based designs, and 3x being some attempts to hybridize blocky VQ and transforms, the 2x line failing to compete well with codecs like XviD and lacking a strong use-case in the original BGBTech engine and for still images not offering much gain over normal JPEG, and 3x designs tending to turn out too complicated to be worth the effort needed to implement them).

Some parts were adapted from 1H onto 4A to yield a design I had called BTIC4B, and I had developed tricks to allow multi-threaded encoding/decoding for individual frames (cutting them up into “slices” which could be encoded/decoded in parallel). Some of these were back-ported to 1H.

Thus far, haven’t really used 4B for video, but it does pretty good at still image compression (in terms of Q/bpp and speed), so could probably make for a decent video codec (though it is more complicated, as in all this there ended up being around 32 different block layouts, as there are a lot more ways to subsample an 8×8 pixel block vs a 4×4 block).

Though, as-is, BTIC1H can already reach gigapixel/second speeds in some cases, the immediate for something faster isn’t a huge need.

However, for static images, on my new FX-8350, given I am seeing damn near 600 megapixels/thread for single-threaded operation (~ 2.4 GB/s of RGBA pixel data), I suspect it could probably reach some pretty crazy fast speeds.

On my older Phenom II, it tended to run into a hard limit of around 400 Mpix/sec per thread (~ 1.6 GB/s of pixel data), which roughly matched what one could get doing a memcpy operation.

But, I expect few people to care, since for the “standard” use-cases (watching movies and online video streaming), it doesn’t have much to offer (ex: 1080p needs ~ 62 Mpix/sec, not Gpix/sec). I guess if we assume that video serves no other purpose than this, it is hard to see the point.