How to introduce this? These are something one works on, they serve a purpose, but don’t get much attention.
I have a design for a format I had called FeLZ32, which is basically “like LZ4 but decodes faster”. How? It basically dispenses with bytes; entire format is based around working with 32-bit DWORDs. So, the compressed data is DWORDs, and the output is also seen as DWORDs. The DWORDs are normally 32-bit aligned, and no fuss with tailing bytes.
Otherwise, FeLZ32 is sort of similar to LZ4: there is a Tag word, which encodes both the match (length/distance) as well as the number of raw/uncompressed DWORDs to copy to the output (before copying the matched symbols). The format of the Tag is fairly simple: [15:0]=distance; [22:16]=match length; [29:23]=raw length; [31:30]=alignment. There are special cases, namely for long runs of raw/unencoded symbols, and an end-of-data case (Creatively, encoded as 0x00000000).
( Sadly, some of the shifts are a little awkward, eg, for SuperH assembler. Normal SH only provides immediate shifts of 1/2/8/16 bits; so X>>7 requires shifting 1b left and 8b right in this case, or the use of SHLD, but otherwise. )
So, basically, it is like LZ4, but in-general doesn’t compress quite as well (depends on content, for some things it fares a little better). In turn, LZ4 doesn’t usually compress as well as Deflate, so same sort of thing. Each is generally faster, but at the expense of compression.
There is some types of content where it does better, but here is for some random stuff I had laying around (not really ideal cases either for FeLZ32, but alas):
As noted, my PC (an AMD FX-8350 with 32GB of DDR3 1600) achieves a “general” memcpy speed of around 3.4 GB/s ( I previously had been using a Phenom II which peaked at ~ 1.8GB/s ).
As noted, FeLZ32 was typically getting decode speeds of around 2 to 3 GB/sec, which seems pretty solid. Encoding doesn’t reliably beat LZ4 though, which is sad ( encoding mostly consists of searching for LZ matches, so its speed is pretty dependent on how much “effort” the encoder puts into the search, and the efficiency of the matching strategy. )
I guess a possible “research path” could be a format which can alternate between an FeLZ32-like mode and a more LZ4 like mode.
There was also BTIC4B, which is a speed-oriented image/video codec, following along with my prior “BTIC family” designs.
For the most part, members of the BTIC family were focused on speed, and go back originally to a format I had called BTIC1C, which was basically a hacked version of the Apple RPZA format with a Deflate-based backend glued on (which I had also tuned for speed).
In this format, blocks consisted of a pair of colors in RGB555 format, followed by 32-bits of pixel data representing a 4×4 block (at 2 bits/pixel). Anyone who has looked at DXT1 probably gets the idea (but, this happened roughly 5 years earlier). The MSB of the color endpoints were used to invoke other magic, namely skipping runs of blocks, or doing runs of a flat color.
There were a few failed attempts to replace it: BTIC1D, which used a conceptually similar design but was YUV420, and had unacceptably poor decode speeds combined with lackluster compression. There was BTIC1F, which added the concept of using a YUV color-vector in place of discrete endpoints, but ran into awkward design issues.
The next semi-successful member was BTIC1G, which was basically a stripped down design intended mostly for some robots I was building. This had a different set of requirements, namely a fast/simple encoder which could work incrementally and stream the video as small spans of blocks (via UDP datagrams). This version was basically straight bit-twiddly.
With BTIC1G, I had also discovered a way to more quickly classify and encode blocks formats, namely to measure some properties of the pixel block and feeding them through some weighting and a weighted decision tree (in some contexts these things are passed off as “neural nets”). This allowed determining in real-time what the most likely ideal representation was for a block. ( Some parts of the design process may have also been assisted using genetic algorithms as well… ).
Then later, started work on a design I called BTIC1H, which was originally intended for the same use-case as 1G, and basically replaced the bit-twiddly blocks with Adaptive Rice Coding (and a trick I had called SMTF; which can mimic the behavior of Huffman coding but without needing Huffman tables or the severe performance overheads associated with Adaptive Huffman and Arithmetic Coding). It had also reintroduced some concepts from 1D and 1F, namely the use of (now optional) YUV420 blocks, and the use of delta-coded color vectors.
Elsewhere, I had began work on a format called BTIC4A, which was focused initially on fixed-size blocks and packing pixels into blocks; in a matter conceptually similar to BC7; just with 8×8 pixel blocks in either 256bit, 384bit, or 512bit forms.
The 4A effort was quickly mutated into BTIC4B, which combined 4A’s block formats with a bitstream format more or less copy/pasted from 1H. Most of the pixel-data is copied as-is into the bitstream. Decoding consists then of reconstructing the blocks from the bitstream, and handing them to a block-decoder.
As a side effect of the bigger 8×8 blocks though, there was only about 1/4 as much overhead from things like entropy coding, allowing it to go faster, and there was also an improvement in terms of compression due to the reduction of side-data for each block. Many blocks use subsampled chroma, ranging from 4:2:0 up to giving the whole 8×8 block a single chroma value (only luma varying).
Likewise, vectors were labeled as (Y, U, V, Dy, Du, Dv), with Y/U/V as a central color, and Dy/Du/Dv being the range relative to the center. Dy is always positive, but Du and Dv may be negative. Unused components may be skipped, and the vector deltas may be based on a difference from a vector predicted via a Paeth predictor (similar to PNG). The vector is then transformed (when decoded) into a pair of color endpoints (in YUV space).
In chroma subsampled blocks, the U/V components are interpolated to form the U/V for a given pixel sub-block, which is then unpacked into possible pixels based on Y. Y then selects the pixels. In non-subsampled blocks, the endpoints are converted into colors and used directly (thus Y may also determine the chroma value). Though, some blocks may ignore Du/Dv with the Y axis only interpolating along Y.
Things are pretty variable, but will note that on my AMD-FX I am often seeing decode speeds in the area of 400-600 megapixels per second for a single threaded decoder, which is plenty fast to be usable. With multiple threads, gigapixel speeds are possible (however, per-thread speed drops sharply as the number of threads increases). I have a Dual Xeon E5410 box (a discarded rack server) which, while initially slower, scales a bit better (I have seen decode speeds of around 4 gigapixels/second on this machine).
But, at its core, the “transform” ultimately isn’t that much different from RPZA and CRAM; we have often 1 or 2 bits pixel (though 3 or 4 are possible). There are variable block-sizes (Y subsampling), and various levels of chroma-subsampling.
The Y block is has various sizes (none; 2×2, 4×2, 2×4, 4×4, 8×4, 4×8, 8×8) which map to the logical 8×8 output block, and UV chroma blocks sizes (none; 2×2, 2×4, 4×4, or 4×8); these in turn come in certain combinations, with various combinations of pixel bit-depths (to fit within a size budget).
No real fancy math, just good ol’ bit twiddly.
AdRice and SMTF
AdRice was another nifty trick I ran across. In traditional Rice Coding, the ‘k’ value is determined up-front, so two passes are typically used to encode this. One pass determined the k values, the other does the actual encoding.
In AdRice, a nifty trick is used, namely, if the Q value (length of the unary prefix) is 0, then k is made smaller; if 1 it remains the same; and if >1 it gets bigger. Starting k is then ‘something sane’, and the rest takes care of itself.
To avoid epic slowness, it is possible to use a lookup table, where both the coded symbol and new k value are held (both the ‘k’ and matched bit pattern are used as the table-index). The range of ‘k’ can also be constrained, and a length-limiting escape case can be used, such that the whole process can be driven by lookup tables (as in a typical Huffman coder).
Unlike Huffman, AdRice has the advantage of allowing encoding/decoding without needing to drag around a Huffman table (potentially a big cost with small payloads; and cases where data may be encoded/decoded in pieces or in no particular order).
Though, Rice may have an obvious limitation: Normally it only works effectively at encoding values following a geometric distribution.
A possible few workarounds exist:
- SMTF: If symbol is small, swap towards front, else rotate space and swap the symbol to the new front (the symbol that “falls off the end” goes into the “hole” left previously).
- STF: Swap towards front (if N>0: swap N with N-1).
- STF2: Swap towards front, but at a fixed ratio (eg: swap N with 7*N/8).
The exact swapping strategy varies and what is ideal is variable. But, these basically tend to put symbols in an order of descending probability (and thus make the Rice coding happy), and isn’t too terribly expensive (SMTF is usually implemented via masking off an rover+index into a buffer; rotating the space consists of decrementing the rover).
STF tends to result in a closest-to-optimal distribution, but suffers from the problem of adapting slowly; STF2 is faster adapting than STF but prone to a limitation that alternating patterns of symbols may get “stuck” swapping with each other (rather than advancing towards the front); …
The combination of AdRice and SMTF, while not quite as fast or as good as Static Huffman, generally works pretty decently as a substitute when an adaptive coding is needed or when payloads are small enough that the cost of passing around and initializing the Huffman tables outweighs the costs of AdRice and SMTF.