BT FastVQ

General:
 * Info about performance tradeoffs for blocky VQ.
 * For sake of simplicity here, I will assume 4x4 blocks.

Representation of VQ blocks:
 * Two or more color endpoints
 * May be stored directly.
 * Ex: RGBa, RGBb
 * May be stored center/side or similar.
 * R,G,B,D: RGBa=(R+D,G+D,B+D), RGBb=(R-D,G-D,B-D)
 * Y,U,V,D: Ya=Y-(D>>1), Yb=Ya+D, YUVa=(Ya, U, V), YUVb=(Yb, U, V)
 * Y,U,V,Dy,Du,Dv:
 * YUVa=(Y-(Dy>>1), U-(Du>>1), V-(Dv>>1)), YUVb=YUVa+(Dy, Du, Dv)
 * YxYxB bits of pixel data.
 * B is typically 1-3 bits.
 * 1: selects one endpoint or another
 * 2: selects either endpoint, or one of two interpolated endpoints.
 * 3: similar, but there are 6 interpolated points.
 * There may be 'partitions' or other special features.

Naive Approach
Faced with options, many people seem to try to use brute force strategies for the encoder, systematically trying each possibility and picking the "best" result. This is very slow, so is not what I am describing.

In the name of speed, a simple filter approach is needed, and heuristics need to drive choices. This isn't as easy as brute force, but can give acceptable results.

General Approach
The general strategy for encoding a block looks like this:
 * Input filtering.
 * Transform RGB or YUV input into an intermediate form.
 * Selection Heuristics
 * Use filtered input to make block encoding choices.
 * Unnecessary if there is only a single output format.
 * Encode Block
 * Emit the bits or bytes for the chosen format.

Wherever possible, avoid use of floating-point or division, as these operations are slow:
 * Int<->FP conversion is rather expensive, and can kill a block converter.
 * Division may be potentially rather slow:
 * Some x86 processors have notoriously slow idiv instructions.
 * May be implemented as library code on embedded targets.
 * For cases where it is needed, the calculations may be cached in LUTs.

Input Filters
Multiple input filter strategies exist, depending on tradeoffs being made and the nature of the output formats.

For formats where RGB endpoints are used from RGB input, it may make sense to cache representative endpoints to use in the output calculations.

Generally, YUV-like math is used here a lot as it is better suited for many calculations than working directly in RGB. Y is a very useful value, and drives most of the other calculations. U and V hold color, and can be abused a lot more in calculations with less visible effect.

For RGB endpoints, it may make sense to use output color endpoints which are a weighted average of the calculated endpoint with the representative endpoints for this axis.

The calculated endpoint represents what the bottom-range pixel "should" look like according to the color-space, and the representative endpoints show pixel values which are actually in the image (if we assume pixels of similar Y values tend to also have similar colors).

In other cases (where only Y is used), weighted averages of the representative endpoints and average color may be used as the color endpoints.

Single Pass Y Filter
Simple and reasonably fast.

Single Pass Input Filter (Y):
 * Loop over pixel data.
 * May be done per-pixel, or for groups of pixels.
 * Calculate a Y value
 * Such as: Y=(R+2G+B)>>2;
 * Check against current min and max Y.
 * Select Y value (and possibly the RGB point) if it is a better option.
 * Selected RGB points would represent the min and max RGB.
 * Cache Y value for each pixel.
 * (Optional) Accumulate and calculate average RGB (N/A for YUV).
 * Alternate: Assume: RGBavg=(RGBmin+RGBmax)/2;

Example (4:2:0 YUV output from RGBx input): if((xstride==4) && !(tflip&1)) {       for(i=0; i<2; i++) for(j=0; j<2; j++) {           p0=i<<1; p1=p0+1; p2=(j<<3); pxy2=pxy+((p0<<2)+(j<<1)); rgb0=rgb+p0*ystride+p2;       rgb1=rgb+p1*ystride+p2; cr0=rgb0[0]; cg0=rgb0[1]; cb0=rgb0[2]; cr1=rgb0[4]; cg1=rgb0[5]; cb1=rgb0[6]; cr2=rgb1[0]; cg2=rgb1[1]; cb2=rgb1[2]; cr3=rgb1[4]; cg3=rgb1[5]; cb3=rgb1[6]; cr=(cr0+cr3)>>1; cg=(cg0+cg3)>>1; cb=(cb0+cb3)>>1; cy0=((cg0<<2)+3*cr0+cb0+4)>>3;   cy1=((cg1<<2)+3*cr1+cb1+4)>>3; cy2=((cg2<<2)+3*cr2+cb2+4)>>3;   cy3=((cg3<<2)+3*cr3+cb3+4)>>3; cu=((2*cb-cr-cg)>>2)+128;       cv=((2*cr-cb-cg)>>2)+128; pxy2[0]=cy0;   pxy2[1]=cy1;    pxy2[4]=cy2;    pxy2[5]=cy3; l0=(cy0cy3)?cy0:cy3;   l3=(cy1>cy2)?cy1:cy2; mcy=(l0l3)?l2:l3; k=i*2+j; pmcy[k]=mcy; pncy[k]=ncy; pxu[k]=cu; pxv[k]=cv; }       return; }

If dealing with 32-bit RGBx input, use approximate YUV conversion to produce YUV 4:2:0 output.
 * xstride=pixel size
 * ystride=scanline size
 * rgb=RGBx input
 * pxy=pixel Y output (4x4 block)
 * pxu/pxv=pixel U/V output (2x2 block)
 * pmcy/pncy=pixel min/max for each 2x2 sub-block.

This is for dealing with a block format which may uses 4:2:0 chroma sub-sampled YUV.

The loop can be generalized to deal with more range of inputs, but generalizing the loop tends to make it slower (the xstride and tflip mask is used to control the specifics of the filter, such as 24 bit and BGR formats, which are separate not-shown loops).

Alternately, a higher-fidelity YUV conversion could be used: cy0=(77*cr0 + 150*cg0 + 29*cb0 + 127)>>8; cy1=(77*cr1 + 150*cg1 + 29*cb1 + 127)>>8; cy2=(77*cr2 + 150*cg2 + 29*cb2 + 127)>>8; cy3=(77*cr3 + 150*cg3 + 29*cb3 + 127)>>8; cr=(cr0+cr3)>>1; cg=(cg0+cg3)>>1; cb=(cb0+cb3)>>1; cu=((- 43*cr - 85*cg +128*cb + 127)>>8)+128; cv=(( 128*cr -107*cg - 21*cb + 127)>>8)+128;

However, such a color transform is a bit more expensive.
 * For extra cost, you can also average all 4 pixels for calculating U and V.
 * Or, hell, calculate YUV for all 4 pixels and then average the U and V values.
 * "Aint nobody got time for that!"

Single Pass CMY Filter
Better quality for some output formats than a simple Y filter. It is slower but may be more accurate.

Mostly an option when:
 * The blocks may store a pair of color endpoints, rather than RGBD or YUVD.
 * Partitions may be done this way.

Single Pass Input Filter (CMY):
 * Calculate C,M,Y values in parallel:
 * C=(R+3G+4B)>>3; //Cyan Axis
 * M=(4R+G+3B)>>3; //Magenta Axis
 * Y=(3R+4G+B)>>3; //Yellow Axis
 * Check against current min/max for each axis.
 * Select values and endpoints for axes independently.
 * Pick highest contrast axis as primary.
 * Potentially use second place axis if relevant.

Simple Partitions:
 * Start with a zero bitmask.
 * Loop over pixels.
 * Subtract Primary Axis Y from secondary axis Y.
 * Set bit in bitmask if ((Yb-Ya)>0) or alternatively (abs(Yb-Ya)>Eb).
 * Eb would be an epsilon calculated from endpoints.
 * Feed bitmask through a lookup table to get partition number.
 * May need to precompute LUT to map each bitmask to its closest partition.
 * The block encoding step would use the partition number to know which axis to use when encoding each pixel.

Triple Pass Filter
Triple Pass Input Filter:
 * Slow but fairly accurate.
 * Viable for offline/batch use.
 * First Pass: Find average of pixels.
 * Second Pass: Calculate 'mass' vector.
 * Relative to the center point
 * dx=Px-Ox; dy=Py-Oy; dz=Pz-Oz;
 * V=V+(dx^2, dy^2, dz^2)
 * or, use a matrix:
 * M=M+[
 * [ dx*dx, dx*dy, dx*dz ],
 * [ dy*dx, dy*dy, dy*dz ],
 * [ dz*dx, dz*dy, dz*dz ] ]
 * Normalize vector or find and normalize largest vector from matrix.
 * Matrix allows more flexibility or the choice of multiple axes.
 * Loosely similar in premise to the CMY filter.
 * Third Pass:
 * Loop over pixels, calculate Y and select according to prior axis (or axes).
 * Y calculations are essentially a fixed-point dot product.

Block-Type Selection Heuristics
A bit too big of a topic for here.

General idea is once you have the data from the input filter, information about the Y range, U/V ranges, ... can be evaluated to select the most appropriate block format. Some other info may be calculated from the values generated by the input filter, such as averages (range midpoints), or figuring out what the final endpoint colors will be.

General ideas:
 * If value ranges are small, choose the "cheapest" option.
 * If high contrast, favor blocks with more bits for interpolation.
 * If low contrast, favor blocks with fewer pixel bits and better color bits.
 * Contrast between primary and secondary axis effects selection blocks:
 * Low contrast, favor a single-axis block format;
 * High contrast, favor a multi-axis block format.
 * Ex: ones with more chroma resolution, or ones using partitions.

Output Block
Encoding Output Block (for the chosen block format):
 * Emit color endpoint info;
 * Calculate range scales and similar;
 * Example:
 * Calculate Ysc=(32768-K)/(Ymax-Yavg+1)
 * Subject to being optimized with a LUT.
 * K is a fine-tuning constant (ex: 3072 .. 6144).
 * Convert Y values and similar into output bits.
 * It may make sense to fully unroll this loop.

Example Code (Simple 4x4x2 block, emitting output bits): p0=idxtab[((pxy[ 0]-acy)*l1+l3a)>>13]; p1=idxtab[((pxy[ 1]-acy)*l1+l3b)>>13]; p2=idxtab[((pxy[ 2]-acy)*l1+l3a)>>13]; p3=idxtab[((pxy[ 3]-acy)*l1+l3b)>>13]; block[4]=(p0<<6)|(p1<<4)|(p2<<2)|p3; p0=idxtab[((pxy[ 4]-acy)*l1+l3b)>>13]; p1=idxtab[((pxy[ 5]-acy)*l1+l3a)>>13]; p2=idxtab[((pxy[ 6]-acy)*l1+l3b)>>13]; p3=idxtab[((pxy[ 7]-acy)*l1+l3a)>>13]; block[5]=(p0<<6)|(p1<<4)|(p2<<2)|p3; p0=idxtab[((pxy[ 8]-acy)*l1+l3a)>>13]; p1=idxtab[((pxy[ 9]-acy)*l1+l3b)>>13]; p2=idxtab[((pxy[10]-acy)*l1+l3a)>>13]; p3=idxtab[((pxy[11]-acy)*l1+l3b)>>13]; block[6]=(p0<<6)|(p1<<4)|(p2<<2)|p3; p0=idxtab[((pxy[12]-acy)*l1+l3b)>>13]; p1=idxtab[((pxy[13]-acy)*l1+l3a)>>13]; p2=idxtab[((pxy[14]-acy)*l1+l3b)>>13]; p3=idxtab[((pxy[15]-acy)*l1+l3a)>>13]; block[7]=(p0<<6)|(p1<<4)|(p2<<2)|p3;

Bit-packed representations may be similar, except you may emit blocks of bits instead of storing bytes bytes.

Where, in this example:
 * l1=Ysc;
 * l3a/l3b=biases (65536 +/- 1024).
 * Two biases are used for dithering purposes.
 * idxtab is an index table used to map fixpoint values to output values.
 * An index is optional if the indices map linearly and values are adjusted.
 * This can potentially make it a little faster.
 * For some formats, it may make sense to select between two index tables.
 * They may save 1 bit via endpoint ordering.
 * The index table used depends on this bit.