Author |
Topic: LZ77 Compression (Read 578 times) |
|
AltBas
Full Member
member is offline

JB 1.01 8086 4.77 MHz 640K Win 1.0 quad 160K drives

Posts: 347
|
 |
LZ77 Compression
« Thread started on: May 17th, 2011, 10:09am » |
|
This is my take on LZ77 compression. LZ77 coding looks in a buffer (sliding window) of previously read characters for duplicates of the current search characters, replacing the duplicate characters with an offset to the characters and the number of characters to take. Technically, the offset should be from the end of the buffer, but I used the beginning of the buffer to cut out a step of converting to/from LEN(Buffer$) - Offset. This is a simple encoding scheme that is "tuneable". I chose 12 bits of offset (a sliding window of 4096 bytes) and 3 bits length (8 characters) biased by 3 (because we don't replace less than 3 characters) because this seems to give the best general compression. Other values that fit in the bit coding size are 11 bits Offset and 4 bits Length (2048 byte buffer / 16+3 length bytes copy) and 10 bit Offset and 5 bits Length (1024 byte buffer / 32+3 byte Length). The more redundant the data source, the smaller the buffer you can use, and larger Copy Length. The first part is an "8 Bit Lite" version that remaps ASCII char > 128 into unused characters. The second post of this thread is a full 8 bit compression, but it doesn't compress as much as the "Lite" version. It also shows how to encode a multiple choice run Length. Aren't bits wonderful?
Code Part I - 7 Bit LZ77 Compression Code:'*** LZ_77.Jb - An LZ77 Implementation ***
'* Code is in the public domain *
'* Marker of File is chr(22) - "L" xor "Z" = 22 *
'* Followed by File Type byte *
'* More info at http://en.wikipedia.org/wiki/LZ77_and_LZ78 *
GLOBAL k4K
dim Pwr(31), PwrM1(31), Cnt(255), FList$(1,5), Runs(4)
k4K = 4096 '* Sliding window constant *
for a = 0 to 31 '* Powers of 2 and bitmasks *
Pwr(a) = 2^a : PwrM1(a) = Pwr(a) - 1
next a
k4K = 4096 '* Sliding window constant *
EnCode = 1
Src$ = "contest.txt"
open Src$ for input as #1
Buf$ = input$(#1, LOF(#1))
CLOSE #1
if EnCode = 1 then
call CompressLZ Src$, Dst$
else
call ExpandLZ Src$, Dst$
end if
END
'======================================================================
'*** Determine if a File Exists ***
'======================================================================
FUNCTION FileExist(Mask$)
FILES "", Mask$, FList$()
FileExist = val(FList$(0,0))
END FUNCTION
'======================================================================
'*** Determine which Procedure to Compress File with ***
'======================================================================
SUB CompressLZ Src$, Dst$
if FileExist(Src$) = 0 then
print "File ";Src$; " not found." : exit sub
end if
open Src$ for input as #1 '*** File ID Section ***
Buf$ = input$(#1,LOF(#1)) : LB = len(Buf$)
close #1
for a = 1 to LB '* Determine character distribution ***
Tmp = asc(mid$(Buf$,a,1))
Cnt(Tmp) = Cnt(Tmp) + 1
next a
for a = 0 to 127
if Cnt(a) > 0 then LoCnt = LoCnt + 1
next a
for a = 128 to 255
if Cnt(a) > 0 then HiCnt = HiCnt + 1
next a
if HiCnt = 0 then
FID = 127 '* 7 bits, w/ 0 bit marker *
else
'* Can remap many char if HiCnt + LoCnt < 128 *
if HiCnt + LoCnt < 128 then
FID = 128 '* 7 bit re-map *
Tmp = 1
for a = 128 to 255 '* Re-map HiChars around LoChar *
if Cnt(a) > 0 then
DO while Cnt(Tmp) > 0
Cnt(Tmp) = Tmp '* "Re-map" LoChar to self Look-Up *
Tmp = Tmp + 1 '* Find unused "Char" *
LOOP
Cnt(Tmp) = a : Cnt(a) = Tmp : Tmp = Tmp + 1
end if
next a
ReMapCnt = Tmp - 1
else '*** 8 Bit File ***
FID = 255
end if
end if
select case FID
case 127 : Size = EncodeLZ7( Dst$, FID, Buf$, 0)
case 128 : Size = EncodeLZ7( Dst$, FID, Buf$, ReMapCnt)
case 255 : Size = EncodeLZ8( Dst$, FID, Buf$)
end select
END SUB
'======================================================================
'*** Determine which Procedure to Expand File with ***
'======================================================================
SUB ExpandLZ Src$, Dst$
if FileExist(Src$) = 0 then
print "File ";Src$; " not found." : exit sub
end if
redim Cnt(255)
open Src$ for input as #1
Buf$ = input$(#1,LOF(#1)) : LB = len(Buf$)
close #1
if asc(left$(Buf$,1)) = 22 then
select case asc(mid$(Buf$, 2, 1))
case 127 : Buf$ = mid$(Buf$,3) '* Only LoAsc char *
Expand = DecodeLZ7( Dst$, Buf$, 0)
case 128 : ReMapCnt = asc(mid$(Buf$,3,1)) '* Re-mapped HiAsc *
for a = 1 to ReMapCnt
Cnt(a) = asc(mid$(Buf$,a+3,1))
next a
Buf$ = mid$(Buf$,a+3)
Expand = DecodeLZ7( Dst$, Buf$, ReMapCnt)
case 255 : Buf$ = mid$(Buf$,3) '* Binary file *
Expand = DecodeLZ8( OutFile$, Src$)
end select
end if
END SUB
'======================================================================
'*** Encode 7 Bit LZ, possibly with Character Re-Map ***
'======================================================================
FUNCTION EncodeLZ7( OutFile$, FID, BYREF Buf$, ReMapCnt)
LB = len(Buf$)
open OutFile$ for output as #1
Hdr$ = chr$(22); chr$(FID)
if ReMapCnt > 0 then
Hdr$ = Hdr$; chr$(ReMapCnt)
for a = 1 to ReMapCnt : Hdr$ = Hdr$; chr$(Cnt(a)) : next a
end if
print #1, Hdr$; '* Suppress CR/LF *
ptr = 4
Out$ = mid$(Buf$, 1, 3) '* Initial un-compressible string *
Wndw$ = Out$
DO
C$ = mid$(Buf$, ptr, 3) '* Get 3 chars to scan for *
if len(C$) = 3 then
optr = ptr
ptr = ptr + 2
pWndw = 1
for a = 0 to 7 '* Look for string matching C$ in Wndw$ *
p = instr(Wndw$, C$, pWndw)
if p = 0 then
if a > 0 then '* Prev match, but not this char *
LZ = 32768 + (pWndw * 8) + (len(C$) - 4)
'* UnFound C$ has extra char, don't count it *
ptr = ptr - 1 '* Back up ptr from miss *
end if
exit for
end if
pWndw = p
if ptr < len(Buf$) AND len(C$) < 10 then
'* Only replacing 3 bytes or more, so bias length info *
'* 000 is 3 char; 111 is 10 char *
ptr = ptr + 1
C$ = C$ + mid$(Buf$, ptr, 1)
else
LZ = 32768 + (pWndw * 8) + (len(C$) - 3) '* Maxed out *
exit for
end if
next a
if LZ > 0 then
Tmp = (LZ AND 7) + 3
Out$ = Out$ + chr$(int(LZ/256)) + chr$(LZ AND 255)
Wndw$ = Wndw$ + mid$(Buf$, optr, Tmp)
LZ = 0
else
if asc(C$) > 127 then
Out$ = Out$ + chr$(Cnt(asc(C$)))
else
Out$ = Out$ + left$(C$,1)
end if
Wndw$ = Wndw$ + left$(C$,1)
ptr = optr
end if
Wndw$ = right$(Wndw$, k4K) '* Shift window right *
if len(Out$) > k4K then print #1, Out$; : Out$ = ""
else
Out$ = Out$ + C$
exit do '* EOF - Exit *
end if
ptr = ptr + 1
LOOP until ptr > LB
print #1, Out$; '* Suppress CR/LF *
EncodeLZ7 = LOF(#1)
CLOSE #1
END FUNCTION
'======================================================================
'*** Decode 7 Bit LZ, possibly with Character Re-Map ***
'======================================================================
FUNCTION DecodeLZ7( OutFile$, BYREF Src$, ReMapCnt)
'* FID & ReMap$ read prior to entry, not in Src$ buffer *
open OutFile$ for output as #1
Buf$ = left$(Src$, 3) '* 1st 3 char not compressed *
LB = len(Src$)
Wndw$ = Buf$
ptr = 4
DO
C$ = mid$(Src$, ptr, 1)
if asc(C$) > 127 then '* If HiBit set, then LZ code *
ptr = ptr + 1
LZ = (asc(C$) * 256) + asc(mid$(Src$, ptr, 1))
Offset = (int(LZ / 8) AND (k4K-1)) '* Loc to copy from *
Cnt = (LZ AND 7) + 3 '* # Char to copy *
C$ = mid$(Wndw$, Offset, Cnt)
else
if asc(C$) <= ReMapCnt then C$ = chr$(Cnt(asc(C$)))
'* Look-up real HiAsc char *
end if
Buf$ = Buf$ + C$
Wndw$ = right$(Buf$, k4K) '* Shift window (if able) *
if len(Buf$) > 8192 then
print #1, left$(Buf$, k4K);
Buf$ = mid$(Buf$, k4K+1)
end if
ptr = ptr + 1
LOOP until ptr > LB
print #1, Buf$;
DecodeLZ7 = LOF(#1)
CLOSE #1
END FUNCTION - AltBas
|
| « Last Edit: May 18th, 2011, 2:41pm by AltBas » |
Logged
|
|
|
|
AltBas
Full Member
member is offline

JB 1.01 8086 4.77 MHz 640K Win 1.0 quad 160K drives

Posts: 347
|
 |
Re: LZ77 Compression
« Reply #1 on: May 17th, 2011, 10:11am » |
|
Code Part II - 8 Bit LZ77 Compression Put this together with Part I Code for a simple LZ77 encoder. The multiple choice run Length encoding was inspired by reading about LZMA compression on Wikipedia. There are many variations in the LZ family. Code:
'* Code is in the public domain *
'======================================================================
'*** Encode 8 Bit LZ ***
'======================================================================
FUNCTION EncodeLZ8( OutFile$, FID, BYREF Buf$)
LB = len(Buf$)
open OutFile$ for output as #1
'* Hdr$ = chr$(22); chr$(FID)
print #1, chr$(22); chr$(FID); '* Suppress CR/LF *
ptr = 4
Out$ = mid$(Buf$, 1, 3) '* Initial un-compressible string *
Wndw$ = Out$
DO
C$ = mid$(Buf$, ptr, 3) '* Get 3 chars to scan for *
if len(C$) = 3 then
optr = ptr
ptr = ptr + 2
pWndw = 1
for a = 0 to 71 '* Look for string matches *
p = instr(Wndw$, C$, pWndw)
if p = 0 then
if a > 0 then '* Prev match, but not this char *
C$ = left$(C$, len(C$)-1) '* UnFound C$ has extra char *
ptr = ptr - 1 '* Back up ptr *
gosub [SetCode]
end if
exit for
end if
pWndw = p
if ptr < LB AND a < 71 then
ptr = ptr + 1
C$ = C$ + mid$(Buf$, ptr, 1) '* Get another char *
else
gosub [SetCode] '* Max run length or EOB, exit loop *
exit for
end if
next a
if LZ > 0 then
if Bits = 18 then
Tmp = (LZ AND 7) + 3 '* Adjust buffer ptr *
else
Tmp = (LZ AND 63) + 11
end if
Wndw$ = Wndw$ + mid$(Buf$, optr, Tmp)
else
if asc(C$) > 127 then
Bits = 9
LZ = asc(C$) + 256 '* Set Bit 8 for HiAsc *
else
Bits = 8
LZ = asc(left$(C$,1))
end if
Wndw$ = Wndw$ + left$(C$,1)
ptr = optr
end if
call PushStack LZ, Bits, Stk, TOS
LZ = 0
DO while TOS > 7
Out$ = Out$ + chr$(PopStack(8, Stk, TOS))
LOOP
Wndw$ = right$(Wndw$, k4K) '* Shift window right *
if len(Out$) > k4K then print #1, Out$; : Out$ = ""
else
'* EOF - Exit *
for a = 1 to len(C$) '* Len is 1 or 2, so no run *
call PushStack asc(mid$(C$,a,1)), 8, Stk, TOS
Out$ = Out$ + chr$(PopStack(8, Stk, TOS))
next a
DO while TOS > 7
Out$ = Out$ + chr$(PopStack(8, Stk, TOS))
LOOP
exit do
end if
ptr = ptr + 1
LOOP until ptr > LB-2
if Stk > 0 or TOS > 0 then '* Must ALWAYS 0 TOS when encoding *
call PushStack 0, 8-TOS, Stk, TOS
Out$ = Out$ + chr$(PopStack(8, Stk, TOS))
end if
print #1, Out$; '* Suppress CR/LF *
EncodeLZ8 = LOF(#1)
CLOSE #1
EXIT FUNCTION
[SetCode]
if a < 9 then '* C$ is already 3 char when a = 0 *
Bits = 18
LZ = 2^17 + (pWndw * 16) + 0 + (len(C$) - 3)
'* 10 dddddddddddd 0 lll = 18 bits *
else
Bits = 21
LZ = 2^20 + (pWndw * 128) + 64 + (len(C$) - 11)
'* 10 dddddddddddd 1 llllll = 21 bits *
end if
RETURN
END FUNCTION
'======================================================================
'*** Decode 8 Bit LZ ***
'======================================================================
FUNCTION DecodeLZ8( OutFile$, BYREF Src$)
open OutFile$ for output as #1
Buf$ = left$(Src$, 3) '* 1st 3 char not compressed *
LB = len(Src$)
Wndw$ = Buf$
ptr = 4
DO
DO while (ptr <= LB) and (TOS < 21)
call PushStack asc(mid$(Src$, ptr, 1)), 8, Stk, TOS
ptr = ptr + 1
LOOP
gosub [PopBytes]
Wndw$ = right$(Buf$, k4K) '* Shift window (if able) *
if len(Buf$) > 8192 then
print #1, left$(Buf$, k4K);
Buf$ = mid$(Buf$, k4K+1)
end if
LOOP until ptr > LB
if Stk > 0 or TOS > 7 then
if TOS < 8 then call PushStack 0, 8-TOS, Stk, TOS
gosub [PopBytes]
end if
print #1, Buf$;
DecodeLZ8 = LOF(#1)
CLOSE #1
EXIT FUNCTION
[PopBytes]
if PopStack( 1, Stk, TOS) = 0 then
C$ = chr$(PopStack( 7, Stk, TOS)) '* 7 bit char 0bbb bbbb *
else
if PopStack( 1, Stk, TOS) = 0 then '* LZ code *
Offset = PopStack( 12, Stk, TOS)
if PopStack( 1, Stk, TOS) = 0 then
Cnt = PopStack( 3, Stk, TOS) + 3
else
Cnt = PopStack( 6, Stk, TOS) + 11
end if
C$ = mid$(Wndw$, Offset, Cnt)
else '* 8 bit char *
C$ = chr$(PopStack( 7, Stk, TOS) + 128) '* 8 bit 1 1bbb bbbb *
end if
end if
Buf$ = Buf$ + C$
RETURN
END FUNCTION
'======================================================================
' Put a value on the Bit Stack
'======================================================================
SUB PushStack Char, NumBits, BYREF Stk, BYREF TOS
'* Char is value to put on stack *
'* NumBits is the bit size *
'* Stk = Stack variable *
'* TOS = Top of stack pointer *
if Char<Pwr(NumBits) then
Stk=(Stk * Pwr(NumBits)) '* Push up stack *
Stk=Stk OR Char '* Add Char to the stack *
TOS=TOS+NumBits
end if
END SUB
'======================================================================
' Get a value from the Bit Stack
'======================================================================
FUNCTION PopStack(NumBits, BYREF Stk, BYREF TOS)
'* NumBits is the bit size *
'* Stk = Stack variable *
'* TOS = Top of stack pointer *
if NumBits<=TOS then
TOS=TOS-NumBits
PopStack=int(Stk / Pwr(TOS))
Stk=Stk AND PwrM1(TOS)
else
PopStack=-1 '* Error ! *
end if
END FUNCTION
- AltBas
|
| « Last Edit: May 17th, 2011, 10:20am by AltBas » |
Logged
|
|
|
|
tomc
Senior Member
member is offline


Posts: 505
|
 |
Re: LZ77 Compression
« Reply #2 on: May 17th, 2011, 11:50am » |
|
I find every compression system to be fascinating !
|
|
Logged
|
|
|
|
tomc
Senior Member
member is offline


Posts: 505
|
 |
Re: LZ77 Compression
« Reply #3 on: May 17th, 2011, 5:06pm » |
|
I have wrote the following LZW routine ===> http://www.dspguide.com/ch27/5.htm and run out of table entries around record 10,000. Another article informed me that LZW 'proper' does not address the table overflow problen. I know there are other articles on how to either flush the table (starting anew) or dropping 'bad' entries in favor of others (which is probably better). . Can anyone point me to a better link? . For now I am not interested in speed enhancements, only in compression and table modification.
Thanks !
|
|
Logged
|
|
|
|
|