Gunzip for MSX

Pagina 1/11
| 2 | 3 | 4 | 5 | 6

Door Grauw

Enlighted (8190)

afbeelding van Grauw

18-10-2015, 00:54

Lately I’ve been working on a gzip (.gz) decompression implementation for MSX, which as of yesterday is functional. Although I’ve been developing it for VGMPlay, I’ve implemented it as a stand-alone “gunzip” utility.

The gunzip project page is here:

https://bitbucket.org/grauw/gunzip

The latest binary can be downloaded here:

https://bitbucket.org/grauw/gunzip/downloads

And of course it is open source and liberally licensed, so you can reuse the code in any way you want.

Aangemeld of registreer om reacties te plaatsen

Van Grauw

Enlighted (8190)

afbeelding van Grauw

09-11-2015, 20:50

I’m pretty happy with the end result because not only does it successfully decompress gzip files, it is also pretty efficient in doing so. Here are some test results with the first disk of Aleste 2, comparing various decompression formats for MSX:

Test on Philips NMS 8245 MSX2 (openMSX) with Sunrise IDE:

  • gunzip 1.0 (244129 bytes): 133s
  • gunzip 1.1 (244129 bytes): 87s
  • pmext 2.22 (247552 bytes): 619s
  • lhext 1.33 (252614 bytes): 278s
  • tunzip 0.91 (247414 bytes): 341s

Test on Panasonic FS-A1GT turboR (openMSX) with Sunrise IDE:

  • gunzip 1.0 (244129 bytes): 26s
  • gunzip 1.1 (244129 bytes): 18s
  • pmext 2.22 (247552 bytes): 127s
  • lhext 1.33 (252614 bytes): 49s
  • tunzip 0.91 (247414 bytes): 46s

Note, this is just a test with a single file, however it shows the general trend. I’m pretty happy to see this result, looks like I made the right choice in the approach I took to decoding the huffman tree.

Skipping the checksum verification further improves speed by 25% (turboR 20s 13s, MSX2 100s 55s), however I’ll let the code settle for a while before I expose that as a command line option.

Finally, I’ve tested with a fairly large set of gzip files, but if you encounter any errors, please send me the gz file to test with and I will fix it.

[Edit: Added gunzip 1.1 performance results.]

Van Grauw

Enlighted (8190)

afbeelding van Grauw

18-10-2015, 04:00

For those interested in a little development background,

The core of the good performance of the decompression comes from code generation; I generate the Huffman decoding tree as a series of Branch objects which decode one bit each. They consist of a “fetch bit” call, and then jump to either another branch or a symbol handler. PC implementations often use lookup tables for optimisation, but because bit shifting by arbitrary amounts is expensive on the Z80 CPU, and we also have a limited amount of registers (I use all of them in the Huffman decoding loop), I think this is probably the best approach.

[ Aside: Some detail on gzip’s deflate algorithm (also used by zip and png): it consists of two layers, Huffman coding and LZ coding. Huffman encodes bytes as variable bit lengths, e.g. let’s say “A” occurs 100 times and “Q” just once, then A would get a binary code of 001 while Q would get a code of 0000001. Next it performs LZ coding, which basically tells you to either “write this value” or “copy n values that wrote earlier”. For the Z80, the Huffman part is the most challenging. ]

An additional important practice for fast performance is that my input and output buffers are 256-byte aligned, so that when reading or writing bytes I can use quick 8-bit value increments and checks, and only for every 256th value read or written do I need to check whether I crossed the end of the buffer. When that happens it executes the loading and saving of data and the CRC32 calculation, which operate on 4K or 32K blocks of data.

When I first functionally completed the implementation yesterday afternoon, these two things combined resulted in a running time of 37 seconds to decompress the Aleste 2 disk image, which already outperformed the existing compression tools for MSX. (Except for BitBuster-based compression used in a lot of games and SofaPak.)

Then I started optimising. I postponed this until the algorithm was fully implemented and the program functional, to stay focused on the functional goal and prevent myself from getting lost in messy code. I also measured every optimisation, and reverted them if they weren’t worth it. Performance optimisation generally does not make code better, just faster.

First I inlined the bit reading which reduced the running time to 29 seconds (22% faster). After that I did some more optimisations which improved speed by 4% (method variation with other registers), 0.8% (inlined method), 5% (unrolled loop), 3% (self-modifying code) and 1.5% (same), and then I ended up at the current 26s.

I focused these optimisations on the execution hotspots which actually used up the most CPU cycles. There’s lots of places elsewhere in the program which could be optimised, e.g. the the Huffman branch objects are allocated on the heap which is expensive and does lots of integrity checking, while I could have used stacking allocation. And I also do a lot of checks and assertions for cases which should not normally occur. But when it does not yield any real performance gain, it’s better to keep the code readable, maintainable and flexible.

I think to summarise it in a few simple rules of thumb: 1. choose efficient data structures, 2. don’t optimise until the very end, 3. only optimise the hot code (ignore it anywhere else), and 4. measure all your performance improvements.

Another fun fact, I first implemented gunzip in JavaScript to familiarise myself with the algorithm before diving into Z80 assembly code.

As I mentioned I started working on this gunzip implementation because I wanted to add support to VGMPlay for directly playing back VGZ files. In order to minimise the loading time for the user it’s important that it happens as quickly as possible. Eventually I want to try to decompress them during playback to further minimise the loading time, which means that the decompression needs to be fast enough to be able to stay ahead of the music playback position.

Van Louthrax

Prophet (2067)

afbeelding van Louthrax

18-10-2015, 03:34

Had the same 256-bytes aligned buffers trick (to compare only 1 register) in SofaPak Smile

The .gz decompression speed you have here is amazing compared to previous softwares, and it handles an existing format!
So .gz is not handling mutilple files right (and does only contain data, no headers or anything) ? Have you tested how good is the compression ratio compared to .zip / .pma ?

Also, I was wondering how different is the .gz compression format compared to the .zip DEFLATE methode ? If it's not so different, I have all the code to handle "zip-like" headers, long file names (with 8.3 renaming) and sub-directories for SofaPak. I was thinking about something like creating a fast unzipper with long file names & directories support for MSX (an improved version of tunzip ?).

Van Grauw

Enlighted (8190)

afbeelding van Grauw

18-10-2015, 04:02

Yeah, gzip just compresses one file. For multiple files it’s often combined with tar. It does contain some headers including the original filename, size, timestamp and crc, but not much more than that.

Compression ratio is comparable between all of them (I listed them above). For best gzip compression ratio use Google’s Zopfli, it saves a few % compared to the GNU implementation.

Gzip uses DEFLATE, just like zip and png, so it’s algorithmically identical. Feel free to reuse the code in any way you wish :).

Van ricbit

Champion (436)

afbeelding van ricbit

18-10-2015, 06:16

That's a cool implementation! When I was writing the Shalom translation I wrote a huffman decoder in hitech C using a little bit of asm:

https://github.com/ricbit/Oldies/tree/master/2001-08-mis

I was however more interested in the encoding than in decoding. Traditional huffman has one symbol by character, and my plan was to have symbols for variable number of characters. For instance you could have the symbol 001 for "A" but also 0101 for "THE" if the word "the" was very common. This is the case for the corpus of a text-rich like Shalom.

Encoding this way, however, is NP-hard, so I had to resort to some heuristics to get a good covering of the text (I used heuristics based on the MIS algorithm for FPGA boolean logic optimization).

Van bore

Expert (115)

afbeelding van bore

18-10-2015, 11:57

The UPX packer uses a neat optimization for the bit-shifter that you might be able to borrow.
Instead of having bit counter you can use a termination bit.
When loading a new byte and shifting out the first bit you set the most significant bit as a terminator.
When shifting out bits normally you do a logical shift to shift in zeroes.
This means that the zero-bit is guaranteed to not be set until you shift out the termination bit.

Reader_ReadBitInline: MACRO
	srl b	; Shit out bit, shift in zero
	call z,Reader_ReadBitInline_NextByte
	ENDM

Reader_ReadBitInline_NextByte:
	ld c,a
	call Reader_Read
	scf	; Set carry to terminator bit
	rra	; Shift in terminator bit, shift out first bit
	ld b,a
	ld a,c
	ret

It saves a dec from the bit-shifter and adds a few cycles to the byte reader.

Van Manuel

Ascended (15529)

afbeelding van Manuel

18-10-2015, 14:22

Great stuff, Grauw! Smile We wait over 20 years until someone implements an unzipper for MSX and then an alternative comes in within a year!

Van Grauw

Enlighted (8190)

afbeelding van Grauw

18-10-2015, 15:47

Bore, that’s a great one! Really excited about that trick, as it’s smack in the middle of the primary loop. Using a terminator in some way did cross my mind, but I could not think of a way to actually use it. That’s pretty clever! I’ll give it a try right away. Thanks a lot Smile.

Van bore

Expert (115)

afbeelding van bore

18-10-2015, 16:14

I don't know how far you want to go into optimizing the outer code, but if I understood what Reader_ReadBitsInline_2 - 4 is supposed to do you should be able to flip around the rotation direction and save a few cycles there too.

Reader_ReadBitsInline_2:
	xor a
	Reader_ReadBitInline
	rra			; 0-------
	Reader_ReadBitInline
	rla			; -------1, bit 0 to carry
	rla			; ------10
	ret

Reader_ReadBitsInline_3:
	xor a
	Reader_ReadBitInline
	rra			; 0-------
	Reader_ReadBitInline
	rra			; 10------
	Reader_ReadBitInline
	rla			; 0------2, bit 1 to carry
	rla			; ------21, bit 0 to carry
	rla			; -----210
	ret

Reader_ReadBitsInline_4:
	xor a
	Reader_ReadBitInline
	rra			; 0-------
	Reader_ReadBitInline
	rra			; 10------
	Reader_ReadBitInline
	rra			; 210-----
	Reader_ReadBitInline
	rla			; 10-----3, bit 2 to carry
	rla			; 0-----32, bit 1 to carry
	rla			; -----321, bit 0 to carry
	rla			; ----3210
	ret

Van Grauw

Enlighted (8190)

afbeelding van Grauw

18-10-2015, 16:45

Grauw wrote:

Bore, that’s a great one! Really excited about that trick, as it’s smack in the middle of the primary loop. Using a terminator in some way did cross my mind, but I could not think of a way to actually use it. That’s pretty clever! I’ll give it a try right away. Thanks a lot Smile.

It improves performance by 3% Smile. Running time on turboR went down to 25s, and to 130s on MSX2. Very nice!

It’s also nice that it frees up a register, gives me a little more flexibility.

Pagina 1/11
| 2 | 3 | 4 | 5 | 6