Gunzip for MSX

Page 9/11
2 | 3 | 4 | 5 | 6 | 7 | 8 | | 10 | 11

By Prodatron

Paragon (1788)

Prodatron's picture

11-11-2015, 04:31

Grauw wrote:

Hahaha, nice. Don’t worry, I’m more or less finished with the work on gunzip now Smile. I’ll probably make a few more changes as I integrate it in VGMPlay but they won’t be essential to pick up, I don’t expect any major performance improvements anymore.

Ok, these are good news Smile In the past I was thinking about a table-look-up approach like described in the /doc/gzip-algorithm.txt document instead of the "executable huffman tree" methode. First that sounded faster, but after making some calculations it seems that it's not possible to beat your "executable tree" approach.

Louthrax wrote:

I was wondering if you had an idea of the performance improvement not using IX/IY (I do not expect too much here)? Also, did that reduce the code size ? I kept Grauw's code the less modified as possible, in order to update easily, but as I just have a small amount of memory left, I might be interested in doing the same.

As some parts are still missing I can't give you any good numbers right now. I hope to be able to make some real tests later this week.

Grauw wrote:

[...]followed by a 5K heap[...]The classes you need to care about Inflate, Alphabet, Branch, FixedAlphabets, DynamicAlphabets, Reader, Writer, FileReader, FileWriter, NullWriter and crctable.asm. Those currently add up to a little under 8K.

My corresponding code is currently 9K including the heap (which is just a static part here), but I didn't add FileReader, FileWriter, NullWriter and the CRC32 stuff yet (crctable.asm is missing as well - this file doesn't seem the be included in the repository, where can I find it?). So you have to add 1K for the CRC32table + the missing routines. At the end it will be probably a little bit smaller.
My code is currently blowed-up a little bit to save a JUMP in the Inflate_WriteLiteral, (some) Inflate_CopyLength and Inflate_CopyDistance routines. If it turns out that this only has a minimal impact on the performance it is possible to make the code shorter again (~ -1,2KB).

By Louthrax

Prophet (2093)

Louthrax's picture

11-11-2015, 11:00

Prodatron wrote:

crctable.asm is missing as well - this file doesn't seem the be included in the repository, where can I find it?).

It's generated during the build process using an external Java tool you have to install. If you don't want to go for that you can grab my version from the SofaUnZip sources (I released them with version 1.0). It's located in the ovh.c file but should be easy to cast to Z80 assembly code.

By Prodatron

Paragon (1788)

Prodatron's picture

11-11-2015, 11:16

Thank you very much! Smile

By wouter_

Champion (418)

wouter_'s picture

11-11-2015, 20:21

Now that Grauw has published a more or less final version, I think I can publish my version as well. As Grauw mentioned already a few times, we've been working together, exchanging ideas and source code and copying (or not) each other's patches.

Preview: here are the results of the "Aleste test" using my version: 78s on Z80 and 17s on R800. 46s / 12s without CRC validation. So this version on Z80 is 9s (about 10%) faster compared to Grauw's last version. Another test with mostly incompressible data is over 2x faster (and can still be improved further if desired).

The most important (remaining) changes between both versions are:
* I removed all heap allocations, instead I statically allocate the buffers.
* Because of this I have registers DE and HL free in the main decompression loop (they were pointers to the generated code, but now that code lives on a fixed address). That allowed to replace usages of IX/IY (slow) with DE/HL (fast). And IX/IY become free for other (minor) optimizations. Also buffer boundary checks can use simple 'CP n' instructions instead of 'CP (IX+n)'.
* I moved away from the object-oriented coding style. For example I replaced member variables with statically allocated global variables. This got rid of almost all usages of IX/IY.
* I optimized also non-speed critical code, but for code-size instead of speed.
* I put all code in one file and converted to gen80 compatible assembler syntax (actually I used compass instead of gen80 to verify this). This does make the code less elegant in some places, but it allows development using a wider range of tools (e.g. on MSX itself). This also removes the dependency on the javascript crc32 LUT generator.

The main disadvantage of this version is that it's less flexible:
* It's no longer possible to have multiple decompression states simultaneously in memory. But this is not needed for a (g)unzip tool. (I doubt any MSX program really needs this).
* It's a bit harder (though also not too difficult) to adapt this code to decompress to memory or to VRAM instead of to a file. Especially if within the same program you need to decompress to either a file or to memory. But this is also not needed for a (g)unzip tool.

The advantages are:
* It's faster.
* Code is smaller: 7kB versus 11kB. (But it's not a 100% fair comparison, Grauw's version still contains some test code that can easily be removed). I did sacrifice some code size for extra speed, so if code size is an issue it's possible to trade ~1kB for 1-2% performance.
* The required buffer sizes are about the same in both versions, but I did try to overlap buffers that are not simultaneously used.

You can find my version here:
https://github.com/m9710797/msx-gunzip
Or a direct link to the (single file) source code:
https://github.com/m9710797/msx-gunzip/blob/master/src/gunzi...

It's a fun project to work on. Feel free to join. Maybe you can find some additional opportunities to optimize performance and/or code size.

By syn

Paragon (1920)

syn's picture

11-11-2015, 22:55

Wow I'm impressed with the quick progress on the various tools, its like speed records are being shattered on every new release.

I also love it that one thing leads to another, like a snowball that gets bigger and bigger (in this case faster and faster). Imagine if GuyveR800 could have released TuNzIp 10 years ago as intended!

By Manuel

Ascended (15829)

Manuel's picture

11-11-2015, 23:03

He would have had to release it as open source to get that effect, though...

By Prodatron

Paragon (1788)

Prodatron's picture

12-11-2015, 02:47

This whole project started by Grauw is becoming more and more surprising and crazy.
@Wouter, that's really impressive! Did we have that before on the MSX? Running Naked in a Field of Flowers Very soon there will be 5 different deflate tools for our machine! Big smile
My approach is quite similiar to yours, the main difference is not to use the 2nd register set.

The register usage for the main loop is as follows:
; c = bit reader state
; ix = InputBufPos
; hl = OutputBufPos
; iy = length of the to-be-copied block
; de = -distance of the to-be-copied block

For additional optimizations I moved stuff like "CopySmallDist", "CopyBigDist", "CopySetLength" into the "CopyDist??", "CopyLen??" routines, which saves a JUMP (I don't have remaining registers for JP(xx) commands).
The same is for WriteLit??, each has the ld (hl),n:inc l:jp nz,LiteralTree:jp ... included directly, which saves one LD a,n and one JUMP. The additional code is about 1,3KB in my case.
I am using IX for the InputBufPos, which is slower, but on the other hand I save several EXX commands.

Your way of unrolling the LDIs in the CopyAndNext part is quite cool as well. I was already thinking about how to do that, so that copying short blocks won't be slower because of some overhead during preparation. Unfortunately there is no register left for a dynamic jump on my side, so I am still thinking about an alternative way.

Anyway this all was a nice kind of "Z80 sport challenge"! Tongue

By Grauw

Ascended (8515)

Grauw's picture

12-11-2015, 04:00

Louthrax wrote:
Prodatron wrote:

crctable.asm is missing as well - this file doesn't seem the be included in the repository, where can I find it?).

It's generated during the build process using an external Java tool you have to install.

It’s generated by the tools/gencrctable.js script written for for node.js (JavaScript, not Java). You have to install node.js, as mentioned in the README and as invoked by the Makefile Smile.

By Grauw

Ascended (8515)

Grauw's picture

12-11-2015, 03:54

syn wrote:

I also love it that one thing leads to another, like a snowball that gets bigger and bigger (in this case faster and faster). Imagine if GuyveR800 could have released TuNzIp 10 years ago as intended!

I like your observation, but I think you’re wrong about the origin of this snowball effect.

I started work on Gunzip in early April this year, to be able to provide VGZ file playback support in VGMPlay. This was before Tunzip was released or announced, so in no way is gunzip’s development resulting from that project.

Rather, Manuel is so very right, the snowball effect as you call it is a result of the decision to release gunzip as open source. I think it’s a great showcase of how open source can benefit the MSX community. (I remember a recent discussion thread around this topic where not everyone was convinced… ;)) This kind of collaboration and synergy is really great, I love it, and I hope seeing it will encourage more people to release their software’s source into the open.

By Prodatron

Paragon (1788)

Prodatron's picture

12-11-2015, 03:50

Yes, I remember this thread. And thanks to this project I am convinced now Tongue

Page 9/11
2 | 3 | 4 | 5 | 6 | 7 | 8 | | 10 | 11