-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Easy support for compressed files #10
Comments
I am always sceptical about solutions that invoke external binaries like gzip, bzip2, xz. There are tons of pitfalls there (e.g. thin about PATH handling, and portability suffers, esp. to Windows. On the other hand, it is very, very easy to use zlib to read gzip files without using any external binaries, and I suspect the same is possible for bzip2 and xz. So I'd rather see us start using those libraries (optionally, that is). Letting IO_File handle compressed files automatically has pros and cons. It can be super convenient, but one has to be careful to not introduce strange problems for users. At the very least, I think there should be a way to force IO to not decompress on the fly (e.g. there could be a "mode" that forces that). Secondly, how exactly do you think the "transparent" access should be implemented? E.g. suppose I do IO_File("foo"), what happens? One way is this: open the file, check if it looks like a gzip file; if so, provide transparent decompression via zlib, if not, look at the next compression method; and so on, until finally we fallback to regular access. But one can do it also like GAP does it already for Read("foo"): If there is no file "foo", then look if there is a file "foo.gz", and try to open that, and so on. If there are both foo and foo.gz, you pick "foo". If "foo" is a directory, and "foo.gz" is a file, well, you should error out, and so on. Of course, it's not quite as clear what happens if there are foo.gz, foo.bz2 and foo.xz around, but one could just say that this is the user's fault... |
While it might be easy to call gz/bz2/xz directly, it won't be as easy as it currently is to shell out to them, as it's one line using IO_BufferedFile. However, I see your point. If we were to add support, the best idea (seeing as GAP already links against zlib) would be to extend core GAP's compressed file support, and then provide hooks to let IO access those libraries (which may actually be as easy as just calling them once checking they are present, I haven't tried to see what happens). In terms of the handling, personally I would have gone with IO_File("foo") would only read foo, but IO_File("foo.gz") would auto uncompress the gz. However, it might be nice to align with GAP's native handling. With multiple names, a warning is probably the best idea. I'll have a look into what GAP does internally with compression libraries first. |
Update: GAP currently just shells out to the gz command when compressing / uncompressing gz files (I was unaware of this). |
I spoke (briefly) to Steve about this -- he didn't think the shelling out to gzip had ever caused any problems (although of course it might be that not enough people had used them), while each extra library GAP tried to use caused huge amount of horrible pain, so that's one data point. |
Yes, GAP itself currently execs a gzip process. One drawback of that approach is that it doesn't really work on Windows... Anyway, the two approaches do not have to be mutually exclusive: If io was linked against zlib, use that; if not try to fall back to executing gzip; if that fails (e.g. because gzip is not available), generate an error. And of course this wouldn't have to be implemented at first: If things are done right, it should be completely irrelevant for users which exact methods is used to access the compressed file. |
To clarify: With a proper build process, the configure script would detect whether zlib etc. are present, and we could compile the relevant conditionally. Thus no "pain" should be incurred. |
Out of interest, do you know why the forking doesn't work on Windows? (I haven't looked to see if it might be fixable). I also don't know how much of IO's functionality works on Windows, never tried. Might be worth trying to link into zlib and friends. Hopefully the awful pain of linking to gmp is just because gmp is special, and will not happen so much with other libraries. |
Most of IO package's functionality work on Windows - GAP for Windows is distributed with IO binaries compiled on Cygwin. I can't give a complete account, but the stuff needed by SCSCP package is fully functional. |
I think fork on Windows is an inherent problem. Windows doesn’t like processes to be started except from the beginning.
On 24 Apr 2014, at 22:03, Christopher Jefferson [Masked] [email protected] wrote:
|
One must not use fork on Windows, even with cygwin. |
As Max N. says. :-). In fact I already have plans "in my drawer" for a modification of GAP itself to use Windows APIs instead of fork(), which I had intended for e.g. a GSoC student project... This could also be expanded to IO. Unfortunately, while on the one hand it would be a perfect GSoC project (relatively small and well-defined task, documentation for everything exists, requires no deep pre-knowledge), at the same time it's also not very "sexy" ("improve a package for an obscure math project"... ;-). Still, it'd be worth a shot for GSoC 2015 :-). And using zlib / libz is indeed not that hard or messy or even awful. In fact, even using GMP is not awful, IMHO: I always build GAP with a system-wide GMP, not its own. As far as I can tell, what made using GMP "awful" for GAP is the fact that various GMP versions are buggy and thus we chose to bundle our own to make GAP easier to build for users who don't know how to install or update a system library (and I guess also for the sake of Windows users ;-). That said, zlib is quite small and it could even be bundled with GAP or IO (but please, let's not put a copy of its source code into the repository, that's IMHO the wrong way to go about it). Anyway, as I said, we could still first implement IO_CompressedFile using pipes, and unportable, to get the design straight, and then add the rest gradually. The zlib API is super-easy, so a short afternoon should be enough for it; I am not familiar with the corresponding bzip2 / xz libraries, but I have hopes that they won't be much harder. The next logical step, BTW, would be to allow transparent access to .tar, .tar.gz, .zip etc. files. We did that for ScummVM, but to do this right and painfree, one should first come up with a better path abstraction than what GAP and IO currently offer. (This would at the same time also help to improve portability between POSIX systems and Windows).... Again, I have some plans for that "in my drawer"... Perhaps we can discuss this at some point? |
Certainly there are all kind of core cleanups we could do in GAP, from the small (let's get rid of all the macro functions for example), to the larger :) Would be happy to talk about that with you at some point. I'll neaten up a simple IO_CompressedFile for now, and we'll think about doing something more exciting later. |
Just as an experiment, I decided to try IO_Fork on windows 8, and indeed it does fail horribly. Therefore, independant of this it might be nice to try adding IO_Spawn (and perhaps switch GAP internals to use spawn where approriate). While this won't help when someone really wants a forked GAP, my experience is that Cygwin's support of spawn is much better, and also spawn is much closer to window's CreateProcess. Unless anyone vocalises a serious problem with spawn (someone might have already tried it in GAP for example), I think I'll try experimenting with spawn and see how I go. |
Sure. For the record: With zlib / libz, decoding a file can be done using inflateInit2() / inflate() / inflateEnd(). For example, I once implemented a C++ stream class for ScummVM which takes another stream, and on-the-fly decompresses its content, see the code for class GZipReadStream here. For liblzma (the code driving xz, a compression format that often reaches much better compression ratio), I believe lzma_stream_decoder() (or perhaps lzma_auto_decoder() if one also wants to support the legacy .lzma format) / lzma_code() / lzma_end() are the corresponding functions. For bzip2, similar APIs exist. With this, one can implement stream classes that "wrap" other streams to provide on-the-fly decoding of compressed data not just in files, but also from other sources, e.g. downloaded from a HTTP server, without having to put it into a file first. |
There are some functions in the semigroup package which automatically invoke gzip, bzip2 or xz as appropriate when users read or write files with extensions gz, bz and xz. It would be useful to have this support built into IO rather the require repeating it everywhere.
The only question (in my mind) is if this should be done in IO_File automatically, or if there should be a new IO_CompressedFile. Personally I'm tempted to add it to IO_File, would anyone ever really want to read raw compressed binary files in gap? :) (of course such people can also use IO_Open).
I'm happy to work on this, just thought I'd check on interest first.
The text was updated successfully, but these errors were encountered: