Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easy support for compressed files #10

Open
ChrisJefferson opened this issue Apr 16, 2014 · 14 comments
Open

Easy support for compressed files #10

ChrisJefferson opened this issue Apr 16, 2014 · 14 comments

Comments

@ChrisJefferson
Copy link
Member

There are some functions in the semigroup package which automatically invoke gzip, bzip2 or xz as appropriate when users read or write files with extensions gz, bz and xz. It would be useful to have this support built into IO rather the require repeating it everywhere.

The only question (in my mind) is if this should be done in IO_File automatically, or if there should be a new IO_CompressedFile. Personally I'm tempted to add it to IO_File, would anyone ever really want to read raw compressed binary files in gap? :) (of course such people can also use IO_Open).

I'm happy to work on this, just thought I'd check on interest first.

@fingolfin
Copy link
Member

I am always sceptical about solutions that invoke external binaries like gzip, bzip2, xz. There are tons of pitfalls there (e.g. thin about PATH handling, and portability suffers, esp. to Windows.

On the other hand, it is very, very easy to use zlib to read gzip files without using any external binaries, and I suspect the same is possible for bzip2 and xz. So I'd rather see us start using those libraries (optionally, that is).

Letting IO_File handle compressed files automatically has pros and cons. It can be super convenient, but one has to be careful to not introduce strange problems for users. At the very least, I think there should be a way to force IO to not decompress on the fly (e.g. there could be a "mode" that forces that).

Secondly, how exactly do you think the "transparent" access should be implemented? E.g. suppose I do IO_File("foo"), what happens?

One way is this: open the file, check if it looks like a gzip file; if so, provide transparent decompression via zlib, if not, look at the next compression method; and so on, until finally we fallback to regular access.

But one can do it also like GAP does it already for Read("foo"): If there is no file "foo", then look if there is a file "foo.gz", and try to open that, and so on. If there are both foo and foo.gz, you pick "foo". If "foo" is a directory, and "foo.gz" is a file, well, you should error out, and so on.

Of course, it's not quite as clear what happens if there are foo.gz, foo.bz2 and foo.xz around, but one could just say that this is the user's fault...

@ChrisJefferson
Copy link
Member Author

While it might be easy to call gz/bz2/xz directly, it won't be as easy as it currently is to shell out to them, as it's one line using IO_BufferedFile.

However, I see your point. If we were to add support, the best idea (seeing as GAP already links against zlib) would be to extend core GAP's compressed file support, and then provide hooks to let IO access those libraries (which may actually be as easy as just calling them once checking they are present, I haven't tried to see what happens).

In terms of the handling, personally I would have gone with IO_File("foo") would only read foo, but IO_File("foo.gz") would auto uncompress the gz. However, it might be nice to align with GAP's native handling.

With multiple names, a warning is probably the best idea.

I'll have a look into what GAP does internally with compression libraries first.

@ChrisJefferson
Copy link
Member Author

Update: GAP currently just shells out to the gz command when compressing / uncompressing gz files (I was unaware of this).

@ChrisJefferson
Copy link
Member Author

I spoke (briefly) to Steve about this -- he didn't think the shelling out to gzip had ever caused any problems (although of course it might be that not enough people had used them), while each extra library GAP tried to use caused huge amount of horrible pain, so that's one data point.

@fingolfin
Copy link
Member

Yes, GAP itself currently execs a gzip process. One drawback of that approach is that it doesn't really work on Windows...

Anyway, the two approaches do not have to be mutually exclusive: If io was linked against zlib, use that; if not try to fall back to executing gzip; if that fails (e.g. because gzip is not available), generate an error. And of course this wouldn't have to be implemented at first: If things are done right, it should be completely irrelevant for users which exact methods is used to access the compressed file.

@fingolfin
Copy link
Member

To clarify: With a proper build process, the configure script would detect whether zlib etc. are present, and we could compile the relevant conditionally. Thus no "pain" should be incurred.

@ChrisJefferson
Copy link
Member Author

Out of interest, do you know why the forking doesn't work on Windows? (I haven't looked to see if it might be fixable). I also don't know how much of IO's functionality works on Windows, never tried.

Might be worth trying to link into zlib and friends. Hopefully the awful pain of linking to gmp is just because gmp is special, and will not happen so much with other libraries.

@olexandr-konovalov
Copy link
Member

Most of IO package's functionality work on Windows - GAP for Windows is distributed with IO binaries compiled on Cygwin. I can't give a complete account, but the stuff needed by SCSCP package is fully functional.

@stevelinton
Copy link

I think fork on Windows is an inherent problem. Windows doesn’t like processes to be started except from the beginning.
The cygwin implementation is a horrible hack (at one stage it involves sending a long jump bugler through inter-process comms)
but still doesn’t work reliably on more recent versions of Windows.

Steve

On 24 Apr 2014, at 22:03, Christopher Jefferson [Masked] [email protected] wrote:

Preview: Out of interest, do you know why the forking doesn't work on
This email is forwarded from a MASKED EMAIL you created using DoNotTrackMe.
IF THIS IS SPAM, CLICK HERE TO BLOCK.

Want to shop safely and privately online? Get DoNotTrackMe Premium.

Out of interest, do you know why the forking doesn't work on Windows? (I haven't looked to see if it might be fixable). I also don't know how much of IO's functionality works on Windows, never tried.

Might be worth trying to link into zlib and friends. Hopefully the awful pain of linking to gmp is just because gmp is special, and will not happen so much with other libraries.


Reply to this email directly or view it on GitHub.

@neunhoef
Copy link

One must not use fork on Windows, even with cygwin.
However, there is a completely different API using CreateProcess, it is possible to use pipes between the old and the new process (I can point you to sample code in ArangoDB, however compiled with VisualStudio). So in principle, this could be done. Another problem of course is that under Windows, chances are that gzip isn't installed, which is in my opinion, why the libz approach is better in the long run. Using libz should be relatively straightforward these days if we bundle a version with GAP. We do this in ArangoDB as well.

@fingolfin
Copy link
Member

As Max N. says. :-). In fact I already have plans "in my drawer" for a modification of GAP itself to use Windows APIs instead of fork(), which I had intended for e.g. a GSoC student project... This could also be expanded to IO. Unfortunately, while on the one hand it would be a perfect GSoC project (relatively small and well-defined task, documentation for everything exists, requires no deep pre-knowledge), at the same time it's also not very "sexy" ("improve a package for an obscure math project"... ;-). Still, it'd be worth a shot for GSoC 2015 :-).

And using zlib / libz is indeed not that hard or messy or even awful. In fact, even using GMP is not awful, IMHO: I always build GAP with a system-wide GMP, not its own. As far as I can tell, what made using GMP "awful" for GAP is the fact that various GMP versions are buggy and thus we chose to bundle our own to make GAP easier to build for users who don't know how to install or update a system library (and I guess also for the sake of Windows users ;-). That said, zlib is quite small and it could even be bundled with GAP or IO (but please, let's not put a copy of its source code into the repository, that's IMHO the wrong way to go about it).

Anyway, as I said, we could still first implement IO_CompressedFile using pipes, and unportable, to get the design straight, and then add the rest gradually. The zlib API is super-easy, so a short afternoon should be enough for it; I am not familiar with the corresponding bzip2 / xz libraries, but I have hopes that they won't be much harder.

The next logical step, BTW, would be to allow transparent access to .tar, .tar.gz, .zip etc. files. We did that for ScummVM, but to do this right and painfree, one should first come up with a better path abstraction than what GAP and IO currently offer. (This would at the same time also help to improve portability between POSIX systems and Windows).... Again, I have some plans for that "in my drawer"... Perhaps we can discuss this at some point?

@ChrisJefferson
Copy link
Member Author

Certainly there are all kind of core cleanups we could do in GAP, from the small (let's get rid of all the macro functions for example), to the larger :) Would be happy to talk about that with you at some point.

I'll neaten up a simple IO_CompressedFile for now, and we'll think about doing something more exciting later.

@ChrisJefferson
Copy link
Member Author

Just as an experiment, I decided to try IO_Fork on windows 8, and indeed it does fail horribly.

Therefore, independant of this it might be nice to try adding IO_Spawn (and perhaps switch GAP internals to use spawn where approriate). While this won't help when someone really wants a forked GAP, my experience is that Cygwin's support of spawn is much better, and also spawn is much closer to window's CreateProcess. Unless anyone vocalises a serious problem with spawn (someone might have already tried it in GAP for example), I think I'll try experimenting with spawn and see how I go.

@fingolfin
Copy link
Member

Sure.

For the record: With zlib / libz, decoding a file can be done using inflateInit2() / inflate() / inflateEnd(). For example, I once implemented a C++ stream class for ScummVM which takes another stream, and on-the-fly decompresses its content, see the code for class GZipReadStream here.

For liblzma (the code driving xz, a compression format that often reaches much better compression ratio), I believe lzma_stream_decoder() (or perhaps lzma_auto_decoder() if one also wants to support the legacy .lzma format) / lzma_code() / lzma_end() are the corresponding functions.

For bzip2, similar APIs exist.

With this, one can implement stream classes that "wrap" other streams to provide on-the-fly decoding of compressed data not just in files, but also from other sources, e.g. downloaded from a HTTP server, without having to put it into a file first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants