-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do we need to download images to file? #31
Comments
Imho, it should be an option to not download to a file |
Still new to this project, but I think there is the configuration: Which sounds like it can be set to False for outright disabling of image downloading. Haven't actually tried this yet though. |
Not looking to disable image fetching - I need image fetching - just don't think we need to download to disk if we only use bytes,width and height. |
Actually you can disable image fetching if not needed. I do agree that I would be better not to download the whole image file if it's not needed. However, image extractor is not tested at all. I wish to have this part of goose tested before changing how it work to be sure the implemented solution works as well. |
@atlithorn could you source you piece of code ... it seems to be a Zope code right ? |
You're right, looks like Zope. I have been using this snippet for a while with no idea where it came from :) |
Hi all... This is also a Pain Point on my side, although rather from another perspective: right now there are no unit tests for the image stuff. I'd like to do Dependency Injections for http_client. In fact the variable http_client exists but isn't in use. So I think to do this:
In a later step, one could make the function fetch(...) return an image.Image object. So either changing the existing algorithm to HEAD or just implementing a custom http_client would become a piece of cake... Let me know what you think, I'm keen on creating a patch for this ;) Cheers, Philip |
Couldn't wait, I've created a pull request for the DI stuff ;) |
Hello, I don't see why we need to change the Goose.init constructor, nor why we need to pass the http client around if the variable is not used. xav |
Same here, as in my comment is the PR Le jeudi 8 août 2013, Xavier Grangier a écrit :
Benoit Calvez. |
+1 for the mocking/unit testing of the image stuff |
@psilva261 the mocking testing stuff is now implemented. Did you check it out ? |
@grangier yeah, looks great, I couldn't have imagined an easier way to write unit tests for that.. ;) I've also seen you already solved the problem with the mentioned page on timeforkids.com. so right now I don't have other issues for the top image extractor on the radar, but this could change any day... |
What was the timeforkids.com issue ? I don't remember. |
In this article: http://www.timeforkids.com/news/gaga-goo-goo-clusters/54591 The author's image (bottom right) was extracted instead of the large image above the article... I was writing a unit test but then I realized the correct image is already extracted |
@psilva261 Hum weird. In my case the correct image is not extracted. It is still the author's image. >>> import goose
>>> url = 'http://www.timeforkids.com/news/gaga-goo-goo-clusters/54591'
>>> g = goose.Goose()
>>> a = g.extract(url=url)
>>> a.top_image.src
'http://www.timeforkids.com/files/Web_Faye.jpg' Don't you have the same result ? |
Problem is with this article the main image is not in the article top node. The only image in the article top_node is the author's image |
Odd... In the project I work on, on my laptop the correct image is extracted. But on our server it seems still the author's image is extracted. |
@psilva261 i guess you don't have the same html on both side. |
To solve this issue, the only way it to grab all the images of the document and not only the ones on the detected top_node. Side effects would be that image extraction would take more time and results may be worst than expected. For instance we could have a large image that is used as a sprit for icons and goose will detect it as main image. |
I see this a lot too and the only solution I came up with was to add to Would sprite files be dangerous, can you do huge sprite images with just But I know what you're saying, the danger is images that are not connected Maybe some kind of scoring mechanism based on how close to the top node the Atli. On Sun, Aug 18, 2013 at 8:42 PM, Xavier Grangier
|
@atlithorn you are right sprites were a bad exemple as they usually loaded by the CSS file. I like the idea of scoring for images outside the top node .. But this means fetching all the document images. We could also use a configuration settings for a maximum parent node distance. |
I haded some more test file regarding KNOWN_IMG_DOM_NAMES and known-image-css.txt e008ac1 to solve @psilva261 issue you could add timeforkids.com^content to known-image-css.txt file This part of the code need to be refactor because knwon class are only taking in account parent node and not the image class it self. |
@grangier Alright, I'll send you a merge request soon. Ok ok |
Aren't we only looking for width,height and number of bytes?
A head request gives you the content-length and something like https://gist.github.com/atlithorn/6155288 for width and height could remove the need for /tmp access.
If you are interested I'll create a pull request.
Atli.
The text was updated successfully, but these errors were encountered: