Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we need to download images to file? #31

Open
atlithorn opened this issue Aug 5, 2013 · 24 comments
Open

Do we need to download images to file? #31

atlithorn opened this issue Aug 5, 2013 · 24 comments

Comments

@atlithorn
Copy link

Aren't we only looking for width,height and number of bytes?

A head request gives you the content-length and something like https://gist.github.com/atlithorn/6155288 for width and height could remove the need for /tmp access.

If you are interested I'll create a pull request.

Atli.

@dzen
Copy link

dzen commented Aug 5, 2013

Imho, it should be an option to not download to a file

@kcolton
Copy link

kcolton commented Aug 5, 2013

Still new to this project, but I think there is the configuration:
enable_image_fetching

Which sounds like it can be set to False for outright disabling of image downloading. Haven't actually tried this yet though.

@atlithorn
Copy link
Author

Not looking to disable image fetching - I need image fetching - just don't think we need to download to disk if we only use bytes,width and height.

@grangier
Copy link
Owner

grangier commented Aug 5, 2013

Actually you can disable image fetching if not needed. I do agree that I would be better not to download the whole image file if it's not needed. However, image extractor is not tested at all. I wish to have this part of goose tested before changing how it work to be sure the implemented solution works as well.

@grangier
Copy link
Owner

grangier commented Aug 5, 2013

@atlithorn could you source you piece of code ... it seems to be a Zope code right ?

@atlithorn
Copy link
Author

You're right, looks like Zope. I have been using this snippet for a while with no idea where it came from :)

@psilva261
Copy link

Hi all... This is also a Pain Point on my side, although rather from another perspective: right now there are no unit tests for the image stuff. I'd like to do Dependency Injections for http_client.

In fact the variable http_client exists but isn't in use. So I think to do this:

  • turn network.HtmlFetcher into network.HttpClient
  • move the method fetch(...) from images.image.ImageUtils into network.HttpClient
  • actually use the http_client variable. Also change the constructor of Goose to init(self, config, http_client)

In a later step, one could make the function fetch(...) return an image.Image object. So either changing the existing algorithm to HEAD or just implementing a custom http_client would become a piece of cake...

Let me know what you think, I'm keen on creating a patch for this ;)

Cheers, Philip

@psilva261
Copy link

Couldn't wait, I've created a pull request for the DI stuff ;)

@grangier
Copy link
Owner

grangier commented Aug 8, 2013

Hello,

I don't see why we need to change the Goose.init constructor, nor why we need to pass the http client around if the variable is not used.

xav

@dzen
Copy link

dzen commented Aug 8, 2013

Same here, as in my comment is the PR

Le jeudi 8 août 2013, Xavier Grangier a écrit :

Hello,

I don't see why we need to change the Goose.init constructor, nor why
we need to pass the http client around if the variable is not used.

xav


Reply to this email directly or view it on GitHubhttps://github.com//issues/31#issuecomment-22336779
.

Benoit Calvez.

@psilva261
Copy link

+1 for the mocking/unit testing of the image stuff

@grangier
Copy link
Owner

@psilva261 the mocking testing stuff is now implemented. Did you check it out ?

@psilva261
Copy link

@grangier yeah, looks great, I couldn't have imagined an easier way to write unit tests for that.. ;) I've also seen you already solved the problem with the mentioned page on timeforkids.com. so right now I don't have other issues for the top image extractor on the radar, but this could change any day...

@grangier
Copy link
Owner

What was the timeforkids.com issue ? I don't remember.

@psilva261
Copy link

In this article: http://www.timeforkids.com/news/gaga-goo-goo-clusters/54591 The author's image (bottom right) was extracted instead of the large image above the article... I was writing a unit test but then I realized the correct image is already extracted

@grangier
Copy link
Owner

@psilva261 Hum weird. In my case the correct image is not extracted. It is still the author's image.

>>> import goose
>>> url = 'http://www.timeforkids.com/news/gaga-goo-goo-clusters/54591'
>>> g = goose.Goose()
>>> a = g.extract(url=url)
>>> a.top_image.src
'http://www.timeforkids.com/files/Web_Faye.jpg'

Don't you have the same result ?

@grangier
Copy link
Owner

Problem is with this article the main image is not in the article top node. The only image in the article top_node is the author's image

@psilva261
Copy link

Odd... In the project I work on, on my laptop the correct image is extracted. But on our server it seems still the author's image is extracted.

@grangier
Copy link
Owner

@psilva261 i guess you don't have the same html on both side.

@grangier
Copy link
Owner

To solve this issue, the only way it to grab all the images of the document and not only the ones on the detected top_node. Side effects would be that image extraction would take more time and results may be worst than expected. For instance we could have a large image that is used as a sprit for icons and goose will detect it as main image.

@atlithorn
Copy link
Author

I see this a lot too and the only solution I came up with was to add to
KNOWN_IMG_DOM_NAMES.

Would sprite files be dangerous, can you do huge sprite images with just
the img tag? That's usually done via css which goose would not catch
anyway, right?

But I know what you're saying, the danger is images that are not connected
to the article.

Maybe some kind of scoring mechanism based on how close to the top node the
image is? A big image just outside the top node would score higher than a
small image inside it?

Atli.

On Sun, Aug 18, 2013 at 8:42 PM, Xavier Grangier
[email protected]:

To solve this issue, the only way it to grab all the images of the
document and not only the ones on the detected top_node. Side effects would
be that image extraction would take more time and results may be worst than
expected. For instance we could have a large image that is used as a sprit
for icons and goose will detect it as main image.


Reply to this email directly or view it on GitHubhttps://github.com//issues/31#issuecomment-22838975
.

@grangier
Copy link
Owner

@atlithorn you are right sprites were a bad exemple as they usually loaded by the CSS file. I like the idea of scoring for images outside the top node .. But this means fetching all the document images. We could also use a configuration settings for a maximum parent node distance.

@grangier
Copy link
Owner

I haded some more test file regarding KNOWN_IMG_DOM_NAMES and known-image-css.txt e008ac1

to solve @psilva261 issue you could add timeforkids.com^content to known-image-css.txt file

This part of the code need to be refactor because knwon class are only taking in account parent node and not the image class it self.

@psilva261
Copy link

@grangier Alright, I'll send you a merge request soon. Ok ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants