Do we need to download images to file? #31

atlithorn · 2013-08-05T11:43:17Z

Aren't we only looking for width,height and number of bytes?

A head request gives you the content-length and something like https://gist.github.com/atlithorn/6155288 for width and height could remove the need for /tmp access.

If you are interested I'll create a pull request.

Atli.

dzen · 2013-08-05T12:54:23Z

Imho, it should be an option to not download to a file

kcolton · 2013-08-05T13:23:54Z

Still new to this project, but I think there is the configuration:
enable_image_fetching

Which sounds like it can be set to False for outright disabling of image downloading. Haven't actually tried this yet though.

atlithorn · 2013-08-05T13:31:10Z

Not looking to disable image fetching - I need image fetching - just don't think we need to download to disk if we only use bytes,width and height.

grangier · 2013-08-05T14:10:58Z

Actually you can disable image fetching if not needed. I do agree that I would be better not to download the whole image file if it's not needed. However, image extractor is not tested at all. I wish to have this part of goose tested before changing how it work to be sure the implemented solution works as well.

grangier · 2013-08-05T14:11:30Z

@atlithorn could you source you piece of code ... it seems to be a Zope code right ?

atlithorn · 2013-08-05T14:19:20Z

You're right, looks like Zope. I have been using this snippet for a while with no idea where it came from :)

psilva261 · 2013-08-08T15:20:49Z

Hi all... This is also a Pain Point on my side, although rather from another perspective: right now there are no unit tests for the image stuff. I'd like to do Dependency Injections for http_client.

In fact the variable http_client exists but isn't in use. So I think to do this:

turn network.HtmlFetcher into network.HttpClient
move the method fetch(...) from images.image.ImageUtils into network.HttpClient
actually use the http_client variable. Also change the constructor of Goose to init(self, config, http_client)

In a later step, one could make the function fetch(...) return an image.Image object. So either changing the existing algorithm to HEAD or just implementing a custom http_client would become a piece of cake...

Let me know what you think, I'm keen on creating a patch for this ;)

Cheers, Philip

psilva261 · 2013-08-08T16:33:12Z

Couldn't wait, I've created a pull request for the DI stuff ;)

grangier · 2013-08-08T16:39:30Z

Hello,

I don't see why we need to change the Goose.init constructor, nor why we need to pass the http client around if the variable is not used.

xav

dzen · 2013-08-08T17:22:07Z

Same here, as in my comment is the PR

Le jeudi 8 août 2013, Xavier Grangier a écrit :

Hello,

I don't see why we need to change the Goose.init constructor, nor why
we need to pass the http client around if the variable is not used.

xav

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/31#issuecomment-22336779
.

Benoit Calvez.

psilva261 · 2013-08-18T16:01:20Z

+1 for the mocking/unit testing of the image stuff

grangier · 2013-08-18T18:09:26Z

@psilva261 the mocking testing stuff is now implemented. Did you check it out ?

psilva261 · 2013-08-18T18:29:16Z

@grangier yeah, looks great, I couldn't have imagined an easier way to write unit tests for that.. ;) I've also seen you already solved the problem with the mentioned page on timeforkids.com. so right now I don't have other issues for the top image extractor on the radar, but this could change any day...

grangier · 2013-08-18T18:31:15Z

What was the timeforkids.com issue ? I don't remember.

psilva261 · 2013-08-18T18:43:31Z

In this article: http://www.timeforkids.com/news/gaga-goo-goo-clusters/54591 The author's image (bottom right) was extracted instead of the large image above the article... I was writing a unit test but then I realized the correct image is already extracted

grangier · 2013-08-18T19:00:55Z

@psilva261 Hum weird. In my case the correct image is not extracted. It is still the author's image.

>>> import goose
>>> url = 'http://www.timeforkids.com/news/gaga-goo-goo-clusters/54591'
>>> g = goose.Goose()
>>> a = g.extract(url=url)
>>> a.top_image.src
'http://www.timeforkids.com/files/Web_Faye.jpg'

Don't you have the same result ?

grangier · 2013-08-18T19:10:50Z

Problem is with this article the main image is not in the article top node. The only image in the article top_node is the author's image

psilva261 · 2013-08-18T19:38:31Z

Odd... In the project I work on, on my laptop the correct image is extracted. But on our server it seems still the author's image is extracted.

grangier · 2013-08-18T20:36:19Z

@psilva261 i guess you don't have the same html on both side.

grangier · 2013-08-18T20:42:58Z

To solve this issue, the only way it to grab all the images of the document and not only the ones on the detected top_node. Side effects would be that image extraction would take more time and results may be worst than expected. For instance we could have a large image that is used as a sprit for icons and goose will detect it as main image.

atlithorn · 2013-08-18T21:37:22Z

I see this a lot too and the only solution I came up with was to add to
KNOWN_IMG_DOM_NAMES.

Would sprite files be dangerous, can you do huge sprite images with just
the img tag? That's usually done via css which goose would not catch
anyway, right?

But I know what you're saying, the danger is images that are not connected
to the article.

Maybe some kind of scoring mechanism based on how close to the top node the
image is? A big image just outside the top node would score higher than a
small image inside it?

Atli.

On Sun, Aug 18, 2013 at 8:42 PM, Xavier Grangier
[email protected]:

To solve this issue, the only way it to grab all the images of the
document and not only the ones on the detected top_node. Side effects would
be that image extraction would take more time and results may be worst than
expected. For instance we could have a large image that is used as a sprit
for icons and goose will detect it as main image.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/31#issuecomment-22838975
.

grangier · 2013-08-18T22:08:19Z

@atlithorn you are right sprites were a bad exemple as they usually loaded by the CSS file. I like the idea of scoring for images outside the top node .. But this means fetching all the document images. We could also use a configuration settings for a maximum parent node distance.

grangier · 2013-08-18T23:46:38Z

I haded some more test file regarding KNOWN_IMG_DOM_NAMES and known-image-css.txt e008ac1

to solve @psilva261 issue you could add timeforkids.com^content to known-image-css.txt file

This part of the code need to be refactor because knwon class are only taking in account parent node and not the image class it self.

psilva261 · 2013-08-19T15:28:18Z

@grangier Alright, I'll send you a merge request soon. Ok ok

grangier mentioned this issue Aug 8, 2013

Dependency Injection for http_client #32

Closed

psilva261 mentioned this issue Aug 22, 2013

Timeforkids image css #39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do we need to download images to file? #31

Do we need to download images to file? #31

atlithorn commented Aug 5, 2013

dzen commented Aug 5, 2013

kcolton commented Aug 5, 2013

atlithorn commented Aug 5, 2013

grangier commented Aug 5, 2013

grangier commented Aug 5, 2013

atlithorn commented Aug 5, 2013

psilva261 commented Aug 8, 2013

psilva261 commented Aug 8, 2013

grangier commented Aug 8, 2013

dzen commented Aug 8, 2013

psilva261 commented Aug 18, 2013

grangier commented Aug 18, 2013

psilva261 commented Aug 18, 2013

grangier commented Aug 18, 2013

psilva261 commented Aug 18, 2013

grangier commented Aug 18, 2013

grangier commented Aug 18, 2013

psilva261 commented Aug 18, 2013

grangier commented Aug 18, 2013

grangier commented Aug 18, 2013

atlithorn commented Aug 18, 2013

grangier commented Aug 18, 2013

grangier commented Aug 18, 2013

psilva261 commented Aug 19, 2013

Do we need to download images to file? #31

Do we need to download images to file? #31

Comments

atlithorn commented Aug 5, 2013

dzen commented Aug 5, 2013

kcolton commented Aug 5, 2013

atlithorn commented Aug 5, 2013

grangier commented Aug 5, 2013

grangier commented Aug 5, 2013

atlithorn commented Aug 5, 2013

psilva261 commented Aug 8, 2013

psilva261 commented Aug 8, 2013

grangier commented Aug 8, 2013

dzen commented Aug 8, 2013

psilva261 commented Aug 18, 2013

grangier commented Aug 18, 2013

psilva261 commented Aug 18, 2013

grangier commented Aug 18, 2013

psilva261 commented Aug 18, 2013

grangier commented Aug 18, 2013

grangier commented Aug 18, 2013

psilva261 commented Aug 18, 2013

grangier commented Aug 18, 2013

grangier commented Aug 18, 2013

atlithorn commented Aug 18, 2013

grangier commented Aug 18, 2013

grangier commented Aug 18, 2013

psilva261 commented Aug 19, 2013