Add support for semantic markup #99

jeffnappi · 2014-05-06T17:26:18Z

It would be fantastic to have the option to extract article data using Schema.org with a fallback to OpenGraph.

Example - http://www.wired.com/2014/05/star-wars-storyboards-video/

Wired makes effective use of schema.org as seen below:

<a itemprop="url headline name" href="http://www.wired.com/2014/05/star-wars-storyboards-video/" rel="bookmark" title="Permanent Link to Check Out Early Storyboards From the Original Star Wars Trilogy">Check Out Early Storyboards From the Original <em>Star Wars</em> Trilogy</a>
</h1>
<link itemprop="image" href="http://www.wired.com/wp-content/uploads/2014/05/star-wars-storyboards-feat.jpg" />
...
    <li class="entryDate"><time itemprop="datePublished" datetime="2014-05-06T06:30:56+00:00">05.06.14</time>&nbsp;&nbsp;&#124;&nbsp;&nbsp;</li>
...
<span itemprop="articleBody"><p><iframe width="660" height="371" src="//www.youtube.com/embed/8RlpNvUumy0" frameborder="0" allowfullscreen></iframe></p>
<p>Sure, everyone gets excited about <a href="http://www.wired.com/2014/05/jj-abrams-star-wars-video/" target="_blank">May the Fourth</a>
...
</p></span>

A minimal implementation could include:

Schema.org (http://schema.org/Article)
- headline
- author
- image
- datePublished
- articleBody
OpenGraph (http://ogp.me/)
- og:title
- og:image
- og:description

The text was updated successfully, but these errors were encountered:

jeffnappi · 2014-05-06T17:27:35Z

This was just a thought. I intend to implement this whether it becomes part of python-goose or not, but thought it would be good to open up a conversation about it.

Is this something that you would like to see added to python-goose?

grangier · 2014-05-07T06:11:24Z

Hello Jeff,

This had been a long time I was thinking the future of goose (version 2) would be based on html5 sementic tags extraction.

It seems to me obvious that most newsite are now using <article> and other specific tags in their markup, and this could speedup and make more realable text etraction as is more and more used for SEO optimization.

xav

mwjackson · 2014-07-24T10:07:54Z

What is the roadmap for v2? Is this something you'll realistically have time for?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for semantic markup #99

Add support for semantic markup #99

jeffnappi commented May 6, 2014

jeffnappi commented May 6, 2014

grangier commented May 7, 2014

mwjackson commented Jul 24, 2014

Add support for semantic markup #99

Add support for semantic markup #99

Comments

jeffnappi commented May 6, 2014

jeffnappi commented May 6, 2014

grangier commented May 7, 2014

mwjackson commented Jul 24, 2014