You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Wired makes effective use of schema.org as seen below:
<a itemprop="url headline name" href="http://www.wired.com/2014/05/star-wars-storyboards-video/" rel="bookmark" title="Permanent Link to Check Out Early Storyboards From the Original Star Wars Trilogy">Check Out Early Storyboards From the Original <em>Star Wars</em> Trilogy</a>
</h1>
<link itemprop="image" href="http://www.wired.com/wp-content/uploads/2014/05/star-wars-storyboards-feat.jpg" />
...
<li class="entryDate"><time itemprop="datePublished" datetime="2014-05-06T06:30:56+00:00">05.06.14</time> | </li>
...
<span itemprop="articleBody"><p><iframe width="660" height="371" src="//www.youtube.com/embed/8RlpNvUumy0" frameborder="0" allowfullscreen></iframe></p>
<p>Sure, everyone gets excited about <a href="http://www.wired.com/2014/05/jj-abrams-star-wars-video/" target="_blank">May the Fourth</a>
...
</p></span>
This was just a thought. I intend to implement this whether it becomes part of python-goose or not, but thought it would be good to open up a conversation about it.
Is this something that you would like to see added to python-goose?
This had been a long time I was thinking the future of goose (version 2) would be based on html5 sementic tags extraction.
It seems to me obvious that most newsite are now using <article> and other specific tags in their markup, and this could speedup and make more realable text etraction as is more and more used for SEO optimization.
It would be fantastic to have the option to extract article data using Schema.org with a fallback to OpenGraph.
Example - http://www.wired.com/2014/05/star-wars-storyboards-video/
Wired makes effective use of schema.org as seen below:
A minimal implementation could include:
The text was updated successfully, but these errors were encountered: