Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for semantic markup #99

Open
jeffnappi opened this issue May 6, 2014 · 3 comments
Open

Add support for semantic markup #99

jeffnappi opened this issue May 6, 2014 · 3 comments

Comments

@jeffnappi
Copy link
Contributor

It would be fantastic to have the option to extract article data using Schema.org with a fallback to OpenGraph.

Example - http://www.wired.com/2014/05/star-wars-storyboards-video/

Wired makes effective use of schema.org as seen below:

<a itemprop="url headline name" href="http://www.wired.com/2014/05/star-wars-storyboards-video/" rel="bookmark" title="Permanent Link to Check Out Early Storyboards From the Original Star Wars Trilogy">Check Out Early Storyboards From the Original <em>Star Wars</em> Trilogy</a>
</h1>
<link itemprop="image" href="http://www.wired.com/wp-content/uploads/2014/05/star-wars-storyboards-feat.jpg" />
...
    <li class="entryDate"><time itemprop="datePublished" datetime="2014-05-06T06:30:56+00:00">05.06.14</time>&nbsp;&nbsp;&#124;&nbsp;&nbsp;</li>
...
<span itemprop="articleBody"><p><iframe width="660" height="371" src="//www.youtube.com/embed/8RlpNvUumy0" frameborder="0" allowfullscreen></iframe></p>
<p>Sure, everyone gets excited about <a href="http://www.wired.com/2014/05/jj-abrams-star-wars-video/" target="_blank">May the Fourth</a>
...
</p></span>

A minimal implementation could include:

@jeffnappi
Copy link
Contributor Author

This was just a thought. I intend to implement this whether it becomes part of python-goose or not, but thought it would be good to open up a conversation about it.

Is this something that you would like to see added to python-goose?

@grangier
Copy link
Owner

grangier commented May 7, 2014

Hello Jeff,

This had been a long time I was thinking the future of goose (version 2) would be based on html5 sementic tags extraction.

It seems to me obvious that most newsite are now using <article> and other specific tags in their markup, and this could speedup and make more realable text etraction as is more and more used for SEO optimization.

xav

@mwjackson
Copy link

What is the roadmap for v2? Is this something you'll realistically have time for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants