Skip to content

Wikipedia categories as NetworkX graphs for automatic content assessment.

License

Notifications You must be signed in to change notification settings

marvel2010/wikipedia-content-assessment

Repository files navigation

Wikipedia Content Assessment

Goal

The goal of this project is to analyze the structure of the Wikipedia graph with respect to content quality. We conjecture that high-quality articles are disproportionaly less likely to link to low-quality articles. If this is true, then information about the Wikipedia graph may be very helpful in predicting the quality of an article, an interesting Machine Learning task.

Content

Wikipedia articles are assessed by WikiProjects according the Content assessment guidelines. The content assessment guidelines put content into one of seven categories: FA, A, GA, B, C, Start, Stub.

See this Wikipedia page for more details.

Graph Structure

We represent all the articles in a given category as a graph. The vertices of the graph are the articles and an arc represent a link from one article to another.

We use the python package NetworkX to represent these graphs.

Example

from wikipedia_category_graph import WikipediaCategoryGraph
wikipedia_cat_graph_class_egt = WikipediaCategoryGraph("Extremal graph theory")

wikipedia_cat_graph_class_egt.construct_graph(depth=1)
wikipedia_cat_graph_class_egt.print_node_information()
wikipedia_cat_graph_class_egt.print_edge_information()
wikipedia_cat_graph_class_egt.graph_to_file("Extremal_graph_theory-depth1-full.txt")

About

Wikipedia categories as NetworkX graphs for automatic content assessment.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages