Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking out the diff between master and idio master #2

Open
wants to merge 59 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
f35d960
Article Object Changes - Added the links start and end position in the
May 29, 2015
12ba0f6
Added JitPack Repo to the Build File.
Jun 1, 2015
cc09ed9
adding paragraphs with links
dav009 Jun 4, 2015
85ce42a
removing template from paragraph text
dav009 Jun 4, 2015
6142c69
fixing tests
dav009 Jun 4, 2015
4829460
fixing more tests
dav009 Jun 4, 2015
02d1046
adding paragraphWithLinks
dav009 Jun 16, 2015
36e3c87
handling empty anchors
dav009 Jun 22, 2015
651e416
removing unused imports
dav009 Jun 22, 2015
b4d804c
bumping jwpl
dav009 Jun 23, 2015
16d026b
adding tests
dav009 Jun 23, 2015
89d26cc
bumping jwpl versions in pom
dav009 Jun 23, 2015
53a4476
identation
dav009 Jun 23, 2015
6ab48f9
removing jars
dav009 Jun 23, 2015
ab55a8a
adding installed dep jars
dav009 Jun 23, 2015
1c743df
Filtering empty wikipedia Links
dav009 Jun 23, 2015
c8f2726
styling
dav009 Jun 23, 2015
1d80f5c
jsonpedia using spark
dav009 Jul 9, 2015
2594073
cleaning tabs
dav009 Jul 14, 2015
510a87e
adding new jpwl parser version
dav009 Jul 14, 2015
8cd03d2
better comment on included class
dav009 Jul 14, 2015
6b03f5a
tabs
dav009 Jul 14, 2015
5ee9d06
bumping version
dav009 Jul 14, 2015
ac3ebd5
fixing parallel processing
dav009 Jul 17, 2015
61b982c
updating readme
dav009 Jul 17, 2015
3646665
better names
dav009 Jul 17, 2015
5b571d0
deleting split util
dav009 Jul 17, 2015
1ff26f7
updating usage
dav009 Jul 17, 2015
1dd6537
saving space
dav009 Jul 17, 2015
16f29bd
outputing progress
dav009 Jul 17, 2015
1a46c50
commenting stopwatch
dav009 Jul 17, 2015
666c343
removing benchmarking
dav009 Jul 21, 2015
a7eb6e0
extracting annotations from tables & lists
dav009 Aug 17, 2015
7c1deaf
adding test files
dav009 Aug 17, 2015
0c5931c
updating jwpl parser
dav009 Aug 19, 2015
872b78c
adding test for uris with colons
dav009 Aug 19, 2015
7c70696
updating pom(updating jwpl)
dav009 Aug 19, 2015
ace1619
removing unused deps
dav009 Aug 19, 2015
76929dd
adding new JWPL parser
dav009 Aug 19, 2015
a0f5d8f
adding jwpl deps
dav009 Aug 21, 2015
11a0ab8
adding jwpl code
dav009 Aug 21, 2015
5457159
adding jwpl tests
dav009 Aug 24, 2015
c9cd21f
removing jwpl from mvn repo
dav009 Aug 24, 2015
d53cc83
adding base makefile
dav009 Aug 24, 2015
2a663f0
missing dep?
dav009 Aug 24, 2015
5ce83a7
adding travis ci
dav009 Aug 24, 2015
db7b555
using jitpack deps
dav009 Aug 25, 2015
48f59de
updating bliki change
dav009 Aug 25, 2015
b005fad
deleting lib folder
dav009 Aug 25, 2015
b3cbe46
fixing format
dav009 Aug 25, 2015
08df06e
Don't fail with NullPointerException
Aug 25, 2015
5eadf6b
disable travis emails
Aug 25, 2015
b2a5ac2
Split "...|..." greedily.
Aug 25, 2015
e368586
remove commented code
Aug 26, 2015
52fb73d
multiple |
Aug 26, 2015
b26007a
Fixing colon bug
dav009 Sep 3, 2015
b6c68df
adding NE
dav009 Sep 8, 2015
4fa3b6a
using isANE function
dav009 Sep 8, 2015
bb23b6b
Adding more NE
dav009 Sep 9, 2015
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
language: java
jdk:
- oraclejdk7
install: mvn clean test package
notifications:
hipchat:
rooms:
secure: HAIO6qjP1Os4yCduLwRfNrXP9K5v3hOpbXs/HOcoPzh0WACTNZUBJx8GPUWlcUJlzaxkUG05QC91dTn2kTyNteFE9xayKw347i7xywcYDttJdEL8M0agIkZowyrOfmiG+wFv/vCayZpf0T/MPYE1gDvFeP4yuP7CU0pdy1j0SRUkHvcpoXcx1OjbW7kMbiO1WedhzGZwrLWabw8UmNybvoyVSZmtBd4acRuOfzOfbynWoWL/9HD9jUoeCIbQeXWFxXGnQb22QpNWRYu7ZMH1ppkE/lI5zJtqZS5BjfEfK3tql4WaL72r4xO4uHkqi0xjUlP0iyQi4WW9G1PGFz6GmZSCrr04eyRQovhEaQcoces4+Q+uT/3pHplT12kE8y5wGTwYJlCfIRjGYN/uqfT6EEdl+8E4ZCA/o3t7rEHrRqxD98Pt+Q/y+GJxtYmGr4n+HAGCfa3BE4uij3N7EisO6mJ7NtMp5g8UGEvnqFV/eV46Urhoih4ILPhTkMKvX7PRxnSvrqE7toY9PXC5Ufkmt5TU7RmTNRT1cnH28MqlpeywWKNkNr8I6chfUCDNeEHT6Ckj00/l+CbQfNtQWQt7XApcCui0cxvTuZSrAKVzNTrWKcN/tMP5XyJ6t/bxBZAFDV4YHzK8fwzNHjz71sLGYukdvls4yLsEW74JqiQZ6Oo=
template:
- 'Build #%{build_number} (%{commit}) of %{repository_name}/%{branch} %{result} (%{duration}) %{build_url}'
email: false
83 changes: 53 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,49 @@ json-wikipedia ![json-wikipedia](https://dl.dropboxusercontent.com/u/4663256/tmp

- Please be aware that this tool does not work with the `multistream` dump.

#### Setup ####
What's different about this fork:

- Uses Apache Spark to speedup the transformation to json. Original Json-wikipedia runs on a single thread.
- Fixes some issues with JWPL, which means less noisy extractions
- Chunks the article's pages into paragraphs and returns a list of links with correct spans
- Extract Links from article's paragraphs with matching spans


## Convert the Wikipedia XML to JSON

### The Docker way

Enjoy a dockerized image:

`docker run -v <LOCALPATH>:/mnt -i -t dav009/jsonwikipedia -input <PATHTOWIKI> -output <OUTPUTPATH> -lang <LANG> -action export-parallel`

compile the project running
For example if my `english_wikipedia.dump` lives in : `/david/data/english_wikipedia.dump` I could run it as:

mvn assembly:assembly

the command will produce a JAR file containing all the dependencies the target folder.
`docker run -v /david/data:/mnt -i -t dav009/jsonwikipedia -input /mnt/english_wikipedia.dump -output /mnt/english_wikipedia.json -lang en -action export-parallel`

#### Convert the Wikipedia XML to JSON ####
Note that the output path corresponds to a path within the docker container. In the given example the output path is part of the mounted volume, so it will be available at the host machine.

java -cp target/json-wikipedia-1.0.0-jar-with-dependencies.jar it.cnr.isti.hpc.wikipedia.cli.MediawikiToJsonCLI -input wikipedia-dump.xml.bz -output wikipedia-dump.json[.gz] -lang [en|it]

or
### Doing it yourself

./scripts/convert-xml-dump-to-json.sh [en|it] wikipedia-dump.xml.bz wikipedia-dump.json[.gz]
1. Compile the project by doing: `mvn assembly:assembly` the command will produce a JAR file containing all the dependencies the target folder.
2. Download Spark 1.3.1: http://www.apache.org/dyn/closer.cgi/spark/spark-1.3.1/spark-1.3.1.tgz
3. Download Wikipedia Dump ( https://dumps.wikimedia.org/backup-index.html )
4. Uncompress the Wikipedia Dump
5. do:

produces in `wikipedia-dump.json` the JSON version of the dump ([here you can find an example](https://dl.dropboxusercontent.com/u/4663256/tmp/json-wikipedia-sample.json)). Each line of the file contains an article
of dump encoded in JSON. Each JSON line can be deserialized in an [Article](http://sassicaia.isti.cnr.it/javadocs/json-wikipedia/it/cnr/isti/hpc/wikipedia/article/Article.html) object,
which represents an
_enriched_ version of the wikitext page. The Article object contains:
SPARKFOLDER/bin/spark-submit --driver-memory 10G --class it.cnr.isti.hpc.wikipedia.cli.MediawikiToJsonCLI json-wikipedia-1.0.0-jar-with-dependencies.jar -input <PATHTODBPEDIADUMP> -output <PATHTONEWJSONPEDIA> -lang <LANG> -action export-parallel

this produces in `<PATHTONEWJSONPEDIA>` the JSON version of the dump

You can also call Jsonpedia the usual way but it will use a single thread to process the wiki:

java -cp target/json-wikipedia-1.0.0-jar-with-dependencies.jar it.cnr.isti.hpc.wikipedia.cli.MediawikiToJsonCLI -input wikipedia-dump.xml.bz -output wikipedia-dump.json[.gz] -lang [en|it] -action export

### How does Jsonpedia look like?

([here you can find an example](https://dl.dropboxusercontent.com/u/4663256/tmp/json-wikipedia-sample.json)). Each line of the file contains an article
of dump encoded in JSON. Each JSON line can be deserialized in an [Article](http://sassicaia.isti.cnr.it/javadocs/json-wikipedia/it/cnr/isti/hpc/wikipedia/article/Article.html) object,which represents an _enriched_ version of the wikitext page. The Article object contains:


* the title (e.g., Leonardo Da Vinci);
Expand All @@ -45,18 +68,18 @@ _enriched_ version of the wikitext page. The Article object contains:
* a list of terms highlighted in the article;
* if present, the infobox.

#### Usage ####
#### Usage

Once you have created (or downloaded) the JSON dump (say `wikipedia.json`), you can iterate over the articles of the collection
easily using this snippet:

RecordReader<Article> reader = new RecordReader<Article>(
"wikipedia.json",new JsonRecordParser<Article>(Article.class)
).filter(TypeFilter.STD_FILTER);
RecordReader<Article> reader = new RecordReader<Article>(
"wikipedia.json",new JsonRecordParser<Article>(Article.class)
).filter(TypeFilter.STD_FILTER);

for (Article a : reader) {
// do what you want with your articles
}
for (Article a : reader) {
// do what you want with your articles
}

You can also add some filters in order to iterate only on certain articles (in the example
we used only the standard type filter, which excludes meta pages e.g., Portal: or User: pages.).
Expand All @@ -67,38 +90,38 @@ of the [hpc-utils](http://sassicaia.isti.cnr.it/javadocs/hpc-utils) package.

In order to use these classes, you will have to install `json-wikipedia` in your maven repository:

mvn install
mvn install

and import the project in your new maven project adding the dependency:

<dependency>
<groupId>it.cnr.isti.hpc</groupId>
<dependency>
<groupId>it.cnr.isti.hpc</groupId>
<artifactId>json-wikipedia</artifactId>
<version>1.0.0</version>
</dependency>
</dependency>
#### Schema ####

```
|-- categories: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- description: string (nullable = true)
| | |-- anchor: string (nullable = true)
| | |-- id: string (nullable = true)
|-- externalLinks: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- description: string (nullable = true)
| | |-- anchor: string (nullable = true)
| | |-- id: string (nullable = true)
|-- highlights: array (nullable = true)
| |-- element: string (containsNull = false)
|-- infobox: struct (nullable = true)
| |-- description: array (nullable = true)
| |-- anchor: array (nullable = true)
| | |-- element: string (containsNull = false)
| |-- name: string (nullable = true)
|-- integerNamespace: integer (nullable = true)
|-- lang: string (nullable = true)
|-- links: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- description: string (nullable = true)
| | |-- anchor: string (nullable = true)
| | |-- id: string (nullable = true)
|-- lists: array (nullable = true)
| |-- element: array (containsNull = false)
Expand All @@ -119,7 +142,7 @@ and import the project in your new maven project adding the dependency:
| | | | |-- element: string (containsNull = false)
|-- templates: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- description: array (nullable = true)
| | |-- anchor: array (nullable = true)
| | | |-- element: string (containsNull = false)
| | |-- name: string (nullable = true)
|-- templatesSchema: array (nullable = true)
Expand Down
5 changes: 0 additions & 5 deletions lib/info/bliki/wiki/bliki-core/3.0.16/_maven.repositories

This file was deleted.

Binary file not shown.

This file was deleted.

Binary file not shown.

This file was deleted.

This file was deleted.

95 changes: 0 additions & 95 deletions lib/info/bliki/wiki/bliki-core/3.0.16/bliki-core-3.0.16.pom

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

3 changes: 0 additions & 3 deletions lib/info/bliki/wiki/bliki/3.0.16/_maven.repositories

This file was deleted.

Loading