This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technical users. You can analyse an ebook, and see the following information:
- The length of the book in words
- The length of the book in characters
- The number of unique words used in the book
- The number of unique words that are only used once in the book
- The percentage of unique words that are only used once
- The number of unique characters used
- The number of unique characters that are only used once
- The percentage of unique characters that are only used once
- A list of all the words used in the book as well as how often they are used
- A list of all the characters used in the book as well as how often they are used
For text processing, we use MeCab
Currently, the project is not deployed anywhere, so to use the service, you will need to follow the steps below in the development section to get the server running.
- Upload a
.epub
file containing japanese text to the server - The server will redirect you to a page showing you information about the ebook. You can then also click the 'See more details' button to see all the generated data, including a list of all the words used together with how many occurences there are for each word, and the same for the characters as well.
- Clone repository:
git clone https://github.com/christofferaakre/japanese-ebook-analysis.git
- Make sure you have
mecab
set up on your system. See http://www.robfahey.co.uk/blog/japanese-text-analysis-in-python/
(Only required if you will actually upload ebooks or run theanalyse_epub.py
script), which you will not need to do to contribute to other parts of the app. for a good guide on how to set it up. - Install python dependencies:
pip install -r requirements.txt
- Install other dependencies (these all need to be in your system path):
pandoc
- Run
./app.py
to start the flask dev server
I'm very happy for any happy contributions! Before contributing, please
have a look at
CONTRIBUTING.md.
To see what needs work on, have a look at the repo's
Issues
and its
Pull requests.
Feel free to submit your own issue or pull request about a new feature or anything else. When submitting a pull request, don't be afraid to modify any of the files; I'm not very attached to the coding style used in the repo.