-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1001 plots #112
Comments
I ended up using a lighter version of my PHP array, it was still more than 70Mo but reasonnably usable. I was hoping to get more readable plots but I fear the markov chains were not sufficient for this time. Anyway, I plan to update the ReadMe later and maybe try to generate another sample with the full array although I doubt the results will be much better. |
I have an (easy) idea to (maybe) improve my model without making it heavier: I'll let it run through a subpart of the corpus to train for words (as I did for my light array), and then run it through the rest of the corpus without adding the new words encountered (that's what makes the model much heavier after each pass), it'll only increment occurrences of already known words, thus improving the statistical model without making it bigger. I hope to have more human readable results thanks to this. |
So I ended up using an even lighter version of the model with 3000 plots for the words and the rest for improvements of the model. The result seems better, still not very readable but sometimes funny. It kind of looks like plots being told by a child who has no proper grammar but good enough vocabulary. Think of it this way and it can actually make some sense =) I also changed the length of the plots, between 50 and 250 words for each. The result is here: http://louphole.com/divers/1001-plots.html |
Not sure how much you've done with Markov chains before, but ... grammar quality is basically controlled by the (word) length of each phrase in your lookup hash. This is called "order" in technical terms, at least according to Wikipedia. I looked over your code and it seems like your table is "word1" -> pick_random_of("word2","word3","word4"), which is essentially just Order == 1. To get better results, your seed phrase "word1" should be a two- or three-word phrase, so the followup word makes more sense in context. That way instead of picking the next word based on the word before it, pick the next word based on the previous 2 or 3 words. "word1 word2" -> array("word3", "word5"), If you're familiar with Perl at all, maybe give this a lookover - I did a Markov perl module for an entry a couple years ago, you can steal ideas from it. |
I was aware of the order parameter for Markov Chains but the corpus of WikiPlots is composed of many (many) proper nouns and I feared they would bias the model into copying existing sentences. Also I didn't have the time to test it to see if the results would be better. I also did the same as you but in PHP, that's sort of the code I used for my model and text generation : https://github.com/WhiteFangs/WordBasedMarkov Thanks for your advice though! |
My idea for this year is to generate 1001 plots, each around 50 words, and their titles using the WikiPlots dataset and simple Markov chains.
I didn't think I would find time in November to join this year's edition but I found one available evening and started this. My handicap is that I planned to do this in only a few hours and using PHP (for a lot of -not very good- reasons).
Anyway, I started a few hours ago and I struggled to get the statistical model for my Markov chains generator from a 220Mo text file containing all the plots but I found a way (by cutting it into smaller files basically). But now I'm stuck with a >200Mo PHP array that I will try to use to generate the small plots. Let's hope it will work, pray for my RAM.
I plan to release the array generation code as well as the text generation code (but not the full data because it's a bit heavy and can be rebuild using the dataset).
The text was updated successfully, but these errors were encountered: