-
Notifications
You must be signed in to change notification settings - Fork 7
My script for finding books by looking at bookshelves of people who read similar books #36
Comments
#!/usr/bin/env perl
#<--------------------------------- MAN PAGE --------------------------------->|
=pod
=head1 NAME
bookfinder - finding books by looking at bookshelves of people who read similar books
=head1 PURPOSE
=over
=item * fetches books with 4 and 5 stars in your profile
=item * crawls reviews of these books to find users who also rated it 4 or 5 stars
=item * looks up the bookshelves of those users to see which books they rated 4 or 5 stars
=item * ranks books based on number of votes from these users
=item * also ranks users by number of books they have in common (min 3)
=item * also gives more votes to users who love the same books as you but also hate the same books as you get special treatment
=back
=head1 SYNOPSIS
B<bookfinder.pl>
[B<-n> F<number>]
[B<-a> F<number>]
[B<-x> F<number>]
[B<-d> F<filename>]
[B<-u> F<number>]
[B<-c> F<numdays>]
[B<-o> F<filename>]
[B<-s> F<shelfname> ...]
[B<-i>]
F<goodloginmail> [F<goodloginpass>]
=head1 OPTIONS
Mandatory arguments to long options are mandatory for short options too.
=over 4
=item B<-n, --common>=F<number>
Max number of books in user's bookshelf. Currently set to
500. PEople who have hundreds and thousand of books often
add more noise than signal to your results.
=item B<-x, --rigor>=F<numlevel>
we need to find members who rate the books of our authors,
though Goodreads just shows a few ratings.
We exploit ratings filters and the reviews-search to find more members:
level 1 = filters-based search of book-raters (max 5400 ratings) - default
level 2 = like 1 plus dict-search if >3000 ratings with stall-time of 2min
level n = like 1 plus dict-search with stall-time of n minutes
Rigor level 0 is useless here (latest readers only),
and 2+ (dict-search) has a bad cost/benefit ratio given hundreds of books.
=item B<-d, --dict>=F<filename>
default is F<./list-in/dict.lst>
=item B<-u, --userid>=F<number>
check another member instead of the one identified by the login-mail
and password arguments. You find the ID by looking at the shelf URLs.
=item B<-c, --cache>=F<numdays>
number of days to store and reuse downloaded data in F</tmp/FileCache/>,
default is 31 days. This helps with cheap recovery on a crash, power blackout
or pause, and when experimenting with parameters. Loading data from Goodreads
is a very time consuming process.
=item B<-o, --outfile>=F<filename>
name of the CSV file where we write results to, default is
"./likeminded-F<goodusernumber>-F<shelfname>.csv"
=item B<-i, --ignore-errors>
Don't retry on errors, just keep going.
Sometimes useful if a single Goodreads resource hangs over long periods
and you're okay with some values missing in your result.
This option is not recommended when you run the program unattended.
=item B<-?, --help>
show full man page
=back
=head1 FILES
F<./list-in/dict.lst>
F<./list-out/likeminded-$USERID-$SHELF.html>
F</tmp/FileCache/>
=head1 EXAMPLES
$ ./bookfinder.pl [email protected] MyPASSword
$ ./bookfinder.pl -c 31 -o myfile.csv [email protected] pass
=head1 REPORTING BUGS
Report bugs to <[email protected]> or use Github's issue tracker
L<https://github.com/andre-st/goodreads-toolbox/issues>
=head1 COPYRIGHT
This is free software. You may redistribute copies of it under the terms of
the GNU General Public License L<https://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.
=head1 VERSION
2020-01-23 (Since 2018-06-22)
=cut
#<--------------------------------- 79 chars --------------------------------->|
use strict;
use warnings qw(all);
use locale;
use 5.18.0;
# Perl core:
use FindBin;
use lib "$FindBin::Bin/lib/";
use Time::HiRes qw(time tv_interval);
use POSIX qw(strftime floor locale_h);
use File::Spec; # Platform indep. directory separator
use IO::File;
use Getopt::Long;
use Pod::Usage;
# Third party:
use Text::CSV;
# Ours:
use Goodscrapes;
# ----------------------------------------------------------------------------
# Program configuration:
#
setlocale(LC_CTYPE, "en_US"); # GR dates all en_US
STDOUT->autoflush(1);
gsetopt(cache_days => 31);
our $TSTART = time();
our $MINCOMMON = 5;
our $MAXAUBOOKS = 100;
our $RIGOR = 1;
our $MAXBOOKS = 500;
our $DICTPATH = File::Spec->catfile($FindBin::Bin, 'list-in', 'dict.lst');
our $OUTPATH;
our @SHELVES;
our $USERID;
GetOptions('rigor|x=i' => \$RIGOR,
'dict|d=s' => \$DICTPATH,
'userid|u=s' => \$USERID,
'outfile|o=s' => \$OUTPATH,
'maxbooks|n=s' => \$MAXBOOKS,
'shelf|s=s' => \@SHELVES,
'ignore-errors|i' => sub {gsetopt(ignore_errors => 1);},
'cache|c=i' => sub {gsetopt(cache_days => $_[1]);},
'help|?' => sub {pod2usage(-verbose => 2);})
or pod2usage(1);
pod2usage(1) if !$ARGV[0];
glogin(usermail => $ARGV[0], # Login also allows to load 200 books in 1 request
userpass => $ARGV[1], # Asks pw if omitted
r_userid => \$USERID);
sub bookshelf {
my $id = shift;
my %books;
print "\nLooking bookshelf of $id..";
greadshelf(from_user_id => $id,
ra_from_shelves => [ 'read' ],
rh_into => \%books,
# on_book => sub{},
on_progress => gmeter('books')
);
my (@good, @bad);
for my $book_id (keys %books) {
my $book = $books{$book_id};
#next unless $book->{title} =~ /Club/;
my $rating = $book->{user_rating};
push(@good, $book) if ($rating >= 4);
push(@bad, $book) if ($rating <= 2);
#warn("cannot find rating for $book->{title} of $id\n") unless ($rating >= 1);
}
return (\@good, \@bad);
}
sub bookgenres {
my $bid = shift;
my $html = Goodscrapes::_html(Goodscrapes::_book_url($bid));
my @genres;
while ($html =~ m[href="/genres/([\w-]+)"]g) {
push(@genres, $1);
}
return \@genres;
}
my ($su_good, $su_bad) = bookshelf($USERID);
my (%good_users, %good_books, %haters);
for my $b (@$su_good) {
print "\nLooking up reviews for for $b->{title}..";
$b->{reviews} = {};
greadreviews(rh_for_book => $b,
rh_into => $b->{reviews},
rigor => $RIGOR,
dict_path => $DICTPATH,
on_progress => gmeter('memb'));
for my $rev (values %{$b->{reviews}}) {
my $u = $rev->{rh_user};
if ($rev->{rating} >= 4) {
$good_users{$u->{id}} = { 'votes' => (defined($good_users{$u->{id}}->{votes}) ? $good_users{$u->{id}}->{votes} : 0) + 1, 'user' => $u };
} elsif ($rev->{rating} <= 2) {
$haters{$u->{id}} = { 'votes' => (defined($haters{$u->{id}}->{votes}) ? $haters{$u->{id}}->{votes} : 0) + 1, 'user' => $u };
}
}
}
for my $u (keys %good_users) {
$good_users{$u}->{'bad'} = defined($haters{$u}->{votes}) ? $haters{$u}->{votes} : 0;
}
printf("\nHere are your best users (out of %d users):\n", scalar keys %good_users);
my $filename = File::Spec->catfile($FindBin::Bin, 'list-out', "bookfinder-users.csv");
my $csv = Text::CSV->new({ binary => 1, eol => $/ }) or die "Failed to create a CSV handle: $!";
open my $fh, ">:encoding(utf8)", $filename or die "failed to create $filename: $!";
$csv->print($fh, [ 'uid', 'name', 'good_common', 'bad_common', 'total_common', 'total_books', 'ratio', 'url' ]);
for my $user_id (keys %good_users) {
my $userHash = $good_users{$user_id};
if (($user_id ne $USERID) && ($userHash->{votes} >= 2)) {
my $user = $userHash->{user};
my $uBooks = bookshelf($user_id);
my $numBooks = scalar @$uBooks;
if (!$MAXBOOKS || ($numBooks <= $MAXBOOKS)) {
my $total = $userHash->{votes} + $userHash->{bad};
$csv->print($fh, [ $user->{id}, $user->{name}, $userHash->{votes}, $userHash->{bad}, $total, $numBooks, $numBooks > 0 ? $total / $numBooks : 0, "https://www.goodreads.com/review/list/$user_id?sort=rating" ]);
for my $gb (@$uBooks) {
$good_books{$gb->{id}} = { 'votes' => (defined($good_books{$gb->{id}}->{votes}) ? $good_books{$gb->{id}}->{votes} : 0) + 1, 'book' => $gb };
}
} else {
print "\nskipped books for $user_id: $numBooks > $MAXBOOKS\n";
}
}
}
close $fh or die "failed to close $filename: $!";
printf("\nHere are your best books (out of %d books):\n", scalar keys %good_books);
$OUTPATH = File::Spec->catfile($FindBin::Bin, 'list-out', "bookfinder-books.csv") if !$OUTPATH;
$csv = Text::CSV->new({ binary => 1, eol => $/ }) or die "Failed to create a CSV handle: $!";
open $fh, ">:encoding(utf8)", $OUTPATH or die "failed to create $OUTPATH: $!";
$csv->print($fh, [ 'bid', 'title', 'author', 'votes', 'avg_rating', 'num_ratings', 'genres', 'img_url' ]);
for my $bk (sort {$b->{votes} <=> $a->{votes}} values(%good_books)) {
if ($bk->{votes} > 1) {
my $b = $bk->{book};
my $genres = bookgenres($b->{id});
printf("%s with %d votes\n", $b->{title}, $bk->{votes});
$csv->print($fh, [ $b->{id}, $b->{title}, $b->{rh_author}->{name}, $bk->{votes}, $b->{avg_rating}, $b->{num_ratings}, join(', ', @$genres), $b->{img_url} ]);
}
}
close $fh or die "failed to close $OUTPATH: $!";
|
For this to work, there is a minor patch in $bk{ user_rating } = $row =~ /data-rating="(\d+)"/ ? ($1?$1:0) : 0; I guess goodreads has changed the HTML so the user rating is always 0. The above line fixes it. |
Hi San Kumar, thanks for sharing your script. I will definitely check this out over the course of the next week. |
Super like |
This is exactly what I've been looking for! Can it be run in Docker? |
I haven't tried it but shouldn't be so hard. Just modify goodreads-toolbox |
I added your script and the patch to the goodreads-toolbox directory and then modified the .dockerignore file to include the new script in the exceptions list, then rebuilt the container from my local drive instead of pointing to github in the build command. However, it seems to have broken my bash prompt and I get "no such file or directory" when trying to run any of the scripts in the container. Oh well! I'm not a Linux programmer and have never messed around with Docker before until today. I realize this isn't a Docker help forum, however if you happen to have any tips I would love to hear them. Thank you for your awesome work on this! I hope the toolbox will be supported again one day and this can be added as an official script. |
I think your |
Thanks again for your help! For anyone who stumbles across this in future, here are all the steps I took to eventually get this working in Docker for Windows:
That's it! |
Love this toolbox. But it was missing a feature for finding books by looking at bookshelves of people who read similar books. So I wrote this small perl script for that today.
Here is how it works:
fetches books with 4 and 5 stars in your profile
crawls reviews of these books to find users who also rated it 4 or 5 stars
looks up the bookshelves of those users to see which books they rated 4 or 5 stars
ranks books based on number of votes from these users
also ranks users by number of books they have in common (min 3)
also gives more votes to users who love the same books as you but also hate the same books as you (i.e. 1 or 2 star)
Output:
My perl is a little rusty so this isn't the best way to do it but then perl motto is TIMTOWTDI and it did produce some good outputs.
Let me know what you guys think. Will post the script in the next comment.
The text was updated successfully, but these errors were encountered: