Scraping Google Scholar e-mail alerts
Google Scholar <http://scholar.google.com> offers a function of email alerts <http://googlesystem.blogspot.com/2010/05/email-alerts-for-google-scholar.html> <http://tieki83.blog106.fc2.com/blog-entry-142.html>. It would be very helpful for everyday updating of your bibliography. Regrettably, however, the alerts are in the form of HTML e-mails. They are thus unsuitable to automatic processing of huge data. That's why we need a scraping tool to extract data from HTML files.
Perl script "extract-scholar-a.pl" scrapes HTML files attached to Google Scholar email alerts. Output is a tab-separated format text. Each line in the output begins with the target URL, followed by the text including file types, the author, journal title, year of publication, and extraction from the source. HTML tags are removed. Some kinds of entity-reference characters are decoded. Note that the output includes meaningless text fragments.
Source
#!/usr/bin/perl =head1 NAME extract-scholar-a.pl =cut =head1 Require Perl 5.8 =head1 INPUT HTML files (listed on the command line) by Google Scholar email alerts =head1 OUTPUT Tab-separated text (URL, file type, decoded text) =cut # *** marks for technically critical codes undef $/ ; $\ ="\n"; $" = $, ="\t"; while( $file = <> ) { my @entry = split ( /<a / , $file ); # *** Divide entries my $prefix = '' ; my $e; foreach $e(@entry){ $text =~ s/\n/ /g ; my ( $href ) = $e=~/href\s*=\s*"([^"]*)"/i; # *** Extract URL (with CGI GET parameters) my ( @para ) = split( /\&/ , $href ); # *** Divide the parameters my ( @url ) = grep { s/^q=// } @para; # *** Extract the "q" parameter my $text = $e ; $text =~ s/^[^>]+>// ; $text =~ s|</a>| // |g ; $text =~ s/<[^>]+>/ /g ; $text =~ s/\n/ /g ; $text =~ s/&#(\d+);/ chr($1) /eg ; $text =~ s/…/ ... /g ; $text =~ s/\s+/ /g ; $text =~ s/(\[[A-Z]+\])\s*$//i; my $next_filetype = $1; print @url, $prefix, $text ; $prefix = $next_filetype; } } #------------- EOF
References
- Google Scholar <http://scholar.google.com>.
- Alex Chitu. 2010. "Email Alerts for Google Scholar". Google Operating System: unofficial news and tips about Google. <http://googlesystem.blogspot.com/2010/05/email-alerts-for-google-scholar.html>
- ichi_tas. 2012. "[Google Scholar活用法] 最新論文のチェックをメールアラート機能で自動化する". My Scratch Pad. <http://tieki83.blog106.fc2.com/blog-entry-142.html>.
Related articles
- なぜ子ども手当を復活させるべきなのかについてのイデオロギー分析 (2014-02-04)
- 保守政権が導入すべき少子化対策 (2014-02-01)
- Art of proofreading (2014-01-17)
- There's more than one way to do it (for policy science) (2013-10-16)
- ISA14 call for papers: Family Studies based on Quantitative Analyses of Surveys (deadline: 2013-10-01) (2013-10-01)
- JSS grants for young CJK scholars on ISA14 (2013-09-18)
- Complete list of sessions for ISA14 (2013-08-17)
- Scraping Google Scholar e-mail alerts (2013-05-09)
- New Book: A Quantitative Picture of Contemporary Japanese Families (2013-03-25)
- 国民生活基礎調査分析結果報告ほか (3/23 仙台) (2013-03-23)
- 報告書『女性の生活状況及び社会的困難をめぐる事例調査』せんだい男女共同参画財団 (2013年3月) (2013-03-21)
- Symposium on Inequality (10/27-28 Sendai): deadline 5/20 (2012-05-20)
- 「機会平等」「結果平等」とは何か (2012-04-02)
- IATUR conference (8/22-24 Matue): deadline 3/15 (2012-03-15)
- NFRJ reports available online (2012-02-21)
Comment:
Leave your comment
All items are optional (except the comment content). Posted comment will be immediately published, without preview/confirmation.
To pass my SPAM filter, include some non-ASCII characters more than 1% of Your Comment content. If you cannot type non-ASCII characters, copy & paste the star marks: ★☆★☆★☆.
Trackback:
http://blog.tsigeto.jp/tb.php/155-8cc1bc49