Derivazioni: Space for communication by TANAKA Sigeto
[About] [Author] [Archive] [Mobile] [RSS]

« 【休講】5月8日(水) 2限 (3年生健康診断のため) «  | |  » 現代日本論概論 (金1) 5/10 の授業 »

Scraping Google Scholar e-mail alerts

Google Scholar <http://scholar.google.com> offers a function of email alerts <http://googlesystem.blogspot.com/2010/05/email-alerts-for-google-scholar.html> <http://tieki83.blog106.fc2.com/blog-entry-142.html>. It would be very helpful for everyday updating of your bibliography. Regrettably, however, the alerts are in the form of HTML e-mails. They are thus unsuitable to automatic processing of huge data. That's why we need a scraping tool to extract data from HTML files.

Perl script "extract-scholar-a.pl" scrapes HTML files attached to Google Scholar email alerts. Output is a tab-separated format text. Each line in the output begins with the target URL, followed by the text including file types, the author, journal title, year of publication, and extraction from the source. HTML tags are removed. Some kinds of entity-reference characters are decoded. Note that the output includes meaningless text fragments.

Source


#!/usr/bin/perl
=head1 NAME

        extract-scholar-a.pl

=cut
=head1 Require

        Perl 5.8

=head1 INPUT

        HTML files (listed on the command line)
        by Google Scholar email alerts

=head1 OUTPUT

        Tab-separated text (URL, file type, decoded text) 

=cut

# *** marks for technically critical codes

undef $/ ;
$\ ="\n";
$" = $,  ="\t";

while( $file = <> ) {
    my @entry = split ( /<a / , $file );    # *** Divide entries
    my $prefix = '' ;
    my $e;
    foreach $e(@entry){
        $text =~ s/\n/ /g ;
        my ( $href ) = $e=~/href\s*=\s*"([^"]*)"/i;     # *** Extract URL (with CGI GET parameters)
        my ( @para ) = split( /\&amp;/ , $href );       # *** Divide the parameters
        my ( @url ) = grep { s/^q=// } @para;           # *** Extract the "q" parameter
        my $text = $e ;
        $text =~ s/^[^>]+>// ;
        $text =~ s|</a>| // |g ;
        $text =~ s/<[^>]+>/ /g ;
        $text =~ s/\n/ /g ;
        $text =~ s/&#(\d+);/ chr($1) /eg ;
        $text =~ s/&hellip;/ ... /g ;
        $text =~ s/\s+/ /g ;

        $text =~ s/(\[[A-Z]+\])\s*$//i;
        my $next_filetype =  $1;

        print @url, $prefix, $text ;

        $prefix = $next_filetype;
    }
}
#------------- EOF

References

  1. Google Scholar <http://scholar.google.com>.
  2. Alex Chitu. 2010. "Email Alerts for Google Scholar". Google Operating System: unofficial news and tips about Google. <http://googlesystem.blogspot.com/2010/05/email-alerts-for-google-scholar.html>
  3. ichi_tas. 2012. "[Google Scholar活用法] 最新論文のチェックをメールアラート機能で自動化する". My Scratch Pad. <http://tieki83.blog106.fc2.com/blog-entry-142.html>.


Related articles



Comment:


Leave your comment

All items are optional (except the comment content). Posted comment will be immediately published, without preview/confirmation.

To pass my SPAM filter, include some non-ASCII characters more than 1% of Your Comment content. If you cannot type non-ASCII characters, copy & paste the star marks: ★☆★☆★☆.

Name
Title
E-mail (not to be published)
Your comment
Secret
Only the blog owner can read your comment


Trackback:

http://blog.tsigeto.jp/tb.php/155-8cc1bc49


Recent

Articles

Comments

Trackbacks


Archive

Monthly

Categories [Explanation]

| News:0 || Research:80 || Education:4 || School:278 || School/readu:3 || School/writing:17 || School/family:18 || School/occ:16 || School/quesu:6 || School/statu:4 || School/readg:18 || School/quesg:13 || School/statg:25 || School/kiso:5 || School/study:24 || School/intv:12 || School/book:0 || Profile:2 || WWW:7 || WWW/this:4 |