Parse PubMed database using their API + ruby + ActiveRecord 

In case any of you need to parse the PubMed database, here’s what I came up with as a first try. Run via a ruby script. You will need ruby, ActiveRecord (if you have rails setup, you should be good), and a few libraries — use gem install [x] for the libraries that you don’t have — if you have rails installed, you probably only need to get “xmlsimple”.

gem install xmlsimple

I used 3 tables in my mysql database: articles, authors, and a join table articles_authors

Here’s the schema:

DROP TABLE IF EXISTS `pubmed`.`articles`;
CREATE TABLE `pubmed`.`articles` (
`id` int(11) NOT NULL auto_increment,
`pubmed_id` int(11) NOT NULL,
`source` varchar(50) character set latin1 default NULL,
`title` varchar(255) character set latin1 default NULL,
`full_journal_name` varchar(255) character set latin1 default NULL,
`author_list` varchar(255) character set latin1 default NULL,
`pub_date` date default NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=34013 DEFAULT CHARSET=utf8;

DROP TABLE IF EXISTS `pubmed`.`articles_authors`;
CREATE TABLE `pubmed`.`articles_authors` (
`article_id` int(11) NOT NULL,
`author_id` int(11) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

DROP TABLE IF EXISTS `pubmed`.`authors`;
CREATE TABLE `pubmed`.`authors` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(255) default NULL,
`pubmed_id` int(11) default NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=69002 DEFAULT CHARSET=latin1;

Here’s the ruby script:

# #!/usr/local/bin/ruby -w



# Require Files & Libs

require 'net/http'

require 'rubygems'

require 'active_record'

require 'xmlsimple'



# App Configuration Settings

empty_tables = true



# Search Settings

searchRelDate = 7

searchLimit = 100000

batch_size = 100

searchUrlBase = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=#{searchLimit}&"

summaryUrlBase = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&retmode=xml&id="



# Database Settings:

db_host    = "localhost"

db_user    = "root"

db_pass    = "password"

db_schema  = "pubmed"



# Header Output:

puts  "#"

puts  "# PubMed parsing tool."

dash = "========================================================================="



# Connect to DB using Active Record:

ActiveRecord::Base.establish_connection(:adapter=>"mysql", :database=>db_schema, :username=>db_user, :password=>db_pass, :host=>db_host)



# Define SearchResults model:

class Article < ActiveRecord::Base

  set_table_name "articles"

end



class Author < ActiveRecord::Base

  set_table_name "authors"

end



class ArticleAuthor  0

          property['Item'].each{|author| article[:authors] < "#{article_summary['Id']}",

        :source => article['Source'],

        :title => article['Title'],

        :full_journal_name => article['FullJournalName'],

        :pub_date => article['PubDate'],

        :author_list => article[:authors].join(', ')

      )

      # Add Authors to database:

      article[:authors].each{|author|

        # See if the author exists already:

        exists = Author.find_all_by_name("#{author}")    

        if exists.length == 0

          new_author = Author.create(

            :name => author,

            :pubmed_id => "#{article_summary['Id']}"

          )

        else

          new_author = exists[0]

        end

        # Add to join table:

        ArticleAuthor.create(

          :author_id =>  new_author.id,

          :article_id =>  new_article.id

        )

      } 

    }

  batch_current += 1

end

From there, you can run queries like this:

SELECT author_id, authors.name, count(author_id)
FROM articles_authors, authors
WHERE articles_authors.author_id = authors.id
GROUP BY author_id
ORDER BY count(author_id) desc

SELECT articles.pubmed_id, authors.name, articles.title, articles.source, articles.pub_date, articles.id
FROM articles_authors, authors, articles
WHERE articles_authors.author_id = authors.id and articles_authors.article_id = articles.id
and authors.name = “Li Y”
GROUP BY articles.pubmed_id

Advertisements