In case any of you need to parse the PubMed database, here’s what I came up with as a first try. Run via a ruby script. You will need ruby, ActiveRecord (if you have rails setup, you should be good), and a few libraries — use gem install [x] for the libraries that you don’t have — if you have rails installed, you probably only need to get “xmlsimple”.
gem install xmlsimple
I used 3 tables in my mysql database: articles, authors, and a join table articles_authors
Here’s the schema:
DROP TABLE IF EXISTS `pubmed`.`articles`;
CREATE TABLE `pubmed`.`articles` (
`id` int(11) NOT NULL auto_increment,
`pubmed_id` int(11) NOT NULL,
`source` varchar(50) character set latin1 default NULL,
`title` varchar(255) character set latin1 default NULL,
`full_journal_name` varchar(255) character set latin1 default NULL,
`author_list` varchar(255) character set latin1 default NULL,
`pub_date` date default NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=34013 DEFAULT CHARSET=utf8;DROP TABLE IF EXISTS `pubmed`.`articles_authors`;
CREATE TABLE `pubmed`.`articles_authors` (
`article_id` int(11) NOT NULL,
`author_id` int(11) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;DROP TABLE IF EXISTS `pubmed`.`authors`;
CREATE TABLE `pubmed`.`authors` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(255) default NULL,
`pubmed_id` int(11) default NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=69002 DEFAULT CHARSET=latin1;
Here’s the ruby script:
# #!/usr/local/bin/ruby -w
# Require Files & Libs
require 'net/http'
require 'rubygems'
require 'active_record'
require 'xmlsimple'
# App Configuration Settings
empty_tables = true
# Search Settings
searchRelDate = 7
searchLimit = 100000
batch_size = 100
searchUrlBase = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax=#{searchLimit}&"
summaryUrlBase = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&retmode=xml&id="
# Database Settings:
db_host = "localhost"
db_user = "root"
db_pass = "password"
db_schema = "pubmed"
# Header Output:
puts "#"
puts "# PubMed parsing tool."
dash = "========================================================================="
# Connect to DB using Active Record:
ActiveRecord::Base.establish_connection(:adapter=>"mysql", :database=>db_schema, :username=>db_user, :password=>db_pass, :host=>db_host)
# Define SearchResults model:
class Article < ActiveRecord::Base
set_table_name "articles"
end
class Author < ActiveRecord::Base
set_table_name "authors"
end
class ArticleAuthor 0
property['Item'].each{|author| article[:authors] < "#{article_summary['Id']}",
:source => article['Source'],
:title => article['Title'],
:full_journal_name => article['FullJournalName'],
:pub_date => article['PubDate'],
:author_list => article[:authors].join(', ')
)
# Add Authors to database:
article[:authors].each{|author|
# See if the author exists already:
exists = Author.find_all_by_name("#{author}")
if exists.length == 0
new_author = Author.create(
:name => author,
:pubmed_id => "#{article_summary['Id']}"
)
else
new_author = exists[0]
end
# Add to join table:
ArticleAuthor.create(
:author_id => new_author.id,
:article_id => new_article.id
)
}
}
batch_current += 1
end
From there, you can run queries like this:
SELECT author_id, authors.name, count(author_id)
FROM articles_authors, authors
WHERE articles_authors.author_id = authors.id
GROUP BY author_id
ORDER BY count(author_id) descSELECT articles.pubmed_id, authors.name, articles.title, articles.source, articles.pub_date, articles.id
FROM articles_authors, authors, articles
WHERE articles_authors.author_id = authors.id and articles_authors.article_id = articles.id
and authors.name = “Li Y”
GROUP BY articles.pubmed_id