Monday, August 16, 2010

IEOR Tools Tutorial: Learning XML with R

I have been using a lot of R lately in my work.  R (main site) is an open source statistical computing platform.  Saying R is only used for statistics does not do it justice.  I am finding it to be a really powerful statistical and optimization computing platform.  There seems to be no task that can not be accomplished.  Lately I've been curious about measuring performance with my blog and how it compares to other blogs.  So I thought I would use this opportunity to show how I performed this in R.  I want to rank Operations Research blogs using the Alexa ranking system.  Unfortunately Alexa does not have a search function for Operations Research blogs so I am going to have to get the information myself using R.

This R tutorial is going to be using the package XML.  Packages are used in R to perform specific computational needs that the base R platform can not accomplish on its own.  There are several different packages that can be loaded into R to perform a wide variety of problem instances. 


The first step is to load the XML package into the current R workspace.  If you do not have the XML package installed on your computer you will have to get it installed from the CRAN repositories. 


After loading the XML package is where the problem set programming begins.  I will need to save into the workspace the url of the Alexa information. Once I have the variables then I can move onto using the XML package to gather the information.

The main functions used in the XML package are htmlTreeParse, getNodeSet, and readHTMLTablehtmTreeParse grabs the XML code from the URL and stores it into an XML readable format.  getNodeSet is a retrieval function that grabs only the data you specifify.  In this instance it is looking for the XML nodes of dir and table with a id value equal to siteStats.  The readHTMLTable takes the siteStats information and creates a table of data values. 

While gathering the Alexa information with XML I'm also going to have to format the data into a readable structure.  This will require tabulating and text string manipulation.  Notice the use of the functions table, strsplit, and gsub to format the data.  All of this is performed in a for loop that performs all of XML and text formatting one URL at a time.  I've also created a data frame to place all of the relevant information into a readable table.

The following is the R code.

library(XML)

urlbeg <- "http://www.alexa.com/siteinfo/"

urllist <- c(
"industrialengineertools.blogspot.com",
"punkrockor.wordpress.com",
"thinkor.org",
"john-poppelaars.blogspot.com",
"bit-player.org",
"opsres.wordpress.com",
"orbythebeach.wordpress.com",
"spokutta.wordpress.com",
"engineered.typepad.com",
"bernoulli-on-business.blogspot.com",
"greenor.wordpress.com",
"fmwaves.kproductivity.com",
"blog.intechne.com",
"jimorlin.wordpress.com",
"jtonedm.com",
"mswd.wordpress.com",
"www.hakank.org",
"optandor.com",
"stochastix.wordpress.com",
"restart2.blogspot.com",
"scottaaronson.com",
"ateji.blogspot.com",
"geomblog.blogspot.com",
"ormsblog.com",
"wehart.blogspot.com",
"yetanothermathprogrammingconsultant.blogspot.com",
"annanagurney.blogspot.com",
"healthyalgorithms.wordpress.com",
"iaoreditor.blogspot.com",
"openresearch.wordpress.com",
"ornotes.blogspot.com",
"reflectionsonor.wordpress.com",
"arandomforest.com",
"analytics-magazine.com",
"hsimonis.wordpress.com",
"cpstandard.wordpress.com",
"blog.athico.com",
"dualnoise.blogspot.com",
"geneticargonaut.blogspot.com",
"john-raffensperger.blogspot.com",
"orforum.blog.informs.org",
"orinanobworld.blogspot.com",
"www.or-exchange.com",
"pomsblog.wordpress.com",
"research-reflections.blogspot.com",
"www.scienceofbetter.org",
"operationsroom.wordpress.com"
)


ORrank <- data.frame()

for (i in c(1:length(urllist)) ){
    url <- paste(urlbeg, urllist[i], sep="")
    doc <- htmlTreeParse(url, useInternalNodes=T)

    nset <- getNodeSet(doc, "//div/table[@id='siteStats']")

    tables <- lapply(nset, readHTMLTable)

    rankstr <- tables[[1]][2]
    rankstrdf <- strsplit(as.character(rankstr$V2), "\n")
    rank <- gsub(" ","",rankstrdf[[1]][1])
    rank <- as.numeric(gsub(",","",rank))
    tmpdf <- data.frame(ORblog=urllist[i], AlexaRank=rank)

    ORrank <- rbind(ORrank, tmpdf)

    rm(url)
    rm(doc)
    rm(nset)
    rm(tables)
    rm(rankstr)
    rm(rankstrdf)
    rm(rank)
    rm(tmpdf)
}
rm(i)


ORrank <- ORrank[order(ORrank$AlexaRank),]
rownames(ORrank) <- 1:nrow(ORrank)
print(ORrank)

Here is a final output from the ORrank data frame.

                                             ORblog AlexaRank
1                          orforum.blog.informs.org    154736
2                                 scottaaronson.com    308410
3                                    bit-player.org   1444318
4                                   blog.athico.com   1484646
5                                       jtonedm.com   1504658
6                      operationsroom.wordpress.com   1631529
7                             geomblog.blogspot.com   1711672
8                                    www.hakank.org   1955830
9                           www.scienceofbetter.org   2550459
10                           engineered.typepad.com   2625563
11                         stochastix.wordpress.com   3002085
12                         punkrockor.wordpress.com   3303052
13                       openresearch.wordpress.com   3811636
14                           hsimonis.wordpress.com   4068033
15                        fmwaves.kproductivity.com   4281627
16                        annanagurney.blogspot.com   5047922
17                              www.or-exchange.com   6052089
18                                      thinkor.org   6134442
19                           analytics-magazine.com   6674061
20                  healthyalgorithms.wordpress.com   7373428
21                  john-raffensperger.blogspot.com   8516473
22                            greenor.wordpress.com   8666209
23                       orbythebeach.wordpress.com   9437585
24                                arandomforest.com  12225347
25                               mswd.wordpress.com  12571553
26                                blog.intechne.com  13784064
27                           spokutta.wordpress.com  15236071
28                         cpstandard.wordpress.com  19401625
29                     geneticargonaut.blogspot.com  20064295
30                                     ormsblog.com  21294575
31                              wehart.blogspot.com  22329286
32 yetanothermathprogrammingconsultant.blogspot.com  24431355
33                           dualnoise.blogspot.com  25165358
34                               ateji.blogspot.com  25304653
35                    reflectionsonor.wordpress.com  27537074
36             industrialengineertools.blogspot.com        NA
37                     john-poppelaars.blogspot.com        NA
38                             opsres.wordpress.com        NA
39               bernoulli-on-business.blogspot.com        NA
40                           jimorlin.wordpress.com        NA
41                                     optandor.com        NA
42                            restart2.blogspot.com        NA
43                          iaoreditor.blogspot.com        NA
44                             ornotes.blogspot.com        NA
45                       orinanobworld.blogspot.com        NA
46                           pomsblog.wordpress.com        NA
47                research-reflections.blogspot.com        NA

Not exactly in the friendliest of formats but it does the trick.  I hope that this will help others who wish to use the powerful XML package with R.  I know I have definitely learned a lot about XML in the process.  I also found out that I have a lot more work to do with my blog.

Note:  If you are wondering where Michael Trick's blog is located there is a reason.  Unfortunately his blog and some others are in a sub-domain of another url not affiliated with his blog.  This means Alexa can not rank it compared to blogs with a primary domain.  Yet everyone in the Operations Research community knows where Michael's blog ranks anyway.

8 comments:

Ryan said...

This looks pretty interesting. I might try to replicate what I've done (e.g., hitting grabbing data from a subset of the 700+ pages) with this package to compare the time it takes to process.

larrydag said...

Let me know. Maybe I can do a follow up post with your results. Or you can blog about it and I will link to the followup.

Anonymous said...

I love this great taste too me

Ryan said...

My initial tests are yielding a 10x increase in speed using the R package XML over using python with beautifulsoup. Pretty unbelievable -- I was imagining python to be faster. I'll blog about it this weekend.

larrydag said...

Wow. That's incredible. I would never have guessed that. I can't wait to see your results.

Suresh said...

Thanks Larry, looking for this since very long time.

Business Plan Writers said...

Hi,
This is an excellent and informative work, this will really helpful for me in future. We like the way you start and then conclude your thoughts. Thanks for this information .We really appreciate your work, keep it up.

Chris Polk said...

Good readd