This R tutorial is going to be using the package XML. Packages are used in R to perform specific computational needs that the base R platform can not accomplish on its own. There are several different packages that can be loaded into R to perform a wide variety of problem instances.
The first step is to load the XML package into the current R workspace. If you do not have the XML package installed on your computer you will have to get it installed from the CRAN repositories.
After loading the XML package is where the problem set programming begins. I will need to save into the workspace the url of the Alexa information. Once I have the variables then I can move onto using the XML package to gather the information.
The main functions used in the XML package are htmlTreeParse, getNodeSet, and readHTMLTable. htmTreeParse grabs the XML code from the URL and stores it into an XML readable format. getNodeSet is a retrieval function that grabs only the data you specifify. In this instance it is looking for the XML nodes of dir and table with a id value equal to siteStats. The readHTMLTable takes the siteStats information and creates a table of data values.
While gathering the Alexa information with XML I'm also going to have to format the data into a readable structure. This will require tabulating and text string manipulation. Notice the use of the functions table, strsplit, and gsub to format the data. All of this is performed in a for loop that performs all of XML and text formatting one URL at a time. I've also created a data frame to place all of the relevant information into a readable table.
The following is the R code.
library(XML)
urlbeg <- "http://www.alexa.com/siteinfo/"
urllist <- c(
"industrialengineertools.blogspot.com",
"punkrockor.wordpress.com",
"thinkor.org",
"john-poppelaars.blogspot.com",
"bit-player.org",
"opsres.wordpress.com",
"orbythebeach.wordpress.com",
"spokutta.wordpress.com",
"engineered.typepad.com",
"bernoulli-on-business.blogspot.com",
"greenor.wordpress.com",
"fmwaves.kproductivity.com",
"blog.intechne.com",
"jimorlin.wordpress.com",
"jtonedm.com",
"mswd.wordpress.com",
"www.hakank.org",
"optandor.com",
"stochastix.wordpress.com",
"restart2.blogspot.com",
"scottaaronson.com",
"ateji.blogspot.com",
"geomblog.blogspot.com",
"ormsblog.com",
"wehart.blogspot.com",
"yetanothermathprogrammingconsultant.blogspot.com",
"annanagurney.blogspot.com",
"healthyalgorithms.wordpress.com",
"iaoreditor.blogspot.com",
"openresearch.wordpress.com",
"ornotes.blogspot.com",
"reflectionsonor.wordpress.com",
"arandomforest.com",
"analytics-magazine.com",
"hsimonis.wordpress.com",
"cpstandard.wordpress.com",
"blog.athico.com",
"dualnoise.blogspot.com",
"geneticargonaut.blogspot.com",
"john-raffensperger.blogspot.com",
"orforum.blog.informs.org",
"orinanobworld.blogspot.com",
"www.or-exchange.com",
"pomsblog.wordpress.com",
"research-reflections.blogspot.com",
"www.scienceofbetter.org",
"operationsroom.wordpress.com"
)
ORrank <- data.frame()
for (i in c(1:length(urllist)) ){
url <- paste(urlbeg, urllist[i], sep="")
doc <- htmlTreeParse(url, useInternalNodes=T)
nset <- getNodeSet(doc, "//div/table[@id='siteStats']")
tables <- lapply(nset, readHTMLTable)
rankstr <- tables[[1]][2]
rankstrdf <- strsplit(as.character(rankstr$V2), "\n")
rank <- gsub(" ","",rankstrdf[[1]][1])
rank <- as.numeric(gsub(",","",rank))
tmpdf <- data.frame(ORblog=urllist[i], AlexaRank=rank)
ORrank <- rbind(ORrank, tmpdf)
rm(url)
rm(doc)
rm(nset)
rm(tables)
rm(rankstr)
rm(rankstrdf)
rm(rank)
rm(tmpdf)
}
rm(i)
ORrank <- ORrank[order(ORrank$AlexaRank),]
rownames(ORrank) <- 1:nrow(ORrank)
print(ORrank)
Here is a final output from the ORrank data frame.
ORblog AlexaRank
1 orforum.blog.informs.org 154736
2 scottaaronson.com 308410
3 bit-player.org 1444318
4 blog.athico.com 1484646
5 jtonedm.com 1504658
6 operationsroom.wordpress.com 1631529
7 geomblog.blogspot.com 1711672
8 www.hakank.org 1955830
9 www.scienceofbetter.org 2550459
10 engineered.typepad.com 2625563
11 stochastix.wordpress.com 3002085
12 punkrockor.wordpress.com 3303052
13 openresearch.wordpress.com 3811636
14 hsimonis.wordpress.com 4068033
15 fmwaves.kproductivity.com 4281627
16 annanagurney.blogspot.com 5047922
17 www.or-exchange.com 6052089
18 thinkor.org 6134442
19 analytics-magazine.com 6674061
20 healthyalgorithms.wordpress.com 7373428
21 john-raffensperger.blogspot.com 8516473
22 greenor.wordpress.com 8666209
23 orbythebeach.wordpress.com 9437585
24 arandomforest.com 12225347
25 mswd.wordpress.com 12571553
26 blog.intechne.com 13784064
27 spokutta.wordpress.com 15236071
28 cpstandard.wordpress.com 19401625
29 geneticargonaut.blogspot.com 20064295
30 ormsblog.com 21294575
31 wehart.blogspot.com 22329286
32 yetanothermathprogrammingconsultant.blogspot.com 24431355
33 dualnoise.blogspot.com 25165358
34 ateji.blogspot.com 25304653
35 reflectionsonor.wordpress.com 27537074
36 industrialengineertools.blogspot.com NA
37 john-poppelaars.blogspot.com NA
38 opsres.wordpress.com NA
39 bernoulli-on-business.blogspot.com NA
40 jimorlin.wordpress.com NA
41 optandor.com NA
42 restart2.blogspot.com NA
43 iaoreditor.blogspot.com NA
44 ornotes.blogspot.com NA
45 orinanobworld.blogspot.com NA
46 pomsblog.wordpress.com NA
47 research-reflections.blogspot.com NA
Not exactly in the friendliest of formats but it does the trick. I hope that this will help others who wish to use the powerful XML package with R. I know I have definitely learned a lot about XML in the process. I also found out that I have a lot more work to do with my blog.
Note: If you are wondering where Michael Trick's blog is located there is a reason. Unfortunately his blog and some others are in a sub-domain of another url not affiliated with his blog. This means Alexa can not rank it compared to blogs with a primary domain. Yet everyone in the Operations Research community knows where Michael's blog ranks anyway.
8 comments:
This looks pretty interesting. I might try to replicate what I've done (e.g., hitting grabbing data from a subset of the 700+ pages) with this package to compare the time it takes to process.
Let me know. Maybe I can do a follow up post with your results. Or you can blog about it and I will link to the followup.
I love this great taste too me
My initial tests are yielding a 10x increase in speed using the R package XML over using python with beautifulsoup. Pretty unbelievable -- I was imagining python to be faster. I'll blog about it this weekend.
Wow. That's incredible. I would never have guessed that. I can't wait to see your results.
Thanks Larry, looking for this since very long time.
Hi,
This is an excellent and informative work, this will really helpful for me in future. We like the way you start and then conclude your thoughts. Thanks for this information .We really appreciate your work, keep it up.
Good readd
Post a Comment