Spreadsheets, like Microsoft Excel or Apache's Open Office Calc (free), have statistical and graphing functions as a start. If you want to go beyond a spreadsheet and organize data into groups, plot many graphs at once or build statistical models then it's worth learning the rapidly growing language statistical R. Modules let you integrate R with spreadsheets.
During a three-month break from writing this blog, I took a couple of online courses that used R:
- Johns Hopkins University's Coursera course "Computing For Data Analysis" (rebranded R Programming with a new session starting May 5)
- Stanford University's online course, "Statistical Learning"
In the mid-1970s, John Chambers at Bell Laboratories created the S language to analyze statistics. It launched on the UNIX operating system in 1978. That's before VisiCalc, the first spreadsheet, that appeared for the Apple II in 1979.
In 1996, Robert Gentleman and Ross Ihaka from the University of Auckland, announced their development of the R language, inspired by S. They attracted collaborators worldwide to build the powerful R software that's available today.
R is freely available, open source software with many components. Today there are 5449 available packages on the leading CRAN software library, for everything from insightful graphics, to biomedical data analysis to financial forecasting.
Palo Alto's Tibco offers a commercial version of S, S+, as well as TERR (Tibco Enterprise Runtime for R). Revolution Analytics, headquartered in Mountain View, was founded in 2007 to sell R software and services to commercial users. Recently, the company was named an Advanced Analytics "Visionary" company by analyst firm Gartner. Gartner estimates advanced analytics to be a $2 billion market that spans a broad array of industries globally. Revolution Analytics, lets you run R in Amazon's cloud for data sets as large as a terabyte.
Source: Bay Area useR Group logo
Last week, I attended the monthly Bay Area useR Group Meetup, for those interested in R. The April meeting was at Intuit. There were 7 excellent speakers, but I'm only going to write about one - Ram Narasimhan. Ram teaches R for UCSC Extension. He gave an amusing talk on analyzing weather data - this link shows several examples of how he plotted weather data.
It started a couple of years ago, when he moved to Silicon Valley and his wife thought the weather was better in Chicago than in Sunnyvale. Ram decided he'd write some R software to test this. As averages can be misleading (see for example Stanford Consulting Professor Sam Savage's short talk on the Flaw of Averages), Ram selected the minimum temperature, as long as it was over 50įF. He ended up writing a package weatherData - to gather weather station data for different locations and date ranges.
City Mean, Maximum and Minimum Temperatures, by Month for Austin, Las Vegas, San Diego, San Francisco and Tampa (Source: http://ramnarasimhan.wordpress.com/tag/weather-analysis/)
One weather data source is Weather Underground which shows the personal weather station Mountain Shadows KCAMOUNT15 in Mountain View. This site shows temperature, dew point, wind speed, pressure and precipitation starting in 2008. For an organizational site, Moffett Field's station KNUQ gives weather data on Weather Underground as far back as 19:00hrs on March 2nd 1945. Comparing the two sites, you can see that temperatures can vary, even within a small town, given the marine effect near Moffett.
Playing with R gives insights into the challenges of using weather station data to see if Mountain View is warming or cooling. Which station should you use? Most days, at Moffett, the weather data is collected approximately once an hour, but other days there are more observations. Do you want to include night temperatures, or just day temperatures? Whereas climate models show the world in general is warming, climate is not the same as weather, which can jiggle around from day to day. You can read more about climate on NASA's website.
If you want to learn about statistics, probability and survey sampling, then Stat Trek is a useful site. To get started with R, download the free RStudio development environment which comes with training and help files. Type in a sum "5+3" and the answer "8" will appear. At a minimum RStudio can be used as a calculator. To analyze the local weather data from Moffett Field from April 1st, 2005 to April 8th, 2005. Open RStudio, install the package weatherData from CRAN, then type in:
mvweather <- getWeatherForDate("KNUQ","2005-4-1","2005-4-8")
summary(mvweather) to get a summary of temperature data
or you could type:
to see a list of temperatures.
Being interactive, R provides a fun way to learn statistics and quickly visualize data.