The R UseR Conference 2011
University of Warwick, UK
15:59. useR! 2012 will be hosted by Vanderbilt University, Nashville, USA.
15:02 FINALLY, the last invited talk by Simon Urbanek (At&T) on
“R Graphics: Supercharged – Recent Advances in Visualization and Analysis of Large Data in R“ Download Abstract here
- 85% of seats in the great lecturer hall are filled! >250 people
- New feature in R graphic
- Real demo n city map, how can city be more green. estimate the traffic
- Integrated R graphic with the real map (like a google map). Looks nice.
- Polygons with holes:
polypath()
– regular polygon()
s can create holes
- Most recent features -> screen output control
- so far there is no way to tell when to actually show graphic on the screen: now or only now???
Challenges
- Data size increases
- Large RAM (>100 GB) and CPU power is affordable
- Visualisation need to keep up
– redering, game industry provides solutions: OpenGL + GPUs
– visualization method for large data
-> interactivity 9divided and conquer, shift of focus)
-> sufficient statistics, aggregation
Proposed solution
– Redering speed -> use OpenGL back-ends for R devices (qtdevice, iPlot extreme)
– Example is showing much quicker speed from 5 secs to just a sec
- About iPlots
- iPlots = interactive plot for data analysis- selection, highlighting, brushing …
– interactive change of plot parameter
– queries
– all essential plots ( scatterplot, bar charts, histograms, parallel coordination)
- Demo now. The interactive seems nice, similar to the set of new iPods colour set.
iPlot (credit: Ashley Ford's Facebook)
- Now we can ask the selected points in the graph by
>which(selected(p))
– DEMO: pca and now you can select those outliers and find what are they!
– DEMO: histogram with changing parameter and so does its graph
Conclusion
- R: rasterImage(), polypath(), dev.hold/flush()
- Large data requires fast graphics and interactivity
- OpenGL graphics devices (idev(), qtdevice, …)
- iPlot eXtreme: high performance interactive graphic.
- Fast (C++, OpenGL: Interactivity on > 1 mino points)
- Efficient (no copying, reference semantics)
- Extensible (custom visuals, statistical objects, plots)
- CRAN release as “ix” expected next month
References with Link
END! Q&A now
##############################################
12:15An algorithm for the computation of the power of Monte Carlo tests with guaranteed precision by Patrick Rubin-Delanchy. Download abstract here
- Statistical formulation. Data = N observable stream.
- Algorithm with finite time, an example of Permutation test for two Gaussian groups
- They have other “trick” to reduce the effort: Choice of N, hypothesis test on remaining stream.
##############################################
11:57am. Nonparametric Goodness-of-Fit Tests for Discrete Null Distributions by Taylor Arnold. Download abstract here
- Deal with Komogorov-Smirnov Test
- Discreet K-S test – not implement in any of the major statistical computing package. Hence they aim to do in R
- Decided to require that discrete null dist be specified via class stepfun
- After obtaining test statistics, the p-value must be calculated
- Implementation is tedious but relatively straight forward
- Using Discrete Cramér-von Mises Test
##############################################
11:31am. The benchden Package: Benchmark Densities for Nonparametric Density Estimation by Henrike Weinert (Inference Session) download abstract here
- the package implements 28 benchmark densities
- from stat package e.g., uniform, exponential
- normal mixtures (Marronite, claw)
- Support: compact = infinite peack, uniform scale mixture or sawtooth
- Support: gaps e.g., Matterhorn, caliper and trimodal uniform
- Support: half line e.g., Maxwell, Pareto and inverse exponential
- Support: Real line e.g., logistic, double exponential
- Simulation study to compare two different bandwidth selector for a kernel estimator
##############################################
11:15am. Density Estimation Packages in R by Henry Deng Download abstract here (Inference Session)
- Review packages in CRAN on density function
- Calculating speed, random sert of n normally distributed. ash is the fastest and pendensity is the slowest
- Estimation accuracy using mean absolute error. varying results based on type of distribution and data point, interesting.
- Additional idea, trade-off between speed and accuracy
- Well-performing packages seems to have long establishment in R with frequent updates.
- Recommended packages: KernSmooth or ASH
##############################################
18 Aug 2011, The last day (but by no mean the least)
10:32am Binomial regression model by Merete Download abstract here
- Three differnt methods for extractor of residuals, unstandardized, standardized and studentized.
- Exact deletion residuals, new type of residual implementaed in binomTools
- approx.deletion (rstudent) residual function
- Parallel histrograms = explorative vertion of Hosmer-Lemeshow goodness-of-fit test (with fixed cut point)
- Half normal plot, uses absolute residual values but otherwise equivalent to a normal plot. Optimal simulated envelopes to support interpretation
- Profile likelihood from MASS package -> return and plot the profile likelihood root – nor the profile likelihood
- Misc = to grop binary or complete grouped fara based on a specified data, empirical area under ROC curve
##############################################
6:12pm Interval before Conference Dinner!
5:33pm Missing Values in Principle Component Analysis (pca) by Julie Josse download abstract
- Starting with… theoretical stuffs
- Overfitting issues from missing values -> fixed by shrinkage method
- Procedures in misMDA package
- Step1: Estimation of number of dimensions
- Step2: Imputation of missing values by ‘imputePCA’ function
- Step3: PCA on the completed data set, ‘MIPCA’ function
– Iterative PCA: single imputation method
– A unique alue cannot reflect the cariability of prediction
– MUltiple imputation: generating plausible values for each missing value
- Supplementary projection via ‘plot()’ function
– Individual position (and variables) with other predictions
- Between imputation variability too!
##############################################
5:06pm Here is the HIGHLIGHT. John Fox! on Tests for Multivariate Linear Models with the car Package Download abstract
- Discussion on fitting multivariate linear models (MLMs) in R with the lm function
- The anova function is flexible but calculating sequential (TypeI) test and performing other common tests, especially for repeat-measures designs, is relatively inconvenient.
- The Anova function (with a capital A) in car package (FOx and Weisberg, 2011) can perform partial (type II or type III) test for the terms in a multivariate linear model, including simply specified multivariate and univariate test for repeated-measures models.
- The linearHypothesis function in the car package can test arbitrary linear hypothesis for multivariate linear models, including models for repeated measures.
- Both the anova and linearHypothesis functions return a variety of information useful in further computation on multivariate linear model
- Now he’s demonstrating how to use ‘car’ package using the Anderson-Fisher Iris data
Correlation plot, basic box-plot
> mod.iris <- lm(cbind(Sepal.Lenght, Sepal.Width,
Petal.Length, Petal.Width) ~ Species, data=iris)
> (monova.iris <- Anova(mod.iris))
Type II MONOVA Tests: ...
> anova(mod.iris)
gave exact result as default function
- Summary
>summary()
- Also handling repeated measures = a single repeated-measure. it can be handled in anova function in R but it is simpler to get common tests from Anova and LinearHypothesis function in the car package.
- {21% of my MacBook battery now, please no blackout any time soon.
- 5 mins left in this presentation!
- It’s done! Q&A now.
##############################################
4:47pm Multiple choice models: why not the same answer? A comparison among LIMDEP, R, SAS and Stata by Giuseppe Bruno Download Abstract
- Similar to my presentation but focus on the application of R packages on Choice Model with other proprietary softwares and more technical!
##############################################
4:35pm Regression Models for Ordinal Data: Introducing R-package ordinal by Rune Haubo B. Christensen Download Abstract
- The package offers the regression model for ordinal data.
- Providing various standard model fit indices.
- Extends the basic model with scale effect, normal effect, random effects, structured thresholds.
- Future work. more flexible random effect structures and nested effects.
##############################################
17 August 2011, 14:02pm. Invited talk on Modelling Three-dimensional surface in R by Adrian Bowman, University of Glasgow.
- He is showing an application on people faces.
- Three-dimensional point graph he’s presenting is pretty much like a picture! Cool Stuff again.
- Now how to model such data.
- Face3D research consortium: http://www.face3d.ac.uk/wiki/index.php/The_Face3D_project
Breast surgery/reconstruction
-> Identifying breast boundary
-> Begin at the landmark which represent the most prominent point
-> Identifying breast boundary by the point of maximum curve.
-> Subsequent boundary points are now identified by rotation
-> Fit a principle curve ti the single point
-> Then Decomposing asymmetry – surfaces. The component can be also examined by an individual patient
- Identifying curves
-> Surface curvature is one of the key issue in the area. We can measure the direction of the curve using this.
- Change point detection. There are many approaches.
- He is now showing an example of curve identification of lips, tracking where the lips meet! Hence such a curve is changing dynamically and not linear.
- Disclaimer! Image application in R is not my cup of tea. So the note may looks weird!
- Principle component for faces now!
- Then, application in Orthognatic surgery. Comparing before and after!
– key issue id the prediction after the surgery
– Use CT scanning before the surgery then get the data of your face to predict what’s gonna happens after the surgery
– Taking some measure of uncertainty into the model too.
- The last topic Magneto-encephagraphy (MEG)
– data could be very noisy in this case
– Showing a typical dipole topographies on a single dyad data
– Possible dipole? Result on a single trial experiment using dynamic and multiple colour looks nice!
– Result presented in term of both picture and also graph
– variation across trials -> All trails dipole
– A visualization tool in ‘rpanel’ package is a GUI one!
##############################################
4:40pm Just finished my presentation. Relax time!
4:41pm A presentation before me is very interesting. It’s about Inventory but also deal with Bullwhip Effect and Supply Chain Performance. Nice one. The package creator is also a PhD student from Brasil. Gotta tell my supervisor.
4:43pm Now in M02. Ortolani Millo is presenting “Integrating R and Excel for Automatic business Forecasting. It works as an add-in in Excel offering options to do the forecasting. ARIMA is in there too!
##############################################
2:46pm Nomograms for visualising relationships between variables by Janathan Rougier
- He is showing how to use monogram by fitting a donkey hand-drawing picture
- See picture from David Smith http://yfrog.com/kgev6rvj
- using pynomo package see http://www.pynomo.org
##############################################
2:02pm: Design of Experiment (DoE) in R by Ulrike Grömping
- She is explaining Principles of DoE.
Block what you can and randomise what you cannot (Box, Hunter and Hunter 1978; 2005).
Randomisation: Balance out unknown influences.
- DoE in R: What is there?
– Task Views, thanks to Achim Zeileis
– started February 2008
– currently contains 37 R packages related to DoE
– Main Purposes = Pointer to existing functionality and support synergies. avoid double work
– First package in 2000, conf.design (core) and roughly exponentially increase since 2004
- Key driver for her work on DoE in R
– Wanted free software solution for industrial experimentation
Most often-needed: fractional factorial 2-level designs (->FrF2)
– Also sometimes needed: orthogonal array
- Mission
– Free researcher’s and experimenters’ brains
– From intricate mathematical and/or programming tasks
– For thinking about application problem
- Package suite for industrial DoE in R
– ‘DoE.base
– FrF2
– DoE.wrapper. for wrapping existing functionality
- DoE available in Rcmdr (John Fox) as Rcmdr.Plugin.DoE <- So now it seems not too difficult for me!
- Call for activities
– Make R cover a boarder range of DoE facilities
– Writer a package, or contribute functionality to an existing package
– Try to stay close to existing structures
##############################################
R Studio by J.J.Allaire
- RStudio = R coding Tool available on Window, MacOS X, and Linux and on the web
- Screenshots look very similar on any platform.
- Highlight = Extract function to re-run a chunk of code
- Conventional R history mechanism = save every command entered, searchable history, code navigation (in the next beta release)
- In 10 year it will be almost impossible to justify NOT using open source software.
- Future plan = make the capabilities of R more transparant and accessibility
##############################################
9:58am Keynote by Brian D. Ripley
- A brief Timeline
- Prehistory – 1997
- JCGS paper summitted Mar 1995
- The ealiest extant version seems to be Jun 1995 (456KB); 0.1 alpha (842KB)
- R 2.14.0 Oct 2011
- R 2.15.0 is scheduled in Mar 2012
- R 3.0.0 will be a Major change but no plan for this yet.
CRAN
- CRAN: 2 packages in 1997, >100 in 2001 an now ~3200 current packages
- ~80 successful submission per week to CRAN
- 10,000 current packages for Christmas 2016?
- Infrastructure provided by wu.ac.at and Stefan Theußl
CRAN was replaced by ‘repos’ and provided tool to suport other repositories in 2004, but rather few public repositories have emerged.
The R Development Process
- R is run by the active member of its core team. Meet in person only every couple of year.
- The day-to-day business is by email. 3 in NZ 1 India, 8 EU, 3 America
How do features get into R?
- R was principally developed for the benefit of the core team. Only they have votes.
- Most of what we have seen in R is there because core team members needed / wanted for e.g., research (esp. initially), teaching ( early 2000s), to develop R itself or to support other projets they were involve with.
Internationalization
- The core member are all native speaker if a Western european language which can be written in Latin-1.
- Japanese statisticians became interested in working in R.
The Future
- R is heavily dependent on a small group of altruistic people.
- They do feel that their contributions are not treated with respect.
- People needs to trust the decision of the core team.
Trend prediction
- Window will remain out-of-step with other OSec.
- The number of packages will grow inexorably. Whereas they provide a wonderfully comprehensive test suite, they also provide a formidable barrier to change.
##############################################
8:45am Opening session now!
Interesting numbers
440 participants
41 countries
342 EU, 60 N America, 16 Oceania, 13 Asia, 5 Central and South America, 4 Africa, 13 conference sponsors and exhibitor