Skip to content

Posts from the ‘R’ Category

Econometrics with R – Part2


 

 

 

 

R นั้นเป็นโปรแกรมที่เหมาะที่จะวิเคราะห์ข้อมูล การป้อนข้อมูลเข้าวิเคราะห์ใน R นั้นจึงนิยมทำในโปรแกรมอื่นก่อน เช่น ใน spreadsheet (MS Excel) หรือ SPSS แล้วค่อยนำเข้า (Import) ข้อมูลดังกล่าวเข้ามาวิเคราห์ใน R โดยสามารถทำได้โดยใช้ command ด้านล่างนี้

1. จากไฟล์นามสกุล .csv

สมมุติว่าต้องการนำเข้าไฟล์ data.csv
โดยแถวแรกของข้อมูลนั้นเป็นชื่อตัวแปร

data <- read.table("c:/data.csv",
header=TRUE,
sep=",",
row.names="id")

2. จาก MS Excel

ให้  export ไฟล์ออกมาให้อยู่ในรูป .csv แล้วทำตามข้อที่ 1.

3. จาก SPSS หรือ  PASW (IBM)

ให้  export ไฟล์ออกมาให้อยู่ในรูป .csv ก่อน
โดยใช้คำสั่ง “Save as” แล้วเลือกนามสกุลของไฟล์เป็น .csv
จากนั้นทำการนำเข้าข้อมูลไปในโปรแกรม R ตามข้อที่ 1

4. จาก Stata

library(foreign)
data <- read.dta("c:/data.dta")

ตัวอย่างการเลือกตัวแปรเพื่อสร้างฐานข้อมูลย่อย (sub_data) จากฐานข้อมูลหลัก (data)

โดยสามารถเลือกตัวแปรได้โดยการ คลิ๊ก ร่วมกับ Shift (เลือกช่วงตัวแปร) และ Ctrl (เลือกทีละตัว)

data <- data.frame(replicate(26,list(rnorm(5))))
names(data) <- LETTERS
sub_data <- data[select.list(names(data), multiple=TRUE)]

Source: Adapted from an answer of  Josh O’Brien in Stackoverflow to the question from  Jeromy Anglim

กลับสู่สารบัญ

Related posts

Econometrics with R – Part 1


Econometrics with R- Part 1

Why have I written this manual

I love R. R is a great software for statistical analysis and producing beautiful and reliable graphic from data. I initially learned to use R myself and found it was a steep learning curve to do it alone. However, I have been successfully used R for my research thanks to such an useRs community. Whatever and whenever I got a problem in R, I can find an answer (or usually more than one answer) available freely in the Internet or just emailing the package creator and get the answer with CODE in minutes. Hence I have owned the useRs a great deal. Once I believe that I can contribute back to the R community, I thought that writing my version of R manual (from the non-programmer & econometrician aspect) would make a small contribution to the R ecosystem. I am not the expert in R, but believe in its power and potentials for others for research or teaching. If you find any bugs or error found in this manual, please do not hesitate to let me know. Then I will correct it ASAP.

Why do econometricians should use R

  1. Free
    Anyone can use R without any financial cost.
  2. Reliable
    R has been accepted as a lingua franca for statistical analysis. It has been widely and intensively used by both academics and practitioners.
  3. Available Window PC, Mac and Linux (with high compatibility)
    ข้อนี้ผมชอบมากเพราะส่วนตัวใช้ Mac แต่ที่โรงเรียนเป็น window PC
    การใช้ ทำให้ไม่ประสบปัญหาเรื่องการย้ายเครื่องคอมพิวเตอร์ในการวิเคราะห์ สามารถทำงานได้ทั้งบน Mac และ Window PC
  4. Fresh and Up-to-date
    เนื่องจากเป็น จึงมีผุ้ร่วมพัมนาโดยการเขียน Packages ซึ่งนักวิจัยและอาจารย์ที่มีชื่อเสียงหลายๆ ท่านก็จะเขียน Packages สำหรับเครื่องมือทางสถิติใหม่ๆ ซึ่งสามารถใช้ได้ฟรีเช่นกัน
  5. Worldwide community of useRs
    มีผู้ใช้งาน R มากมายทั่วโลก และหลายๆ คนก็เขียนคู่มือ ข้อแนะนำลงในอินเตอร์เน็ต หากมีปัญหาอะไร ลองค้นดูในอินเตอร์เน็ตกจะเจอคำตอบมากมาย นอกจากนี้ก็มี R-bloggers, Twitter (#rstats), stackoverflow ที่เราสามาาถติดต่อและถามคำถาม ผู้ใช้และผู้ที่พัฒนา  R ได้อย่างง่ายดาย
  6. Reproducibility 
    R มีการเก้บบันทึกคำสั่งในการวิเคราะห์ทำให้เราไม่ต้องจำว่ากดปุ่มไหนไปบ้าง ทำให้สามารถกลับไปดูการวิเคราะห์เก่าๆ และทำซ้ำได้อย่างง่ายดาย
  7. Synergy with LaTeX
    R สามารถใช้ร่วมกับ โปรแกรมการสร้่างเอกสารอย่างมืออาชีพอย่าง TeX หรือ  LaTeX ผ่านการ Sweave ได้อย่างมีประสิทธิภาพ
    สำหรับผู้ที่ต้องใช้การวิเคราะหืแบบเดิมๆ กับข้อมูลใหม่ที่มาอยู่เสมอ เช่น รายงานการเงินประจำปี นั้น การใช้ R จะทำให้ประหยัดเวลาได้เยอะมาก

Continue to Part 2

Other posts about R

My presentation at The R useR! Conference 2011


At useR! 2011, I talked about using R (with packages sem, lavaan and OpenMx)  for Structural Equation Modeling by comparing to other commercial software i.e., AMOS, Lisrel and Mplus.

In this study, I compare R and other software by running the same model of ‘Transaction Costs in Supply Chain“.

Followings are the presentation slide, R codes and the abstract.

Click here to download the slide.

The R Codes

  1. sem
  2. lavaan
  3. OpenMx

Abstract

View this document on Scribd

useR! 2011 Live Blog


The R UseR Conference 2011
University of Warwick, UK

15:59. useR! 2012 will be hosted by Vanderbilt University, Nashville, USA.

15:02 FINALLY, the last invited talk by Simon Urbanek (At&T) on
R Graphics: Supercharged – Recent Advances in Visualization and Analysis of Large Data in R Download Abstract here

  • 85% of seats in the great lecturer hall are filled! >250 people
  • New feature in R graphic
  • Real demo n city map, how can city be more green. estimate the traffic
  • Integrated R graphic with the real map (like a google map). Looks nice.
  • Polygons with holes: polypath()
    – regular polygon()s can create holes
  • Most recent features -> screen output control
  • so far there is no way to tell when to  actually show graphic on the screen: now or only now???

Challenges

  • Data size increases
  • Large RAM (>100 GB) and CPU power is affordable
  • Visualisation need to keep up
    – redering, game industry provides solutions: OpenGL + GPUs
    – visualization method for large data
    -> interactivity 9divided and conquer, shift of focus)
    -> sufficient statistics, aggregation

Proposed solution
– Redering speed -> use OpenGL back-ends for R devices (qtdevice, iPlot extreme)
– Example is showing much quicker speed from 5 secs to just a sec

  • About iPlots
  • iPlots = interactive plot for data analysis- selection, highlighting, brushing …
    – interactive change of plot parameter
    – queries
    – all essential plots ( scatterplot, bar charts, histograms, parallel coordination)
  • Demo now. The interactive seems nice, similar to the set of new iPods colour set.

    iPlot (credit: Ashley Ford's Facebook)

  • Now we can ask the selected points in the graph by
    >which(selected(p))
    – DEMO: pca and now you can select those outliers and find what are they!
    – DEMO: histogram with changing parameter and so does its graph

Conclusion

  • R: rasterImage(), polypath(), dev.hold/flush()
  • Large data requires fast graphics and interactivity
  • OpenGL graphics devices (idev(), qtdevice, …)
  • iPlot eXtreme: high performance interactive graphic.
  • Fast (C++, OpenGL: Interactivity on > 1 mino points)
  • Efficient (no copying, reference semantics)
  • Extensible (custom visuals, statistical objects, plots)
  • CRAN release as “ix” expected next month

References with Link

END! Q&A now

##############################################

12:15An algorithm for the computation of the power of Monte Carlo tests with guaranteed precision by Patrick Rubin-Delanchy. Download abstract here

  • Statistical formulation. Data = N observable stream.
  • Algorithm with finite time, an example of Permutation test for two Gaussian groups
  • They have other “trick” to reduce the effort: Choice of N, hypothesis test on remaining stream.

##############################################

11:57am. Nonparametric Goodness-of-Fit Tests for Discrete Null Distributions by Taylor Arnold
. Download abstract here

  • Deal with Komogorov-Smirnov Test
  • Discreet K-S test – not implement in any of the major statistical computing package. Hence they aim to do in R
  • Decided to require that discrete null dist be specified via class stepfun
  • After obtaining test statistics, the p-value must be calculated
  • Implementation is tedious but relatively straight forward
  • Using Discrete Cramér-von Mises Test

##############################################

11:31am. The benchden Package: Benchmark Densities for Nonparametric Density Estimation by Henrike Weinert (Inference Session) download abstract here

  • the package implements 28 benchmark densities
  • from stat package e.g., uniform, exponential
  • normal mixtures (Marronite, claw)
  • Support: compact = infinite peack, uniform scale mixture or sawtooth
  • Support: gaps e.g., Matterhorn, caliper and trimodal uniform
  • Support: half line e.g., Maxwell, Pareto and inverse exponential
  • Support: Real line e.g., logistic, double exponential
  • Simulation study to compare two different bandwidth selector for a kernel estimator

##############################################

 11:15am. Density Estimation Packages in R by Henry Deng Download abstract here (Inference Session) 

  • Review packages in CRAN on density function
  • Calculating speed, random sert of n normally distributed. ash is the fastest and pendensity is the slowest
  • Estimation accuracy using mean absolute error. varying results based on type of distribution and data point, interesting.
  • Additional idea, trade-off between speed and accuracy
  • Well-performing packages seems to have long establishment in R with frequent updates.
  • Recommended packages: KernSmooth or ASH    

 

##############################################

18 Aug 2011, The last day (but by no mean the least)

10:32am Binomial regression model by Merete Download abstract here

  • Three differnt methods for extractor of residuals, unstandardized, standardized and studentized.
  • Exact deletion residuals, new type of residual implementaed in binomTools
  • approx.deletion (rstudent) residual function
  • Parallel histrograms = explorative vertion of Hosmer-Lemeshow goodness-of-fit test (with fixed cut point)
  • Half normal plot, uses absolute residual values but otherwise equivalent to a normal plot. Optimal simulated envelopes to support interpretation
  • Profile likelihood from MASS package -> return and plot the profile likelihood root – nor the profile likelihood
  • Misc = to grop binary or complete grouped fara based on a specified data, empirical area under ROC curve

##############################################

6:12pm Interval before Conference Dinner!

5:33pm Missing Values in Principle Component Analysis (pca) by Julie Josse download abstract

  • Starting with… theoretical stuffs
  • Overfitting issues from missing values -> fixed by shrinkage method
  • Procedures in misMDA package
  • Step1: Estimation of number of dimensions
  • Step2: Imputation of missing values by ‘imputePCA’ function
  • Step3: PCA on the completed data set, ‘MIPCA’ function
    – Iterative PCA: single imputation method
    – A unique alue cannot reflect the cariability of prediction
    – MUltiple imputation: generating plausible values for each missing value
  • Supplementary projection via ‘plot()’ function
    – Individual position (and variables) with other predictions
  • Between imputation variability too!

##############################################

5:06pm Here is the HIGHLIGHT. John Fox! on Tests for Multivariate Linear Models with the car Package Download abstract

    • Discussion on fitting multivariate linear models (MLMs) in R with the lm function
    • The anova function is flexible but calculating sequential (TypeI) test and performing other common tests, especially for repeat-measures designs, is relatively inconvenient.
    • The Anova function (with a capital A) in car package (FOx and Weisberg, 2011) can perform partial (type II or type III) test for the terms in a multivariate linear model, including simply specified multivariate and univariate  test for repeated-measures models.
    • The linearHypothesis function in the car package can test arbitrary linear hypothesis for multivariate linear models, including models for repeated measures.
    • Both the anova and linearHypothesis functions return a variety of information useful in further computation on multivariate linear model
    • Now he’s demonstrating how to use ‘car’ package using the Anderson-Fisher Iris data
Correlation plot, basic box-plot
> mod.iris <- lm(cbind(Sepal.Lenght, Sepal.Width,
 Petal.Length, Petal.Width) ~ Species, data=iris)
> (monova.iris <- Anova(mod.iris))
Type II MONOVA Tests: ...
> anova(mod.iris)
gave exact result as default function
  • Summary
    >summary()
  • Also handling repeated measures = a single repeated-measure. it can be handled in anova function in R but it is simpler to get common tests from Anova and LinearHypothesis function in the car package.
  • {21% of my MacBook battery now, please no blackout any time soon.
  • 5 mins left in this presentation!
  • It’s done! Q&A now.

##############################################

4:47pm Multiple choice models: why not the same answer? A comparison among LIMDEP, R, SAS and Stata by Giuseppe Bruno Download Abstract

  • Similar to my presentation but focus on the application of R packages on Choice Model with other proprietary softwares and more technical!

##############################################

4:35pm Regression Models for Ordinal Data: Introducing R-package ordinal by Rune Haubo B. Christensen Download Abstract

  • The package offers the regression model for ordinal data.
  • Providing various standard model fit indices.
  • Extends the basic model with scale effect, normal effect, random effects, structured thresholds.
  • Future work. more flexible random effect structures and nested effects.

##############################################

17 August 2011, 14:02pm. Invited talk on Modelling Three-dimensional surface in R by Adrian Bowman, University of Glasgow.  

  • He is showing an application on people faces.
  • Three-dimensional point graph he’s presenting is pretty much like a picture! Cool Stuff again.
  • Now how to model such data.
  • Face3D research consortium: http://www.face3d.ac.uk/wiki/index.php/The_Face3D_project
    Breast surgery/reconstruction
    -> 
    Identifying breast boundary
    -> Begin at the landmark which represent the most prominent point
    -> Identifying breast boundary by the point of maximum curve.
    -> Subsequent boundary points are now identified by rotation
    -> Fit a principle curve ti the single point
    -> Then Decomposing asymmetry – surfaces. The component can be also examined by an individual patient
  • Identifying curves
    -> Surface curvature is one of the key issue in the area. We can measure the direction of the curve using this.
  • Change point detection. There are many approaches.
  • He is now showing an example of curve identification of lips, tracking where the lips meet! Hence such a curve is changing dynamically and not linear. 
  • Disclaimer! Image application in R is not my cup of tea. So the note may looks weird!
  • Principle component for faces now!
  • Then, application in Orthognatic surgery. Comparing before and after!
    – key issue id the prediction after the surgery
    – Use CT scanning before the surgery then get the data of your face to predict what’s gonna happens after the surgery
    – Taking some measure of uncertainty into the model too.
  • The last topic Magneto-encephagraphy (MEG)
    – data could be very noisy in this case
    – Showing a typical dipole topographies  on a single dyad data
    – Possible dipole? Result on a single trial experiment using dynamic and multiple colour looks nice!
    – Result presented in  term of both picture and also graph
    – variation across trials -> All trails dipole
    – A visualization tool in ‘rpanel’ package is a GUI one!

##############################################

4:40pm Just finished my presentation. Relax time!

4:41pm A presentation before me is very interesting. It’s about Inventory but also deal with Bullwhip Effect and Supply Chain Performance. Nice one. The package creator is also a PhD student from Brasil. Gotta tell my supervisor.

4:43pm Now in M02. Ortolani Millo is presenting “Integrating R and Excel for Automatic business Forecasting. It works as an add-in in Excel offering options to do the forecasting. ARIMA is in there too!

##############################################

2:46pm Nomograms for visualising relationships between variables by Janathan Rougier

  • He is showing how to use monogram by fitting a  donkey hand-drawing picture
  • See picture from David Smith http://yfrog.com/kgev6rvj
  • using pynomo package see http://www.pynomo.org

##############################################


2:02pm: Design of Experiment (DoE) in R by  Ulrike Grömping

  •  She is explaining Principles of DoE.
    Block what you can and randomise what you cannot (Box, Hunter and Hunter 1978; 2005).
    Randomisation: Balance out unknown influences.
  • DoE in R: What is there?
    – Task Views, thanks to Achim Zeileis
    – started February 2008
    – currently contains 37 R packages related to DoE
    – Main Purposes = Pointer to existing functionality and support synergies. avoid double work
    – First package in 2000, conf.design (core) and roughly exponentially increase since 2004
  • Key driver for her work on DoE in R
    – Wanted free software solution for industrial experimentation
    Most often-needed: fractional factorial 2-level designs (->FrF2)
    – Also sometimes needed: orthogonal array
  • Mission
    – Free researcher’s and experimenters’ brains
    – From intricate mathematical and/or programming tasks
    – For thinking about application problem
  • Package suite for industrial DoE in R
    – ‘DoE.base
    – FrF2
    – DoE.wrapper. for wrapping existing functionality
  • DoE available in Rcmdr (John Fox) as Rcmdr.Plugin.DoE <- So now it seems not too difficult for me!
  • Call for activities
    – Make R cover a boarder range of DoE facilities
    – Writer a package, or contribute functionality to an existing package
    – Try to stay close to existing structures

##############################################


R Studio by J.J.Allaire

  • RStudio =  R coding Tool available on Window, MacOS X, and Linux and on the web
  • Screenshots look very similar on any platform.
  • Highlight = Extract function to re-run a chunk of code
  • Conventional R history mechanism = save every command entered, searchable history, code navigation (in the next beta release)
  • In 10 year it will be almost impossible to justify NOT using open source software.
  • Future plan = make the capabilities of R more transparant and accessibility

##############################################

9:58am Keynote by Brian D. Ripley

  • A brief Timeline
  • Prehistory – 1997
  • JCGS paper summitted Mar 1995
  • The ealiest extant version seems to be Jun 1995 (456KB); 0.1 alpha (842KB)
  • R 2.14.0 Oct 2011
  • R 2.15.0 is scheduled in Mar 2012
  • R 3.0.0 will be a Major change but no plan for this yet.

CRAN

  • CRAN: 2 packages in 1997, >100 in 2001 an now ~3200 current packages
  • ~80 successful submission per week to CRAN
  • 10,000 current packages for Christmas 2016?
  • Infrastructure provided by wu.ac.at and Stefan Theußl

CRAN was replaced by ‘repos’ and provided tool to suport other repositories in 2004, but rather few public repositories have emerged.

The R Development Process

  • R is run by the active member of its core team. Meet in person only every couple of year.
  • The day-to-day business is by email. 3 in NZ 1 India, 8 EU, 3 America

How do features get into R?

  • R was principally developed for the benefit of the core team. Only they have votes.
  • Most of what we have seen in R is there because core team members needed / wanted for e.g., research (esp. initially), teaching ( early 2000s), to develop R itself or to support other projets they were involve with.

Internationalization

  • The core member are all native speaker if  a Western european language which can be written in Latin-1.
  • Japanese statisticians became interested in working in R.

The Future

  • R is heavily dependent on a small group of altruistic people.
  • They do feel that their contributions are not treated with respect.
  • People needs to trust the decision of the core team.

Trend prediction

  • Window will remain out-of-step with other OSec.
  • The number of packages will grow inexorably. Whereas they provide a wonderfully comprehensive test suite, they also provide a formidable barrier to change.

##############################################

8:45am Opening session now!

Interesting numbers
440 participants
41 countries
342 EU, 60 N America, 16 Oceania, 13 Asia, 5 Central and South America, 4  Africa, 13 conference sponsors and exhibitor

R packages for Structural Equation Model: SEM with R


Structural Equation Model (SEM) was first examined by a software called LISREL. Then, SEM has been mainly run by several proprietary software i.e., Mplus, AMOS, EQS, SAS and a new version of Stata (v.12).

However, you may also run SEM with a great but free software like R.

To the best of my knowledge, there are now four active packages that you can use to fit SEM. Here they are:

Main Packages (for fitting SEM models) 

  1. sem (John Fox, 2006):The first R package for SEM ” fit by maximum likelihood assuming multinormality, and single-equation estimation for observed-variable models by two-stage least.squares.” It was also the first package I tried to run SEM in R. Thanks to a very quick response from Prof.Fox on my question I emailed him.
    See Example of ‘sem’ package here.
  2. OpenMx (Boker et al, 2011)
    A very active package that “is free and open source software for use with R that allows estimation of a wide variety of advanced multivariate statistical models.” contributed by experts in R and SEM.
    See Example of ‘OpenMx’ package here.
  3. lavaan (Yves Rosseel, 2012)
    A promising package for SEM. Its command language is similar to those of Mplus. Hence it is perhaps the most user-friendly package for SEM to date.
    See Example of ‘lavaan’ package here.
    Link to JSS paper
  4. semPLS (Armin Monecke, 2012)
    Fitting Structural Equation Model Using Partial Least Squares
    See: CRAN link, JSS paper
  5. plspm (Gaston Sanchez, 2012)
    R package dedicated to Partial Least Squares (PLS) methods (CRAN, plsmodeling.com)
    by Gaston Sanchez and Laura Trinchera
    A corresponding book titled “PLS Path Modeling with R” can be downloaded here.
My paper in useR! 2011 has evaluated R packages vs. Proprietary software i.e., AMOS & Lisrel.

Today (30 May 2012), I gladly found that there are also complementary packages for SEM in R as follows.

Complementary packages

  • SEMplusR: Functions, examples and datasets to learn, use and teach Structural Equation Modeling (SEM)  [GitHub]
    by Pairach Piboonrungroj 
  • SEMModComp: Model Comparisons for SEM [CRAN link, Additional Documents]
    by  Roy Levy
  • semGOF: an add-on package which provides fourteen goodness-of-fit indeces for structural equation models using ‘sem’ package.[CRAN]
    by Elena Bertossi 
  • stremo: Functions to help the process of learning structural equation modelling [CRAN link]
    by  Gustavo Carvalho, Marco Batalha, and Owen Petchey
  • FIAR: Functional Integration Analysis in R [CRAN link]
    by  Bjorn Roelstraete
  • semTools: Useful tools for structural equation modeling [CRAN link]
    by  Sunthud Pornprasertmanit, Patrick Miller, Alex Schoemann, Yves Rosseel
  • simsem: SIMulated Structural Equation Modeling [CRAN link]
    by  Sunthud Pornprasertmanit, Patrick Miller, Alexander Schoemann
  • pathmox R package dedicated to segmentation trees in PLS Path Modeling [CRAN, plsmodeling.com]

Packages for SEM plotting and graphics

  • qgraph: Network representations of relationships in data [CRAN link]
    by  Sacha Epskamp, Angelique O. J. Cramer, Lourens J. Waldorp, Verena D. Schmittmann and Denny Borsboom
  • psych: Procedures for Psychological, Psychometric, and Personality Research [CRAN link]
    by William Revelle

Packages that link R with other software to fit SEM

  • Mplus
    Automating Mplus Model Estimation and Interpretation [CRAN link]
    by  Michael Hallquist
  • EQS
    R/EQS Interface [CRAN link]
    by  Patrick Mair and Eric Wu

More external resources on SEM in R

  • CRAN Task view on ‘Structural Equation Models, Factor Analysis, PCA’ in Psychometrics [url]
    by Patrick Mair
  • A tutorial on the use of sem package  [url]
    by William Revelle
  • A post on ‘Structural Equation Modeling in R‘  [url]
    by Jeromy Anglim