R‎ > ‎

R Survey Analysis

Weighting Data
Inverse-Probability Weights and Inverse-Variance Weighting (link to survey package website)
The "survey" package contains most of what is need to handle both inverse probability weighted and variance weighted data. You can create a weighted "object" that is then used to perform statistics on the original dataframe using special survey functions. This does not create a new dataframe, but rather a "filter" through which the original data is processes before outputting results. This works great for working with GSS data and similarly weighted, clustered, and/or stratified samples. This package will output the correct standard errors and can output the design effects.

Check the survey website for information - also there are several slide presentations on the capabilities of the survey package for analyzing survey data. This is a great package that has great tabulation abilities and great graphics potential.

This <page> has a great introduction to the theory behind survey weighting. The biggest take home message is that SPSS only does frequency weights and can't handle clustered samples (except in the complex survey package); STATA and R (with the survey package) allow you to use inverse probability weights, clustering and stratification. It is imperative that you know what sort of weights your data uses and make sure that you use weights in calculating all your statistics. The other issue with SPSS is that it does not calculate errors correctly with weighted data. R is free and does most of the complex functions required. 

Winship and Radbill point out the potential problems with using weights in OLS in this article. If you are performing regression on a weighted sample, you must consider the potential that WOLS will produce randomly wrong estimates of error in the coefficients. The general procedure is to run OLS and then WOLS and conduct an F-test to see if the models are substantially different. If they don't differ, OLS is to be preferred. If they are significantly different, then the context will determine which model to use. A frequent source of error is an underspecified model that does not include key interaction/non-linear terms. This can be found by creating an augmented equation that has the weights times the independent variables. If this doesn't produce better estimates, then the WOLS is the best way to go, but to estimate the errors with White's estimator.

Clustered samples provide a whole other problem that I haven't really sorted out, but the general idea is that in multistage clustered samples, the errors will often not be independent of each other or distributed with equal variances. In clustered samples, often it is not wise to utilize the sample weights because you will be amplifying the non-independence. A better strategy is to create a survey design object by specifying the clusters, but not the weights and then using this object for the regressions. 

In terms of coping with complex survey design, I think the best rule of thumb is if using vs. not using weights and/or clusters changes your substantive results, then you probably don't want to hang you hat (or your paper) on these results.

Important Survey Functions
#not well documented in the survey package, but very useful when you don't want to use t-tests
svymean() #means...
svyttest() #ttests...
svyglm() # FOR OLS AND LOGISTIC (family="quasibinomial" argument added)
svyby(x~y,data,svymean) - Cross tabs with row and columns sums and proportions 
svytable() #tables
svymean(~interaction(health,region),na.rm=T) #this returns the proportion of health over the total region.