R‎ > ‎

R-Data Management

Combining data frames, etc. with rbind and cbind
When combining dataframes, it is often convenient to use cbind (bind columns), as in y=cbind(y1$x,y2$x2). However, cbind converts factors to numerical values automatically, making this a poor solution. cbind is a wrapper for data.frame(), so it is better just to use the original function. rbind() has no such behavior, so it is safe to use with factors! Here's an example, a clumsy one, but it gets the point across.
cd=data.frame(year, incomemem
cd2=data.frame(year, incomemem

Lattice Plots - Setting Themes to Black and White, Changing the Strip Background Color
Define a name for a lattice theme
ltheme <- canonical.theme(color = FALSE)      ## in-built B&W theme
theme$strip.background$col <- "transparent" ## change strip bg color to clear
lattice.options(default.theme = ltheme)      ## set as default
Greyscaled strips can be set this way:
strip.background <- trellis.par.get("strip.background")
trellis.par.set(strip.background = list(col = grey(7:1/8)))
Greyscaled plot symbols similarly by:
plot.symbol <- trellis.par.get("plot.symbol")
trellis.par.set(plot.symbol = list(col = grey(5/8)))
Removing Unused Variables
R is a very memory intensive project because of its object-oriented nature. The full dataframe being worked on is often stored in subsequent objects created by regressions and other functions. For this reason, it is best to drop variables in a data set that you are not going to use. This is done easily by the subset() function:
new.data=subset(old.data, select=c(var1, var2, var3))
Make sure to remove the old dataset:

Selecting all columns in a dataframe that contain a certain string
R contains a grep() function, which operates much like the grep command in unix. There are many uses for this function, one very handy one is to select all the variables in a dataframe that contain a certain string. This is done by:

colswithxinname=data[ ,grep("^X", colnames(data))]

Creating cutpoints
Sometimes you want to take a continuous variable and turn it into a categorical variable. This is done with the "cut" command. For example, to turn age into 8 categories:

data$agecat<- cut(data$age,

br = c(-1, 24+10*0:6, Inf), 

+    labels = c('Below 25', '25-34', '35-44',

'45-54', '55-64', '65-74', '75-84', '85+'))

Nice Crosstabs and Tables in R
The standard R functions table() and xtabs() are very utilitarian and not terribly flexible. In particular, xtabs doesn't let calculate proportions.
Enter: gmodels (http://cran.r-project.org/web/packages/gmodels/index.html) and the CrossTable function. This does everything that tabulate does in Stata and more, and it produces nice output. Very simple to use.

How to Name Data with Dimnames
dimnames() assigns names to the various dimensions of arrays
for instance:

x=array(0,c(2,3,2), dimnames=list(c("a","b"),c("a","b","c"),c("a","b")
creates a 2x3x2 array with the dimnames specified by the list
[think of a list as another level to the c() command]

to reference these dimension names
dimnames(x)[[1]] all the names of the things in the first dimension of the array
dimnames(x)[[1]][1] the first names in the first dimension of the array

Sorting and ordering data
The key here to know is that when you use "sort" you directly sort a single variable. This will not work for data frames, as you will only reorder one variable. For dataframes, you have to use "order." Order produces a list of indices, not a direct sort.

#this makes a new data set with columns taken from the variables x1, x2, x3...xn

#you can add columns (say you have two columns and want to add another)

newdata=rbind(x1...xn) # this makes a data set with rows defined by x's

#NOTE: if there are ties between x1's, then you need to tell R what to do, that's what x2 does.

Recoding Variables:
Dummy Variables:
# create dummy for blacks
mydata$dm.black <- ifelse(mydata$race=="black", 1,0) 
#creates a variable test as a recode of gss$att in one step

Creating Categorical Variables (a factor in R):
# another example: create 3 race categories
attach(mydata) #so we can call race below
mydata$race.3[race=="white"|race=="Asian"] <- "Whasian"
mydata$race.3[race =="black"] <- "Black"
mydata$agecat[race =="hispanic"] <- "Latino"

Useful tasks in recoding:
# get rid of NAs in data

#Make all NA's in data zeroes: use this custom function
  for (i in 1:length(dm.ama3)) if(is.na(dm.ama3[i])) dm.ama3[i]=0

drop empty categories in a factor: