# Goal: "Dummy variables" in regression.

# Suppose you have this data:
people = data.frame(
  age =       c(21,62,54,49,52,38),
  education = c("college", "school", "none", "school", "college", "none"),
  education.code = c(  2,        1,      0,        1,         2,      0 )
# Here people$education is a string categorical variable and
# people$education.code is the same thing, with a numerical coding system.

# Note the structure of the dataset --
# The strings supplied for `education' have been treated (correctly) as
# a factor, but education.code is being treated as an integer and not as
# a factor.

# We want to do a dummy variable regression. Normally you would have:
#  1 Chosen college as the omitted category
#  2 Made a dummy for "none" named educationnone
#  3 Made a dummy for "school" named educationschool
#  4 Ran a regression like lm(age ~ educationnone + educationschool, people)
# But this is R. Things are cool:
lm(age ~ education, people)

# ! :-)
# When you feed him an explanatory variable like education, he does all
# these steps automatically. (He chose college as the omitted category).

# If you use an integer coding, then the obvious thing goes wrong --
lm(age ~ education.code, people)
# because he's thinking that education.code is an integer explanatory
# variable. So you need to:

lm(age ~ factor(education.code), people)
# (he choose a different omitted category)

# Alternatively, fix up the dataset --
people$education.code <- factor(people$education.code)
lm(age ~ education.code, people)

# Bottom line:
# Once the dataset has categorical variables correctly represented as factors, i.e. as
# doing OLS in R induces automatic generation of dummy variables while leaving one out:
lm(age ~ education, people)
lm(age ~ education.code, people)

# But what if you want the X matrix?
m <- lm(age ~ education, people)
# This is the design matrix that went into the regression m.