library(CRTConjoint)
This vignette will go through examples cases listed in the package for all the main functions with details on how to use the inputs. Finally, the last section contains detailed steps on how to use CRTConjoint with the Amazon Web Services computing cluster.
To begin, we start with understanding the main function
CRT_pval
. All examples use the immigration choice conjoint
experiment data from Hainmueller et. al. (2014). We first load this data
from the package.
data("immigrationdata")
Each row consists a pair of immigrant candidates that were shown to respondents. For example, the first row shows that the left immigrant candidate was a Male from Iraq who had a high school degree while the right immigrant candidate was a Female from France with no formal education. The respondent who evaluated this task was a 20 year old college educated White Male, who voted for the left immigrant candidate.
In the first example, we aim to understand if the candidate’s
education matters for immigration preferences. To test this,
CRT_pval
requires users to specify all factors and
respondent factors of interest in the formula
input along
with the binary response. CRT_pval
requires us to
additionally specify which of the factors are the left and right
profiles.
= formula("Y ~ FeatEd + FeatGender + FeatCountry + FeatReason + FeatJob +
form FeatExp + FeatPlans + FeatTrips + FeatLang + ppage + ppeducat + ppethm + ppgender")
= colnames(immigrationdata)[1:9]
left = colnames(immigrationdata)[10:18]
right
left; right#> [1] "FeatEd" "FeatGender" "FeatCountry" "FeatReason" "FeatJob"
#> [6] "FeatExp" "FeatPlans" "FeatTrips" "FeatLang"
#> [1] "FeatEd_2" "FeatGender_2" "FeatCountry_2" "FeatReason_2"
#> [5] "FeatJob_2" "FeatExp_2" "FeatPlans_2" "FeatTrips_2"
#> [9] "FeatLang_2"
Users can see that the left and right profile factors are aligned,
i.e., the first entry is the education for both the left and right
profiles. It is important that they are expected to be aligned. We also
note that the formula only contains the factors for the left profile.
This is sufficient as the algorithm will take the left and right input
to use both left and right profile attributes for testing the
hypothesis. Lastly, we include all respondent characteristics (ppage,
ppeducat, etc.) to boost power. We are ready to now run
CRT_pval
to test whether education matters.
= CRT_pval(formula = form, data = immigrationdata, X = "FeatEd",
education_test left = left, right = right, non_factor = "ppage", B = 100, analysis = 2)
$p_val education_test
We again note X = "FeatEd"
is sufficient to clarify
which factor we are testing for. Because the function assumes all
attributes in conjoint experiments are of class factor, if there are
variables that are not factor class, for example respondent age (ppage),
the function must know which are these non-factor variables.
Furthermore, to save time we only run it for \(B = 100\) resamples. Lastly, we set \(analysis = 2\) to also allow the function
to spit out two of the strongest interactions that contributed to the
observed test statistic. We note that this is purely for exploratory
purposes.
The output should contain not only the \(p\)-value but also the observed test statistic, all the resampled test statistic, etc. The function will also show a progress bar to show the percentage of resamples the user is finished with.
Since the immigrant’s education was uniformly sampled, we did not need to specify the design because the default design was uniform. However, there are some factors that used more complex design. One such factor was job (FeatJob). For this experiment, the candidate’s occupation could only be financial analyst, computer programmer, research scientist, and doctor if their education degree was equivalent to at least some level of college. Because this constrained uniform design is popular in conjoint experiments, we allow the function to account for this design so long as the user specifies the constraints. We now show, using the same example, how to specify this constraint.
= list() # (Job has dependent randomization scheme)
constraint_randomization "FeatJob"]] = c("Financial analyst","Computer programmer",
constraint_randomization[["Research scientist","Doctor")
"FeatEd"]] = c("Equivalent to completing two years of
constraint_randomization[[college in the US", "Equivalent to completing a graduate degree in the US",
"Equivalent to completing a college degree in the US")
The constraint_randomization
list is a list of length
two. The first element contains the levels of job that can only be
randomized with certain levels of education. The second element of the
list contains the levels of education that allows the aforementioned
jobs to have positive probability. The listed levels are assumed to
match the levels in the supplied data. Additionally, the names of the
list is also assumed to match the column names in the supplied data. We
note that the user only has to supply the constraint for either the left
or right factor and the function will assume the constraint
randomization scheme is the same for both left and right factors, i.e.,
same for FeatJob_2 and FeatEd_2.
= CRT_pval(formula = form, data = immigrationdata, X = "FeatJob",
job_test left = left, right = right, design = "Constrained Uniform",
constraint_randomization = constraint_randomization, non_factor = "ppage", B = 100)
$p_val job_test
Once, we have the constraint list, we input it into the
constraint_randomization
input, after stating that the
design = "Constrained Uniform"
. We supply other examples
when a user has a nonuniform (but no constraints) design and how to
force a variable to include as an interaction in the examples.
We provide three other CRT functions that take similar inputs but
tests regularity conditions often invoked in conjoint experiments. The
first aims to test the profile order effect
(CRT_profileordereffect
), i.e., whether or not being in the
left or right has any impact on the response. The syntax for such a test
is straightforward.
= CRT_profileordereffect(formula = form, data = immigrationdata,
profileorder_test left = left, right = right, B = 100)
$p_val profileorder_test
Testing the carryover effect and fatigue effect is slightly more
complex. When testing the carryover effect, it requires resampling all
the left and right factors. The default design = "Uniform"
assumes that all factors were uniformly sampled. If that is not the
case, (which the immigration example is not) we need to supply our own
resamples. To do this, we build our resampling function,
= function(x, seed = sample(c(0, 1000), size = 1), left_idx, right_idx) {
resample_func_immigration set.seed(seed)
= x[, c(left_idx, right_idx)]
df = colnames(x)[c(left_idx, right_idx)]
variable = length(variable)
len = list()
resampled = nrow(df)
n for (i in 1:len) {
= df[, variable[i]]
var = levels(var)
lev = factor(sample(lev, size = n, replace = TRUE))
resampled[[i]]
}
= data.frame(resampled[[1]])
resampled_df for (i in 2:len) {
= cbind(resampled_df, resampled[[i]])
resampled_df
}colnames(resampled_df) = colnames(df)
#escape persecution was dependently randomized
= resampled_df[, "FeatCountry"]
country_1 = resampled_df[, "FeatCountry_2"]
country_2 = which((country_1 == "Iraq" | country_1 == "Sudan" | country_1 == "Somalia"))
i_1 = which((country_2 == "Iraq" | country_2 == "Sudan" | country_2 == "Somalia"))
i_2
= resampled_df[, "FeatReason"]
reason_1 = resampled_df[, "FeatReason_2"]
reason_2 = levels(reason_1)
levs = levs[c(2,3)]
r_levs
= sample(r_levs, size = n, replace = TRUE)
reason_1
= sample(levs, size = length(i_1), replace = TRUE)
reason_1[i_1]
= sample(r_levs, size = n, replace = TRUE)
reason_2
= sample(levs, size = length(i_2), replace = TRUE)
reason_2[i_2]
"FeatReason"] = reason_1
resampled_df[, "FeatReason_2"] = reason_2
resampled_df[,
#profession high skill fix
= resampled_df[, "FeatEd"]
educ_1 = resampled_df[, "FeatEd_2"]
educ_2 = which((educ_1 == "Equivalent to completing two years of college in the US" |
i_1 == "Equivalent to completing a college degree in the US" |
educ_1 == "Equivalent to completing a graduate degree in the US"))
educ_1 = which((educ_2 == "Equivalent to completing two years of college in the US" |
i_2 == "Equivalent to completing a college degree in the US" |
educ_2 == "Equivalent to completing a graduate degree in the US"))
educ_2
= resampled_df[, "FeatJob"]
job_1 = resampled_df[, "FeatJob_2"]
job_2 = levels(job_1)
levs # take out computer programmer, doctor, financial analyst, and research scientist
= levs[-c(2,4,5, 9)]
r_levs
= sample(r_levs, size = n, replace = TRUE)
job_1
= sample(levs, size = length(i_1), replace = TRUE)
job_1[i_1]
= sample(r_levs, size = n, replace = TRUE)
job_2
= sample(levs, size = length(i_2), replace = TRUE)
job_2[i_2]
"FeatJob"] = job_1
resampled_df[, "FeatJob_2"] = job_2
resampled_df[,
colnames(resampled_df)] = lapply(resampled_df[colnames(resampled_df)], factor )
resampled_df[
return(resampled_df)
}
This resampling function takes the data (x) as an input and given the
indexes of the left and right profile attributes, it returns a
completely new resampled dataframe of the same dimension as x. As stated
above, because testing the carryover effect requires resampling all the
attributes B times, we must supply all the manually resampled
dataframes. To do this we store them in a list of length B, each
containing a resampled data of all the left and right attributes. We
store this in own_resamples
.
= immigrationdata
carryover_df = list()
own_resamples = 100
B for (i in 1:B) {
= resample_func_immigration(carryover_df, left_idx = 1:9, right_idx = 10:18, seed = i)
newdf = newdf
own_resamples[[i]] }
Lastly, the carryover test requires a column that indicates the task evaluation number for each row. In the immigration experiment, each respondent rated five tasks and the data is sorted by each task, i.e., the first five rows are the first five tasks for the first respondent. Consequently, we can define a new column, task, that iterates 1 to 5 and run the main function. NOTE: it is important that the task variable has no missing tasks, i.e., a respondent that only rated four tasks while all other respondents rated five tasks. If there is such a respondent, please drop it before using this function.
= 5
J $task = rep(1:J, nrow(carryover_df)/J)
carryover_df
= CRT_carryovereffect(formula = form, data = carryover_df, left = left,
carryover_test right = right, task = "task", supplyown_resamples = own_resamples, B = B)
$p_val carryover_test
Lastly, to test the fatigue effect, we only need to additionally specify the respondent index. Like the task column, we similarly repeat 1 to 200 five times and store it in a new column called respondent.
= immigrationdata
fatigue_df $task = rep(1:J, nrow(fatigue_df)/J)
fatigue_df$respondent = rep(1:(nrow(fatigue_df)/J), each = J)
fatigue_df
= CRT_fatigueeffect(formula = form, data = fatigue_df, left = left,
fatigue_test right = right, task = "task", respondent = "respondent", B = 100)
$p_val fatigue_test
An optional argument for all CRT functions in this package is the
num_cores
input. Currently it is set at 2 cores as the
default. Although 2 cores may be sufficient for someone who is doing
exploratory work with \(B = 100\), the
final reported \(p\)-values is
recommended to have a much higher value of \(B\). A typical Mac laptop will only support
up to num_cores = 4
, which may be unsatisfactory for
researchers. For researchers with their own computing cluster, this
section is not applicable for them. However, for those with no easy
access to computing cluster, we write this section to easily use our
functions in powerful computing clusters provided by Amazon Web Services
(AWS).
We will leverage the RStudio Server in AWS maintained and provided by: https://www.louisaslett.com/RStudio_AMI/. We will now list the steps needed to use AWS Rstudio.
We recommend that the user has already written all necessary code to run on the AWS. Since users will be charged hourly, it is recommended that they can directly run the code. If the user runs with 50 cores, the runtime for a typical conjoint experiment should not exceed more than 5-10 minutes for a single \(p\)-value with \(B = 2000\), thus we do not believe any user to have to incur much fees. For any questions please do not hesitate to contact me at: daewoongham at g dot harvard dot edu