Heating
from the mlogit
package contains the data in R
format. The observations consist of single-family houses in California that were newly built and had central air-conditioning. The choice is among heating systems. Five types of systems are considered to have been possible:gc
),gr
),ec
),er
),hp
).There are 900 observations with the following variables:
idcase
gives the observation number (1-900),depvar
identifies the chosen alternative (gc
, gr
, ec
, er
, hp
),ic.alt
is the installation cost for the 5 alternatives,oc.alt
is the annual operating cost for the 5 alternatives,income
is the annual income of the household,agehed
is the age of the household head,rooms
is the number of rooms in the house,region
a factor with levels ncostl
(northern coastal region), scostl
(southern coastal region), mountn
(mountain region), valley
(central valley region).Note that the attributes of the alternatives, namely, installation cost and operating cost, take a different value for each alternative. Therefore, there are 5 installation costs (one for each of the 5 systems) and 5 operating costs. To estimate the logit model, the researcher needs data on the attributes of all the alternatives, not just the attributes for the chosen alternative. For example, it is not sufficient for the researcher to determine how much was paid for the system that was actually installed (ie., the bill for the installation). The researcher needs to determine how much it would have cost to install each of the systems if they had been installed. The importance of costs in the choice process (i.e., the coefficients of installation and operating costs) is determined through comparison of the costs of the chosen system with the costs of the non-chosen systems.
For these data, the costs were calculated as the amount the system would cost if it were installed in the house, given the characteristics of the house (such as size), the price of gas and electricity in the house location, and the weather conditions in the area (which determine the necessary capacity of the system and the amount it will be run.) These cost are conditional on the house having central air-conditioning. (That's why the installation cost of gas central is lower than that for gas room: the central system can use the air-conditioning ducts that have been installed.)
In a logit model, each variable takes a different value in each alternative. So, in our case, for example, we want to know the coefficient of installation cost in the logit model of system choice. The variable installation cost in the model actually consists of five variables in the dataset: ic.gc
, ic.gr
, ic.ec
, ic.er
and ic.hp
, for the installation costs of the five systems. In the current code, there are two variables in the logit model. The first variable is called ic
for installation cost. This variable consists of five variables in the dataset: ic.gc
in the first alternative, ic.gr
in the second alternative, etc.
library("mlogit")
data("Heating", package = "mlogit")
H <- dfidx(Heating, choice = "depvar", varying = c(3:12))
m <- mlogit(depvar ~ ic + oc | 0, H)
summary(m)
##
## Call:
## mlogit(formula = depvar ~ ic + oc | 0, data = H, method = "nr")
##
## Frequencies of alternatives:choice
## ec er gc gr hp
## 0.071111 0.093333 0.636667 0.143333 0.055556
##
## nr method
## 4 iterations, 0h:0m:0s
## g'(-H)^-1g = 1.56E-07
## gradient close to zero
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## ic -0.00623187 0.00035277 -17.665 < 2.2e-16 ***
## oc -0.00458008 0.00032216 -14.217 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -1095.2
Yes, they are negative as expected, meaning that as the cost of a system rises (and the costs of the other systems remain the same) the probability of that system being chosen falls.
Yes, the t-statistics are greater than 1.96, which is the critical level for 95% confidence level.
apply(fitted(m, outcome = FALSE), 2, mean)
## ec er gc gr hp
## 0.10413057 0.05141477 0.51695653 0.24030898 0.08718915
Not very well. 63.67% of the sample chose
gc
(as shown at the top of the summary) and yet the estimated model gives an average probability of only 51.695%. The other alternatives are also fairly poorly predicted. We will find how to fix this problem in one of the models below.
\[ U = \beta_{ic} ic + \beta_{oc} oc \]
\[ dU = \beta_{ic} dic + \beta_{oc} doc = 0 \Rightarrow -\frac{dic}{doc}\mid_{dU=0}=\frac{\beta_{oc}}{\beta_{ic}} \]
coef(m)["oc"]/coef(m)["ic"]
## oc
## 0.7349453
The model implies that the decision-maker is willing to pay $.73 (ie., 73 cents) in higher installation cost in order to reduce annual operating costs by $1.
A $1 reduction in annual operating costs recurs each year. It is unreasonable to think that the decision-maker is only willing to pay only 73 cents as a one-time payment in return for a $1/year stream of saving. This unreasonable implication is another reason (along with the inaccurate average probabilities) to believe this model is not so good. We will find below how the model can be improved.
\(U=a LC\) where \(LC\) is lifecycle cost, equal to the sum of installation cost and the present value of operating costs: \(LC=IC+(1/r)OC\). Substituting, we have \(U=aIC + (a/r)OC\).
The models estimates \(a\) as \(-0.00623\) and \(a/r\) as \(-0.00457\). So \(r = a/(a/r)=-.000623/.00457 = 1.36\) or 136% discount rate. This is not reasonable, because it is far too high.
To impose this constraint, we create a lifecycle cost that embodies the constraint \(lcc=ic+oc/0.12\) and estimate the model with this variable.
H$lcc <- H$ic + H$oc / 0.12
mlcc <- mlogit(depvar ~ lcc | 0, H)
library("lmtest")
lrtest(m, mlcc)
## Likelihood ratio test
##
## Model 1: depvar ~ ic + oc | 0
## Model 2: depvar ~ lcc | 0
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 2 -1095.2
## 2 1 -1248.7 -1 306.93 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qchisq(0.05, df = 1, lower.tail = FALSE)
## [1] 3.841459
We perform a likelihood ratio test. The \(\ln L\) for this constrained model is \(-1248.7\). The \(\ln L\) for the unconstrained model is \(-1095.2\). The test statistic is twice the difference in \(\ln L\): \(2(1248.7-1095.2)=307\). This test is for one restriction (ie a restiction on the relation of the coefficient of operating cost to that of installation cost.) We therefore compare \(307\) with the critical value of chi-squared with \(1\) degree of freedom. This critical value for 95% confidence is \(3.8\). Since the statistic exceeds the critical value, we reject the hypothesis that \(r=0.12\).
hp
to 0.mc <- mlogit(depvar ~ ic + oc, H, reflevel = 'hp')
summary(mc)
##
## Call:
## mlogit(formula = depvar ~ ic + oc, data = H, reflevel = "hp",
## method = "nr")
##
## Frequencies of alternatives:choice
## hp ec er gc gr
## 0.055556 0.071111 0.093333 0.636667 0.143333
##
## nr method
## 6 iterations, 0h:0m:0s
## g'(-H)^-1g = 9.58E-06
## successive function values within tolerance limits
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept):ec 1.65884594 0.44841936 3.6993 0.0002162 ***
## (Intercept):er 1.85343697 0.36195509 5.1206 3.045e-07 ***
## (Intercept):gc 1.71097930 0.22674214 7.5459 4.485e-14 ***
## (Intercept):gr 0.30826328 0.20659222 1.4921 0.1356640
## ic -0.00153315 0.00062086 -2.4694 0.0135333 *
## oc -0.00699637 0.00155408 -4.5019 6.734e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -1008.2
## McFadden R^2: 0.013691
## Likelihood ratio test : chisq = 27.99 (p.value = 8.3572e-07)
apply(fitted(mc, outcome = FALSE), 2, mean)
## hp ec er gc gr
## 0.05555556 0.07111111 0.09333333 0.63666667 0.14333333
Note that they match exactly: alternative-specific constants in a logit model insure that the average probabilities equal the observed shares.
wtp <- coef(mc)["oc"] / coef(mc)["ic"]
wtp
## oc
## 4.563385
r <- 1 / wtp
r
## oc
## 0.2191356
The decision-maker is willing to pay $4.56 for a $1 year stream of savings. This implies \(r = 0.22\). The decision-maker applies a 22% discount rate. These results are certainly more reasonable that in the previous model. The decision-maker is still estimated to be valuing saving somewhat less than would seem rational (ie applying a higher discount rate than seems reasonable). However, we need to remember that the decision-maker here is the builder. If home buyers were perfectly informed, then the builder would adopt the buyer's discount rate. However, the builder would adopt a higher discount rate if home buyers were not perfectly informed about (or believed) the stream of saving.
ec
-er
-gc
-gr
, with the constant for alternative hp
normalized to zero. Suppose you had included constants for alternatives ec
-er
-gc
-hp
, with the constant for alternative gr
normalized to zero. What would be the estimated coefficient of the constant for alternative gc
? Figure this out logically rather than actually estimating the model.We know that when the hp is left out, the constant for alternative
gc
is \(1.71074\) meaning that the average impact of unicluded factors is \(1.71074\) higher for alternativegc
than for alternative hp. Similarly, the constant for alternativegr
is \(0.30777\). Ifgr
were left out instead ofhp
, then all the constants would be relative to alternativegr
. The constant for alternativegc
would the be \(1.71074-.30777=1.40297\). That is, the average impact of unincluded factors is \(1.40297\) higher for altgc
than altgr
. Similarly for the other alternatives. Note the the constant for alt 5 would be \(0-.30777=-.3077\), sincehp
is normalized to zero in the model withhp
left out.
update(mc, reflevel = "gr")
##
## Call:
## mlogit(formula = depvar ~ ic + oc, data = H, reflevel = "gr", method = "nr")
##
## Coefficients:
## (Intercept):ec (Intercept):er (Intercept):gc (Intercept):hp
## 1.3505827 1.5451737 1.4027160 -0.3082633
## ic oc
## -0.0015332 -0.0069964
mi <- mlogit(depvar ~ oc + I(ic / income), H)
summary(mi)
##
## Call:
## mlogit(formula = depvar ~ oc + I(ic/income), data = H, method = "nr")
##
## Frequencies of alternatives:choice
## ec er gc gr hp
## 0.071111 0.093333 0.636667 0.143333 0.055556
##
## nr method
## 6 iterations, 0h:0m:0s
## g'(-H)^-1g = 1.03E-05
## successive function values within tolerance limits
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept):er 0.0639934 0.1944893 0.3290 0.742131
## (Intercept):gc 0.0563481 0.4650251 0.1212 0.903555
## (Intercept):gr -1.4653063 0.5033845 -2.9109 0.003604 **
## (Intercept):hp -1.8700773 0.4364248 -4.2850 1.827e-05 ***
## oc -0.0071066 0.0015518 -4.5797 4.657e-06 ***
## I(ic/income) -0.0027658 0.0018944 -1.4600 0.144298
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -1010.2
## McFadden R^2: 0.011765
## Likelihood ratio test : chisq = 24.052 (p.value = 5.9854e-06)
The model seems to get worse. The \(\ln L\) is lower (more negative) and the coefficient on installation cost becomes insignificant (t-stat below 2).
mi2 <- mlogit(depvar ~ oc + ic | income, H, reflevel = "hp")
The model implies that as income rises, the probability of heat pump rises relative to all the others (since income in the heat pump alt is normalized to zero, and the others enter with negative signs such that they are lower than that for heat pumps. Also, as income rises, the probability of gas room drops relative to the other non-heat-pump systems (since it is most negative).
Do these income terms enter significantly? No. It seems that income doesn't really have an effect. Maybe this is because income is for the family that lives in the house, whereas the builder made decision of which system to install.
lrtest(mc, mi2)
## Likelihood ratio test
##
## Model 1: depvar ~ ic + oc
## Model 2: depvar ~ oc + ic | income
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 6 -1008.2
## 2 10 -1005.9 4 4.6803 0.3217
waldtest(mc, mi2)
## Wald test
##
## Model 1: depvar ~ ic + oc
## Model 2: depvar ~ oc + ic | income
## Res.Df Df Chisq Pr(>Chisq)
## 1 894
## 2 890 4 4.6456 0.3256
scoretest(mc, mi2)
##
## score test
##
## data: depvar ~ oc + ic | income
## chisq = 4.6761, df = 4, p-value = 0.3222
## alternative hypothesis: unconstrained model
I'm not going to give what I consider my best model: your ideas on what's best are what matter here.
X <- model.matrix(mc)
alt <- idx(mc, 2)
chid <- idx(mc, 1)
eXb <- as.numeric(exp(X %*% coef(mc)))
SeXb <- tapply(eXb, chid, sum)
P <- eXb / SeXb[chid]
P <- matrix(P, ncol = 5, byrow = TRUE)
head(P)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.05107444 0.07035738 0.6329116 0.1877416 0.05791494
## [2,] 0.04849337 0.06420595 0.6644519 0.1558322 0.06701658
## [3,] 0.07440281 0.08716904 0.6387765 0.1439919 0.05565974
## [4,] 0.07264503 0.11879833 0.5657376 0.1879231 0.05489595
## [5,] 0.09223575 0.10238514 0.5670663 0.1561227 0.08219005
## [6,] 0.09228184 0.10466584 0.6366615 0.1152634 0.05112739
apply(P, 2, mean)
## [1] 0.07111111 0.09333333 0.63666666 0.14333334 0.05555556
This can be computed much more simply using the \Rf{fitted
function, with the \Ra{outcome
{fittedargument set to \Rv{FALSE
so that the probabilities for all the alternatives (and not only the chosen one) is returned.
apply(fitted(mc, outcome = FALSE), 2, mean)
## hp ec er gc gr
## 0.05555556 0.07111111 0.09333333 0.63666667 0.14333333
Hn <- H
Hn[idx(Hn, 2) == "hp", "ic"] <- 0.9 * Hn[idx(Hn, 2) == "hp", "ic"]
apply(predict(mc, newdata = Hn), 2, mean)
## hp ec er gc gr
## 0.06446230 0.07045486 0.09247026 0.63064443 0.14196814
We estimate the model with the actual costs. Then we change the costs and calculate probabilities with the new costs. The average probability is the predicted share for an alternative. At the original costs, the heat pump share is \(0.0555\) (ie, about 5.5%) This share is predicted to rise to \(0.0645\) (about 6.5%) when rebates are given.
ec
. We want to predict the potential market penetration of this technology. Note that there are now six alternatives: the original five alternatives plus this new one. Calculate the probability and predict the market share (i.e., the average probability) for all six alternatives, using the model that is estimated on the original five alternatives. (Be sure to use the original installation cost for heat pumps, rather than the reduced cost in exercise 7.) What is the predicted market share for the new technology? From which of the original five systems does the new technology draw the most customers?X <- model.matrix(mc)
Xn <- X[idx(mc, 2) == "ec",]
Xn[, "ic"] <- Xn[, "ic"] + 200
Xn[, "oc"] <- Xn[, "oc"] * 0.75
unchid <- unique(idx(mc, 1))
rownames(Xn) <- paste(unchid, 'new', sep = ".")
chidb <- c(chid, unchid)
X <- rbind(X, Xn)
X <- X[order(chidb), ]
eXb <- as.numeric(exp(X %*% coef(mc)))
SeXb <- as.numeric(tapply(eXb, sort(chidb), sum))
P <- eXb / SeXb[sort(chidb)]
P <- matrix(P, ncol = 6, byrow = TRUE)
apply(P, 2, mean)
## [1] 0.06311578 0.08347713 0.57145108 0.12855080 0.04977350
## [6] 0.10363170
The new technology captures a market share of 0.1036. That is, it gets slightly more than ten percent of the market.
It draws the same percent (about 10%) from each system. This means that it draws the most in absolute terms from the most popular system, gas central. For example, gas central drops from to \(0.637\) to \(0.571\); this is an absolute drop of \(0.637-0.571=0.065\) and a percent drop of \(0.065/0.637\) about 10%. Of the 10.36% market share that is attained by the new technology, 6.5% of it comes from gas central. The other systems drop by about the same percent, which is less in absolute terms.
The same percent drop for all systems is a consequence of the IIA property of logit. To me, this property seems unreasonable in this application. The new technology is a type of electric system. It seems reasonable that it would draw more from other electric systems than from gas systems. Models like nested logit, probit, and mixed logit allow more flexible, and in this case, more realistic substitution patterns.