Possum classification. The common brushtail possum of the Australia region is a bit cuter than its distant cousin, the American opossum. We consider 104 brushtail possums from two regions in Australia, where the possums may be considered a random sample from the population. The first region is Victoria, which is in the eastern half of Australia and traverses the southern coast. The second region consists of New South Wales and Queensland, which make up eastern and northeastern Australia. We use logistic regression to differentiate between possums in these two regions. The outcome variable, called population, takes value 1 when a possum is from Victoria and 0 when it is from New South Wales or Queensland. We consider five predictors: sex male (an indicator for a possum being male), head length, skull width, total length, and tail length. Each variable is summarized in a histogram. The full logistic regression model and a reduced model after variable selection are summarized in the table.
Frequency sex_male
0 (Female)
1 (Male)
0
20
40
60
head_length (in mm)
Frequency
85 90 95 100
0 5 10 15
skull_width (in mm)
Frequency
50 55 60 65
0 5 10 15
total_length (in cm)
Frequency
75 80 85 90 95
0 5 10
tail_length (in cm)
Frequency
32 34 36 38 40 42
0 5 10 15 20
Frequency
0 (Not Victoria)
1 (Victoria)
population
0
20
40
60
Full Model Reduced Model
Estimate SE Z Pr(>|Z|) Estimate SE Z Pr(>|Z|) (Intercept) 39.2349 11.5368 3.40 0.0007 33.5095 9.9053 3.38 0.0007
sex male -1.2376 0.6662 -1.86 0.0632 -1.4207 0.6457 -2.20 0.0278 head length -0.1601 0.1386 -1.16 0.2480 skull width -0.2012 0.1327 -1.52 0.1294 -0.2787 0.1226 -2.27 0.0231 total length 0.6488 0.1531 4.24 0.0000 0.5687 0.1322 4.30 0.0000
tail length -1.8708 0.3741 -5.00 0.0000 -1.8057 0.3599 -5.02 0.0000
(a) Examine each of the predictors. Are there any outliers that are likely to have a very large influence on the logistic regression model?
(b) The summary table for the full model indicates that at least one variable should be eliminated when using the p-value approach for variable selection: head length. The second component of the table summarizes the reduced model following variable selection. Explain why the remaining estimates change between the two models.
Possum classification, Part II. A logistic regression model was proposed for classifying common brushtail possums into their two regions in Exercise 9.15. The outcome variable took value 1 if the possum was from Victoria and 0 otherwise.
Estimate SE Z Pr(>|Z|) (Intercept) 33.5095 9.9053 3.38 0.0007
sex male -1.4207 0.6457 -2.20 0.0278 skull width -0.2787 0.1226 -2.27 0.0231 total length 0.5687 0.1322 4.30 0.0000
tail length -1.8057 0.3599 -5.02 0.0000
(a) Write out the form of the model. Also identify which of the variables are positively associated when controlling for other variables.
(b) Suppose we see a brushtail possum at a zoo in the US, and a sign says the possum had been captured in the wild in Australia, but it doesn’t say which part of Australia. However, the sign does indicate that the possum is male, its skull is about 63 mm wide, its tail is 37 cm long, and its total length is 83 cm. What is the reduced model’s computed probability that this possum is from Victoria? How confident are you in the model’s accuracy of this probability calculation?