Question
Math 204 Assignment 3

Deadline: 11.59pm, Thursday 2nd April, 2015

Submit via myCourses

Comp and Circumstance

A McGill statistics professor (name withheld for legal reasons) is being sued by a number of former

study participants. The volunteers are demanding compensation for being forced to follow a strict

diet of poutine, hamburgers, or fish and chips, drink large amounts of coffee and – worst of all – attend

statistics lectures. Each volunteer is claiming a different amount of compensation, with the figure

appearing to vary depending on which diet they were subscribed to, how much coffee they drank,

and how many statistics lectures they were forced to attend. Despite the seriousness of the situation,

the professor is determined to have a fun time, and so is interested in using statistical methods to

analyze these data.

In the zip file “Assignment3Files” you will find datasets with file names of the form xxxxxxxxx.csv.

You should find the dataset with your McGill ID number as the file name, and only use that particular

file for this assignment. If you can’t find your ID number in the list email me as soon as possible!

Note that each file will contain a unique dataset so make sure you use the correct one!

Your dataset contains five variables, with each row corresponding to one ‘volunteer’. The variables

are as follows:

• comp: the compensation being sought (in Canadian dollars).

• diet: the diet the subject was forced to follow, coded as “CAN”, “USA”, and “UK” for poutine,

hambugers, and fish and chips, respectively.

• coff: the amount of coffee the subject was forced to drink (in millilitres).

• lect: the number of statistics lectures the subject was forced to attend.

• mood: the subject’s reported mood after completion of their diet, coded as “:)” and “:(” for

‘happy’ and ‘unhappy’, respectively.

For the purposes of this assignment, you may treat comp, coff and lect as continuous covariates,

while diet and mood are categorical. You should use these data to answer the questions that follow.

You may assume that all necessary model assumptions are satisfied.

Late work will be penalized by loss of marks, and should be submitted to me by email.

The time of your submission will be determined by my McGill email client (rounded to the nearest

minute). Work submitted from 0000 to 0159 (inclusive) on Friday 3rd April will lose 1 mark. Work

submitted from 0200 to 0359 (inclusive) will lose 2 marks. This pattern will continue, with 1 mark

lost for every additional 2 hours, until your grade is a non-positive integer.

Part I: 13 marks

(a) Fit a linear regression model that supposes mean compensation is a linear combination of diet

followed, coffee drunk and lectures attended (without any interaction terms). Summarize the results

of your model in a manner similar to the following table. Your answer must clearly indicate each

parameter estimate and the p-value associated with a test of the null hypothesis that the true

parameter value is equal to zero against the alternative that it is not equal to zero. Note that your

table may vary slightly depending on what is chosen to act as the ‘baseline’ diet, and that some

marks are available for presentation. [4 marks]

Parameter Estimate p-value

Intercept

Diet (UK)

Diet (USA)

Coffee

Lectures

(b) According to your model, discuss how predicted compensation varies depending on diet. [3 marks]

Next, consider the following four models, where compensation is modeled as a linear combination of

the corresponding terms. Note that Model 1 is the model fit in part (a).

• Model 1: Diet, coffee, and lectures.

• Model 2: Diet.

• Model 3: Diet and lectures.

• Model 4: Diet, coffee, lectures, and an interaction between diet and coffee.

(c) Use backward selection based on the ANOVA F-test to choose the best model of these four. You

should summarize your results in a similar manner to the table below, making it clear at each stage

which models are being compared, which is the ‘complete’ model, and which is the ‘reduced’ model

(as per the terminology used in class), the sum of squared errors for both models, and the resulting

F statistic and p-value. Your table should be followed by a sentence stating clearly which

model is recommended by the procedure. [6 marks]

Complete Reduced

Step Model Model SSEC SSER F p-value

1

2

3

Warning: this isn’t the end of the assignment – more statistical fun over the page!

Part II: 7 marks

In preparing his defense, the professor decides to investigate whether a volunteer’s mood was independent

of their diet. He hopes that, if he finds evidence against the null hypothesis of independence

between these variables, he can persuade the academic community (and a jury) that his research is

worthwhile and not, as one of the volunteers put it, “the work of a man driven insane by the power

of having a PhD in statistics”.

(d) Produce a table like the one below of the observed counts of mood and diet, along with row and

column totals. [2 marks]

Diet

Mood CAN UK USA Total

Total

(e) If the null hypothesis were true, how many of the subjects who followed the Canadian diet would

you expect to be happy? Give your answer to 1 decimal place. (You do not need to show your

working – a correct answer is worth full marks – but if your final answer is wrong you could be

awarded 1 mark for appropriate working.) [2 marks]

(f) Carry out an appropriate Chi-squared test of the null hypothesis that mood and diet are independent,

summarizing the results of your analysis in 100 words or fewer. Note that your answer should

be written in complete sentences, such as for the conclusion of a report. (Hyphenated words, such

as p-value, count as one word, as do any numbers. If you exceed the word limit only the first 100

words of your answer will be marked.) [3 marks]