• Type all your answers in a single MS word document. Copy and paste relevant Excel output reports, graphs, and/or tables in the same document. You need not retype the questions, but label your answers with the question numbers, e.g., 1)______
• Please keep the Excel file containing all of your analyses and results.
• You will be graded based on the completeness and correctness of your answers. Simply submitting the Excel output, table, and/or plots without any explanation will not be sufficient.
Paul Raymond, a math savvy baseball manager, would like to apply his knowledge in statistics to develop a multiple regression model to forecast pitching performances for starting pitchers for this baseball season. He intended to use the baseball statistics from the last year season to build the model. He understood that the initial variable selection was the most important aspect of developing a regression model. He knew that, if he didn’t have good predictor variables, he wouldn’t end up with useful predicting equations.
Paul had spent considerable amount of time to download the baseball statistics for starting pitchers from the last year season. He also decided to include only starting pitchers who had pitched at least 100 innings during the last season. The data for the 138 pitchers selected is presented in the Excel filename “DS312Fall21 Excel assignment#2_data.xlsx”
Paul decided that Earned Run Average (ERA) is the best indicator of performance and so wanted to develop a regression model to predict this variable. He chose the six potential predictor variables as follow,
WHIP: Number of walks plus hits given up per inning pitched
CMD: Command of pitches, the ratio strikeouts/walks
K/9: How many batters a pitcher strikes out per game (nine innings pitched)
HR/9: Opposition homeruns per gram (nine innings pitched)
OBA: Opposition batting average
THROWS: Right-handed pitcher (1) or left-handed pitcher (0)
QUESTIONS & TASKS:
Examine the correlation matrix, which predictor variables are highly correlated with ERA. Are there potential correlations among predictor variables? Explain in detail.
Based on the results from part 1, which predictor variable(s) should be excluded from the model? Use the remaining predictor variables to develop a multiple regression model to predict Earned Run Average (ERA). Present the Excel summary report and do the following tasks:
2.1. conduct the t-test (using α = 0.01) to determine which predictor variables, if any, are significant. Provide the details of the test and comment on the results of the test.
2.2. conduct the F-test (using α = 0.01) to test the overall significance of the model. Provide the details of the test and comment on the results of the test.
2.3. comment on the values of r2 and adjust r2
3. Based on the results from part 2, if we want to have a simplified regression model with fewer predictor variables, which variables should be dropped? Explain clearly why.
4. Based on your conclusion from part 3, develop the simplified regression model with the predictor variables you decide to keep. Evaluate the model by conducting the t-test and F-test. Comment on the values of r2 and adjust r2. Comment on how good this model is compared to the model in part 2.