# R studio

ANLY 500 Laboratory #2 – Predictive Statistics

Evans Chapter 8 through and including Chapter 12

“Performance Lawn Equipment Case Study” from Evans, Business Analytics

Introduction .................................................................................................................................................. 1

Chapter 8 ....................................................................................................................................................... 2

Part 1: Defects after Delivery .................................................................................................................... 3

Step 1 ........................................................................................................................................................ 3

Step 2 ........................................................................................................................................................ 5

Part 2: Defects after Delivery Continued .................................................................................................. 6

Step 1 ........................................................................................................................................................ 6

Part 3: Employee Retention .................................................................................................................... 10

Step 1 ...................................................................................................................................................... 10

Part 4: Learning Associated with Adopting New Technology ................................................................. 14

Step 1 ...................................................................................................................................................... 14

Chapter 9 ..................................................................................................................................................... 18

Part 1: Mower Sales ................................................................................................................................ 18

Step 1 ...................................................................................................................................................... 18

Part 3: Production Costs ......................................................................................................................... 23

Step 1 ...................................................................................................................................................... 24

Introduction This laboratory follows the exercises in the book, specifically the Performance Lawn Equipment Case Study homework assigned exercises Chapters 8 through and including 12, except this laboratory requires that you use R to complete the exercises. That is, you should answer all questions in the textbook exercises and complete computations using R. Each laboratory in ANLY 500 will build on the laboratories you have completed before. So, you will want to continue to keep your work in the folder you set-up for Laboratory #1 so that you can refer

Page 2 of 25

back to previous laboratories if necessary. You will also continue to use the data files for ANLY 500 on Moodle.

Chapter 8 In reviewing the PLE data, Elizabeth Burke noticed that defects received from suppliers have decreased (DefectsAfterDelivery.csv data file). Upon investigation, she learned that in 2010, PLE experienced some quality problems due to an increasing number of defects in materials received from suppliers. The company instituted an initiative in August 2011 to work with suppliers to reduce these defects, to more closely coordinate deliveries, and to improve materials quality through reengineering supplier production policies. Elizabeth noted that the program appeared to reverse an increasing trend in defects; she would like to predict what might have happened had the supplier initiative not been implemented and how the number of defects might further be reduced in the near future.

In meeting with PLE’s human resources director, Elizabeth also discovered a concern about the high rate of turnover in its field service staff. Senior managers have suggested that the department look closer at its recruiting policies, particularly to try to identify the characteristics of individuals that lead to greater retention. However, in a recent staff meeting, HR managers could not agree on these characteristics. Some argued that years of education and grade point averages were good predictors. Others argued that hiring more mature applicants would lead to greater retention. To study these factors, the staff agreed to conduct a statistical study to determine the effect that years of education, college grade point average, and age when hired have on retention. A sample of 40 field service engineers hired 10 years ago was selected to determine the influence of these variables on how long each individual stayed with the company. Data are compiled in the Employee Retention worksheet.

Finally, as part of its efforts to remain competitive, PLE tries to keep up with the latest in production technology. This is especially important in the highly competitive lawn-mower line, where competitors can gain a real advantage if they develop more cost-effective means of production. The lawn-mower division therefore spends a great deal of effort in testing new technology. When new production technology is introduced, firms often experience learning, resulting in a gradual decrease in the time required to produce successive units. Generally, the rate of improvement declines until the production time levels off. One example is the production of a new design for lawn-mower engines.

To determine the time required to produce these engines, PLE produced 50 units on its production line; test results are given on the worksheet Engines in the database. Because PLE is

Page 3 of 25

continually developing new technology, understanding the rate of learning can be useful in estimating future production costs without having to run extensive prototype trials, and Elizabeth would like a better handle on this. Use techniques of regression analysis to assist her in evaluating the data in these three worksheets and reaching useful conclusions. Summarize your work in a formal report with all appropriate results and analyses.

Part 1: Defects after Delivery

Step 1 As we found before, most of the work is getting the data into the proper format to analyze. This will still be true, even for performing regression analyses. So, the first thing to do to look at what the numbers of defects would have been if nothing had changed is to get a month number column along with that part of the DefectsAfterDelivery.csv data that is of interest. It looks like to me that the number of defects continues to rise through October 2011. However, the solutions for the textbook only go through August 2011. You should decide what you think should be included in the data of interest for this question.

To get the number of the month in a first column and the number of defects for that month in a second column I:

1. combine the years 2010 and 2011 into a vector, then remove the last two values because I’m not including November or December 2011:

> defects.old.rate <- c(DefectsAfterDelivery$X2010, DefectsAfterDelivery$X 2011) > defects.old.rate <- defects.old.rate[1:22]

2. create a month number vector for the 22 months of interest:

> month.nr <- c(1:22)

3. combine the month numbers and the number of defects into a matrix with two columns and check the first few rows:

> defects.old.rate <- matrix(c(month.nr, defects.old.rate), ncol = 2) > head(defects.old.rate) [,1] [,2] [1,] 1 812 [2,] 2 810 [3,] 3 813 [4,] 4 823 [5,] 5 832

Page 4 of 25

[6,] 6 848

4. use the lm() function to do the regression and check the output:

> defects.old.rate.mod <- lm (defects.old.rate[,2] ~ defects.old.rate[,1]) > summary(defects.old.rate.mod) Call: lm(formula = defects.old.rate[, 2] ~ defects.old.rate[, 1]) Residuals: Min 1Q Median 3Q Max -14.440 -6.678 -1.944 6.976 22.572 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 817.4156 4.0677 200.952 < 2e-16 *** defects.old.rate[, 1] 1.3354 0.3097 4.312 0.000339 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.216 on 20 degrees of freedom Multiple R-squared: 0.4818, Adjusted R-squared: 0.4558 F-statistic: 18.59 on 1 and 20 DF, p-value: 0.0003394

5. plot the data in a scatter plot, add the regression line and check it:

> plot(defects.old.rate[,1], defects.old.rate[,2], xlab = "Month", ylab = "Number of Defects") > abline(defects.old.rate.mod, pch=2, col="red")

Page 5 of 25

Evans output in Excel is a bit different that the summary() output from R. One of the questions I have about the solutions is why Evans ran an ANOVA on this data. You should say something in your lab report about whether you think it was necessary to do analysis of variance in this case or not.

Step 2 In performing the regression analysis we always want to check and make sure that all the assumptions are true; linearity, normality of errors, equal variance or homoscedasticity, and independence. If we plot the residuals we see we are in trouble already. The plot of the residuals shown below has a pattern that to me looks cyclical. To create this plot I wrote another short R script. Note that to use the function xyplot() you will need to attach the lattice package again, i.e. library(lattice). The scrip and resulting plot are:

xyplot(resid(defects.old.rate.mod) ~ fitted(defects.old.rate.mod), xlab="Fitted Values", ylab="Residuals", main="Residual Diagnostic Plot", panel=function(x, y, ...) { panel.grid(h=-1, v=-1) panel.abline(h = 0) panel.xyplot(x, y, ...) } )

Page 6 of 25

I believe you should see some periodicity in mower related data, more units sold in the spring, less in the fall and so on. So I think the best idea would be to use a method amenable to seasonal cycles. Evans tries to use a 3rd order polynomial to fit this data. I believe that is just wrong. So, I won’t proceed with an analysis that matches the solutions now. I will resume this analysis after the technique is presented in Chapter 9.

Part 2: Defects after Delivery Continued

Step 1 If we want to look at the trend in Defects after Delivery since the change was made, as usual, we need to put the data of interest in the proper format. Since we included the year 2010 and January through October 2011 in the first regression analysis, we’ll use November-December 2011 and the remaining years in this analysis.

I use the same commands and logic as before for the data and create a scatter plot.

Looking at the scatterplot of the data it appears that it was several months after the change was made before there was a real difference in the number of defects. This is where I see what looks like a “gap”. But, we’ll leave that alone for now. Note that this plot is considerably different that the plot in Evans’ solutions. In Evans’ solutions all the months prior to and after the change was made are included in the plot. So, in Evans’ plot you see the increasing numbers of defects prior to the change, which we already looked at. Here is what I think the plot should look like:

Page 7 of 25

Now, we can perform a linear regression for this data rather than the 3rd order polynomial used in the solutions as follows:

> defects.new.rate.mod <- lm(defects.new.rate[,2] ~ defects.new.rate[,1]) > summary(defects.new.rate.mod) Call: lm(formula = defects.new.rate[, 2] ~ defects.new.rate[, 1]) Residuals: Min 1Q Median 3Q Max -45.357 -25.383 -0.792 19.539 57.397 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 884.432 9.621 91.93 <2e-16 *** defects.new.rate[, 1] -11.538 0.430 -26.83 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 29.07 on 36 degrees of freedom Multiple R-squared: 0.9524, Adjusted R-squared: 0.951 F-statistic: 719.8 on 1 and 36 DF, p-value: < 2.2e-16

The p-values for both the intercept and slope are very, very small. So, we reject the null hypothesis that there is no change in the number of defects and accept the alternative hypothesis that the change did make a difference in the number of defects.

Page 8 of 25

Here we see that by omitting the months prior to the change we find that a linear model explains just over 95% of the variance in the data. This is a very good result. The plot with the regression line is:

What I have done is a “piecewise” solution based on the information in the case study. I think Evans was very fortunate that the 3rd order polynomial regression fit the data so well. I don’t think I would perform this analysis using that model. You should comment in your lab report on what you believe is the most correct way to approach this problem.

I used the same R script as before to plot the residuals for this analysis. And, I see the same cyclical nature in the residuals as before. In fact, if you look for it you can see the cyclical nature of the data in the scatter plot. The plot of the residuals is:

Page 9 of 25

If I were really curious I would take an engineering approach and use a Fast Fourier Transform to determine what the periodicity is of this apparently cyclical data, perhaps about every 12 months... Or, perhaps not. For our analysis let’s complete a normal probability plot of the residuals to look at normality. The R commands are:

> defects.new.rate.stdres <- rstandard(defects.new.rate.mod) > qqnorm(defects.new.rate.stdres, ylab = "Standardized Residuals", xlab = "No rmal Scores", main = "Defects after Delivery") > qqline(defects.new.rate.stdres, pch=2, col="red")

And the normal probability plot looks like:

which looks pretty good. Evans uses the Durbin-Watson test to evaluate serial dependence. This test can be done in R with the dwtest() function in the lmtest package. So, if you haven’t already installed and attached the lmtest package you’ll need to do that. Output from the Durbin-Watson test can be interpreted using the following table:

Then, perform the Durbin-Watson test as follows:

> dwtest(defects.new.rate.mod) Durbin-Watson test data: defects.new.rate.mod

Page 10 of 25

DW = 0.42875, p-value = 1.29e-10 alternative hypothesis: true autocorrelation is greater than 0

Which indicates that autocorrelation does exist.

Part 3: Employee Retention

Step 1 This is a pretty straight forward look at the variables PLE has available with regard to employee retention, i.e. gender, number of years at PLE, number of years of education, college GPA, college graduate, employee age, and if the employee is local or not. We can do several things to analyze these data. We could look at correlations. And we could use ANOVA to assess any relationships between variables.

One thing Evans did was omit the categorical variables; gender, college graduate and local, so as not to have to deal with those. Evans also tested the remaining numeric variables to determine if he could eliminate any of those, e.g. if they were not statistically significant. Let’s take a fresh look at this and see what we find.

We can convert categorical variables, e.g. gender, to numeric variables in order to conduct an analysis using them in R relatively easily as follows:

1. convert to numeric value, i.e. female = 0, male = 1

> dGender <- as.numeric(EmployeeRetention$Gender) -1 > str(dGender) num [1:40] 0 1 1 0 0 1 1 1 0 1 ...

2. Create a plot to explore how gender is related to the number of years at PLE and view it

> boxplot(dGender, EmployeeRetention$YearsPLE, main = "Years at PLE by Gen der", xlab = "Gender (0=Female, 1=Male)", ylab = "Years at PLE")

Page 11 of 25

3. look at the mean and standard deviation of the categorical variable

> mean(dGender) [1] 0.675 > sd(dGender) [1] 0.4743416

4. create a matrix of categorical variables to look at how they might be correlated

> cat.employee.retention.variables <- matrix(c(dGender, dCollege.Grad, dLo cal), ncol = 3) > colnames(cat.employee.retention.variables) <- c("Gender", "College.Grad" , "Local") > str(cat.employee.retention.variables) num [1:40, 1:3] 0 1 1 0 0 1 1 1 0 1 ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:3] "Gender" "College.Grad" "Local" > head(cat.employee.retention.variables) Gender College.Grad Local [1,] 0 1 1 [2,] 1 1 1 [3,] 1 1 0 [4,] 0 1 1 [5,] 0 1 1 [6,] 1 1 1 > cor(cat.employee.retention.variables) Gender College.Grad Local Gender 1.0000000 -0.1396011 -0.0566862 College.Grad -0.1396011 1.0000000 -0.1646599 Local -0.0566862 -0.1646599 1.0000000

Page 12 of 25

It is interesting. In conducting the analysis of the categorical variables I found that there were as many male employees at PLE as there were college graduates.

> length(which (dGender == 1)) [1] 27 > length(which (dCollege.Grad == 1)) [1] 27

I found this because the boxplots were identical. Coincidence? It does make the data analysis have strange results. So, let’s look at the numeric variables.

First, let’s put the variables we want to analyze in a matrix as follows:

> num.employee.retention.variables <- matrix(c(EmployeeRetention$YearsPLE, Em ployeeRetention$YrsEducation, EmployeeRetention$College.GPA, EmployeeRetentio n$Age), ncol = 4) > colnames(num.employee.retention.variables) <- c("YearsPLE", "YrsEducation", "College.GPA", "Age") > str(num.employee.retention.variables) num [1:40, 1:4] 10 10 10 10 9.6 8.5 8.4 8.4 8.2 7.9 ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:4] "YearsPLE" "YrsEducation" "College.GPA" "Age"

Then, let’s see how/if these variables are correlated as follows:

> cor(num.employee.retention.variables) YearsPLE YrsEducation College.GPA Age YearsPLE 1.0000000 0.1796535 0.1774365 0.3766582 YrsEducation 0.1796535 1.0000000 0.5866539 0.4210289 College.GPA 0.1774365 0.5866539 1.0000000 0.2485216 Age 0.3766582 0.4210289 0.2485216 1.0000000

So, it looks like college GPA is correlated with years of education. To a lesser extent, age is correlated with years of education. The book uses 0.7 as a threshold for stating variables are correlated. You should note that other references use 0.5 as a threshold. The only time two variables are truly not correlated is if the correlation coefficient is in fact 0.0. However, we are conducting the correlation in part to assess any multicollinearity. Since the correlation is low we can state that multicollinearity is not an issue.

We can just do a linear regression of these variables to determine each independent variable’s effect on the dependent variable.

Start the linear regression as follows (remembering that the dependent variable or “y” is YearsPLE):

Page 13 of 25

> EmployeeRetention.mod <- lm(EmployeeRetention$YearsPLE ~ EmployeeRetention$ YrsEducation + EmployeeRetention$College.GPA + EmployeeRetention$Age) > summary(EmployeeRetention.mod) Call: lm(formula = EmployeeRetention$YearsPLE ~ EmployeeRetention$YrsEducation + EmployeeRetention$College.GPA + EmployeeRetention$Age) Residuals: Min 1Q Median 3Q Max -5.3299 -1.6122 -0.2433 1.8893 4.6312 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.73711 4.50415 -0.608 0.5472 EmployeeRetention$YrsEducation -0.06705 0.35516 -0.189 0.8513 EmployeeRetention$College.GPA 0.67998 1.18355 0.575 0.5692 EmployeeRetention$Age 0.29154 0.13504 2.159 0.0376 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.726 on 36 degrees of freedom Multiple R-squared: 0.1502, Adjusted R-squared: 0.07939 F-statistic: 2.121 on 3 and 36 DF, p-value: 0.1146

Which shows that Age is the only statistically significant variable when it comes to employee retention at PLE. Note that this is essentially the same as Evans’ solutions. We can plot and view the residuals to verify using the R script from before (but changing the name of the linear model to EmployeeRetention.mod):

Page 14 of 25

This may be alright, but shows that the residuals do not really have an equal variance across all fitted values. But moreover, based on R-squared this really isn’t a good model for the data. If we eliminate the variables that are not significant and run it again we get:

> EmployeeRetention.mod2 <- lm(EmployeeRetention$YearsPLE ~ EmployeeRetentio n$Age) > summary(EmployeeRetention.mod2) Call: lm(formula = EmployeeRetention$YearsPLE ~ EmployeeRetention$Age) Residuals: Min 1Q Median 3Q Max -4.9927 -1.6430 -0.1952 2.0328 4.8078 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.0149 3.0425 -0.662 0.5118 EmployeeRetention$Age 0.3003 0.1198 2.506 0.0166 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.666 on 38 degrees of freedom Multiple R-squared: 0.1419, Adjusted R-squared: 0.1193 F-statistic: 6.282 on 1 and 38 DF, p-value: 0.01659

Which still shows that Age is statistically significant but explains very little of the variation in the data.

Part 4: Learning Associated with Adopting New Technology

Step 1 Just to clean things up a bit and make it easier to work with the variables, after I imported the Engines.csv data I changed the column names to just “Sample” and “Time” as follows:

> colnames(Engines) <- c("Sample", "Time")

Then, just to see what’s going on I created a scatter plot of the data as follows:

> plot(Engines$Sample, Engines$Time)

Page 15 of 25

This doesn’t look exactly like the plot of the data in Evans’ solutions. But, it doesn’t look like a linear model will fit these data very well. It really looks like a logarithmic decay. Also, when I played around with the data a bit I found a better fit by just looking at the last 30 data points rather than the entire series. See what you think below:

> plot(Engines$Sample[20:50], log(Engines$Time[20:50]))

which is nearly linear. So, let’s try that and see what we get. To transform the data we’ll use the exponential as follows:

> m <- lm(log(Engines$Time[20:50]) ~ Engines$Sample[20:50]) > summary(m) Call:

Page 16 of 25

lm(formula = log(Engines$Time[20:50]) ~ Engines$Sample[20:50]) Residuals: Min 1Q Median 3Q Max -0.0052441 -0.0016683 -0.0004296 0.0020182 0.0052949 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.9837148 0.0019472 2045.84 <2e-16 *** Engines$Sample[20:50] -0.0033450 0.0000539 -62.05 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.002684 on 29 degrees of freedom Multiple R-squared: 0.9925, Adjusted R-squared: 0.9923 F-statistic: 3851 on 1 and 29 DF, p-value: < 2.2e-16

We find that both the intercept and the slope are statistically significant. And, from R-squared it looks like the model explains nearly all the variation in the data. Then last, running the R script again to plot the residuals we get:

which is not too bad.

So, if we use a 2nd order fit as Evans did in the solutions:

> m <- lm(Engines$Time ~ Engines$Sample + I(Engines$Sample^2)) > summary(m) Call: lm(formula = Engines$Time ~ Engines$Sample + I(Engines$Sample^2))

Page 17 of 25

Residuals: Min 1Q Median 3Q Max -1.2814 -0.5720 -0.1081 0.6119 3.9750 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 61.8301020 0.4114946 150.26 < 2e-16 *** Engines$Sample -0.7133564 0.0372222 -19.16 < 2e-16 *** I(Engines$Sample^2) 0.0082498 0.0007076 11.66 1.8e-15 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9314 on 47 degrees of freedom Multiple R-squared: 0.9612, Adjusted R-squared: 0.9595 F-statistic: 581.9 on 2 and 47 DF, p-value: < 2.2e-16

which looks promising. But what about the residuals?

> plot(m, 1)

which really looks bad. This is a bit different than the plot of residuals that Evans shows in the solutions but is just as bad. Honestly, I’d go with the logarithmic decay model. The residuals look the best. If you do this makes the model for learning associated with adopting this new technology

Production Time (min) = 3.99 – 0.0034*log(Unit)

Page 18 of 25

Chapter 9 An important part of planning manufacturing capacity is having a good forecast of sales. Elizabeth Burke is interested in forecasting sales of mowers and tractors in each marketing region as well as industry sales to assess future changes in market share. She also wants to forecast future increases in production costs. Develop forecasting models for these data and prepare a report of your results with appropriate charts and output from R.

Part 1: Mower Sales Plot data for mower sales in each marketing region as well as industry sales. Then, establish a method for use in predicting mower sales.

Step 1 Import and plot the mower sales. As usual, the hardest part will be getting the data in the proper format. First, I’ll get rid of the “NA” column name for North America, just to get that out of the way.

> colnames(MowerUnitSales) <- c("Month", "Year", "NorthA", "SouthA", "Europe" , "Pacific", "China", "World")

Also, we need to put the month and year together into a “Date” variable. The paste() function will work for that.

> df <- data.frame(paste(MowerUnitSales$Month,MowerUnitSales$Year), MowerUnit Sales$NorthA, MowerUnitSales$SouthA, MowerUnitSales$Europe, MowerUnitSales$Pa cific, MowerUnitSales$China, MowerUnitSales$World) > str(df) 'data.frame': 60 obs. of 7 variables: $ paste.MowerUnitSales.Month..MowerUnitSales.Year.: Factor w/ 60 levels "Apr il 2010","April 2011",..: 21 16 36 1 41 31 26 6 56 51 ... $ MowerUnitSales.NorthA : int 6000 7950 8100 9050 9900 10200 8730 8140 6480 5990 ... $ MowerUnitSales.SouthA : int 200 220 250 280 310 300 280 250 230 220 ... $ MowerUnitSales.Europe : int 720 990 1320 1650 1 590 1620 1590 1560 1590 1320 ... $ MowerUnitSales.Pacific : int 100 120 110 120 130 120 140 130 130 120 ... $ MowerUnitSales.China : int 0 0 0 0 0 0 0 0 0 0 ... $ MowerUnitSales.World : int 7020 9280 9780 1110 0 11930 12240 10740 10080 8430 7650 ...

Page 19 of 25

However, I got the column names a bit messed up putting the data frame together. That is easy to fix…

> colnames(df) <- c("Date", "NorthA", "SouthA", "Europe", "Pacific", "China", "World") > str(df) 'data.frame': 60 obs. of 7 variables: $ Date : Factor w/ 60 levels "April 2010","April 2011",..: 21 16 36 1 41 3 1 26 6 56 51 ... $ NorthA : int 6000 7950 8100 9050 9900 10200 8730 8140 6480 5990 ... $ SouthA : int 200 220 250 280 310 300 280 250 230 220 ... $ Europe : int 720 990 1320 1650 1590 1620 1590 1560 1590 1320 ... $ Pacific: int 100 120 110 120 130 120 140 130 130 120 ... $ China : int 0 0 0 0 0 0 0 0 0 0 ... $ World : int 7020 9280 9780 11100 11930 12240 10740 10080 8430 7650 ...

Now plot the mower sales. The only tricky part is getting tick marks and labels on the x-axis and a grid if you want one. First, I set up a data object for the number of tick marks I wanted every 6 months, then one for the labels.

> c = seq(1, length(Date), 6) > ticks=c("Jan 2010", "Jun 2010", "Jan 2011", "Jun 2011", "Jan 2012", "Jun 20 12", "Jan 2013", "Jun 2013", "Jan 2014", "Jun 2014")

Then, I created the plot in separate pieces; 1) plot the data, 2) add the tick marks and labels, 3) add the y-axis grid, 4) add a custom grid to match my tick marks and labels for every 6 months of data:

> plot(df$NorthA, type = "b", xlab = "", xaxt="n", ylab = "North America", ma in = "Mower Sales by Region") > axis(1, at=c, labels=ticks, las=2) > grid(NA, NULL, lty = 2) > abline(v=seq(1, length(Date), 6), lty=2, col="lightgray")

And, here’s the plot:

Page 20 of 25

You can change the colors and everything in R. For this lab you might want to just do the basic plots to get done fastest. Each would be similar to this one. In fact, you can use this same process to plot the tractor sales too. Here is the plot for the mower sales for the Pacific region:

Evans plotted the industry sales data for multiple regions on the same plot. In fact, you could do that for both PLE’s sales data and the industry sales data. There are a couple ways to do this. The way we’ve used before is the par() function, i.e. par(new=TRUE), so that we can add to the existing plot. A second way to do the same thing is just to use the lines() function to add the second “line” to the same plot. Note that I’ve imported and cleaned up the data/column headings the same way as before. Evans plotted the North America, Europe and World regions together because they show seasonality without any trend. He plotted the South America and

Page 21 of 25

Pacific regions together because they show seasonality with a trend. However, on the scales Evans used it is hard to see a trend in the South America data. It is slightly negative.

Using the technique of inserting the command par(new=TRUE) each time you want to add another curve to a plot the North America, Europe and World data commands are:

> par(oma=c(5,0,0,0)) > plot(IndustryMowerTotalSales$NorthA, type = "b", xlab="", xaxt="n", ylab="S ales", main="Industry Mower Sales") > par(new=TRUE) > plot(IndustryMowerTotalSales$Europe, type = "b", col="green", xlab="", xaxt ="n", ylab="", yaxt="n") > par(new=TRUE) > plot(IndustryMowerTotalSales$World, type = "b", col="magenta", xlab="", xax t="n", ylab="", yaxt="n") > axis(1, at=c, labels=ticks, las=2) > grid(NA, NULL, lty = 2) > abline(v=seq(1, length(Date), 6), lty=2, col="lightgray") > par(xpd=NA) > legend(x=0, y=1, legend = text, text.width = max(sapply(text, strwidth)), c ol=plot_colors, lwd = 3, cex = 1, horiz = TRUE)

par(oma=c(…)) sets the outer margin area. Here I’ve set the bottom of the plot to have 5 line0s.

par(xpd=NA) tells R to put the next thing in the outer margin area, in this case the legend. I’ve used the parameter horiz=TRUE to make the legend print along the center (x=0) bottom (y=1). You can play with the colors, the size of the lines and font, etc. It is all very flexible.

And, here is the figure:

As you can see the data are very different for each of these regions. The data for the North America region is seasonal but does not have a trend. The data for the Pacific region is

Page 22 of 25

seasonal and does have a trend. So, you’ll need to decide what method to use to analyze the data by region; e.g. seasonal with no trend = Holt-Winter no trend option, seasonal with trend = Holt-Winter’s additive model, etc.

You should use the link: http://a-little-book-of-r-for-time- series.readthedocs.io/en/latest/src/timeseries.html to study using R for analyzing and making predictions using time-series data. It is absolutely the best reference I have found. For this lab, seasonal data with no trend, I used the following:

First I set up a data object with the mower sales data from the North America region and plotted it:

> demand <- ts(MowerUnitSales[,3], start = c(2010, 1), frequency=12) > plot(demand)

Here’s the plot:

which looks like the correct data. To use Holt Winters in R takes you just:

> hw <- HoltWinters(demand)

Then, you can use this to make the forecast using:

> forecast <- predict(hw, n.ahead = 12, prediction.interval = T, level = 0.95 )

Page 23 of 25

where the number of months ahead is 12 and the significance level is .05. To plot use the following command:

> plot(hw, forecast)

and, the plot looks like:

where you have the prediction and the upper and lower bounds. Pretty simple – right? The thing to keep in mind is to use the seasonal parameter with the HoltWinters() function and set to additive or multiplicative as required. The website I’ve given you talks about moving average and exponential smoothing.

Also, when you look through the webpage I’ve given you the link for above you’ll see that a more sophisticated approach is to decompose the data into its estimated trend, its estimated seasonal component and its estimated irregular (noise) component then work on each of these. That is beyond the scope of this coursework for me to require. However, should you be so inclined... There is also extra credit available.

Part 3: Production Costs Plot data for mower and tractor production costs. Then, establish a method for use in predicting future costs. Note that the file for this is UnitProductionCosts.csv

Page 24 of 25

Step 1 You should already be able to do this. You might want to spend a few minutes to see if you can before reading on. It really only requires developing a linear model for the data using the lm() function in R and then printing out the summary. This gives you the intercept and slope of the regression line and hence the equation to use for forecasting future production costs. Always remember, this is based (obviously based) on historical data and care should be taken in making predictions of future whatever using historical data…

What I did was plot up the data to see what I had. Then, I set-up a separate data object for the number of the months, 1 through and including 60. Then I used lm() and summary() as follows for tractor production costs:

> plot(prodCosts[,2]) > trctr <- lm(prodCosts$Tractor ~ monthID) > summary(trctr) Call: lm(formula = prodCosts$Tractor ~ monthID) Residuals: Min 1Q Median 3Q Max -39.172 -7.457 5.058 15.627 24.154 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1750.0441 5.6009 312.46 <2e-16 *** monthID 6.1658 0.1597 38.61 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 21.42 on 58 degrees of freedom Multiple R-squared: 0.9626, Adjusted R-squared: 0.9619 F-statistic: 1491 on 1 and 58 DF, p-value: < 2.2e-16 > abline(trctr, col="red", lwd=2)

And, here’s the plot:

Page 25 of 25

As usual I don’t get exactly the same coefficients for the intercept and slope as Evans did but the regression line plotted on the data looks fine.

- Introduction
- Chapter 8
- Part 1: Defects after Delivery
- Step 1
- Step 2
- Part 2: Defects after Delivery Continued
- Step 1
- Part 3: Employee Retention
- Step 1
- Part 4: Learning Associated with Adopting New Technology
- Step 1
- Chapter 9
- Part 1: Mower Sales
- Step 1
- Part 3: Production Costs
- Step 1