Note that in some instances this coefficient does not prove to be very meaningful, such as this one since it would be impractical for the petal width of a flower to be 0 cm. The intercept coefficient, B o, tells us that if the petal width of an Iris flower is 0 cm, then we’d expect its’ petal length to be approximately 1.08 cm. Using the coefficients from the line of best fit, we can make even more precise inferences. Based off this plot, there appears to be a positive linear association between the petal length and the petal width of Iris flowers (i.e., as the width of a flower’s petal increases, its length increases as well). The plot included above as Figure 2 provides us with several useful insights.
Note that all R code used to produce these two figures is included in Appendix A. The equation of this line of best fit is defined as ^y i = 1.08+2.23x i, where ^y i represents the predicted petal length of experimental unit i and x i represents the observed petal width of experimental unit i. Note that this plot includes the best fitting line, which you may recall can be obtained by finding the B o and B 1 coefficients that minimize the sum of the 150 squared prediction errors. Graphing the petal length (cm) versus the petal width (cm) of all 150 experimental units included in the dataset results in the scatterplot shown below as Figure 2. Figure 1 included below displays the first six rows of this dataset. It can be difficult to solve for these coefficients by hand, yet luckily most programming languages such as R will do it for us! Example-Iris DatasetĪny student interested in studying data science should be exposed at some point to the Iris dataset-it’s a built-in dataset in R that contains measurements (in centimeters) on four attributes (i.e., sepal length, sepal width, petal length, and petal width) for fifty flowers from three different species. For example, the “least squares criterion” is a common approach, which says that the regression coefficients B o and B 1 that define the best fitting line are those that minimize the sum of the squared prediction errors. The line of best fit will be the one for which the n prediction errors (where n represents the total number of experimental units or data points) are as small as possible in some overall sense. This leads to “prediction error” (also known as “residual error”) defined as follows for each experimental unit i: Remember that a best fitting line leads us to a predicted response value, y i, and predictions are not always entirely accurate. Also note that an “experimental unit” is simply the object or person on which the specific measurement is made. More specifically, o represents the y-intercept of the best fitting line and 1 represents the slope of the best fitting line. Where x i denotes the observed predictor value for experimental unit i, y i denotes the observed response value for experimental unit i, ^y i is the predicted response or “fitted” value for experimental unit i, and B o and B 1 are the coefficients of the best fitting line. The equation for this best fitting line is defined as follows: In building a simple linear regression model, we seek to determine the “best fitting line”-the most appropriate line through the data. Suppose that we have a standard Cartesian graph plotted with several (x i,y i ) data points. The easiest way to describe a simple linear regression model is using graphical terms. You should note that there are also “multiple” linear regression models that involve the analysis of multiple predictor variables however, we will not cover that topic in this blog post. These terms are interchangeable however, their use varies across different fields of study. The other variable, commonly known as the response, outcome, or dependent variable, is denoted y. One variable, often referred to as the predictor, explanatory, or independent variable, is denoted x. The adjective “simple” is not meant to demean linear regression models-it’s a reference to the fact that simple linear regression only involves the analysis of two variables. In basic terms, simple linear regression is a modeling technique that enables us to summarize the trend between two variables. Let's walk through the basics behind simple linear regression-a statistical model used to study the relationship between two variables. One of the most basic, yet most useful tools for a data scientist is the linear regression model. Given the recent rise of big data, there continues to be growing interest in the field of data science.