Lets try it #2 : Importing/Setting up DATA for linear regression
Importing/Setting up DATA for linear regression
by Ashur Baroutta
When it comes to machine learning, data is Queen. It dictates the flow and reliability of the end results. Using data donated to the University of California Irvine, we will use a linear regression model to make a prediction on a students final grade, based off a number of attributes.
After installing all the software packages mentioned in the previous blog post, Let's Try It #1, we import the CSV file provided by UCI. The following is the code used to achieve this.
The student performance data set includes data for 649 different students and tracks 33 attributes to each student. Attributes range from school attended, address, age, class failures, health, absences, etc. This is what the data set looks like prior to training on specific attributes we've scrubbed for.
While it is organized, this is too much information to account for and some of the attributes aren't in integer form. While we could set each answer to convert to integer form, for the purpose of this project we will be focusing on data already in integer form.
The attribute we will be predicting for is the final grade (#32 in the data set). The attributes we will be using to train on and predict this final grade will be the G1 (grade first period), G2 (grade second period), studytime, failures in previous classes, and absence count. The thought behind selecting for these attributes is grade trends intuitively seem like a strong place to start, studytime means more time covering class materials, failures indicate work ethic or quality of time spent, and absences suggest overall exposure to class material and dedication to academia.
Now that the attributes to train on have been selected, we want to trim the data set to cut out excess noise and be more resource efficient. So we set a variable = to the data set (specific to the attributes previously mentioned). The following is the code to do so.
So if we print the first 5 students of the data set to take a look at it now, you see it's easier to parse through.
Before any prediction is made, we will split the data up into different sections so as to ensure the regression model isnt just training on the entire data set as it is, if that were the case the prediction would be almost always 100% or close to in accuracy since it would "know" all the information available. For this exercise, the portion of the data the model will learn on is going to be 15%.
Now that the data is set in the preferred manner, we can start training and making predictions. In the next blog post we're going to explore just that. to see what prediction accuracy we can attain given the parameters above.
Cited
https://archive.ics.uci.edu/ml/datasets/Student+Performance
Ashur,
ReplyDeleteVery simple, but powerful techniques. Excited to see the result.