Instructions
Objective
Write a python assignment to implement polynomial degree.
Requirements and Specifications
Source Code
# STUDENT NAME, STUDENT NUMBER ( TO BE FILLED BY THE STUDENT )
# Advanced Data Analysis - Assignment 2
This notebook contains the **Assignment 2** of the Advanced Data Analysis course.
The topic of the assignment consists in performing linear regression on National Health and Nutrition Examination data.
### DEADLINE: 10-October-2021
The assignment is **individual**. You should submit your resolution on Moodle by the deadline. While doing this assignment, you can use or adapt any code from the lectures if you want.
Students have three grace days that they can use for all assignments and group project, which allows them to deliver the projects late. Use these grace days carefully.
[//]: # ( We will be using latex for fomulas )
### Notebook Instructions
* You only need to deliver this notebook file ( notice that, a notebook file extension is filename.ipynb ) - Data files must not be submitted
* You don't need to create additional cells. Try to use the ones that are already available
* The notebook should be delivered with the outputs already available
# Dataset
The file children.csv contains a file with two columns. The first column is the age of each child in
months, and the second the weight in Kg. The data is from the National Health and Nutrition Examination
Survey of 2017-2018 and represents a sample of children up to 24 months old.
The following code loads the children.csv file
# This code cell does not need to be changed
import os
import pandas as pd
#dataFileName = os.path.join( "../assignment2", "children.csv” )
dataFileName = "children.csv"
dataDF = pd.read_csv( dataFileName )
dataDF.head( )
# Assignment
In this assignment, we aim to predict the weight of a children until 24 monthts old based on child age.
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
import os
from sklearn.metrics import mean_squared_error, r2_score
x = dataDF[['age']]
y = dataDF[['weight']]
## Question 1
In this question, we aim to identify the best polynomial degree.
### **1.a )** Find the best polynomial degree from 1 to 12 ( 6 points out of 20 ).
# Solve question here. Add a Markdown cell after this cell if you want to add some comment on you solution.
r2_scores = []
rmse_vals = []
min_rmse = 1e10
min_rmse_dg = -1
for nd in range( 1, 13 ): # from 1 to 12
poly = PolynomialFeatures( degree = nd )
X = poly.fit_transform( x )
# Fit values
model = linear_model.LinearRegression( )
model.fit( X, y )
y_new = model.predict( X )
# Calculate RMS error
rmse = ( mean_squared_error( y, y_new ) )**( ½ )
r2_val = r2_score( y,y_new )
print( f"The RMS error for a degree of {nd} is {rmse} and the R2 value is {r2_val}" )
r2_scores.append( r2_val )
rmse_vals.append( rmse )
if rmse < min_rmse:
min_rmse = rmse
min_rmse_dg = nd
print( f"The min RMSE obtained was of {min_rmse} and it was for a polynomial of degree: {min_rmse_dg}" )
So, from the RMS errors, it seems that the best fit for the data is a polynomial of degree: 10. This degree also ensures the highest R2 coefficient
### **1.b )** Plot the results obtained ( for each degree the score obtained ) ( 2 points out of 20 ).
# Solve question here.
fig, axes = plt.subplots( nrows = 1, ncols = 2, figsize=( 8,8 ) )
axes[0].plot( range( 1, 13 ), rmse_vals )
axes[0].set_xlabel( 'Polynomial Degree' )
axes[0].set_ylabel( 'RMS Error' )
axes[0].grid( True )
axes[1].plot( range( 1,13 ), r2_scores )
axes[1].set_xlabel( 'Polynomial Degree' )
axes[1].set_ylabel( 'R2' )
axes[1].grid( True )
### **1.c )** Why k-fold cross validation approach is important to evaluate the performance of predictive models? ( 1 point out of 20 )
K-Fold procedure is used to evaluate the skill of a model. This procedure is often used when the dataset is small or has too many features, and it helps to estimate the parameters so the model has a good validation accuracy.
## Question 2 ( 10 points out of 20 )
Here, we aim to build a model to predict the weigth of children based on their agr.
### **2.a )** Using the best degree found, find the coefficients of the best curve ( 4 points out of 20 ).
# Solve question here. Add a Markdown cell after this cell if you want to add some comment on you solution.
# The coefficients are:
poly = PolynomialFeatures( degree = min_rmse_dg )
X = poly.fit_transform( x )
# Fit values
model = linear_model.LinearRegression( )
model.fit( X, y )
print( model.coef_ )
### **2.b )** Plot the train and test set and the model computed ( 3 points out of 20 )
# Solve question here. Add a Markdown cell after this cell if you want to add some comment on you solution.
# First, split into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( x, y, test_size = 0.3 ) # we use 70% of the dataset for training
# Now, create model again
poly = PolynomialFeatures( degree = min_rmse_dg )
X_train_transform = poly.fit_transform( X_train )
X_test_transform = poly.fit_transform( X_test )
# Fit values
model = linear_model.LinearRegression( )
model.fit( X_train_transform, y_train )
# Plot train and test data
plt.figure( )
plt.scatter( X_train, y_train, label = 'Train Data', color = 'lightcoral' )
plt.scatter( X_test, y_test, label = 'Test Data', color = 'steelblue' )
y_predict = model.predict( X_train_transform )
plt.scatter( X_train, y_predict, label = 'Model', color = 'black' )
# Plot the model
plt.legend( )
plt.grid( True )
plt.xlabel( 'Age' )
plt.ylabel( 'Weight' )
plt.show( )
### **2.c )** What is the mean squared error ( MSE ) on the test set? ( 1 point out of 20 )
# Solve question here. Add a Markdown cell after this cell if you want to add some comment on you solution.
# Predict the values in the test set
y_predict = model.predict( X_test_transform )
# Calculate RMS error
rmse = ( mean_squared_error( y_test, y_predict ) )**( 1/2 )
print( f"The MSE in the test set is: {rmse}" )
# Question 3
### **3.a )** In the plot made in 2.b ) represent also the uncertainty of the model achieved with different shades at the levels the confidence intervals of 95% and 99% ( 3 points out of 20 ). Discuss the results achieved.
import numpy as np
# Plot train and test data
plt.figure( )
y_predict = model.predict( X_train_transform )
error = np.abs( y_predict - y_train.values )
x = [x[0] for x in X_train.values]
y2 = [y_predict[i][0] + error[i][0] for i in range( len( y_train ) )]
y1 = [y_predict[i][0] -error[i][0] for i in range( len( y_train ) )]
plt.scatter( x, y1 )
plt.scatter( x, y2 )
plt.scatter( X_train, y_predict, label = 'Model', color = 'black' )
#plt.fill_between( x, y1, y2,
# alpha=0.5, edgecolor='#CC4F1B', facecolor='#FF9848' )
# Plot the model
plt.legend( )
plt.grid( True )
plt.xlabel( 'Age' )
plt.ylabel( 'Weight' )
plt.show( )
From the results obtained in figure 2.b ) we can see that the model is quite good since the trend is centered on the test and training data. If we observe each point of the obtained model, we can see that it is centered approximately on the mean of the dataset values. This means that the model is not overfitting and therefore does not try to recreate each oscillation of the data, because for each x value, there are several y values, but our model predicts a y value for each x. Also, we can see that the obtained RMSE in part 2.c ) it is relatively low which indicates that the values are close to the median.
# Solve question here. Add a Markdown cell after this cell if you want to add some comment on you solution.
y_predict.shape
x
X_train
Related Samples
ProgrammingHomeworkHelp.com offers extensive support for students seeking assistance with their Python assignments. Our website features a dedicated section with related Python samples, providing valuable insights and guidance for your projects. These samples showcase our expertise in Python programming and serve as a helpful resource for understanding complex concepts. By exploring our Python samples, students can enhance their coding skills and gain confidence in tackling their assignments. Visit today to access top-notch Python assignment support and elevate your academic performance.
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python