Instructions
Objective
Write a program to create classification system in python language.
Requirements and Specifications
In this assignment you will solve a classification problem using Python.
The “LRS_Pre_Assessment_trimmed_rank.csv” physical fitness testing dataset is provided for use in this exercise. The data was collected from an active-duty squadron and the samples were deidentified at the point of collection. The independent variables include numeric and categorical data related to demographics, mental health surveys, fitness participation surveys, injury history surveys, physical performance measures, and body composition assessments. The dependent variable is whether or not the member passed their fitness test, and is titled APFT_1_is_pass. For this label, pass = 1, and fail = 0.
In a marked-up Jupyter notebook (*.ipynb), use the statsmodels Logit algorithm to predict the labels of the dataset:
- Break your code into logical chunks, using multiple “text” and “code” sections, similar to the examples given in class.
- Drop the “flight” column and one-hot-encode the “rank” & “gender” columns with df = pd.get_dummies(df, drop_first=True). drop_first is needed so the columns are linearly independent.
- Create a Data Understanding table using .describe() and include 3 Data Understanding visualizations such as a scatterplot, histogram, pairplot or correlation matrix.
- Based on your data understanding visualizations, perform data preparation transformations as required, such as normalization or log transform.
- In your Modeling section:
- Split your dataset into 70% train & 30% test.
- Include three variations on your modeling method. Variations could include adding, removing or transforming input variables.
- For the “best” variation, using the train dataset, create a ROC curve plot and calculate the accuracy, odds ratio, AUC, classification report and confusion matrix.
- For the “best” variation, using the test dataset, create a ROC curve plot and calculate the accuracy, odds ratio, AUC, classification report and confusion matrix.
- Include a text block related to Business/Mission Understanding:
- Review the week 2 “Binary Classification metric summary” file
- Mention what the majority class is (passing or failing the test), the % of datapoints in the majority class and whether or not the dataset is balanced.
- Discuss the penalty (if any) associated with a False Negative or False Positive
- Discuss the metrics (accuracy/f1/etc) that would be most appropriate for this problem based on the balance & penalties
- Include a summary text block:
- Your justification for the “best” variation in part 2.e.
- The contribution of the most important 2-3 input variables to your model, such as z or p test scores.
- A discussion of the performance metrics from part 2.e. Based on comparing the model performance on the train/test datasets, mention if any of the models overfit the data.
- Write in a formal writing style based on the Appendix B guidance, with the exception that references and citations are not required.
Upload your .ipynb python file to Canvas to write your python homework.
Source Code
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LogisticRegression
import seaborn as sns
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
## Load Data into a DataFrame
df = pd.read_csv('LRS_Pre_Assessment_trimmed_rank.csv')
df.head()
## Drop the 'flight' column
df = df.drop(columns=['Flight'])
df.head()
## One-Hot encode the column 'Rank'
rank_encode = pd.get_dummies(df.Rank, prefix='Rank', drop_first = True)
df = pd.get_dummies(df, drop_first=True)
df.head()
## Describe Dataset
### Show statistical description of the dataset
df.describe()
### Show relation between Body Fat Percent and Muscle Percent
passed = df[df['APFT_1_is_pass'] == 1]
not_passed = df[df['APFT_1_is_pass'] == 0]
plt.figure()
ax = passed.plot.scatter(x='BodyFatPerc', y = 'MusclePerc', label = 'Passed')
not_passed.plot.scatter(x='BodyFatPerc', y = 'MusclePerc', label = 'Not Passed',ax = ax, color='red')
plt.grid(True)
plt.show()
An interesting thing about the figure shown above, is that, as more Body Fat Percent, less Muscle Percent in the body. Another thing no notice is that, the majority of the people that filed the test, is people with more Body Fat Percent (bottom-right records)
## Show a bar plot for the number of people that passed and failed the test, grouped by sex
pd.crosstab(df.APFT_1_is_pass, df.Gender_M).plot(kind='bar', rot=0)
plt.grid(True)
plt.legend(['Male', 'Female'])
plt.show()
We see that, the gender of the most people that passed the test is Female, but for the people that failed it, the most are also Womens
### Now plot a correlation map
fig, ax = plt.subplots(figsize=(10,10))
corr = df.corr()
sns.heatmap(corr)
plt.show()
## Normalize the data so all numeric values in the dataset are between 0 and 1
We will use Min-Max Normalization
df_norm = (df - df.min())/(df.max()-df.min())
df_norm.head()
## Split Dataset into 70% train, 30% test
train_df, test_df = train_test_split(df_norm, test_size=0.3)
y_train = train_df['APFT_1_is_pass'].values
X_train = train_df.drop(columns = ['APFT_1_is_pass']).values
y_test = test_df['APFT_1_is_pass'].values
X_test = test_df.drop(columns = ['APFT_1_is_pass']).values
print(f"There are {len(y_train)} rows in the train dataset, and {len(y_test)} rows in the test dataset.")
## Function to compute accuracy
This function will compute accuracy given the real output and the predicted output
def calc_accuracy(y_real, y_pred):
N = len(y_pred)
# Compute the number of predictions that are equal to the real values
return np.where(y_real == y_pred)[0].shape[0]/N
## Model 1: Using all variables
Create a LogisticRegressionModel considering all the variables in the dataset
model1 = LogisticRegression()
model1.fit(X_train, y_train)
# Now predict
y_pred1 = model1.predict(X_test)
# Print accuracy
accuracy1 = calc_accuracy(y_pred1, y_test)
print("The accuracy of Model 1 is: {:.2f}%".format(accuracy1*100.0))
## Model 2:
From the correlation map shown before, we see that there is no correlation (or correlation almost equal to zero) between the ***APFT_1_is_pass*** variable and the ***ORS_Total*** variable, so now we will remove the **ORS_Total** variable
***ORS_Total*** is column 1 in the X_train/X_test arrays, so we will remove that column
X_train2 = np.delete(X_train, 1, 1)
X_test2 = np.delete(X_test, 1, 1)
Create Model 2
model2 = LogisticRegression()
model2.fit(X_train2, y_train)
# Now predict
y_pred2 = model2.predict(X_test2)
# Print accuracy
accuracy2 = calc_accuracy(y_pred2, y_test)
print("The accuracy of Model 2 is: {:.2f}%".format(accuracy2*100.0))
We see that the accuracy did not change, so the removed column did not affect the output.
## Model 3: Removing all variables with a correlation less than |0.1|
corr
We see that the variables with a correlation (absolute value) less than 0.1 are: ORS_Total, PTSD_Score and Rank_SrEnlisted
X_train3 = np.delete(X_train, [1,2,9], 1)
X_test3 = np.delete(X_test, [1,2,9], 1)
model3 = LogisticRegression()
model3.fit(X_train3, y_train)
# Now predict
y_pred3 = model3.predict(X_test3)
# Print accuracy
accuracy3 = calc_accuracy(y_pred3, y_test)
print("The accuracy of Model 3 is: {:.2f}%".format(accuracy3*100.0))
### Removing variables with low correlation does not affects the model, so we keep the original model with all parameters
## ROC, AUC, accuracy and Confusion Matrix for Train Dataset
lr_probs = model1.predict_proba(X_train)
# keep probabilities for the positive outcome only
lr_probs = lr_probs[:, 1]
# calculate scores
lr_auc = roc_auc_score(y_train, lr_probs)
# summarize scores
print('Logistic: ROC AUC=%.3f' % (lr_auc))
lr_fpr, lr_tpr, _ = roc_curve(y_train, lr_probs)
# plot the roc curve for the model
plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic Regression')
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# show the legend
plt.legend()
# show the plot
plt.show()
Confusion Matrix
y_pred_train = model1.predict(X_train)
conf_mat = confusion_matrix(y_train, y_pred_train)
print(conf_mat)
## ROC, AUC, accuracy and Confusion Matrix for Test Dataset
lr_probs = model1.predict_proba(X_test)
# keep probabilities for the positive outcome only
lr_probs = lr_probs[:, 1]
# calculate scores
lr_auc = roc_auc_score(y_test, lr_probs)
# summarize scores
print('Logistic: ROC AUC=%.3f' % (lr_auc))
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)
# plot the roc curve for the model
plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic Regression')
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# show the legend
plt.legend()
# show the plot
plt.show()
Confusion Matrix
y_pred_test = model1.predict(X_test)
conf_mat = confusion_matrix(y_test, y_pred_test)
print(conf_mat)
Related Samples
Discover our collection of free Python assignment samples, meticulously crafted to aid your learning journey. These samples provide comprehensive solutions, helping you grasp Python programming concepts effectively. Access them to gain insights and excel in your Python assignments effortlessly.
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python
Python