Create a Program to Create Classification System in Python Assignment Solution

July 08, 2024

Professor Liam

🇦🇺 Australia

Python

Professor Liam Taylor holds a Master's degree in Computer Science from a prominent university in Australia and has completed over 600 assignments related to Python file handling. His expertise includes designing file handling libraries, implementing data serialization techniques, and optimizing file access patterns. Professor Taylor specializes in multimedia file processing, metadata extraction, and developing cross-platform file handling solutions using Python.

Hire Now

Python

Key Topics

Instructions
Requirements and Specifications

Submit Your Python Assignment

Get a FREE Quote

Tip of the day

Always start by understanding the problem’s constraints and expected input/output. Choose the most efficient algorithm by comparing time and space complexities—this ensures your solution is both correct and optimized.

News

Eclipse Theia IDE: Introduced Theia AI, an open, adaptable AI coding assistant, enhancing the learning experience by integrating AI capabilities directly into the development environment.

Instructions

Objective

Write a program to create classification system in python language.

Requirements and Specifications

In this assignment you will solve a classification problem using Python.

The “LRS_Pre_Assessment_trimmed_rank.csv” physical fitness testing dataset is provided for use in this exercise. The data was collected from an active-duty squadron and the samples were deidentified at the point of collection. The independent variables include numeric and categorical data related to demographics, mental health surveys, fitness participation surveys, injury history surveys, physical performance measures, and body composition assessments. The dependent variable is whether or not the member passed their fitness test, and is titled APFT_1_is_pass. For this label, pass = 1, and fail = 0.

In a marked-up Jupyter notebook (*.ipynb), use the statsmodels Logit algorithm to predict the labels of the dataset:

Break your code into logical chunks, using multiple “text” and “code” sections, similar to the examples given in class.
Drop the “flight” column and one-hot-encode the “rank” & “gender” columns with df = pd.get_dummies(df, drop_first=True). drop_first is needed so the columns are linearly independent.
Create a Data Understanding table using .describe() and include 3 Data Understanding visualizations such as a scatterplot, histogram, pairplot or correlation matrix.
Based on your data understanding visualizations, perform data preparation transformations as required, such as normalization or log transform.
In your Modeling section:

Split your dataset into 70% train & 30% test.
Include three variations on your modeling method. Variations could include adding, removing or transforming input variables.
For the “best” variation, using the train dataset, create a ROC curve plot and calculate the accuracy, odds ratio, AUC, classification report and confusion matrix.
For the “best” variation, using the test dataset, create a ROC curve plot and calculate the accuracy, odds ratio, AUC, classification report and confusion matrix.

Include a text block related to Business/Mission Understanding:

Review the week 2 “Binary Classification metric summary” file
Mention what the majority class is (passing or failing the test), the % of datapoints in the majority class and whether or not the dataset is balanced.
Discuss the penalty (if any) associated with a False Negative or False Positive
Discuss the metrics (accuracy/f1/etc) that would be most appropriate for this problem based on the balance & penalties

Include a summary text block:

Your justification for the “best” variation in part 2.e.
The contribution of the most important 2-3 input variables to your model, such as z or p test scores.
A discussion of the performance metrics from part 2.e. Based on comparing the model performance on the train/test datasets, mention if any of the models overfit the data.
Write in a formal writing style based on the Appendix B guidance, with the exception that references and citations are not required.

Upload your .ipynb python file to Canvas to write your python homework.

Source Code

import numpy as np import tensorflow as tf from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt import pandas as pd from sklearn.linear_model import LogisticRegression import seaborn as sns from sklearn.metrics import roc_curve from sklearn.metrics import roc_auc_score from sklearn.metrics import confusion_matrix ## Load Data into a DataFrame df = pd.read_csv('LRS_Pre_Assessment_trimmed_rank.csv') df.head() ## Drop the 'flight' column df = df.drop(columns=['Flight']) df.head() ## One-Hot encode the column 'Rank' rank_encode = pd.get_dummies(df.Rank, prefix='Rank', drop_first = True) df = pd.get_dummies(df, drop_first=True) df.head() ## Describe Dataset ### Show statistical description of the dataset df.describe() ### Show relation between Body Fat Percent and Muscle Percent passed = df[df['APFT_1_is_pass'] == 1] not_passed = df[df['APFT_1_is_pass'] == 0] plt.figure() ax = passed.plot.scatter(x='BodyFatPerc', y = 'MusclePerc', label = 'Passed') not_passed.plot.scatter(x='BodyFatPerc', y = 'MusclePerc', label = 'Not Passed',ax = ax, color='red') plt.grid(True) plt.show() An interesting thing about the figure shown above, is that, as more Body Fat Percent, less Muscle Percent in the body. Another thing no notice is that, the majority of the people that filed the test, is people with more Body Fat Percent (bottom-right records) ## Show a bar plot for the number of people that passed and failed the test, grouped by sex pd.crosstab(df.APFT_1_is_pass, df.Gender_M).plot(kind='bar', rot=0) plt.grid(True) plt.legend(['Male', 'Female']) plt.show() We see that, the gender of the most people that passed the test is Female, but for the people that failed it, the most are also Womens ### Now plot a correlation map fig, ax = plt.subplots(figsize=(10,10)) corr = df.corr() sns.heatmap(corr) plt.show() ## Normalize the data so all numeric values in the dataset are between 0 and 1 We will use Min-Max Normalization df_norm = (df - df.min())/(df.max()-df.min()) df_norm.head() ## Split Dataset into 70% train, 30% test train_df, test_df = train_test_split(df_norm, test_size=0.3) y_train = train_df['APFT_1_is_pass'].values X_train = train_df.drop(columns = ['APFT_1_is_pass']).values y_test = test_df['APFT_1_is_pass'].values X_test = test_df.drop(columns = ['APFT_1_is_pass']).values print(f"There are {len(y_train)} rows in the train dataset, and {len(y_test)} rows in the test dataset.") ## Function to compute accuracy This function will compute accuracy given the real output and the predicted output def calc_accuracy(y_real, y_pred): N = len(y_pred) # Compute the number of predictions that are equal to the real values return np.where(y_real == y_pred)[0].shape[0]/N ## Model 1: Using all variables Create a LogisticRegressionModel considering all the variables in the dataset model1 = LogisticRegression() model1.fit(X_train, y_train) # Now predict y_pred1 = model1.predict(X_test) # Print accuracy accuracy1 = calc_accuracy(y_pred1, y_test) print("The accuracy of Model 1 is: {:.2f}%".format(accuracy1*100.0)) ## Model 2: From the correlation map shown before, we see that there is no correlation (or correlation almost equal to zero) between the ***APFT_1_is_pass*** variable and the ***ORS_Total*** variable, so now we will remove the **ORS_Total** variable ***ORS_Total*** is column 1 in the X_train/X_test arrays, so we will remove that column X_train2 = np.delete(X_train, 1, 1) X_test2 = np.delete(X_test, 1, 1) Create Model 2 model2 = LogisticRegression() model2.fit(X_train2, y_train) # Now predict y_pred2 = model2.predict(X_test2) # Print accuracy accuracy2 = calc_accuracy(y_pred2, y_test) print("The accuracy of Model 2 is: {:.2f}%".format(accuracy2*100.0)) We see that the accuracy did not change, so the removed column did not affect the output. ## Model 3: Removing all variables with a correlation less than |0.1| corr We see that the variables with a correlation (absolute value) less than 0.1 are: ORS_Total, PTSD_Score and Rank_SrEnlisted X_train3 = np.delete(X_train, [1,2,9], 1) X_test3 = np.delete(X_test, [1,2,9], 1) model3 = LogisticRegression() model3.fit(X_train3, y_train) # Now predict y_pred3 = model3.predict(X_test3) # Print accuracy accuracy3 = calc_accuracy(y_pred3, y_test) print("The accuracy of Model 3 is: {:.2f}%".format(accuracy3*100.0)) ### Removing variables with low correlation does not affects the model, so we keep the original model with all parameters ## ROC, AUC, accuracy and Confusion Matrix for Train Dataset lr_probs = model1.predict_proba(X_train) # keep probabilities for the positive outcome only lr_probs = lr_probs[:, 1] # calculate scores lr_auc = roc_auc_score(y_train, lr_probs) # summarize scores print('Logistic: ROC AUC=%.3f' % (lr_auc)) lr_fpr, lr_tpr, _ = roc_curve(y_train, lr_probs) # plot the roc curve for the model plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic Regression') # axis labels plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') # show the legend plt.legend() # show the plot plt.show() Confusion Matrix y_pred_train = model1.predict(X_train) conf_mat = confusion_matrix(y_train, y_pred_train) print(conf_mat) ## ROC, AUC, accuracy and Confusion Matrix for Test Dataset lr_probs = model1.predict_proba(X_test) # keep probabilities for the positive outcome only lr_probs = lr_probs[:, 1] # calculate scores lr_auc = roc_auc_score(y_test, lr_probs) # summarize scores print('Logistic: ROC AUC=%.3f' % (lr_auc)) lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs) # plot the roc curve for the model plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic Regression') # axis labels plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') # show the legend plt.legend() # show the plot plt.show() Confusion Matrix y_pred_test = model1.predict(X_test) conf_mat = confusion_matrix(y_test, y_pred_test) print(conf_mat)

Related Samples

Discover our collection of free Python assignment samples, meticulously crafted to aid your learning journey. These samples provide comprehensive solutions, helping you grasp Python programming concepts effectively. Access them to gain insights and excel in your Python assignments effortlessly.

See All Samples

Prime Number Check, Sum of Even Numbers, Guessing Game, and Dice Simulation in Python

Python

Word Count

4091 Words

Writer Name:Walter Parkes

Total Orders:2387

Satisfaction rate:

Python Assignment Sample: Analyzing Stock Market Data with Pandas

Python

Word Count

2184 Words