Tip of the day
News
Instructions
Objective
Write a python homework program to implement weather prediction.
Requirements and Specifications
Create a weather predictive system with the help of machine learning.
Source Code
import numpy as np # linear algebraimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)import osfor dirname, _, filenames in os.walk('/kaggle/input'):for filename in filenames:print(os.path.join(dirname, filename))Loading and previewing the dataset.weatherdata = pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv')weatherdata.head(10)List the columns of the dataframe and load the shape of the dataframe.print(weatherdata.columns)print("Shape of the dataframe: ", weatherdata.shape)Loading the descriptive statistic summary of the dataframe.weatherdata.describe()Inspecting the data types of each column in the dataframe.weatherdata.dtypesVisualizing the correlation of each column in heatmap.import seaborn as snsimport matplotlib.pyplot as pltsns.set(rc = {'figure.figsize':(15,8)})corrplot = sns.heatmap(weatherdata.corr(), cmap = 'YlGnBu', annot = True)plt.show()From the above heatmap, we can see that the most positive correlation occurs between variables MaxTemp and Temp3pm, and the most negative correlation occurs between variables Sunshine and Cloud3pm.We want to make sure that there is no missing data in the columns containing numeric-type data.We need to inspect how many NAs in the columns containing numeric-type data.# Count how many null values in columns containing numeric datanumeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', \'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', \'Temp9am', 'Temp3pm']na_sum_numeric_cols = []for n in numeric_cols:na_sum_numeric_cols.append(weatherdata[n].isnull().sum())for i in range(len(numeric_cols)):print("Sum of NAs in", numeric_cols[i], ":", na_sum_numeric_cols[i])We will check whether the numeric columns have outliers using Seaborn boxplot, to determine how we will remove the NA values.fig, ax = plt.subplots(len(numeric_cols), 1, figsize = (20, 65))fig.suptitle("Distribution of Outliers")for n in numeric_cols:sns.boxplot(x = weatherdata[n], data = weatherdata, palette = "crest", ax = ax[numeric_cols.index(n)], width = 0.4)ax[numeric_cols.index(n)].set_title("")plt.show()From the above boxplots, we can find there are many numeric columns containing outliers. Hence, we want to remove the NAs by replacing the NA values with the median of each column.for n in numeric_cols:weatherdata[n].fillna(value = weatherdata[n].median(), inplace = True)weatherdata.head(10)Making sure that there is no NA value left in each column.for n in numeric_cols:print('Amount of null values in column', n, ":", weatherdata.shape[0] - len(weatherdata[n]))We do the same to the columns containing categorical values, namely WindGustDir, WindDir9am, and WindDir3pm. However, we will use the 'ffill' method, since we can't replace NAs with either median, mode, or mean of categorical values.categorical_cols = ["WindGustDir", "WindDir9am", "WindDir3pm"]na_sum_categorical_cols = []# Count the number of NAs in categorical columnsfor col in categorical_cols:na_sum_categorical_cols.append(weatherdata[col].isnull().sum())for i in range(len(categorical_cols)):print("Sum of NAs in", categorical_cols[i], ":", na_sum_categorical_cols[i])# Replacing NA with 'ffill' method and check whether all NAs have been replacedfor col in categorical_cols:weatherdata[col].fillna(method = 'ffill', inplace = True)print('Amount of null values in column', col, "after preprocessing is:", \weatherdata.shape[0] - len(weatherdata[col]))We check how many NA values exist in RainToday.print("The number of NAs in RainToday is", weatherdata['RainToday'].isna().sum(), \"while the total number of rows in RainToday is", len(weatherdata['RainToday']))print("The ratio between NAs and non-NAs is", weatherdata['RainToday'].isna().sum()/len(weatherdata['RainToday']))Since NA values constitute only 2.24% of all rows, we may drop the NAs in RainToday.weatherdata.dropna(subset = ["RainToday"], inplace = True)print(weatherdata.shape)We do the same to RainTomorrow.print("The number of NAs in RainTomorrow is", weatherdata['RainTomorrow'].isna().sum(), \"while the total number of rows in RainToday is", len(weatherdata['RainTomorrow']))print("The ratio between NAs and non-NAs is", weatherdata['RainTomorrow'].isna().sum()/len(weatherdata['RainTomorrow']))weatherdata.dropna(subset = ['RainTomorrow'], inplace = True)print(weatherdata.shape)Finding the most minimum and maximum temperature in degrees Celcius.print("The most minimum temperature recorded is", weatherdata['MinTemp'].min(), "degrees Celcius")print("The most maximum temperature recorded is", weatherdata['MaxTemp'].max(), "degrees Celcius")Finding the largest amount of rainfall recorded in a day.print("The largest amount of rainfall recorded in a day is", weatherdata['Rainfall'].max(), "mm")Now let's get started with training and testing data.We notice that we have columns containing categorical data. Hence, we need to apply label encoding to "convert" the categorical values into numerical values.from sklearn import preprocessinglabel_encoder = preprocessing.LabelEncoder()for col in categorical_cols:weatherdata[col] = label_encoder.fit_transform(weatherdata[col])print(weatherdata[col].unique())We do the same for Date, Location, RainToday and Rain Tomorrow.for col in ['Date', 'Location', 'RainToday', 'RainTomorrow']:weatherdata[col] = label_encoder.fit_transform(weatherdata[col])print(weatherdata[col].unique())weatherdata.head(10)Let's start splitting the dataset into training and test sets.X = weatherdata.drop(columns = ['RainTomorrow'], axis=1)y = weatherdata['RainTomorrow']from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)Checking the shapes of training and test sets.print ('Training set:', X_train.shape, y_train.shape)print ('Test set:', X_test.shape, y_test.shape)We will test some machine learning algorithms, namely Decision Tree and Naive Bayes.We start with Decision Tree first.We will also use hyperparameter tuning using RandomizedSearchCV to find and use the best parameters, that may result in a good-performance model.from sklearn.tree import DecisionTreeClassifierfrom scipy.stats import randintfrom sklearn.model_selection import RandomizedSearchCVfrom sklearn.metrics import classification_report, confusion_matrixparam_dist = {"max_depth": [3, None], "max_features": randint(1, 9), "min_samples_leaf": randint(1, 9), \"criterion": ["gini", "entropy"]}# Instantiate the classifierweathertree = DecisionTreeClassifier()# Instantiate the hyperparameter tuning algorithmweathertree_cv = RandomizedSearchCV(weathertree, param_dist, cv = 5)# Train the training dataweathertree_cv.fit(X_train, y_train)# Print the tuned parameters and scoreprint("Tuned Decision Tree Parameters: {}".format(weathertree_cv.best_params_))print("Best score is {}\n".format(weathertree_cv.best_score_))# Inspecting the accuracy of the fine-tuned modelprint("Confusion matrix: \n {}".format(confusion_matrix(y_true=y_test, y_pred=weathertree_cv.predict(X_test))))print("\nClassification report: \n \n {}".format(classification_report(y_test, y_pred=weathertree_cv.predict(X_test))))print("Accuracy using Decision Tree: ", metrics.accuracy_score(y_test, weathertree_cv.predict(X_test)))We can see that, using Decision Tree, the fine-tuned model has 80.22% accuracy.We will try classifying using Naive Bayes.from sklearn.naive_bayes import GaussianNBweathernb = GaussianNB()weathernb.fit(X_train, y_train)# Inspecting the metrics of the modelprint("Confusion matrix: \n {}".format(confusion_matrix(y_true=y_test, y_pred=weathernb.predict(X_test))))print("\nClassification report: \n \n {}".format(classification_report(y_test, y_pred=weathernb.predict(X_test))))print("\nAccuracy using GaussianNB: ", metrics.accuracy_score(y_test, weathernb.predict(X_test)))From the above, we can see that, using Gaussian Naive Bayes, the accuracy is slightly better than Decision Tree at 80.30%.We will try classifying using Logistic Regression.from sklearn.linear_model import LogisticRegressionc_space = np.logspace(-5, 8, 15)param_grid = {'C': c_space}weatherLR = LogisticRegression(solver = 'liblinear')weatherLR_cv = RandomizedSearchCV(weatherLR, param_grid, cv = 5)weatherLR_cv.fit(X_train, y_train)# Print the tuned parameters and scoreprint("Tuned Logistic Regression Parameters: {}".format(weatherLR_cv.best_params_))print("Best score is {}".format(weatherLR_cv.best_score_))# Inspecting the metrics of the modelprint("\nConfusion matrix: \n {}".format(confusion_matrix(y_true=y_test, y_pred=weatherLR_cv.predict(X_test))))print("\nClassification report: \n \n {}".format(classification_report(y_test, y_pred=weatherLR_cv.predict(X_test))))print("\nAccuracy using Logistic Regression: ", metrics.accuracy_score(y_test, weatherLR_cv.predict(X_test)))From the above, we may conclude that Logistic Regression with C = 163789.37 performs better than the other two algorithms, with the accuracy of 84%.
Related Samples
Discover our Python Assignment Samples for clear, detailed solutions to programming tasks. These examples cover essential topics such as loops, functions, data manipulation, and algorithmic problems. Perfect for students looking to enhance their Python skills with practical, educational resources designed to aid understanding and improve academic performance.
Python
Word Count
4091 Words
Writer Name:Walter Parkes
Total Orders:2387
Satisfaction rate:
Python
Word Count
2184 Words
Writer Name:Dr. Jesse Turner
Total Orders:2365
Satisfaction rate:
Python
Word Count
6429 Words
Writer Name:Dr. Olivia Campbell
Total Orders:753
Satisfaction rate:
Python
Word Count
5883 Words
Writer Name:Dr. David Adam
Total Orders:489
Satisfaction rate:
Python
Word Count
5849 Words
Writer Name:Dr. Nicholas Scott
Total Orders:642
Satisfaction rate:
Python
Word Count
6162 Words
Writer Name:Professor Liam Mitchell
Total Orders:932
Satisfaction rate:
Python
Word Count
3640 Words
Writer Name:Professor Daniel Mitchell
Total Orders:465
Satisfaction rate:
Python
Word Count
4343 Words
Writer Name:Prof. Jackson Ng
Total Orders:627
Satisfaction rate:
Python
Word Count
7272 Words
Writer Name:Dr. Jennifer Carter
Total Orders:879
Satisfaction rate:
Python
Word Count
4577 Words
Writer Name:Dr. Sophia Nguyen
Total Orders:900
Satisfaction rate:
Python
Word Count
4145 Words
Writer Name:Dr. Isabella Scott
Total Orders:754
Satisfaction rate:
Python
Word Count
4193 Words
Writer Name:Dr. Isabella Scott
Total Orders:754
Satisfaction rate:
Python
Word Count
4165 Words
Writer Name:Dr. Ashley
Total Orders:685
Satisfaction rate:
Python
Word Count
4176 Words
Writer Name:Prof. Jackson Ng
Total Orders:627
Satisfaction rate:
Python
Word Count
3922 Words
Writer Name:Dr. Chloe Mitchell
Total Orders:957
Satisfaction rate:
Python
Word Count
4091 Words
Writer Name:Glenn R. Maguire
Total Orders:714
Satisfaction rate:
Python
Word Count
4747 Words
Writer Name:Dr. Sophia Nguyen
Total Orders:900
Satisfaction rate:
Python
Word Count
4594 Words
Writer Name:Dr. Samantha Benet
Total Orders:812
Satisfaction rate:
Python
Word Count
6716 Words
Writer Name:Dr. Isabella Scott
Total Orders:754
Satisfaction rate:
Python
Word Count
4347 Words
Writer Name:Prof. James Harper
Total Orders:664
Satisfaction rate: