13.1 Project 1: Predicting House Prices with Regression
In this project, we will develop a machine learning model to predict house prices. This is a common real-world application of regression, a type of supervised learning method in machine learning. We will use the Boston Housing dataset, which contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.
13.1.1 Problem Statement
The goal of this project is to build a model that can predict the median value of owner-occupied homes in Boston, given a set of features such as crime rate, average number of rooms per dwelling, and others.
13.1.2 Dataset
The dataset used in this project comes from the UCI Machine Learning Repository. This data was collected in 1978 and each of the 506 entries represents aggregate information about 14 features of homes from various suburbs located in Boston.
The features can be summarized as follows:
- CRIM: This is the per capita crime rate by town
- ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft.
- INDUS: This is the proportion of non-retail business acres per town.
- CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)
- NOX: This is the nitric oxides concentration (parts per 10 million)
- RM: This is the average number of rooms per dwelling
- AGE: This is the proportion of owner-occupied units built prior to 1940
- DIS: This is the weighted distances to five Boston employment centers
- RAD: This is the index of accessibility to radial highways
- TAX: This is the full-value property-tax rate per $10,000
- PTRATIO: This is the pupil-teacher ratio by town
- B: This is calculated as 1000(Bk — 0.63)², where Bk is the proportion of people of African American descent by town
- LSTAT: This is the percentage lower status of the population
- MEDV: This is the median value of owner-occupied homes in $1000s
13.1.3 Implementation
Step 1
Let's start by loading the dataset and removing the non-essential features.
# Import libraries necessary for this projectimport numpy as npimport pandas as pdfrom sklearn.model_selection import ShuffleSplit # Import supplementary visualizations code visuals.pyimport visuals as vs # Pretty display for notebooks%matplotlib inline # Load the Boston housing datasetdata = pd.read_csv('housing.csv')prices = data['MEDV']features = data.drop('MEDV', axis = 1) # Successprint("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))Code breakdown:
The first line imports the NumPy library, which provides a high-level interface to numerical computing. The second line imports the Pandas library, which provides high-level data structures and data analysis tools. The third line imports the ShuffleSplit class from scikit-learn, which is used to create train/test splits of data. The fourth line imports the supplementary visualizations code from the [visuals.py](http://visuals.py/) file. The fifth line sets up the notebook for pretty printing. The sixth line loads the Boston housing dataset from the housing.csv file. The seventh line creates the prices variable, which contains the median value of owner-occupied homes in thousands of dollars. The eighth line creates the features variable, which contains the 13 features of the dataset. The ninth line prints a success message, followed by the number of data points and variables in the dataset.
We will then split the dataset into features and the target variable. The features 'RM', 'LSTAT', and 'PTRATIO', give us quantitative information about each data point. The target variable, 'MEDV', will be the variable we seek to predict.
Next, we will calculate some descriptive statistics about the Boston housing prices.
import numpy as npimport pandas as pd # Load the Boston housing datasetdata = pd.read_csv('housing.csv')prices = data['MEDV'] # Minimum price of the dataminimum_price = np.min(prices) # Maximum price of the datamaximum_price = np.max(prices) # Mean price of the datamean_price = np.mean(prices) # Median price of the datamedian_price = np.median(prices) # Standard deviation of prices of the datastd_price = np.std(prices) # Show the calculated statisticsprint("Statistics for Boston housing dataset:\n")print("Minimum price: ${}".format(minimum_price))print("Maximum price: ${}".format(maximum_price))print("Mean price: ${}".format(mean_price))print("Median price ${}".format(median_price))print("Standard deviation of prices: ${:.2f}".format(std_price)) Code breakdown:
The code first imports the NumPy library, which provides a number of functions for working with numerical data. Next, the code defines a variable called prices, which contains the median home prices in the Boston housing dataset. The code then uses the NumPy functions amin(), amax(), mean(), median(), and std() to calculate the minimum, maximum, mean, median, and standard deviation of the prices, respectively. Finally, the code prints the calculated statistics.
We can make some assumptions about the data. For example, houses with more rooms (higher 'RM' value) will be worth more. Neighborhoods with more lower-class workers (higher 'LSTAT' value) will be worth less. Neighborhoods with a higher student to teacher ratio ('PTRATIO') will be worth less.
Next, we will split the data into training and testing subsets.
# Import libraries necessary for this projectimport numpy as npimport pandas as pdfrom sklearn.model_selection import train_test_split # Load the Boston housing datasetdata = pd.read_csv('housing.csv')prices = data['MEDV']features = data.drop('MEDV', axis=1) # Successprint("Boston housing dataset has {} data points with {} variables each.".format(*data.shape)) # Shuffle and split the data into training and testing subsetsX_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42) # Successprint("Training and testing split was successful.") Code breakdown:
The code first imports the traintestsplit function from the sklearn.modelselection library. Next, the code defines two variables, features and prices, which contain the features and prices of the Boston housing dataset, respectively. The code then uses the traintestsplit function to split the data into training and testing subsets. The testsize parameter specifies that 20% of the data should be used for testing, and the random_state parameter specifies that the data should be shuffled randomly. Finally, the code prints a message indicating that the training and testing split was successful.
We will then train a model using the decision tree algorithm. To ensure that we are producing an optimized model, we will train the model using the grid search technique to optimize the 'max_depth' parameter for the decision tree.
# Import 'ShuffleSplit'from sklearn.model_selection import ShuffleSplit def fit_model(X, y): # Create cross-validation sets from the training data cv_sets = ShuffleSplit(n_splits=10, test_size=0.20, random_state=0) # Create a decision tree regressor object regressor = DecisionTreeRegressor() # Create a dictionary for the parameter 'max_depth' with a range from 1 to 10 params = {'max_depth': list(range(1, 11))} # Transform 'performance_metric' into a scoring function using 'make_scorer' scoring_fnc = make_scorer(performance_metric) # Create the grid search cv object --> GridSearchCV() grid = GridSearchCV(estimator=regressor, param_grid=params, scoring=scoring_fnc, cv=cv_sets) # Fit the grid search object to the data to compute the optimal model grid = grid.fit(X, y) # Return the optimal model after fitting the data return grid.best_estimator_ Code breakdown:
The code first imports the DecisionTreeRegressor, makescorer, and GridSearchCV functions from the sklearn.tree, sklearn.metrics, and sklearn.modelselection libraries, respectively. Next, the code defines a function called fitmodel(), which takes two arguments, X and y, which represent the training data and the target values, respectively. The code then creates a ShuffleSplit object called cvsets, which splits the training data into 10 folds, with 20% of the data used for testing in each fold. Next, the code creates a DecisionTreeRegressor object called regressor. The code then creates a dictionary called params, which maps the parameter name maxdepth to a list of values from 1 to 10. The code then uses the makescorer() function to create a scoring function called scoringfnc, which will be used to evaluate the performance of the different models. Finally, the code creates a GridSearchCV object called grid, which will be used to search for the optimal model. The grid object is passed the regressor, params, scoringfnc, and cvsets objects. The grid object is then fit to the data, which will find the optimal model. The optimal model is then returned from the fitmodel() function.
Finally, we will make predictions on new sets of input data.
# Assume reg is the trained model obtained from fit_model # Produce a matrix for client dataclient_data = [[5, 17, 15], # Client 1 [4, 32, 22], # Client 2 [8, 3, 12]] # Client 3 # Show predictionsfor i, price in enumerate(reg.predict(client_data)): print("Predicted selling price for Client {}'s home: ${:,.2f}".format(i + 1, price)) Code breakdown:
The code first creates a matrix called client_data, which contains the client data. The code then uses the reg.predict() function to predict the selling price for each client. The code then uses the enumerate() function to iterate over the predicted prices and the client IDs. The code then prints the predicted selling price for each client.
This project provides a practical application of machine learning in a real-world setting. It demonstrates how to use regression to predict house prices based on various features. The code provided can be used as a starting point for further exploration and experimentation.