Introduction

This data was used to experiment with combining Instance-based and Model-based methods and was displayed in the 1983 American Statistical Association Exposition. It was originally uploaded by Quinlan R. from the University of Massachusetts in 1993.

There are nine attributes and 406 entities in the original dataset. The modified dataset has 398 entities, but I used the original dataset for this project. The first six attributes are numerical and the last 3 are categorical. The model year should be label encoded because it makes sense for a larger difference in model year to have more of a difference in the results. There is very little data about what origin represents. It is discrete and ranges from 1 to 3, and should probably be one hot encoded. Car Names should be one hot encoded. Car Names and Origin were removed from the analyses; There are 312 distinct names and they don't have much of an impact. The meaning of origin could not be determined, so it shouldn't influence the results.

I've driven a variety of vehicles throughout my life and I always find it interesting how the manufacturers design them with different aspects in mind. The car I currently drive gets pretty good gas mileage on the highway, but not nearly as great in Bozeman. Meanwhile my partner's car gets almost as good gas mileage in Bozeman as it does on the highway. It will be interesting to see how different vehicle metrics affect each other.

Miles per gallon is the focus of the dataset, so I think it will be the most interesting. Horsepower, model year, and weight should also interact a lot with mpg should be the most important.

Functions

Analyses

Preprocessing

There were two versions of this data set on the UCI Machine Learning Repository. One was the original dataset and the other had the instances with a null mpg removed. I chose to use the original dataset to fill in the null values with the mean.

What is the multivariate mean of the numerical data matrix (where categorical data have been converted to numerical values)?

What is the covariance matrix of the numerical data matrix (where categorical data have been converted to numerical values)?

Choose 5 pairs of attributes that you think could be related. Create scatter plots of all 5 pairs and include these in your report, along with a description and analysis that summarizes why these pairs of attributes might be related, and how the scatter plots do or do not support this intuition.

mpg vs cylinders

The number of cylinders appears to be related to miles per gallon. This makes sense because each cylinder requires some amount of gasoline in order to fire. A car with more cylinders will have to pump more gas in order to saturate each cylinder.

mpg vs horsepower

Horsepower also appears to be related to mpg. This makes sense because horsepower is a measure of the power that an engine can output. In order to get more power, more fuel needs to be used, which would decrease mpg. The scatterplot shows a fairly strong relationship.

mpg vs model year

The scatter plot shows that mpg is related to model year. There is a lot of variation within each year, though mpg consistently goes up. This makes sense because car manufacturers will produce better cars over time as new advancements are made.

mpg vs weight

Mpg appears to be related to weight. It requires more energy to move a heavier object than a lighter one. This data set was also collected from driving in the city where starts and stops are more common.

acceleration vs weight

These two are much less related than the other attribute pairs above. One would think that heavier cars would be harder to accelerate, but these vehicles are likely paired with more powerful engines.

Which range-normalized numerical attributes have the greatest sample covariance? What is their sample covariance? Create a scatter plot of these range-normalized attributes.

Which Z-score-normalized numerical attributes have the greatest correlation? What is their correlation? Create a scatter plot of these Z-score-normalized attributes.

Which Z-score-normalized numerical attributes have the smallest correlation? What is their correlation? Create a scatter plot of these Z-score-normalized attributes.

How many pairs of features have correlation greater than or equal to 0.5?

How many pairs of features have negative sample covariance?

What is the total variance of the data?

What is the total variance of the data, restricted to the five features that have the greatest sample variance?