Introduction

https://archive.ics.uci.edu/ml/datasets/Glass+Identification

Glass is always a fun substance and it is everywhere around us. I think it is interesting to know a little more about the composition of such a common substance and how different physical properites lend themselves to different use cases. There are 9 numerical attributes and a 10th categorical attribute that represents the type of glass. There are no missing values in this data set.

There are six different types of glass in this dataset, so I would expect there to be six clusters. If the clusters are distinct, it would mean the types of glass have distinct characteristics. If clusters are not distinct, the types of glass are quite similar, so the type of glass may not matter as much as its application or use. I'm not sure the glass characteristics are distinct or there may be a lot of variance. I think there will be between three and six clusters. The clusters should not be the same size. Glass types five and six have less instances than one, two, or seven.

Functions

Analyses

8. Use sklearn’s PCA implementation to linearly transform the data to two dimensions. Create a scatter plot of the data, with the x-axis corresponding to coordinates of the data along the first principal component, and the y-axis corresponding to coordinates of the data along the second principal component. Does it look like there are clusters in these two dimensions? If so, how many would you say there are?

It looks like there are one or two definite clusters, though there could be a couple smaller ones along the line.

9. Use sklearn’s PCA implementation to linearly transform the data, without specifying the number of components to use. Create a plot with r, the number of components (i.e., dimensionality), on the x-axis, and f(r), the fraction of total variance captured in the first r principal components, on the y-axis. Based on this plot, choose a number of principal components to reduce the dimensionality of the data. Report how many principal components will be used as well as the faction of total variance captured using this many components.

10. For both the original and the reduced-dimensionality data obtained using PCA in question 9, do the following: Experiment with a range of values for the number of clusters, k, that you pass as input to the k-means function, to find clusters in the chosen data set. Use at least 5 different values of k. For each value of k, report the value of the objective function for that choice of k.

11. For both the original and the reduced-dimensionality data obtained using PCA in question 9, do the following: Experiment with a range of values for the mints and epsilon input parameters to the DBSCAN function to find clusters in the chosen data set. First keep epsilon fixed and try out a range of different values for minpts. Then keep minpts fixed, and try a range of values for epsilon. Use at least 5 values of epsilon and at least 5 values of minpts. Report the number of clusters found for each (minpts, epsilon) pair tested.

12. (EC) Create a plot of clustering precision for each value of k used in question 10, each value of epsilon used in question 11, and each value of mints used in question 11, for both the original and reduced-dimensionality data.