Python: Problem Set 10

Problem Set 10 uses K-Means clustering to group counties into similar groups based on 14 demographic features such as income, housing, education, farm acres, age, population, population change and a several others.

The first task is to measure the sum of the squared error (the distance between the points in a cluster and the center of the cluster).

Next, we look at the effects of using increasing numbers of initial clusters (K). If the number of clusters were equal to the the number of points in the dataset, then SSE would be zero and we would be finished. This, however, does not tell us much (if anything) about the relationship between points. So choosing a good K number is important.

In the county dataset, we see that for increasing numbers of clusters (k-vals), the SSE is nearly always on a decreasing trend.

The final part of the problem is to weight some features so as to cluster tighter on the poverty rate without using that feature in the measurement. Weighting all features except poverty ends in about a 44,000 SSE at 125 clusters. But, weighting for income and high school graduation rate was a noticeable decrease, about 31,000.

sum squared error