Monthly Archives: August 2015

Python: Problem Set 10

Problem Set 10 uses K-Means clustering to group counties into similar groups based on 14 demographic features such as income, housing, education, farm acres, age, population, population change and a several others.

The first task is to measure the sum of the squared error (the distance between the points in a cluster and the center of the cluster).

Next, we look at the effects of using increasing numbers of initial clusters (K). If the number of clusters were equal to the the number of points in the dataset, then SSE would be zero and we would be finished. This, however, does not tell us much (if anything) about the relationship between points. So choosing a good K number is important.

In the county dataset, we see that for increasing numbers of clusters (k-vals), the SSE is nearly always on a decreasing trend.

The final part of the problem is to weight some features so as to cluster tighter on the poverty rate without using that feature in the measurement. Weighting all features except poverty ends in about a 44,000 SSE at 125 clusters. But, weighting for income and high school graduation rate was a noticeable decrease, about 31,000.

sum squared error

Python: Problem Set 9

The topic this week is optimization and clustering. This problem is less intensive than last week, but it does demonstrate the trade offs between finding an optimal solution (time intensive) and an adequate solution to an optimization problem.

The point here is to optimize a student’s schedule given two factors: workload and value as defined in the given dictionary. E.g. {‘6.00′: (10, 1)}, the class 6.00 is worth ten times its estimated workload. Naturally, there must be a constraint to our optimization, in this case the student’s desired maximum workload e.g. 12 might return {‘6.00′: (10, 1), ‘6.01’: (5, 4), ‘6.02’: (5, 6)}. With a workload of 11, no additional courses could be added to the schedule without exceeding the maximum.

The initial optimizations went quickly, creating three possible sorting functions: work, value or val/work ratio. The brute force best possible function was more tricky. I first used itertools.permutation method to calculate all possible solutions, in order to then loop through and pick one. This died on the short_subject_list.txt with just 20 courses. My function here works well enough on eight items, but I will have to check the staff solution to see how we are expected to carry out such a calculation on 20 items.

EDIT: OCW has a solution for PS9, but it is for a completely different assignment. But, I realized I need a combination of all possible schedules, not a permutation of each. This makes a sample size of twenty courses manageable.

Python: Problem Set 8

Problem set 8 built on the virus and patient model created in ps7. The main things to look at are comparisons of patients with different medication regimens. In part 2, we look at the results of patients prescribed ‘guttagonal’ with a control group that receives no medication.
guttagonal graph

In part five, we see that there are very similar looking graphs for two groups of patients: group A receives one prescription followed by a delay a second prescription, while the second group, B, receives two anti-virals simultaneously. What the graph does not show is that all 100 patients in group A are virus free by the end of the trial, but group B has 20-25 patients with single digit virus counts remaining.

Python: Problem Set 7

Problem 7 simulates the growth of a virus in a subject. Due to odds of multiplying or self healing, the virus population varies greatly up and down. In this simulation, the graph shows that this virus does not exceed 15% of host cells. This is due to the inverse relationship between the number of virus cells and their likelihood to reproduce: self.maxBirthProb * (1 – popDensity). This would not be so bad for the virus if it were not for the probability of virus cells to cleanse themselves (5%).

virus simulation

Problem 7b computationally shows the odds of rolling a Yahtzee, five six-sided die landing on the same number. Mathematically we know this is 6/7776, the number of desired outcomes over the number of possible outcomes. By simulating a number of trials (20) with increasing numbers of dice rolls, we can see that as our rolls exceed 100K and grow to 2^20, the mean of the resulting ratios becomes very close to the number we expect, 0.00077.


Python: Problem Set 6

Problem set 6 involves modeling robots that clean rooms. We have room object with tiles to be cleaned, robots which “move” and mark tiles as cleaned. All robots begin at random positions in the room, pointed in a random direction. Standard robots move in the same direction until reaching a wall, when they pick a new direction. Random-walking-robots pick a new direction on each step. This looks like spinning when visualized with Tkinter. The visualization module is provided as a helper file with the lesson, not made by me. This came in handy for figuring out that I was not flooring() the robots position either on passing or getting the variable for cleaning. It was clear once I could see that robots would clean if they were moving straight along one axis, but not if they were in a decimal position “between” the tiles. They could clean the grout, but not the tile :)

The second part to the problem is to measure how long it takes both types of robots to clean a room. Graphing is done in Pylab. The random walking robot takes more than twice the time to clean 80% of the tiles in a square room.