I wanted to analyze data on election donations in the 2016 and 2020 presidential cycles to find how donors shifted between candidates in those two cycles. To do so, I looked at all cases in which there was a maximum individual donation under the same name in 2016 and 2020 and then plotting them in a heatmap.
See the source code here and learn more about me, the author, here.
# get key, value from line
def get_kv(l):
# we need the 13th column but quotes mix it up
key, val, commas, quotes = "", "", 0, False
for c in l:
if c == '\"':
quotes = not quotes
elif c == ',' and not quotes:
commas += 1
elif commas == 13:
key += c
elif commas == 1 and (c.isupper() or c.isspace()):
val += c
elif commas == 14:
# shorten val
for f in ["FOR","PRES","AMER"]:
if f in val:
val = val[:val.find("FOR")]
for r in ["FRIENDS OF","COMMITTEE","TO ELECT","INC","CAMPAIGN"]:
val = val.replace(r,"")
val.strip()
return key, val
I did learn something here, which is that it would probably help transparency a lot if the names of donees had to match the names of the candidates on behalf of whom they collect donations.
The data set was approximately 100 megabytes in size encoded as a .csv files. I read the files line by line in a binary search tree that stored the donor name as a key value and then stored as values the donee in two sets, one for 2016 and one for 2020. I used get_kv()
as my function to read lines.
This is where you would use the library of the week if you wanted to use a library.
# a bst is a list - less, more, key, vals
# key is a donor
# vals is a year indexed list of sets of donees
# file line to bst entry
def insert(data, tree, func, year):
# use func to read a line of data from file
key, val = func(data)
# traverse
while tree and key != tree[2]:
tree = tree[key > tree[2]]
if tree:
# add value to key
tree[3][year].add(val)
elif not year:
# add key to tree
tree.extend([[],[], key, [{val},set()]])
I stored the subtrees first to allow for boolean indexing and separated donees by year to track cycle-to-cycle changes. I had not originally planned to track multiple donees per year, but when working over the data found it to be quite common for a donor to donate to multiple donees per cycle and wanted to capture this phenomenon.
I did need to be able to flatten the tree back into a list, so I used a simple recursive utility to do so.
def tolist(tree):
if tree:
# we have a two sets and want to add all pairs between sets
l = [(one, two) for one in tree[3][0] for two in tree[3][1]]
return tolist(tree[0]) + l + tolist(tree[1])
return []
Here I swapped from lists to tuples for internal data (cross cycle donee pairs) as I wanted to be able to make sets of these pairs during debug. Since lists cannot be hashed and therefore cannot be added to Python set, this was the first use of an appropriate Python type other than a list to store data, and provides coverage over the list-tuple-set triad.
d["TRUMP"]["BIDEN"]
. So in all, my code goes csv->bst->list->dictionary. I found I also had to take the log of the log of the frequencies using bit_length
because of strong herding effects around winning nominees making movement between primary candidates almost immeasurably small, especially at the point of visualization.
# csv -> bst
csvs = ["2016_2700s.csv","2020_2800s.csv"]
tree = []
[insert(line, tree, get_kv, n) for n in [0,1] for line in open(csvs[n],"r")]
# bst -> list
dnrs = [donr for donr in tolist(tree) if all(donr)] # ensure donations both cycles
# list -> dict
sxtn, twnt = set([donr[0] for donr in dnrs]), set([donr[1] for donr in dnrs])
d = { s : { t : 0 for t in twnt } for s in sxtn } # init to zero
for donr in dnrs:
d[donr[0]][donr[1]] += 1 # compute frequencies
# frequency -> log of log of frequency
d = { s : { t : d[s][t].bit_length().bit_length() for t in twnt } for s in sxtn }
Once I had these dictionaries on hand, a few shortcomings emerged. First, the data was not sorted in anyway. Bernie has far more in common with Hillary and Trump than with Santorum and McAfee visually, mainly due to the sheer quantity of donors. So I ordered donees along both axis by number of donors. This also showed some trailing candidates only shared donors with eventual nominees. I choose to take these candidates, such as McMullin, out of the dataset.
noms = ["HILLARY", "DONALD J TRUMP", "BIDEN"]
# filter 20s candidates who only received co-donations with nominees
innr = [i for i in d[next(iter(d))] if sum([d[o][i] for o in d if o not in noms])]
# rank 20s candidates donor counts
innr = sorted(innr, key = lambda i : sum([d[o][i] for o in d]))[::-1]
# filter 16s candidates who only received co-donations with nominees
outr = [o for o in d if sum([d[o][i] for i in innr if i not in noms])]
# sort inner and outer dictionaries
d = { o : { innr[i] : d[o][innr[i]] for i in range(len(innr)) } for o in outr }
d = dict(sorted(list(d.items()), key = lambda x : sum(x[1].values()))[::-1])
import pandas
import seaborn
import matplotlib.pyplot as plt
df = pandas.DataFrame.from_dict(d)
plt.rcParams['figure.figsize'] = [10, 10]
plt.rcParams['figure.dpi'] = 100
plt.tick_params(axis='both', which='major', labelsize=10, labelbottom = False,
bottom=False, top = False, left = False, labeltop=True)
plt.style.use("dark_background")
p = seaborn.heatmap(df, cmap=seaborn.color_palette("mako", as_cmap=True))
p.get_figure().savefig("heat_final.png",bbox_inches='tight',transparent=True)
I did touch up a bit with Glimpse at the end. I converted the black font via color exchange to match the HTML text color, and placed the image on a black background for distribution outside of the host page. I also clipped the tick marks from the legend as they were unhelpfully quantitative.