Data Structures for Data Analysis for Data Visualization

I wanted to analyze data on election donations in the 2016 and 2020 presidential cycles to find how donors shifted between candidates in those two cycles. To do so, I looked at all cases in which there was a maximum individual donation under the same name in 2016 and 2020 and then plotting them in a heatmap.

See the source code here and learn more about me, the author, here.

The Data

I drew from FEC individual contributions. You can look at the data for yourself:

This should be static, but was retrieved on 19 April 2022. The originals are retained under version control. As the files are too large to be stored on Github, they are uploaded in parts with the split/combine script also provided. You can also run on partial data, which generally appeared to be representative. I found the csv lines to be extremely low in information density, and over time refined a line reader that I would use when reading lines into an internal data structure.

# get key, value from line
def get_kv(l):
	# we need the 13th column but quotes mix it up
	key, val, commas, quotes = "", "", 0, False
	for c in l:
		if c == '\"':
			quotes = not quotes
		elif c == ',' and not quotes:
			commas += 1
		elif commas == 13:
			key += c
		elif commas == 1 and (c.isupper() or c.isspace()):
			val += c
		elif commas == 14:
			# shorten val
			for f in ["FOR","PRES","AMER"]:
				if f in val:
					val = val[:val.find("FOR")]
			for r in ["FRIENDS OF","COMMITTEE","TO ELECT","INC","CAMPAIGN"]:
				val = val.replace(r,"")
			val.strip()
			return key, val

I did learn something here, which is that it would probably help transparency a lot if the names of donees had to match the names of the candidates on behalf of whom they collect donations.

The Structure

The data set was approximately 100 megabytes in size encoded as a .csv files. I read the files line by line in a binary search tree that stored the donor name as a key value and then stored as values the donee in two sets, one for 2016 and one for 2020. I used get_kv() as my function to read lines.

This is where you would use the library of the week if you wanted to use a library.

# a bst is a list - less, more, key, vals
# key is a donor
# vals is a year indexed list of sets of donees

# file line to bst entry
def insert(data, tree, func, year):
	# use func to read a line of data from file
	key, val = func(data)
	# traverse
	while tree and key != tree[2]:
		tree = tree[key > tree[2]]
	if tree:
	# add value to key
		tree[3][year].add(val)
	elif not year:
	# add key to tree
		tree.extend([[],[], key, [{val},set()]])

I stored the subtrees first to allow for boolean indexing and separated donees by year to track cycle-to-cycle changes. I had not originally planned to track multiple donees per year, but when working over the data found it to be quite common for a donor to donate to multiple donees per cycle and wanted to capture this phenomenon.

I did need to be able to flatten the tree back into a list, so I used a simple recursive utility to do so.

def tolist(tree):
	if tree:
		# we have a two sets and want to add all pairs between sets
		l = [(one, two) for one in tree[3][0] for two in tree[3][1]]
		return tolist(tree[0]) + l + tolist(tree[1])
	return []

Here I swapped from lists to tuples for internal data (cross cycle donee pairs) as I wanted to be able to make sets of these pairs during debug. Since lists cannot be hashed and therefore cannot be added to Python set, this was the first use of an appropriate Python type other than a list to store data, and provides coverage over the list-tuple-set triad.

The Analysis

To look over the data I thought it best to encode in a dictionary of two dimensions indexed by donees capturing donor frequency for the donee pair. To see how many 2016.Trump->2020.Biden donors for example I would then check d["TRUMP"]["BIDEN"]. So in all, my code goes csv->bst->list->dictionary. I found I also had to take the log of the log of the frequencies using bit_length because of strong herding effects around winning nominees making movement between primary candidates almost immeasurably small, especially at the point of visualization.

# csv -> bst
csvs = ["2016_2700s.csv","2020_2800s.csv"]
tree = []
[insert(line, tree, get_kv, n) for n in [0,1] for line in open(csvs[n],"r")]

# bst -> list
dnrs = [donr for donr in tolist(tree) if all(donr)] # ensure donations both cycles

# list -> dict
sxtn, twnt = set([donr[0] for donr in dnrs]), set([donr[1] for donr in dnrs])
d = { s : { t : 0 for t in twnt } for s in sxtn } # init to zero
for donr in dnrs:
	d[donr[0]][donr[1]] += 1 # compute frequencies 

# frequency -> log of log of frequency
d = { s : { t : d[s][t].bit_length().bit_length() for t in twnt } for s in sxtn }

Once I had these dictionaries on hand, a few shortcomings emerged. First, the data was not sorted in anyway. Bernie has far more in common with Hillary and Trump than with Santorum and McAfee visually, mainly due to the sheer quantity of donors. So I ordered donees along both axis by number of donors. This also showed some trailing candidates only shared donors with eventual nominees. I choose to take these candidates, such as McMullin, out of the dataset.

noms = ["HILLARY", "DONALD J TRUMP", "BIDEN"]

# filter 20s candidates who only received co-donations with nominees
innr = [i for i in d[next(iter(d))] if sum([d[o][i] for o in d if o not in noms])]

# rank 20s candidates donor counts
innr = sorted(innr, key = lambda i : sum([d[o][i] for o in d]))[::-1]

# filter 16s candidates who only received co-donations with nominees
outr = [o for o in d if sum([d[o][i] for i in innr if i not in noms])]

# sort inner and outer dictionaries
d = { o : { innr[i] : d[o][innr[i]] for i in range(len(innr)) } for o in outr }
d = dict(sorted(list(d.items()), key = lambda x : sum(x[1].values()))[::-1])

The Visualization

I don't really know how Seaborn works, so I just ran it with default settings and looked up things that annoyed me. I wanted a transparent background and nice color theme in the graph to display with the website HTML theme, had to make it high resolution so I could read the (many) labels, and copied a script from Stack Overflow that manipulated tick marks to move labels from the bottom to the top of the x axis. Otherwise, I simply ran a heatmap over the dictionary and presented what I found!

import pandas
import seaborn
import matplotlib.pyplot as plt

df = pandas.DataFrame.from_dict(d)

plt.rcParams['figure.figsize'] = [10, 10]
plt.rcParams['figure.dpi'] = 100
plt.tick_params(axis='both', which='major', labelsize=10, labelbottom = False, 
                bottom=False, top = False, left = False, labeltop=True)
plt.style.use("dark_background")

p = seaborn.heatmap(df, cmap=seaborn.color_palette("mako", as_cmap=True))
p.get_figure().savefig("heat_final.png",bbox_inches='tight',transparent=True)

I did touch up a bit with Glimpse at the end. I converted the black font via color exchange to match the HTML text color, and placed the image on a black background for distribution outside of the host page. I also clipped the tick marks from the legend as they were unhelpfully quantitative.