Calvin (Deutschbein)
Week 02
Cloud
My first successful cloud servers were all containerized.
Spark, our current leader, is built on Scala, a language from 2004. (R is '93, Python '91, JavaScript '95)
I've heard CaaS called "IaaS for people who like containers" which I think is probably 80% true.
I hear "The Salesforce Platform is the world’s number one Platform as a Service (PaaS) solution" (from Salesforce mostly).
AWS Lambda, Google Cloud Functions, Azure Functions
Calvin (Deutschbein)
Week 02
Cloud
>>> def my_map(f, l):
... return [f(i) for i in l]
...
>>> my_map(print, range(10))
0
1
2
3...
One implementation of a Python "map" function.
public interface Map<K,V>
This interface takes the place of the Dictionary class, which
was a totally abstract class rather than an interface.
Python has 'dict' instead of the Map "interface"
*Hadoop, the open source MapReduce framework I learned was written in Java. It looks like Google MapReduce was in C++, but the code is not public. Map is also not called map in C++...Java "Map"
>>> dct = {'a':1,'b':2,'c':3}
>>> dct['a']
1
>>> type(dct)
RECALL
Introduction to HashTables
also known as a hash map, is a data structure that stores key-value pairs.
# Something like
>>> hash_table = new_ht()
>>> hash_table.add('key1', 'value1')
>>> hash_table.add('key2', 'value2')
>>> hash_table.get('key1')
value1
>>> table = [['a','b','c'],['d','e','f'],['g','h','i']]
>>> table[1]
['d','e','f']
>>> hash(1)
1
>>> hash((1,2,3))
529344067295497451
>>> hash(map)
8795890005933
>>> hash('hi world')
-8782460609096434835
It's called hashing because you mash up the information together just like when cooking hash.
>>> hash(10 ** 20)
848750603811160107
And ideally doesn't return the same integer if the input is an integer.
hash_table = [None for _ in range(10)]
hash_table[hash('key1') % len(hash_table)] = ('key1', 'value1')
hash_table[hash('key2') % len(hash_table)] = ('key2', 'value2')
print(hash_table)
[None, None, None, ('key2', 'value2'), None, None, None, None, None, ('key1', 'value1')]
>>> print(hash('key1'),hash('key2'))
7359185340662927529 -5201184838170105667
>>> hash_table = [[] for _ in range(10)]
>>> hash_table[hash(11) % len(hash_table)].append((11, 'eleven'))
>>> hash_table[hash(21) % len(hash_table)].append((21, 'twenty one'))
>>> hash_table
[[], [(11, 'eleven'), (21, 'twenty one')], [], [], [], [], [], [], [], []]
>>> hash_table[hash('key1') % len(hash_table)][1]
'value1'
>>> start, hash_table, end = time.time(), [[] for _ in range(10 ** 7)], time.time()
>>> end-start
2.94462513923645
>>> start, hash_table, end = time.time(), 7 in hash_table, time.time()
>>> end-start
0.1309800148010254
>>> start, val, end = time.time(), hash_table[hash(7) % len(hash_table)], time.time()
>>> end-start
0.0
hash_table = [None for _ in range(8)]
hash_table[hash('key1') % len(hash_table)] = ('key1', 'value1')
hash_table[hash('key2') % len(hash_table)] = ('key2', 'value2')
print(hash_table)
hash_table[hash('key1') % len(hash_table)] = None
print(hash_table)
[None, None, None, ('key2', 'value2'), None, None, ('key1', 'value1'), None]
[None, None, None, ('key2', 'value2'), None, None, None, None]
def custom_hash(key):
return len(key)
hash_table = [[] for _ in range(10)]
hash_table[custom_hash('key1') % len(hash_table)] = ('key1', 'value1')
hash_table[custom_hash('k2') % len(hash_table)] = ('k2', 'value2')
print(hash_table)
class HashTable:
def __init__(self, size=10):
# what do you need your table to be?
self.table = None
def _hash(self, key):
# a custom hash - you may use hash() but need more
pass
def insert(self, key, value):
# how can we insert a key-value pair to our table?
pass
def search(self, key):
# given a key, how can we return its corresponding value
pass
def delete(self, key):
# given a key, how can we remove it and it's value
pass
def _resize(self, size):
# OPTIONAL: We may need to resize from time to time... how?
# Also... when would this be called?
pass
def apply_f_to_everything_in(f, data):
return [f(d) for d in data]
lapply(X, FUN, …)
a vector (atomic or list) or an object. Other objects (including classed objects) will be coerced by .
the function to be applied to each element of : see ‘Details’. In the case of functions like , , the function name must be backquoted or quoted.
lst = c(1,2,3)
double <- function(x)
{
return(x * 2)
}
lapply(lst,double)
lapply(c(1,2,3), function(x) x * 2)
We use the keyword "function" instead of keyword "lambda" in R and the function to apply is the second argument, but the computation is identical.
>>> d = {'a':1,'b':2}
>>> map(lambda x: x * 2, d)
<map object at 0x00000231FB586A70>
Applying Python.map() to a Python.dictionary
>>> squares = {1:1, 2:4, 3:9}
>>> squares[2]
4
>>> squared = lambda x: x * x
>>> squared(2)
4
>>> list(map(lambda x: squares[x],[1,2,3,2,1,2,3,2,1]))
[1, 4, 9, 4, 1, 4, 9, 4, 1]
>>> locs = {"Wed":"SLM", "Thr":"PDX"}
>>> locs["Wed"]
'SLM'
>>> def locr(d):
... if d == "Wed":
... return "SLM"
... if d == "Thr":
... return "PDX"
...
>>> locr("Wed")
'SLM'
We can take a look at the old Hadoop documentation...
The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
The documentation will latter note that the "combine" stage is optional.
# rolling average isn't independent
[(arr[i] + arr[i+1] + arr[i+2])/3 for i in range(len(arr)-2]
>>> d1 = {1:1, 2:4}
>>> d2 = {3:9, 4:16}
>>> d1 + d2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'dict' and 'dict'
>>> {**d1, **d2}
{1: 1, 2: 4, 3: 9, 4: 16}
>>>
Calvin (Deutschbein)
Week ~02
Cloud
ht = HashTable()
ht.insert("Prof. Calvin","ckdeutschbein")
ht._table
array([None, None, None, None, list([('Prof. Calvin', 'ckdeutschbein')]),
None, None, None, None, None], dtype=object)
ht.search("Prof. Calvin"), ht.search("ckdeutschbein"), ht.search("Prof. Cheng")
ht.search("Prof. Calvin"), ht.search("ckdeutschbein"), ht.search("Prof. Cheng")
('ckdeutschbein', None, None)
ht.insert("Prof. Calvin","ckdeutschbein") # returns nothing
ht.search("Prof. Calvin") # returns 'ckdeutschbein'
ht.delete("Prof. Calvin") # returns nothing
ht.search("Prof. Calvin") # returns nothing
extend_email = lambda x, y: (x,y+"@willamette.edu")
extend_email("Prof. Cheng", "hcheng")
('Prof. Cheng', 'hcheng@willamette.edu')