Hadoop

Calvin (Deutschbein)

Week 03

Cloud

Announcements

Welcome to DATA-599: Cloud Computing!
This is a half lecture demo - the other half is a professional development event
- The second homework, "Fold", is due this week at 6 PM on Wed/Thr (now).
  - Looked good so far!
- This week is a 'non-standard homework' - I'm asking you just to run Hadoop locally and send me a README.md (via email) of what you did.
  - Today in class (ideally) we will go over how I got it working.
  - Sample README.md

Today

Hadoop in Docker walkthrough/code-along
We might as well use Hadoop once
- With a live Hadoop, we can use Hadoop streaming with Python (latter)
- This is good bash/docker practice
- Can say you actually used it on your resume (very cool!)
Hadoop seems... hard.
- I Frankenstein'ed 3 repos and 2 OSes to get a local run through.
- I never successfully reproduced my only Windows successful run
- Hadoop isn't intended to run on your device, and you can tell.
- We suffer together 🙏
What follows are my notes, which mostly are correct probably maybe.

0. Start

Make a working directory, somewhere
Linux>Windows I think

mkdir running-hadoop
cd running-hadoop

There's no real wrong way to do this, but you'll want a clear name in case there's a disaster and you need to separate instances (as best you can).

1. Clone

Clone the big-data-europe fork for commodity hardware.
This was for Apple people but it ran smoothly on Windows/Linux (ha, smoothly) for me.

git clone https://github.com/wxw-matt/docker-hadoop.git
cd docker-hadoop

Once you have it locally, poke around a bit.
- There's a readme we won't follow exactly but will reference.
- There's a variety of docker files.
- There's some jars - "Java ARchive" - that will interface smoothly with Hadoop.

2. Docker Test

Test Docker

I didn't have my Docker running and had to start it.

docker run hello-world

I used Docker both locally on Windows and within Windows Subsystem for Linux (WSL).
I do not have an Apple device, but this was allegedly written for Apple.

3. File Editting

Change version - we need to use docker-compose-v3, but the docker-compose command will look for an unversioned file.

mv docker-compose.yml docker-compose-v1.yml
mv docker-compose-v3.yml docker-compose.yml

I am pretty sure this I mislabelled
- I believe the "unversioned" one is actually version 2, but I don't know what the versions refer to so...
Also: this will draw two warnings - you can fix the files or ignore both.
- It's good practice to read the warnings, think about them, and try to fix them, but is not a focus of this class.

4. Compose

Bring the Hadoop containers up (took me 81.6s).
- I had previously pulled other hadoop images though so your time may vary.
docker-compose up -d
This will be slow and awkward and should be.
- We should be doing this in a cluster (typically for Hadoop was on-premises servers)
- More recently, we should be doing this in AWS/GCP/Azure.
- That said, Hadoop is intended to be testable on a personal device.

5. Namenode

In a Hadoop cluster, we mostly work within the namenode.
We can think of a namenode just like any other Linux system (basically)
I used "docker ps" to list the nodes, then picked the one with name in it. For me:
CONTAINER ID IMAGE NAMES aadf77229476 bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8 namenode e7ebe649b320 bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8 resourcemanager 4ece65b07145 bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8 datanode 65ce5597dde4 bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8 historyserver 9d289fd85a48 bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8 nodemanager
So I used:
docker exec -it namenode /bin/bash
I also tested `hello world` here:
echo hello world

6. Container Filesystem

While within the node, we will need directories for data, for compute, and for results.
I created those under a new 'app' directory.
mkdir app mkdir app/data mkdir app/res mkdir app/jars
These aren't mandatory, but it is much easier to work with them.
Basically Hadoop needs:
- Input data
- Some kind of mapping and reducing specification
- Somewhere to place output data
These are our "data", "res", and "jars" folders.

7. Fetch Data

'hello world' for 'map reduce' is 'word count' so we get some words to count.
You can run over something trivial if you like:

echo hi > /app/data/hi.txt

I grabbed some books.
I got two of my favorite books and also Wuthering Heights from Project Gutenberg in plaintext format like so: cd /app/data curl https://raw.githubusercontent.com/cd-public/books/main/pg1342.txt -o austen.txt curl https://raw.githubusercontent.com/cd-public/books/main/pg84.txt -o shelley.txt curl https://raw.githubusercontent.com/cd-public/books/main/pg768.txt -o bronte.txt
It should be easy enough to find other text files on the Internet, but I used these three.

8. Check Data

Before going further, I verified that I had files of some size:
ls -al
I got:
total 1884 drwxr-xr-x 2 root root 4096 May 28 19:42 . drwxr-xr-x 5 root root 4096 May 28 19:38 .. -rw-r--r-- 1 root root 772420 May 28 19:41 austen.txt -rw-r--r-- 1 root root 693877 May 28 19:42 bronte.txt -rw-r--r-- 1 root root 448937 May 28 19:42 shelley.txt
The byte size is between the second "root" and the date - and all three are nonzero and similarish in size (as each is a novel).
If you wanted to take a look at the text itself, what might you do?
- cat
- head
- vi
- grep

9. Fetch Compute

Fetch a compute jar to app/jars
We are using WordCount.jar
- which is helpfully provided by somebody???
  - (there's a lot of things called "wordcount.jar" and they're binary so hard to track)
- ...but most critically is in the repo.
- We can grab from the local file system (boring)...
  - From outside the container...
  docker cp .\jobs\jars\WordCount.jar namenode:/app/jars/WordCount.jar
- Or we just curl again (dangerous, bad, unsupported):
  cd /app/jars curl https://github.com/wxw-matt/docker-hadoop/blob/master/jobs/jars/WordCount.jar -o WordCounter.jar
- In general, you should not just pull executable files off of the internet and run them, but we are running this in a container, which provides a modicum of safety.

A. "Hash"

We must load data into "HDFS"
Hadoop can only read and write to something called the Hadoop Distributed File System.
- It's approximately hash table that works well in data centers.
We need to move data from the Linux file system into the Hadoop file system. We use the "hdfs" commands.
cd / hdfs dfs -mkdir /test-1-input hdfs dfs -copyFromLocal -f /app/data/*.txt /test-1-input/
This segment is fraught with peril:
- If network was misconfigured, copyFromLocal will fail and it will be the first time you know it's wrong.
  - It will say something about not reaching a data node.
- If you did partial previous runs, some files may exist in hfds already.
  - Hash table deletion is hard, so hfds deletion is hard.
  - Just use new names (why I used "test-n-*")
- You should check things with e.g. "hdfs dfs -cat" - play around.

B. Run

Run Hadoop/MapReduce

hadoop jar /app/jars/WordCount.jar WordCount /test-1-input /test-1-output

It needs a map/reduce jar. This would make sense to use if we knew Java I think. jar jars/WordCount.jar WordCount
It needs some input key/values as a HDFS folder /test-1-input
It needs to know where I will look for output.

At a high level of abstraction, this is just: test_in <- c(austen,shelley,bronte) # imagine these are already defined test_out = lapply(test_in, wordcount) # imagine wordcount is already defined

C. Read out

To read, we copy results out of hdfs

You can use "hdfs dfs -cat" but it's unwieldy

hdfs dfs -copyToLocal /test-1-output /app/res/

See the results!

head /app/res/test-1-output/part-r-00000

Mine looked like this: #1342] 1 #768] 1 #84] 1 $5,000) 3 & 1 ($1 3 (801) 3 (By 1 (Godwin) 1 (He 1
Well, we probably have more work to do.

Fin

I did setup around 20 times and it worked about 2 (so 10%) and I couldn't figure out why it worked sometimes and not others.
Once set up, it always worked.
Here's some tricks:

Run in Unix rather than on Windows
Use --remove-orphans with docker compose
Use "big data europe" containers with "wxw-matt" readme.
Make sure container names are all the same everywhere
Use ls, ls -al, and cat liberally to check files, hdfs and otherwise
Use docker cp and curl and verify the similarity with diff
Check the following:

I believe my failures can be traced back to errors translating Linux paths to Windows paths, but I did have one successful run on "pure" Windows so it has to be possible.

Extensions

So you got Hadoop running, huh.
Good job!
Would be a shame if there was more work yet to do...
- Learn Java to make new jars
  - Read more: "How to Run Your Own Jobs"
- Learn Hadoop Streaming to use Python
  - Read More: Hadoop Streaming
  - I think with data node setup, we simply need to a streaming jar and some .py files
- Automate
  - Since I had a high variance workflow, I didn't script everything.
  - Look at e.g. 'hdfs'
  - Can you script Hadoop from a host machine with "docker cp" etc.?
- Encapsulate
  - Can you package a container to run automatically on a remote server?
Email me a README.md before next class.
Sample "README.md"
Sample ideas: Section 2.3 of MapReduce