Hadoop

Calvin (Deutschbein)

Week 03

Cloud

Announcements

  • Welcome to DATA-599: Cloud Computing!
  • This is a half lecture demo - the other half is a professional development event
    • The second homework, "Fold", is due this week at 6 PM on Wed/Thr (now).
      • Looked good so far!
    • This week is a 'non-standard homework' - I'm asking you just to run Hadoop locally and send me a README.md (via email) of what you did.
      • Today in class (ideally) we will go over how I got it working.
      • Sample README.md

Today

  • Hadoop in Docker walkthrough/code-along
  • We might as well use Hadoop once
    • With a live Hadoop, we can use Hadoop streaming with Python (latter)
    • This is good bash/docker practice
    • Can say you actually used it on your resume (very cool!)
  • Hadoop seems... hard.
    • I Frankenstein'ed 3 repos and 2 OSes to get a local run through.
    • I never successfully reproduced my only Windows successful run
    • Hadoop isn't intended to run on your device, and you can tell.
    • We suffer together 🙏
  • What follows are my notes, which mostly are correct probably maybe.

0. Start

  • Make a working directory, somewhere
  • Linux>Windows I think
  • mkdir running-hadoop cd running-hadoop
  • There's no real wrong way to do this, but you'll want a clear name in case there's a disaster and you need to separate instances (as best you can).

1. Clone

  • Clone the big-data-europe fork for commodity hardware.
  • This was for Apple people but it ran smoothly on Windows/Linux (ha, smoothly) for me.
  • git clone https://github.com/wxw-matt/docker-hadoop.git cd docker-hadoop
  • Once you have it locally, poke around a bit.
    • There's a readme we won't follow exactly but will reference.
    • There's a variety of docker files.
    • There's some jars - "Java ARchive" - that will interface smoothly with Hadoop.

2. Docker Test

  • Test Docker
    • I didn't have my Docker running and had to start it.
    docker run hello-world
  • I used Docker both locally on Windows and within Windows Subsystem for Linux (WSL).
  • I do not have an Apple device, but this was allegedly written for Apple.

3. File Editting

  • Change version - we need to use docker-compose-v3, but the docker-compose command will look for an unversioned file.
  • mv docker-compose.yml docker-compose-v1.yml mv docker-compose-v3.yml docker-compose.yml
    • I am pretty sure this I mislabelled
      • I believe the "unversioned" one is actually version 2, but I don't know what the versions refer to so...
    • Also: this will draw two warnings - you can fix the files or ignore both.
      • It's good practice to read the warnings, think about them, and try to fix them, but is not a focus of this class.

4. Compose

  • Bring the Hadoop containers up (took me 81.6s).
    • I had previously pulled other hadoop images though so your time may vary.
    docker-compose up -d
  • This will be slow and awkward and should be.
    • We should be doing this in a cluster (typically for Hadoop was on-premises servers)
    • More recently, we should be doing this in AWS/GCP/Azure.
    • That said, Hadoop is intended to be testable on a personal device.

5. Namenode

  • In a Hadoop cluster, we mostly work within the namenode.
  • We can think of a namenode just like any other Linux system (basically)

    I used "docker ps" to list the nodes, then picked the one with name in it. For me:

    CONTAINER ID IMAGE NAMES aadf77229476 bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8 namenode e7ebe649b320 bde2020/hadoop-resourcemanager:2.0.0-hadoop3.2.1-java8 resourcemanager 4ece65b07145 bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8 datanode 65ce5597dde4 bde2020/hadoop-historyserver:2.0.0-hadoop3.2.1-java8 historyserver 9d289fd85a48 bde2020/hadoop-nodemanager:2.0.0-hadoop3.2.1-java8 nodemanager

    So I used:

    docker exec -it namenode /bin/bash

    I also tested `hello world` here:

    echo hello world

6. Container Filesystem

  • While within the node, we will need directories for data, for compute, and for results.
  • I created those under a new 'app' directory.

    mkdir app mkdir app/data mkdir app/res mkdir app/jars
  • These aren't mandatory, but it is much easier to work with them.
  • Basically Hadoop needs:
    • Input data
    • Some kind of mapping and reducing specification
    • Somewhere to place output data
  • These are our "data", "res", and "jars" folders.

7. Fetch Data

  • 'hello world' for 'map reduce' is 'word count' so we get some words to count.
  • You can run over something trivial if you like:
  • echo hi > /app/data/hi.txt
  • I grabbed some books.
  • I got two of my favorite books and also Wuthering Heights from Project Gutenberg in plaintext format like so: cd /app/data curl https://raw.githubusercontent.com/cd-public/books/main/pg1342.txt -o austen.txt curl https://raw.githubusercontent.com/cd-public/books/main/pg84.txt -o shelley.txt curl https://raw.githubusercontent.com/cd-public/books/main/pg768.txt -o bronte.txt
  • It should be easy enough to find other text files on the Internet, but I used these three.

8. Check Data

  • Before going further, I verified that I had files of some size:

    ls -al

    I got:

    total 1884 drwxr-xr-x 2 root root 4096 May 28 19:42 . drwxr-xr-x 5 root root 4096 May 28 19:38 .. -rw-r--r-- 1 root root 772420 May 28 19:41 austen.txt -rw-r--r-- 1 root root 693877 May 28 19:42 bronte.txt -rw-r--r-- 1 root root 448937 May 28 19:42 shelley.txt
  • The byte size is between the second "root" and the date - and all three are nonzero and similarish in size (as each is a novel).
  • If you wanted to take a look at the text itself, what might you do?
    • cat
    • head
    • vi
    • grep

9. Fetch Compute

  • Fetch a compute jar to app/jars
  • We are using WordCount.jar
    • which is helpfully provided by somebody???
      • (there's a lot of things called "wordcount.jar" and they're binary so hard to track)
    • ...but most critically is in the repo.
    • We can grab from the local file system (boring)...
      • From outside the container...
      docker cp .\jobs\jars\WordCount.jar namenode:/app/jars/WordCount.jar
    • Or we just curl again (dangerous, bad, unsupported):

      cd /app/jars curl https://github.com/wxw-matt/docker-hadoop/blob/master/jobs/jars/WordCount.jar -o WordCounter.jar
    • In general, you should not just pull executable files off of the internet and run them, but we are running this in a container, which provides a modicum of safety.

A. "Hash"

  • We must load data into "HDFS"
  • Hadoop can only read and write to something called the Hadoop Distributed File System.
    • It's approximately hash table that works well in data centers.
  • We need to move data from the Linux file system into the Hadoop file system. We use the "hdfs" commands.

    cd / hdfs dfs -mkdir /test-1-input hdfs dfs -copyFromLocal -f /app/data/*.txt /test-1-input/
  • This segment is fraught with peril:
    • If network was misconfigured, copyFromLocal will fail and it will be the first time you know it's wrong.
      • It will say something about not reaching a data node.
    • If you did partial previous runs, some files may exist in hfds already.
      • Hash table deletion is hard, so hfds deletion is hard.
      • Just use new names (why I used "test-n-*")
    • You should check things with e.g. "hdfs dfs -cat" - play around.

B. Run

  • Run Hadoop/MapReduce
  • hadoop jar /app/jars/WordCount.jar WordCount /test-1-input /test-1-output
    • It needs a map/reduce jar. This would make sense to use if we knew Java I think. jar jars/WordCount.jar WordCount
    • It needs some input key/values as a HDFS folder /test-1-input
    • It needs to know where I will look for output.
  • At a high level of abstraction, this is just: test_in <- c(austen,shelley,bronte) # imagine these are already defined test_out = lapply(test_in, wordcount) # imagine wordcount is already defined

C. Read out

  • To read, we copy results out of hdfs
    • You can use "hdfs dfs -cat" but it's unwieldy
    hdfs dfs -copyToLocal /test-1-output /app/res/
  • See the results!
  • head /app/res/test-1-output/part-r-00000
  • Mine looked like this: #1342] 1 #768] 1 #84] 1 $5,000) 3 & 1 ($1 3 (801) 3 (By 1 (Godwin) 1 (He 1
  • Well, we probably have more work to do.

Fin

  • I did setup around 20 times and it worked about 2 (so 10%) and I couldn't figure out why it worked sometimes and not others.
  • Once set up, it always worked.
  • Here's some tricks:
  • I believe my failures can be traced back to errors translating Linux paths to Windows paths, but I did have one successful run on "pure" Windows so it has to be possible.

Extensions

  • So you got Hadoop running, huh.
  • Good job!
  • Would be a shame if there was more work yet to do...
    • Learn Java to make new jars
    • Learn Hadoop Streaming to use Python
      • Read More: Hadoop Streaming
      • I think with data node setup, we simply need to a streaming jar and some .py files
    • Automate
      • Since I had a high variance workflow, I didn't script everything.
      • Look at e.g. 'hdfs'
      • Can you script Hadoop from a host machine with "docker cp" etc.?
    • Encapsulate
      • Can you package a container to run automatically on a remote server?
  • Email me a README.md before next class.
  • Sample "README.md"
  • Sample ideas: Section 2.3 of MapReduce
// reveal.js plugins