Monday, September 15, 2014

build a virtually distributed environment on a single laptop.

Using LXC to achieve that.
good tutorial of LXC:
http://en.community.dell.com/techcenter/os-applications/w/wiki/6950.lxc-containers-in-ubuntu-server-14-04-lts
http://en.community.dell.com/techcenter/os-applications/w/wiki/7440.lxc-containers-in-ubuntu-server-14-04-lts-part-2

http://wupengta.blogspot.com/2012/08/lxchadoop.html

golden tutorial:
http://www.kumarabhishek.co.vu/

Once you have installed LXC and created a user. You can check it under /var/lib/lxc:
Note, you have to be a root user to check it:

gstanden@vmem1:/usr/share/lxc/templates$ cd /var/lib/lxc
bash: cd: /var/lib/lxc: Permission denied
gstanden@vmem1:/usr/share/lxc/templates$ sudo cd /var/lib/lxc
sudo: cd: command not found
gstanden@vmem1:/usr/share/lxc/templates$ sudo su
root@vmem1:~# cd /var/lib/lxc

#ifconfig -a
lxcbr0 Link encap:Ethernet HWaddr fe:d3:07:23:4d:71

inet addr:10.0.3.1 Bcast:10.0.3.255 Mask:255.255.255.0

LXC creates this NATed bridge "lxcbr0" at host startup, which means "lxcbr0" will connect containers.

>sudo lxc-create -t ubuntu -n hdp1
>sudo lxc-start -d -n hdp1
>sudo lxc-console -n hdp1
>sudo lxc-info –n hdp1

Name: hdp1
State: RUNNING
PID: 17954
IP: 10.0.3.156
CPU use: 2.18 seconds
BlkIO use: 160.00 KiB
Memory use: 9.13 MiB

>sudo lxc-stop –n lxc-test

>sudo lxc-destroy –n lxc-test

ubuntu@hdp1# sudo useradd -m hduser1

ubuntu@hdp1:~$ sudo passwd hduser1
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfull

Then install JDK on VM "hduser1":

apt-get install openjdk-7-jdk

THen we should set JAVA_HOME, etc., in .bashrc:
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
export JRE_HOME=$JAVA_HOME/jre
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH="$PATH:$JAVA_HOME/bin:/home/hduser1/hadoop-2.4.1/bin:$JRE_HOME/bin"

configure network:
http://www.kumarabhishek.co.vu/
http://tobala.net/download/lxc/
http://containerops.org/2013/11/19/lxc-networking/
Now I have 5 LXC virtual machines.
hdp1 : namenode,jobtracker,secondarynamenode
hdp2 : datanodes,tasktrackers
hdp3 : datanodes,tasktrackers
hdp4 : datanodes,tasktrackers
hdp5 : datanodes,tasktrackers

For each VM, check and change two files:
1:/var/lib/lxc/hdp1/config
make sure this line exists:

lxc.network.link = lxcbr0

"lxcbr0" is the bridge created by LXC, whose virtual IP is: 10.0.3.1, who also has the same hostname as the host machine.
2:/var/lib/lxc/hdp1/rootfs/etc/network/interfaces
change 2nd part to assign a static IP address:

auto eth0iface eth0 inet static address 10.0.3.101 netmask 255.255.0.0 broadcast 10.0.255.255 gateway 10.0.3.1 dns-nameservers 10.0.3.1

Once the master node is configured, we copy the LXC.

To clone our lxc-test container, we first need to stop it if it’s running:

$ sudo lxc-stop -n lxc-test

Then clone:

sudo lxc-clone -o hdp1 -n hdpX #replace X with 2,3,...,N

Then For each VM, we need edit /etc/hosts to reflect the changes we made on /etc/hosts on our host machine except the host machine.
10.0.3.101 hdp1
10.0.3.102 hdp2
10.0.3.103 hdp3
10.0.3.104 hdp4
10.0.3.105 hdp5

http://jcinnamon.wordpress.com/lxc-hadoop-fully-distributed-on-single-machine/

How to create multiple bridges?

add a bridge interface:

sudo brctl addbr br100

to delete a bridge inteface

# ip link set br100 down
# brctl delbr br100

Setting up a bridge is pretty much straightforward. At first you create a new bridge, and then continue with adding as many interfaces to it as you want:

# brctl addbr br0
# brctl addif br0 eth0
# brctl addif br0 eth1
# ifconfig br0 netmask 255.255.255.0 192.168.32.1 up

The name br0 is just a suggestion, following the loose conventions for interface names -- identifier followed by a number. However, you're free to choose anything you like. You can name your bridge pink_burning_elephant if you like to. I just don't know if you remember in 5 years why you're having iptables for a burning elephant.

Good tutorial of brctl command:

http://www.lainoox.com/bridge-brctl-tutorial-linux/

Multi-Cluster Multi-Node Distributed Virtual Network Setup

http://containerops.org/2013/11/19/lxc-networking/

Bridge Mode

Tuesday, September 9, 2014

Install Hadoop first time!

all versions are available here:

http://mirror.tcpdiag.net/apache/hadoop/common/

I picked 2.4.1 (currently stable version)

Following instructions on official site:

http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-common/SingleNodeSetup.html

1: pretty smooth until I saw:

In the distribution, edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java installation.

Note, there is no folder named "conf". By comparing it to the install instructions of 2.5.0, I found the correct one should be: etc/hadoop-env.sh

2: another typo in the official instruction:

$ mkdir input
$ cp conf/*.xml input
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
$ cat output/*

again, no "conf" folder, here it should be:

$ mkdir input

$ cp etc/hadoop/*.xml input

$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar grep input output 'dfs[a-z.]+'

$ cat output/*

A good unoffical installation guide:
http://data4knowledge.org/2014/08/16/installing-hadoop-2-4-1-detailed/

handling warnings you may see:
http://chawlasumit.wordpress.com/2014/06/17/hadoop-java-hotspottm-execstack-warning/

if ssh has some problem, make sure
1: ssh server is running.
2: run: /etc/init.d/ssh reload

a tutorial for dummy:

if you see "WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable":
the solution is here:
http://stackoverflow.com/questions/19943766/hadoop-unable-to-load-native-hadoop-library-for-your-platform-error-on-centos

after everything is correctly installed and launched.
you can check the status by:
$jps
output is:

23208 SecondaryNameNode
22857 NameNode
26575 Jps

22997 DataNode

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

Formatting the Namenode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem, which is implemented on top of the local filesystems of your cluster. You need to do this the first time you set up a Hadoop installation. Do not format a running Hadoop filesystem, this w Before formatting, ensure that the dfs.name.dir directory exists. If you just used the default, then mkdir -p /tmp/hadoop-username/dfs/name will create the directory. To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command:

% $HADOOP_INSTALL/hadoop/bin/hadoop namenode -format

"no such file or directory":

http://stackoverflow.com/questions/20821584/hadoop-2-2-installation-no-such-file-or-directory

hadoop fs -mkdir -p /user/[current login user]

"datanode is not running":

This is for newer version of Hadoop (I am running 2.4.0)

In this case stop the cluster sbin/stop-all.sh
Then go to /etc/hadoop for config files.

In the file: hdfs-site.xml Look out for directory paths corresponding to dfs.namenode.name.dir and dfs.namenode.data.dir

Delete both the directories recursively (rm -r).
Now format the namenode via bin/hadoop namenode -format
And finally sbin/start-all.sh

how to copy file from local system to hdfs?

hadoop fs -copyFromLocal localfile.txt /user/hduser/input/input1.data

THen run example:

$bin/hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar wordcount /user/hdgepo/input /user/hdgepo/output

jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar wordcount <input> <output>, where <input> is a text file or a directory containing text files, and <output> is the name of a directory that will be created to hold the output. The output directory must not exist before running the command or you will get an error.

RUn your own hadoop:

https://github.com/uwsampa/graphbench/wiki/Standalone-Hadoop
useful hadoop fs commands:
http://www.bigdataplanet.info/2013/10/All-Hadoop-Shell-Commands-you-need-Hadoop-Tutorial-Part-5.html

Cluster Setup:

http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html
Web interface for hadoop 2.4.1:
http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/ClusterSetup.html#Web_Interfaces

Sunday, September 7, 2014

Conquering Spark

Spark is hot! Indeed.
I have no knowledge of Hadoop or Internet programming. But I still want to conquer Spark.

The first thing I learned is from downloading Spark.
https://spark.apache.org/downloads.html

They have :

Pre-built packages:

For Hadoop 1 (HDP1, CDH3): find an Apache mirror or direct file download
For CDH4: find an Apache mirror or direct file download
For Hadoop 2 (HDP2, CDH5): find an Apache mirror or direct file download

Pre-built packages, third-party (NOTE: may include non ASF-compatible licenses):

For MapRv3: direct file download (external)
For MapRv4: direct file download (external)

What are all these abbreviations representing?

HDFS, HDP1, CDH3, CDH4, HDP2, CDH5, MapRv3 and MapRv4

Simply put, they are all distributions of Hadoop. Just like a Linux distribution gives you more than Linux, CDH delivers the core elements of Hadoop – scalable storage and distributed computing – along with additional components such as a user interface, plus necessary enterprise capabilities such as security, and integration with a broad range of hardware and software solutions.

http://www.dbms2.com/2012/06/19/distributions-cdh-4-hdp-1-hadoop-2-0/

HDP1 and HDP2: two versions of Hortonworks Data Platform.

Hortonworks is a company which makes use of Hadoop. Hortonworks is to promote the usage of Hadoop. Its product named Hortonworks Data Platform (HDP) includes Apache Hadoop and is used for storing, processing, and analyzing large volumes of data. The platform is designed to deal with data from many sources and formats. The platform includes various Apache Hadoop projects including the Hadoop Distributed File System(HDFS), MapReduce, Pig, Hive, HBase and Zookeeper and additional components.

official site of HDP: http://hortonworks.com/

its wiki: http://en.wikipedia.org/wiki/Hortonworks

CDH3, CDH4, CDH5: versions of Cloudera Distribution Including Apache Hadoop

Its wiki: http://en.wikipedia.org/wiki/Cloudera

MapRv3, MapRv4: versions from MapR company

3 pillars of Hadoop: HDFS, MapReduce, Yarn

Now Spark may replace MapReduce in the future.

http://hortonworks.com/hadoop/hdfs/

to run spark, you need install CDH or HDP or MapR hadoop. or you can run spark standalone.

Essentials For disctributed development

Virtual Box

Docker: using your existing kernel as its kernel and just creates a container to wrap the kernel to run your apps. Sharing the same kernel.

vagrant: a light-weight VM, better isolation

How to install Docker on Ubuntu 14.04
http://docs.docker.com/installation/ubuntulinux/

Docker vs. Vagrant
http://www.scriptrock.com/articles/docker-vs-vagrant

Installing Hadoop 2.4 on Ubuntu 14.04:
http://dogdogfish.com/2014/04/26/installing-hadoop-2-4-on-ubuntu-14-04/
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

Using Docker to try Hadoop:
http://techtraits.com/hadoopsetup/

Distributed environment on a single laptop:
http://ofirm.wordpress.com/2014/01/05/creating-a-virtualized-fully-distributed-hadoop-cluster-using-linux-containers/

Deryk's stack