Tuesday, October 6, 2015

Learn D3 by "Official" examples -- Treemap

Explanation of D3 clickable "pack" layout. click to see official example. "elements" in this article means HTML elements. In treemap layou, we are using rectangular as basic units while in pack layout, we are using circles as basic units.


<style>
body {
  font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
  margin: auto;
  position: relative;    /*CSS positioning: http://www.w3schools.com/css/css_positioning.asp*/
  width: 960px;
}
form {
  position: absolute;
  right: 10px;
  top: 10px;
}
.node {
  border: solid 1px white;
  font: 10px sans-serif;
  line-height: 12px;
  overflow: hidden;
  position: absolute;
  text-indent: 2px;
}
</style>
<form>
  <label><input type="radio" name="mode" value="size" checked> Size</label>
  <label><input type="radio" name="mode" value="count"> Count</label>
</form>
<script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min.js"></script>
<script>
var margin = {top: 40, right: 10, bottom: 10, left: 10},
    width = 960 - margin.left - margin.right,
    height = 500 - margin.top - margin.bottom;
var color = d3.scale.category20c();
var treemap = d3.layout.treemap()
    .size([width, height])
    .sticky(true)
    .value(function(d) { return d.size; });  /*size of rect*/
var div = d3.select("body").append("div")
    .style("position", "relative")
    .style("width", (width + margin.left + margin.right) + "px")
    .style("height", (height + margin.top + margin.bottom) + "px")
    .style("left", margin.left + "px")
    .style("top", margin.top + "px");
d3.json("flare.json", function(error, root) {
  if (error) throw error;
  var node = div.datum(root).selectAll(".node").data(treemap.nodes) //same as div.selectAll(".node").data(treemap.nodes(root))
      .enter().append("div")
      .attr("class", "node")
      .call(position)
      .style("background", function(d) { return d.children ? color(d.name) : null; }) /* "backgournd:null" means transparent.*/
      .text(function(d) { return d.children ? null : d.name; });
  d3.selectAll("input").on("change", function change() {
    var value = this.value === "count" ? function() { return 1; } : function(d) { return d.size; };
    node.data(treemap.value(value).nodes)
        .transition()
        .duration(1500)
        .call(position);
  });
});
/*compute the boundary of rectangular*/
function position() {
  this.style("left", function(d) { return d.x + "px"; })  /*(d.x,d.y) is the top left point of the rect.*/
      .style("top", function(d) { return d.y + "px"; })
      .style("width", function(d) { return Math.max(0, d.dx - 1) + "px"; }) /*d.dx is width of rect.*/
      .style("height", function(d) { return Math.max(0, d.dy - 1) + "px"; }); /*d.dy is eight of rect.*/
}
</script>

Treemap nodes:

parent - the parent node, or null for the root.
children - the array of child nodes, or null for leaf nodes.
value - the node value, as returned by the value accessor.
depth - the depth of the node, starting at 0 for the root.
x - the minimum x-coordinate of the node position.
y - the minimum y-coordinate of the node position.
dx - the x-extent of the node position.
dy - the y-extent of the node position.

For each rectangular, top left point is (x,y) and the width is "dx" and the height is "dy". "value" is the size of rect.

Friday, October 2, 2015

Learn D3 by "Official" Examples -- "Pack" layout II

Explanation of D3 clickable "pack" layout. click to see official example. "elements" in this article means HTML elements.

<head>
<style>
.node {
  cursor: pointer;
}
/*CSS selector ".node:hover" means select elements with "class=node" when mouse is over the element. */
.node:hover {
  stroke: #000;
  stroke-width: 1.5px;
}
/*CSS selector ".node--leaf" means select elements with "class=node--leaf". "node--leaf" is a class name. */
.node--leaf {
  fill: white;
}
.label {
  font: 11px "Helvetica Neue", Helvetica, Arial, sans-serif;
  text-anchor: middle;
  text-shadow: 0 1px 0 #fff, 1px 0 0 #fff, -1px 0 0 #fff, 0 -1px 0 #fff;
}
.label,
.node--root,
.node--leaf {
  pointer-events: none; /*no mouse event response on elements with these classes*/
}
</style>
</head>
<body>
<script>
var margin = 20,
    diameter = 960;
/*define the color. d3.scale transforms the input from .domain to .range using .interpolate. In this example, it means the input to the function "color" will be in [-1,5] and the output will be in ["hsl(152,80%,80%)", "hsl(228,30%,40%)"]. Basically, it is a mapping from [-1,5] to ["hsl(152,80%,80%)", "hsl(228,30%,40%)"]. How to compute which value in [-1,5] corresponds to which value in ["hsl(152,80%,80%)", "hsl(228,30%,40%)"]? We use interpolate "d3.interpolateHcl".*/
var color = d3.scale.linear()
    .domain([-1, 5])
    .range(["hsl(152,80%,80%)", "hsl(228,30%,40%)"])
    .interpolate(d3.interpolateHcl);
/*Here is how we use pack layout. We define a size of the entire pack and a value for each of its elements.*/
var pack = d3.layout.pack()
    .padding(2)
    .size([diameter - margin, diameter - margin])
    .value(function(d) { return d.size; })     /*d.size is defined in the given dataset in file "flare.json". Just a heads up, if you do not return anything, the default return value is 0.*/
var svg = d3.select("body").append("svg")
    .attr("width", diameter)
    .attr("height", diameter)
  .append("g")
    .attr("transform", "translate(" + diameter / 2 + "," + diameter / 2 + ")");
d3.json("flare.json", function(error, root) {
  if (error) throw error;
  var focus = root,
      nodes = pack.nodes(root), /*"nodes" is an array of objects which are extracted from Json file.*/
      view;
  var circle = svg.selectAll("circle")
      .data(nodes)   /*each element of circle is associate with an object stored in "nodes".*/
      .enter().append("circle")
      /*determine the classes for different HTML elements. "node node--leaf" means that the html element with two attributes "class=node" and "class=node--leaf" which further means the CSS styles defined for ".node" and ".node--leaf" will be applied to this element. However, in this example, "node" is not needed for "node--leaf" and "node--root". There will be no differences if we write like this:
      .attr("class", function(d) { return d.parent ? d.children ? "node" : "node--leaf" : "node--root"; })  */
      .attr("class", function(d) { return d.parent ? d.children ? "node" : "node node--leaf" : "node node--root"; })
      .style("fill", function(d) { return d.children ? color(d.depth) : null; }) /*d.depth is level of the node in hierarchy. automatically computed.*/
      .on("click", function(d) { if (focus !== d) zoom(d), d3.event.stopPropagation(); }); /*when we click a node and it is different from previous clicked node, we "zoom" to that node. d3.event.stopPropagation() stops the click event propagated to its parent node. what does this mean? Suppose you have two circles A and A_sub. A_sub is inside A. If you click A_sub, it will trigger the actions for A_sub first and then the actions for A. If you have d3.event.stopPropagation() put in A_sub's actions, then you click A_sub, only A_sub's actions will be triggered. See an example here: http://bl.ocks.org/jasondavies/3186840*/
  var text = svg.selectAll("text")  /*select all elements with tag name "text"*/
      .data(nodes)
    .enter().append("text")
      .attr("class", "label")       /*define a class "label" for tag "text"*/
      .style("fill-opacity", function(d) { return d.parent === root ? 1 : 0; }) /*Opacity can be used if you want to create transparency or fade effect. without this line, you won't see the fade effect.*/
      .style("display", function(d) { return d.parent === root ? null : "none"; }) /*display the elements when they are the children of root module.*/
      .text(function(d) { return d.name; });
  var node = svg.selectAll("circle,text"); /*all elements with "class="circle" and "class="text" are selected.*/
  d3.select("body")
      .style("background", color(-1))      /*the parameter of color is in [-1,5]*/
      .on("click", function() { zoom(root); }); /*if you click the background, it will zoom to root.*/
  zoomTo([root.x, root.y, root.r * 2 + margin]);     /*initial display of the entire data from json file.*/
  function zoom(d) {
    var focus0 = focus; focus = d;/*update the current focus. focus0 is not used.*/
    /*simply put, d3.transition performs the transition procedure from one point to another point. During these two points, there are many "frames" (think it as film frame). ".tween" is used to control how to display each frame. The tween function is called repeatedly, being passed the current normalized time t in [0, 1].*/
    var transition = d3.transition()
        .duration(d3.event.altKey ? 7500 : 750)     /*specify the length of the transition.*/
        .tween("zoom", function(d) {                
          var i = d3.interpolateZoom(view, [focus.x, focus.y, focus.r * 2 + margin]); /*compute each frame between "view" and "focus", where "view" is a global variable and set in function "zoomto".*/
          return function(t) { zoomTo(i(t)); }; /*"i" is a function, t is passed to the 2nd parameter of tween function. i(t) is the view of frame.*/
        });
    transition.selectAll("text")
      .filter(function(d) { return d.parent === focus || this.style.display === "inline"; })
        .style("fill-opacity", function(d) { return d.parent === focus ? 1 : 0; })        /*fade effect*/
        .each("start", function(d) { if (d.parent === focus) this.style.display = "inline"; }) /*beginning of frame.*/
        .each("end", function(d) { if (d.parent !== focus) this.style.display = "none"; });    /*end of frame.*/
  }
  function zoomTo(v) {
    /*room from current  "view" to "v" which is computed by d3.interpolateZoom.*/
    var k = diameter / v[2]; view = v; /*v[2] is the new radius of the clicked circle.*/
    node.attr("transform", function(d) { return "translate(" + (d.x - v[0]) * k + "," + (d.y - v[1]) * k + ")"; });
    circle.attr("r", function(d) { return d.r * k; }); /*update the circle size*/
    /*note, all nodes ("circle" and "text" on Line 66) will be updated.*/
  }
});
d3.select(self.frameElement).style("height", diameter + "px");
</script>
</body>

Saturday, September 26, 2015

Learn D3 by "Official" Examples -- "Pack" layout

Explanation of D3 "pack" layout. click to see official example

<head>
<style>
circle {
  fill: rgb(31, 119, 180);
  fill-opacity: .25;
  stroke: rgb(31, 119, 180);
  stroke-width: 1px;
}
.leaf circle {
  fill: #ff7f0e;
  fill-opacity: 1;
}
text {
  font: 10px sans-serif;
}
</style>
<!--
Line 2-16 defines styles for tag selector "circle" and class selector ".leaf circle" and tag selector "text". Priority of CSS selectors: id selector > class selector > tag selector. Wait a second, what does Line 9 mean by ".leaf circle"? ".leaf" means all elements with "class="leaf"". ".leaf circle" means all elements with tag "<circle>" inside "class="leaf"" should follow the specified rules between Line 9 and Line 12. See the following example, "c1" and "c2" are elements belonging to ".leaf circle".
<p class="leaf">
    <circle id="c1"> Circle 1 is shown here.</circle>
    <p>
        <circle id="c2"> Circle 2 is shown here.</circle>
    </p>
</p>
<p class="leaf node">double classes</p> means this element has two classes associated with it. The way to specify the style:
<style>
.leaf.node {color:red}
</style>
-->
<!--the official CDN of D3 -->
<script src="//d3js.org/d3.v3.min.js" charset="utf-8"></script>
</head>
<body>
<script>
var diameter = 960,
    /*d3.format is a function converting a number to a string. See: https://github.com/mbostock/d3/wiki/Formatting#d3_format.  Variable "format" is a function that converts integer as string with a comma for a thousands separator. If the number is not an integer it will be ignored. Try console.log(format(123456)) and console.log(format(123.456)) for details.*/
    format = d3.format(",d");
var pack = d3.layout.pack()                    //This is the way to use pack layout.*/
    .size([diameter - 4, diameter - 4])        //Specify size of the layout.*/
    .value(function(d) { return d.size; });    //".value" determines the size of each node in the pack layout. Where is "d"? "d" will be associated when the pack connects to dataset.  
                                               
var svg = d3.select("body").append("svg")      //append a svg which is like a whiteboard to draw.
    .attr("width", diameter)                   //specify width of svg     
    .attr("height", diameter)                  //specify height of svg
    .append("g")                               //define a group
    .attr("transform", "translate(2,2)");      //Translate will move the entire svg by 2 along x-axis and by 2 along y-axis. 
d3.json("flare.json", function(error, root) {  //read in a json file and store it to variable "root".
  if (error) throw error;
  var node = svg.datum(root)     //connect the retrieved data "root" to "svg". Similar to "d3.selection.data". 
      .selectAll(".node")        //select the "virtual" elements with <class="node">. These elements are not created yet but will be created later.
      .data(pack.nodes)          //pack.nodes comes from its parent node "svg.datum(root)". then pack.nodes are assigned to elements with <class="node">. From now on, you can believe each node in pack corresponds to some data (an object in the given json) from pack.nodes. 
      .enter().append("g")     
      /*for each node, we assign css attribute to it. Here the node's class depends on whether it has children.*/  
      .attr("class", function(d) { return d.children ? "node" : "leaf node"; })
      /*assign location for each node. d.x and d.y are automatically computed by pack layout.*/
      .attr("transform", function(d) { return "translate(" + d.x + "," + d.y + ")"; });
  node.append("title")           //assign a title to each node
      .text(function(d) { return d.name + (d.children ? "" : ": " + format(d.size)); });
  node.append("circle")          //assign a shape to each node
      .attr("r", function(d) { return d.r; });  //where is d.r from? pack layout automatically computes it. I guess it is computed based on node.value (Line 36).
  node.filter(function(d) { return !d.children; }) //choose all leaf nodes and assign text to them. 
      .append("text")
      .attr("dy", ".3em")
      .style("text-anchor", "middle")
      .text(function(d) { return d.name.substring(0, d.r / 3); });
});
d3.select(self.frameElement).style("height", diameter + "px");
</script>
</body>

The concept of "node" in a pack layout is important to understand how pack layout works.
Each node has the following attributes:


parent - the parent node, or null for the root. Automatically computed!
children - the array of child nodes, or null for leaf nodes.
value - the node value, as returned by the value accessor.Line 40. If not existing, it is the sum of its children's size. 
depth - the depth of the node, starting at 0 for the root.Automatically computed!
x - the computed x-coordinate of the node position. Automatically computed!
y - the computed y-coordinate of the node position.Automatically computed!
r - the computed node radius.Automatically computed!

So only "children" and "value" are assigned from the given input dataset. I will use an example to explain the concept.

{
 "name": "flare",
 "children": [
      {"name": "sub1", "size": 3938},
      {
          "name": "sub2", 
          "children": [
                  {"name": "sub2_sub1", "size": 6714},
                  {"name": "sub2_sub2", "size": 743}
           ]
      }
 ]
}

We have 5 objects in the json file:

Object { name: "flare", children: Array[2], depth: 0, value: 11395, y: 478, x: 478, r: 478 } 
Object { name: "sub1", size: 3938, parent: Object, depth: 1, value: 3938, r: 174.4465055101499, x: 174.4465055101499, y: 478 } 
Object { name: "sub2", children: Array[2], parent: Object, depth: 1, value: 7457, r: 303.5534944898501, x: 652.4465055101499, y: 478 } 
Object { name: "sub2_sub1", size: 6714, parent: Object, depth: 2, value: 6714, r: 227.7797367518277, x: 728.2202632481724, y: 478 }
Object { name: "sub2_sub2", size: 743, parent: Object, depth: 2, value: 743, r: 75.77375773802244, x: 424.6667687583222, y: 478 }

"Node" of the pack corresponds to each object. So we have 5 nodes in this pack layout. "x", "y", "r" and "parent" are already there which means they are computed by pack layout. "name" and "children" are assigned by the dataset in the given json file. The size of each node is controlled by pack.value (Line 40). Note object "flare" does not have a "size" attribute, so its size is the sum of its children's sizes.

Tuesday, September 15, 2015

Parallelism vs. Concurrency

I always had a perception that parallelism and concurrency are interchangeable and they convey the same concept.
But that was a wrong perception!

Here is my understanding:
Parallelism is using multiple threads to compute the same problem such that this single problem can be divided into many sub-problems which then be computed simultaneously).
Concurrency is more like a concept used in distributed system, i.e., rapid IO interaction combined with callbacks to be triggered on certain events.

An explicit example is web server. Suppose a web server can have 100 requests per second. Then these requests will be handled by this single server which will be running concurrently. If there are 100 web servers to serve the 100 requests, then it is a parallel handling.

Monday, September 15, 2014

build a virtually distributed environment on a single laptop.

Using LXC to achieve that.
good tutorial of LXC:
http://en.community.dell.com/techcenter/os-applications/w/wiki/6950.lxc-containers-in-ubuntu-server-14-04-lts
http://en.community.dell.com/techcenter/os-applications/w/wiki/7440.lxc-containers-in-ubuntu-server-14-04-lts-part-2

http://wupengta.blogspot.com/2012/08/lxchadoop.html

golden tutorial:
http://www.kumarabhishek.co.vu/

Once you have installed LXC and created a user. You can check it under /var/lib/lxc:
Note, you have to be a root user to check it:

gstanden@vmem1:/usr/share/lxc/templates$ cd /var/lib/lxc
bash: cd: /var/lib/lxc: Permission denied
gstanden@vmem1:/usr/share/lxc/templates$ sudo cd /var/lib/lxc
sudo: cd: command not found
gstanden@vmem1:/usr/share/lxc/templates$ sudo su
root@vmem1:~# cd /var/lib/lxc

#ifconfig -a
lxcbr0 Link encap:Ethernet HWaddr fe:d3:07:23:4d:71

inet addr:10.0.3.1 Bcast:10.0.3.255 Mask:255.255.255.0

LXC creates this NATed bridge "lxcbr0" at host startup, which means "lxcbr0" will connect containers.

>sudo lxc-create -t ubuntu -n hdp1
>sudo lxc-start -d -n hdp1
>sudo lxc-console -n hdp1
>sudo lxc-info –n hdp1

Name: hdp1
State: RUNNING
PID: 17954
IP: 10.0.3.156
CPU use: 2.18 seconds
BlkIO use: 160.00 KiB
Memory use: 9.13 MiB

>sudo lxc-stop –n lxc-test

>sudo lxc-destroy –n lxc-test

ubuntu@hdp1# sudo useradd -m hduser1

ubuntu@hdp1:~$ sudo passwd hduser1
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfull

Then install JDK on VM "hduser1":

apt-get install openjdk-7-jdk

THen we should set JAVA_HOME, etc., in .bashrc:
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
export JRE_HOME=$JAVA_HOME/jre
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH="$PATH:$JAVA_HOME/bin:/home/hduser1/hadoop-2.4.1/bin:$JRE_HOME/bin"

configure network:
http://www.kumarabhishek.co.vu/
http://tobala.net/download/lxc/
http://containerops.org/2013/11/19/lxc-networking/
Now I have 5 LXC virtual machines.
hdp1 : namenode,jobtracker,secondarynamenode
hdp2 : datanodes,tasktrackers
hdp3 : datanodes,tasktrackers
hdp4 : datanodes,tasktrackers
hdp5 : datanodes,tasktrackers

For each VM, check and change two files:
1:/var/lib/lxc/hdp1/config
make sure this line exists:

lxc.network.link = lxcbr0

"lxcbr0" is the bridge created by LXC, whose virtual IP is: 10.0.3.1, who also has the same hostname as the host machine.
2:/var/lib/lxc/hdp1/rootfs/etc/network/interfaces
change 2nd part to assign a static IP address:

auto eth0iface eth0 inet static address 10.0.3.101 netmask 255.255.0.0 broadcast 10.0.255.255 gateway 10.0.3.1 dns-nameservers 10.0.3.1

Once the master node is configured, we copy the LXC.

To clone our lxc-test container, we first need to stop it if it’s running:

$ sudo lxc-stop -n lxc-test

Then clone:

sudo lxc-clone -o hdp1 -n hdpX #replace X with 2,3,...,N

Then For each VM, we need edit /etc/hosts to reflect the changes we made on /etc/hosts on our host machine except the host machine.
10.0.3.101 hdp1
10.0.3.102 hdp2
10.0.3.103 hdp3
10.0.3.104 hdp4
10.0.3.105 hdp5

http://jcinnamon.wordpress.com/lxc-hadoop-fully-distributed-on-single-machine/

How to create multiple bridges?

add a bridge interface:

sudo brctl addbr br100

to delete a bridge inteface

# ip link set br100 down
# brctl delbr br100

Setting up a bridge is pretty much straightforward. At first you create a new bridge, and then continue with adding as many interfaces to it as you want:

# brctl addbr br0
# brctl addif br0 eth0
# brctl addif br0 eth1
# ifconfig br0 netmask 255.255.255.0 192.168.32.1 up

The name br0 is just a suggestion, following the loose conventions for interface names -- identifier followed by a number. However, you're free to choose anything you like. You can name your bridge pink_burning_elephant if you like to. I just don't know if you remember in 5 years why you're having iptables for a burning elephant.

Good tutorial of brctl command:

http://www.lainoox.com/bridge-brctl-tutorial-linux/

Multi-Cluster Multi-Node Distributed Virtual Network Setup

http://containerops.org/2013/11/19/lxc-networking/

Bridge Mode

Tuesday, September 9, 2014

Install Hadoop first time!

all versions are available here:

http://mirror.tcpdiag.net/apache/hadoop/common/

I picked 2.4.1 (currently stable version)

Following instructions on official site:

http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-common/SingleNodeSetup.html

1: pretty smooth until I saw:

In the distribution, edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java installation.

Note, there is no folder named "conf". By comparing it to the install instructions of 2.5.0, I found the correct one should be: etc/hadoop-env.sh

2: another typo in the official instruction:

$ mkdir input
$ cp conf/*.xml input
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
$ cat output/*

again, no "conf" folder, here it should be:

$ mkdir input

$ cp etc/hadoop/*.xml input

$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar grep input output 'dfs[a-z.]+'

$ cat output/*

A good unoffical installation guide:
http://data4knowledge.org/2014/08/16/installing-hadoop-2-4-1-detailed/

handling warnings you may see:
http://chawlasumit.wordpress.com/2014/06/17/hadoop-java-hotspottm-execstack-warning/

if ssh has some problem, make sure
1: ssh server is running.
2: run: /etc/init.d/ssh reload

a tutorial for dummy:

if you see "WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable":
the solution is here:
http://stackoverflow.com/questions/19943766/hadoop-unable-to-load-native-hadoop-library-for-your-platform-error-on-centos

after everything is correctly installed and launched.
you can check the status by:
$jps
output is:

23208 SecondaryNameNode
22857 NameNode
26575 Jps

22997 DataNode

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

Formatting the Namenode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem, which is implemented on top of the local filesystems of your cluster. You need to do this the first time you set up a Hadoop installation. Do not format a running Hadoop filesystem, this w Before formatting, ensure that the dfs.name.dir directory exists. If you just used the default, then mkdir -p /tmp/hadoop-username/dfs/name will create the directory. To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command:

% $HADOOP_INSTALL/hadoop/bin/hadoop namenode -format

"no such file or directory":

http://stackoverflow.com/questions/20821584/hadoop-2-2-installation-no-such-file-or-directory

hadoop fs -mkdir -p /user/[current login user]

"datanode is not running":

This is for newer version of Hadoop (I am running 2.4.0)

In this case stop the cluster sbin/stop-all.sh
Then go to /etc/hadoop for config files.

In the file: hdfs-site.xml Look out for directory paths corresponding to dfs.namenode.name.dir and dfs.namenode.data.dir

Delete both the directories recursively (rm -r).
Now format the namenode via bin/hadoop namenode -format
And finally sbin/start-all.sh

how to copy file from local system to hdfs?

hadoop fs -copyFromLocal localfile.txt /user/hduser/input/input1.data

THen run example:

$bin/hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar wordcount /user/hdgepo/input /user/hdgepo/output

jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar wordcount <input> <output>, where <input> is a text file or a directory containing text files, and <output> is the name of a directory that will be created to hold the output. The output directory must not exist before running the command or you will get an error.

RUn your own hadoop:

https://github.com/uwsampa/graphbench/wiki/Standalone-Hadoop
useful hadoop fs commands:
http://www.bigdataplanet.info/2013/10/All-Hadoop-Shell-Commands-you-need-Hadoop-Tutorial-Part-5.html

Cluster Setup:

http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/ClusterSetup.html
Web interface for hadoop 2.4.1:
http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/ClusterSetup.html#Web_Interfaces

Sunday, September 7, 2014

Conquering Spark

Spark is hot! Indeed.
I have no knowledge of Hadoop or Internet programming. But I still want to conquer Spark.

The first thing I learned is from downloading Spark.
https://spark.apache.org/downloads.html

They have :

Pre-built packages:

For Hadoop 1 (HDP1, CDH3): find an Apache mirror or direct file download
For CDH4: find an Apache mirror or direct file download
For Hadoop 2 (HDP2, CDH5): find an Apache mirror or direct file download

Pre-built packages, third-party (NOTE: may include non ASF-compatible licenses):

For MapRv3: direct file download (external)
For MapRv4: direct file download (external)

What are all these abbreviations representing?

HDFS, HDP1, CDH3, CDH4, HDP2, CDH5, MapRv3 and MapRv4

Simply put, they are all distributions of Hadoop. Just like a Linux distribution gives you more than Linux, CDH delivers the core elements of Hadoop – scalable storage and distributed computing – along with additional components such as a user interface, plus necessary enterprise capabilities such as security, and integration with a broad range of hardware and software solutions.

http://www.dbms2.com/2012/06/19/distributions-cdh-4-hdp-1-hadoop-2-0/

HDP1 and HDP2: two versions of Hortonworks Data Platform.

Hortonworks is a company which makes use of Hadoop. Hortonworks is to promote the usage of Hadoop. Its product named Hortonworks Data Platform (HDP) includes Apache Hadoop and is used for storing, processing, and analyzing large volumes of data. The platform is designed to deal with data from many sources and formats. The platform includes various Apache Hadoop projects including the Hadoop Distributed File System(HDFS), MapReduce, Pig, Hive, HBase and Zookeeper and additional components.

official site of HDP: http://hortonworks.com/

its wiki: http://en.wikipedia.org/wiki/Hortonworks

CDH3, CDH4, CDH5: versions of Cloudera Distribution Including Apache Hadoop

Its wiki: http://en.wikipedia.org/wiki/Cloudera

MapRv3, MapRv4: versions from MapR company

3 pillars of Hadoop: HDFS, MapReduce, Yarn

Now Spark may replace MapReduce in the future.

http://hortonworks.com/hadoop/hdfs/

to run spark, you need install CDH or HDP or MapR hadoop. or you can run spark standalone.

Deryk's stack