如何使用DigitalOceanDroplets启动Hadoop集群

介绍

本教程将介绍在 DigitalOcean 上设置 Hadoop 集群。 Hadoop 软件库是一个 Apache 框架，可让您通过利用基本编程模型跨服务器集群以分布式方式处理大型数据集。 Hadoop 提供的可扩展性允许您从单个服务器扩展到数千台机器。它还在应用层提供故障检测，因此它可以作为高可用性服务来检测和处理故障。

在本教程中，我们将使用 4 个重要模块：

Hadoop Common 是支持其他 Hadoop 模块所必需的常用实用程序和库的集合。
Hadoop 分布式文件系统 (HDFS)，如 Apache 组织所述，是一种高度容错的分布式文件系统，专门设计用于在商用硬件上运行以处理大数据套。
Hadoop YARN 是用于作业调度和集群资源管理的框架。
Hadoop MapReduce 是一个基于 YARN 的系统，用于并行处理大型数据集。

在本教程中，我们将在四个 DigitalOcean Droplet 上设置和运行 Hadoop 集群。

先决条件

本教程将需要以下内容：

设置了四个具有非 root sudo 用户的 Ubuntu 16.04 Droplet。如果您没有此设置，请按照使用 Ubuntu 16.04 的初始服务器设置的步骤 1-4 进行操作。本教程将假设您使用本地计算机上的 SSH 密钥。根据 Hadoop 的语言，我们将通过以下名称引用这些 Droplet：
- hadoop-master
- hadoop-worker-01
- hadoop-worker-02
- hadoop-worker-03
此外，您可能希望在初始服务器设置并完成您的第一个 Droplet 的 Steps 1 和 2（下）后使用 DigitalOcean Snapshots。

具备这些先决条件后，您就可以开始设置 Hadoop 集群了。

第 1 步 — 每个 Droplet 的安装设置

我们将在我们的四个 Droplet 的每个上安装 Java 和 Hadoop。如果您不想在每个 Droplet 上重复每个步骤，您可以在 Step 2 的末尾使用 DigitalOcean Snapshots 以复制您的初始安装和配置。

首先，我们将使用可用的最新软件补丁更新 Ubuntu：

sudo apt-get update && sudo apt-get -y dist-upgrade

接下来，让我们在每个 Droplet 上安装适用于 Ubuntu 的无头版 Java。 “无头”是指能够在没有图形用户界面的设备上运行的软件。

sudo apt-get -y install openjdk-8-jdk-headless

要在每个 Droplet 上安装 Hadoop，让我们创建将安装 Hadoop 的目录。我们可以称它为 my-hadoop-install，然后进入该目录。

mkdir my-hadoop-install && cd my-hadoop-install

创建目录后，让我们从 Hadoop 版本列表安装最新的二进制文件。在编写本教程时，最新的是 Hadoop 3.0.1。

注意：请记住，这些下载是通过镜像站点分发的，建议首先使用 GPG 或 SHA-256 检查是否被篡改。

当您对您选择的下载感到满意时，您可以使用 wget 命令与您选择的二进制链接，例如：

wget http://mirror.cc.columbia.edu/pub/software/apache/hadoop/common/hadoop-3.0.1/hadoop-3.0.1.tar.gz

下载完成后，使用 Ubuntu 的文件归档工具 tar 解压缩文件内容：

tar xvzf hadoop-3.0.1.tar.gz

我们现在准备开始我们的初始配置。

第 2 步 — 更新 Hadoop 环境配置

对于每个 Droplet 节点，我们需要设置 JAVA_HOME。使用 nano 或您选择的其他文本编辑器打开以下文件，以便我们对其进行更新：

nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/hadoop-env.sh

更新 JAVA_HOME 所在的以下部分：

hadoop-env.sh

...
###
# Generic settings for HADOOP
###

# Technically, the only required environment variable is JAVA_HOME.
# All others are optional.  However, the defaults are probably not
# preferred.  Many sites configure these options outside of Hadoop,
# such as in /etc/profile.d

# The java implementation to use. By default, this environment
# variable is REQUIRED on ALL platforms except OS X!
# export JAVA_HOME=

# Location of Hadoop.  By default, Hadoop will attempt to determine
# this location based upon its execution path.
# export HADOOP_HOME=
...

看起来像这样：

hadoop-env.sh

...
###
# Generic settings for HADOOP
###

# Technically, the only required environment variable is JAVA_HOME.
# All others are optional.  However, the defaults are probably not
# preferred.  Many sites configure these options outside of Hadoop,
# such as in /etc/profile.d

# The java implementation to use. By default, this environment
# variable is REQUIRED on ALL platforms except OS X!
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

# Location of Hadoop.  By default, Hadoop will attempt to determine
# this location based upon its execution path.
# export HADOOP_HOME=
...

我们还需要添加一些环境变量来运行 Hadoop 及其模块。它们应该被添加到文件的底部，如下所示，其中 sammy 将是您的 sudo 非 root 用户的用户名。

注意：如果您在集群 Droplet 中使用不同的用户名，则需要编辑此文件以反映每个特定 Droplet 的正确用户名。

hadoop-env.sh

...
#
# To prevent accidents, shell commands be (superficially) locked
# to only allow certain users to execute certain subcommands.
# It uses the format of (command)_(subcommand)_USER.
#
# For example, to limit who can execute the namenode command,
export HDFS_NAMENODE_USER="sammy"
export HDFS_DATANODE_USER="sammy"
export HDFS_SECONDARYNAMENODE_USER="sammy"
export YARN_RESOURCEMANAGER_USER="sammy"
export YARN_NODEMANAGER_USER="sammy"

此时，您可以保存并退出文件。接下来，运行以下命令来应用我们的导出：

source ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/hadoop-env.sh

随着 hadoop-env.sh 脚本的更新和获取，我们需要为 Hadoop 分布式文件系统 (HDFS) 创建一个数据目录来存储所有相关的 HDFS 文件。

sudo mkdir -p /usr/local/hadoop/hdfs/data

使用您各自的用户设置此文件的权限。请记住，如果您在每个 Droplet 上使用不同的用户名，请务必允许您各自的 sudo 用户拥有以下权限：

sudo chown -R sammy:sammy /usr/local/hadoop/hdfs/data

如果您想使用 DigitalOcean 快照在您的 Droplet 节点之间复制这些命令，您可以立即创建快照并从此图像创建新的 Droplet。有关这方面的指导，您可以阅读 DigitalOcean 快照简介。

当您跨 所有四个 Ubuntu Droplets 完成上述步骤后，您可以继续跨节点完成此配置。

第 3 步 — 完成每个节点的初始配置

此时，我们需要为您的 Droplet 节点的 all 4 更新 core_site.xml 文件。在每个单独的 Droplet 中，打开以下文件：

nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/core-site.xml

您应该看到以下几行：

核心站点.xml

...
<configuration>
</configuration>

将文件更改为类似于以下 XML，以便我们在属性值中包含 每个 Droplet 各自的 IP，我们在其中写入了 server-ip。如果您使用的是防火墙，则需要打开端口 9000。

核心站点.xml

...
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://server-ip:9000</value>
    </property>
</configuration>

在您的所有四个 服务器的 的相关 Droplet IP 中重复上述写入。

现在应该为每个服务器节点更新所有常规 Hadoop 设置，我们可以继续通过 SSH 密钥连接我们的节点。

第 4 步 — 为每个节点设置 SSH

为了让Hadoop正常工作，我们需要在主节点和工作节点之间建立无密码SSH（master和worker的语言是Hadoop的语言参考[ X196X] 和 secondary 服务器）。

在本教程中，主节点为 hadoop-master，工作节点统称为 hadoop-worker，但您总共将拥有三个（称为 [ X180X]、-02 和 -03）。我们首先需要在主节点上创建一个公私密钥对，主节点将是 IP 地址属于 hadoop-master 的节点。

在 hadoop-master Droplet 上，运行以下命令。您将按 enter 以使用默认密钥位置，然后按 enter 两次以使用空密码：

ssh-keygen

对于每个工作节点，我们需要获取主节点的公钥并将其复制到每个工作节点的 authorized_keys 文件中。

通过在位于 .ssh 文件夹中的 id_rsa.pub 文件上运行 cat 从主节点获取公钥，以打印到控制台：

cat ~/.ssh/id_rsa.pub
```

Now log into each worker node Droplet, and open the `authorized_keys` file:

```custom_prefix(sammy@hadoop-worker$)
[environment fourth]
nano ~/.ssh/authorized_keys
```

You’ll copy the master node’s public key — which is the output you generated from the `cat ~/.ssh/id_rsa.pub` command on the master node — into each Droplet’s respective `~/.ssh/authorized_keys` file. Be sure to save each file before closing.

When you are finished updating the 3 worker nodes, also copy the master node’s public key into its own `authorized_keys` file by issuing the same command:

```custom_prefix(sammy@hadoop-master$)
[environment second]
nano ~/.ssh/authorized_keys
```

On `hadoop-master`, you should set up the `ssh` configuration to include each of the hostnames of the related nodes. Open the configuration file for editing, using nano:

```custom_prefix(sammy@hadoop-master$)
[environment second]
nano ~/.ssh/config
```

You should modify the file to look like the following, with relevant IPs and usernames added.

```
[environment second]
[label config]
Host hadoop-master-server-ip
    HostName hadoop-example-node-server-ip
    User sammy
    IdentityFile ~/.ssh/id_rsa

Host hadoop-worker-01-server-ip
    HostName hadoop-worker-01-server-ip
    User sammy
    IdentityFile ~/.ssh/id_rsa

Host hadoop-worker-02-server-ip
    HostName hadoop-worker-02-server-ip
    User sammy
    IdentityFile ~/.ssh/id_rsa

Host hadoop-worker-03-server-ip
    HostName hadoop-worker-03-server-ip
    User sammy
    IdentityFile ~/.ssh/id_rsa
```

Save and close the file.

From the `hadoop-master`, SSH into each node:

```custom_prefix(sammy@hadoop-master$)
[environment second]
ssh sammy@hadoop-worker-01-server-ip
```

Since it’s your first time logging into each node with the current system set up, it will ask you the following:

```
[environment second]
[secondary_label Output]
are you sure you want to continue connecting (yes/no)?
```

Reply to the prompt with `yes`. This will be the only time it needs to be done, but it is required for each worker node for the initial SSH connection. Finally, log out of each worker node to return to `hadoop-master`:

```custom_prefix(sammy@hadoop-worker$)
[environment fourth]
logout
```

Be sure to **repeat these steps** for the remaining two worker nodes.

Now that we have successfully set up passwordless SSH for each worker node, we can now continue to configure the master node.

## Step 5 — Configure the Master Node

For our Hadoop cluster, we need to configure the HDFS properties on the master node Droplet.

While on the master node, edit the following file:

```custom_prefix(sammy@hadoop-master$)
[environment second]
nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/hdfs-site.xml
```

Edit the `configuration` section to look like the XML below:


```xml
[environment second]
[label hdfs-site.xml]
...
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///usr/local/hadoop/hdfs/data</value>
    </property>
</configuration>
```

Save and close the file.

We’ll next configure the `MapReduce` properties on the master node. Open `mapred.site.xml` with nano or another text editor:

```custom_prefix(sammy@hadoop-master$)
[environment second]
nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/mapred-site.xml
```

Then update the file so that it looks like this, with your current server’s IP address reflected below:

```xml
[environment second]
[label mapred-site.xml]
...
<configuration>
    <property>
        <name>mapreduce.jobtracker.address</name>
        <value>hadoop-master-server-ip:54311</value>
    </property>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
```

Save and close the file. If you are using a firewall, be sure to open port 54311.

Next, set up YARN on the master node. Again, we are updating the configuration section of another XML file, so let’s open the file:

```custom_prefix(sammy@hadoop-master$)
[environment second]
nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/yarn-site.xml
```

Now update the file, being sure to input your current server’s IP address:


```xml
[environment second]
[label yarn-site.xml]
...
<configuration>
    <!-- Site specific YARN configuration properties -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop-master-server-ip</value>
    </property>
</configuration>
```

Finally, let’s configure Hadoop’s point of reference for what the master and worker nodes should be. First, open the `masters` file:

```custom_prefix(sammy@hadoop-master$)
[environment second]
nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/masters
```

Into this file, you’ll add your current server’s IP address:

```
[environment second]
[label masters]
hadoop-master-server-ip
```

Now, open and edit the `workers` file:

```custom_prefix(sammy@hadoop-master$)
[environment second]
nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/workers
```

Here, you’ll add the IP addresses of each of your worker nodes, underneath where it says `localhost`.

```
[environment second]
[label workers]
localhost
hadoop-worker-01-server-ip
hadoop-worker-02-server-ip
hadoop-worker-03-server-ip
```

After finishing the configuration of the `MapReduce` and `YARN` properties, we can now finish configuring the worker nodes.

## Step 6 — Configure the Worker Nodes

We’ll now configure the worker nodes so that they each have the correct reference to the data directory for HDFS.

On **each worker node**, edit this XML file:

```custom_prefix(sammy@hadoop-worker$)
[environment fourth]
nano ~/my-hadoop-install/hadoop-3.0.1/etc/hadoop/hdfs-site.xml
```

Replace the configuration section with the following:

```
[label hdfs-site.xml]
[environment fourth]
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///usr/local/hadoop/hdfs/data</value>
    </property>
</configuration>
```

Save and close the file. Be sure to replicate this step on **all three** of your worker nodes. 

At this point, our worker node Droplets are pointing to the data directory for HDFS, which will allow us to run our Hadoop cluster.

## Step 7 — Run the Hadoop Cluster

We have reached a point where we can start our Hadoop cluster. Before we start it up, we need to format the HDFS on the master node. While on the master node Droplet, change directories to where Hadoop is installed:

```custom_prefix(sammy@hadoop-master$)
[environment second]
cd ~/my-hadoop-install/hadoop-3.0.1/
```

Then run the following command to format HDFS:

```custom_prefix(sammy@hadoop-master$)
[environment second]
sudo ./bin/hdfs namenode -format
```

A successful formatting of the namenode will result in a lot of output, consisting of mostly `INFO` statements. At the bottom you will see the following, confirming that you’ve successfully formatted the storage directory. 

```
[environment second]
[secondary_label Output]
...
2018-01-28 17:58:08,323 INFO common.Storage: Storage directory /usr/local/hadoop/hdfs/data has been successfully formatted.
2018-01-28 17:58:08,346 INFO namenode.FSImageFormatProtobuf: Saving image file /usr/local/hadoop/hdfs/data/current/fsimage.ckpt_0000000000000000000 using no compression
2018-01-28 17:58:08,490 INFO namenode.FSImageFormatProtobuf: Image file /usr/local/hadoop/hdfs/data/current/fsimage.ckpt_0000000000000000000 of size 389 bytes saved in 0 seconds.
2018-01-28 17:58:08,505 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2018-01-28 17:58:08,519 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop-example-node/127.0.1.1
************************************************************/
```

Now, start the Hadoop cluster by running the following scripts (be sure to check scripts before running by using the `less` command):

```custom_prefix(sammy@hadoop-master$)
[environment second]
sudo ./sbin/start-dfs.sh
```

You’ll then see output that contains the following:

```
[environment second]
[secondary_label Output]
Starting namenodes on [hadoop-master-server-ip]
Starting datanodes
Starting secondary namenodes [hadoop-master]
```

Then run YARN, using the following script:

```custom_prefix(sammy@hadoop-master$)
[environment second]
./sbin/start-yarn.sh
```

The following output will appear:

```
[environment second]
[secondary_label Output]
Starting resourcemanager
Starting nodemanagers
```

Once you run those commands, you should have daemons running on the master node and one on each of the worker nodes.

We can check the daemons by running the `jps` command to check for Java processes:

```custom_prefix(sammy@hadoop-master$)
[environment second]
jps
```

After running the `jps` command, you will see that the `NodeManager`, `SecondaryNameNode`, `Jps`, `NameNode`, `ResourceManager`, and `DataNode` are running. Something similar to the following output will appear:

```
[environment second]
[secondary_label Output]
9810 NodeManager
9252 SecondaryNameNode
10164 Jps
8920 NameNode
9674 ResourceManager
9051 DataNode
```

This verifies that we’ve successfully created a cluster and verifies that the Hadoop daemons are running. 

In a web browser of your choice, you can get an overview of the health of your cluster by navigating to:

```
http://hadoop-master-server-ip:9870
```

If you have a firewall, be sure to open port 9870. You’ll see something that looks similar to the following:

![Hadoop Health Verification](https://assets.digitalocean.com/articles/hadoop-cluster/hadoop-verification.png)

From here, you can navigate to the `Datanodes` item in the menu bar to see the node activity.

### Conclusion

In this tutorial, we went over how to set up and configure a Hadoop multi-node cluster using DigitalOcean Ubuntu 16.04 Droplets. You can also now monitor and check the health of your cluster using Hadoop’s DFS Health web interface. 

To get an idea of possible projects you can work on to utilize your newly configured cluster, check out Apache’s long list of projects [powered by Hadoop](https://wiki.apache.org/hadoop/PoweredBy).