Category: Hadoop

  • AWS CloudFormation Template for creating EMR Cluster with Autoscaling, Cloudwatch metrics and Lambda

    There are various ways we can spin up an EMR cluster such as Manual approach: AWS console, using CLI command, e.g. as simple as : aws emr create-cluster —name “Test Spark cluster” \ –release-label emr-5.7.0 —applications Name=Spark —ec2-attributes KeyName=test1_Ec2_keypair – – instance-type m4.xlarge —instance-count 3 —use-default-roles with Advanced options: aws emr create-cluster –release-label emr-5.7.0 –name…

  • Verifying HDFS in-transit encryption Using tcpdump and Wireshark

    Verifying HDFS in-transit encryption Using tcpdump and Wireshark In this document we will show, how we can verify if the data being transferred to a Hadoop cluster with HDFS in-transit encryption enabled is actually getting encrypted or not. So, let’s start with : Verifying HDFS in-transit encryption Using tcpdump and Wireshark Note: here we are…

  • Oozie ssh action on EMR cluster

    Oozie ssh action on EMR cluster Prerequisites for Oozie ssh action on EMR cluster: Please note that in case of Oozie ssh action, Oozie tries to ssh into remote host using oozie user. Hence we need to first ensure that we are able to ssh into remote host from Oozie server using oozie user. We also…

  • Unable to import graphframes with pyspark

    Unable to import graphframes with pyspark You might hit into below error message while trying to import graphframe module into your pyspark session in an EMR cluster.   >> print(spark.version) 2.1.0 >>> from graphframes import* Traceback (most recent call last): File “<stdin>”, line 1, in <module> ImportError: No module named graphframes >>>   it will  need…

  • Zookeeper Setup

    Zookeeper Setup Required Software Zookeeper runs in Java release 1.6 or greater (JDK 6 or greater) hence please download and install JDK first. Zookeeper runs as an ensemble of Zookeeper servers, which should be of odd numbers,  as zookeeper requires a majority. For example, with four machines ZooKeeper can only handle the failure of a single machine;…

  • Securing Hadoop Cluster part-2 KERBEROS SETUP

    Securing Hadoop Cluster part-2 KERBEROS SETUP Contents Kerberos: 1 Kerberos Installation and setup: 2 Kerberos KDC server setup. 2 Kerberos Client Setup: 8 Create service principal and keytabs  for Hadoop Services. 8 Update the configuration files for each Hadoop service. 10   Kerberos:   –a secured netowrk authentication system developed by MIT in mid 1990.…

  • Securing Hadoop Cluster part -1 (SSL/TLS for HDFS and Yarn)

    Securing Hadoop Cluster part -1   Securing Hadoop Cluster part -1 (SSL/TLS for HDFS and Yarn) Hadoop in Secure Mode : Security features of Hadoop consist of authentication, service level authorization , authentication for Web consoles and data confidenciality. For client interaction, Authentication, and service level authorization  can be achieved by using  with Kerberos . The data transferred between hadoop…

  • VM setup for Hadoop Cluster (centos)

    VM setup for Hadoop Cluster (centos) Before you get hands on Hadoop ecosystem, be it for Hadoop Admin or Development exposure you will need your own small cluster setup which will help you understand how hadoop internally works. In order to have a distributed Hadoop cluster setup we need multiple hosts/servers and for obvious reason…

  • Hadoop2 Cluster Setup

    Hadoop2  Cluster  Setup This document guideline the steps to Hadoop2 Cluster Setup. Table of Contents Pre-requisites: 1 Create user and group (hadoop) on all hosts: 1 setup SSH for hadoop user. 1 Download and install Java. 2 Download hadoop: 2 Update bashrc file for hadoop user to set required path variables. 2 Update Hadoop Configuration…