In this How To article I demonstrate how to use the AWS CLI to create an Amazon Elastic Map Reduce (EMR) Cluster along with some common supplementary resources for experimentation and development on an EMR cluster. Amazon EMR is a big data service provided by AWS which makes it easy to provision distributed computing clusters provisioned with user defined open source data processing and AI/ML tools like Apache Hive, Apache Spark, Apache Flink and many others.
In this demonstration I provide the steps necessary to create a modest EMR cluster complete with Apache Hadoop and Apache Hive suitable for experimenting and interactive development. By this I mean that these steps will produce an EMR cluster of three nodes (one master and two core worker nodes) that can be accessed via SSH and will be given fairly open service roles to interact with other common AWS services like Kinesis, S3, Glue and others.
Before going any further I do want to warn that this cluster will cost around $0.75 per hour to run so if you intend to use this "recipe" of steps to create an EMR cluster you will be charged by AWS.
1) Create EC2 Key Pair for SSH Access to Master Node of Cluster
This command will create a EC2 Key Pair (.pem key) and save it to a local file which can then be used to SSH onto the ending EMR Cluster's Master Node.
aws ec2 create-key-pair --key-name emr-keypair \
--query 'KeyMaterial' --output text > emr-keypair.pem
Change key file permissions to read by user only.
chmod 400 emr-keypair.pem
2) Create an S3 Bucket to Hold Resources as well as Accept Log Files
This step creates an S3 bucket that will likely be useful for saving data to be processed within the EMR cluster, serve as a storage container for results and logs produced by EMR, and receive log files from it's execution.
If you have an S3 bucket you'd like to use you can skip creating one.
The following creates an S3 bucket named tci-emr-demo in the us-east-1 region.
aws s3 mb s3://tci-emr-demo
Next I create folders in s3 for logs, inputs, outputs, scripts
aws s3api put-object --bucket tci-emr-demo --key logs/
aws s3api put-object --bucket tci-emr-demo --key inputs/
aws s3api put-object --bucket tci-emr-demo --key outputs/
aws s3api put-object --bucket tci-emr-demo --key scripts/
3) Create an EC2 Security Group to Allow SSH Connections from Your IP
First I find my IP using an HTTP Client such as HTTPie (you could also use your browser or curl) which I'll save to an environment variable named MY_IP
http https://checkip.amazonaws.com -b
Next I create an EC2 Security Group using a VPC that has public subnets. Your AWS account likely has a default VPC that has a public subnet but, you may need to create one if this is not the case.
I've previously grabbed my VPC ID and saved it to an environment variable named MY_VPC along with a Public Subnet ID and saved it to a SUBNET_ID variable.
aws ec2 create-security-group --group-name ssh-my-ip \
--description "For SSHing from my IP" --vpc-id $MY_VPC
This should output the newly created Security Group ID which I've saved to an environment variable named MY_SG
Using the Security Group ID along with the result of my IP lookup I can now add an ingress rule to the security group to allow TCP connections to the standard SSH port 22.
aws ec2 authorize-security-group-ingress --group-id $MY_SG \
--protocol tcp --port 22 --cidr $MY_IP/32
4) Create default IAM Roles for EMR
This is a simple way to create default EMR IAM roles which give an EMR cluster rather liberal access to other commonly used AWS services. Please be sure to consult your organization's AWS and IAM administrator before doing this.
aws emr create-default-roles
5) Create EMR Cluster with Hive
The following command creates a emr-5.33.0 version EMR cluster of 3 m5.xlarge EC2 instances with the Hadoop and Hive applications along with 12 GB of disk space. The output of this command will give you your EMR Cluster ID which is needed for subsequent operations. I've saved my EMR Cluster ID in a variable CLUSTER_ID
aws emr create-cluster --name tci-cluster --applications Name=Hadoop Name=Hive \
--release-label emr-5.33.0 --use-default-roles \
--instance-count 3 --instance-type m5.xlarge \
--ebs-root-volume-size 12 \
--log-uri s3://tci-emr-demo/logs \
--ec2-attributes KeyName=emr-keypair,AdditionalMasterSecurityGroups=$MY_SG,SubnetId=$SUBNET_ID \
--no-auto-terminate
Another common application used with EMR (likely more common than Hive) is Spark. If you are looking to add Spark to the list of installed applications ammend the --applications flag to the following.
--applications Name=Hadoop Name=Hive Name=Spark
You can use the describe-cluster command to determine your MasterPublicDnsName of the Cluster as well as determine when the cluster is fully provisioned and in the WAITING state.
aws emr describe-cluster --cluster-id $CLUSTER_ID
If you have the CLI program jq installed you can parse out the MasterPublicDnsName using the following
MASTER_URL=$(aws emr describe-cluster --cluster-id $CLUSTER_ID | jq -r ".Cluster.MasterPublicDnsName")
Next you'll want to be sure you can SSH onto the Master Node using the MasterPublicDnsName retrived and the SSH key pair created earlier.
ssh hadoop@$MASTER_URL -i emr-keypair.pem
Once onto the Master node I should be able to interact with HDFS to verify that Hadoop and Hive was installed by listing the contents of the /user directory like so.
hadoop fs -ls /user
Yielding the following output.
Found 4 items
drwxrwxrwx - hadoop hdfsadmingroup 0 2021-06-23 03:29 /user/hadoop
drwxr-xr-x - mapred mapred 0 2021-06-23 03:14 /user/history
drwxrwxrwx - hdfs hdfsadmingroup 0 2021-06-23 03:14 /user/hive
drwxrwxrwx - root hdfsadmingroup 0 2021-06-23 03:14 /user/root
When you are done working with this cluster you will likely want to destroy it to minimize costs.
aws emr terminate-clusters --cluster-ids $CLUSTER_ID
Then use either the same describe command from earlier or list all clusters to sure that the cluster reaches a TERMINATED state.
aws emr list-clusters
In this How To article I provided a recipe of steps and resources required to create an interactive AWS EMR cluster with Apache Hadoop and Apache Hive applications installed on it and configured to be accessible via SSH from a narrowly scoped individual IP address. I've regularly used this same set of steps to experiment with various cluster based Big Data technologies useful for developing many data intensive workloads and hope readers find it useful.
As always, thanks for reading and don't be shy about commenting or critiquing below.