How To Provision AWS EMR Cluster with Cross Account S3 Bucket Access Using Terraform

By Adam McQuistan in DevOps  06/28/2021 Comment

Introduction

In this article I demonstrate "How To" use the popular Terraform Infrastructure as Code (IaC) Configuration Language to provision an AWS EMR Cluster along with establish Read Only S3 Bucket access for consuming data from another AWS Account. To demonstrate this I have created two AWS accounts which I'll refer to as acct_a and acct_b throughout. The AWS EMR cluster is to be provisioned in acct_a and a separate S3 bucket serving as read-only input data will be provisioned in acct_b.

Provisioning AWS EMR Cluster in Account A

As already stated I am using Terraform (version 0.14) to provision an EMR cluster along with its own single public subnet VPC to reside in. A complete list of resources being created is listed below.

  • AWS Virtual Private Cloud (VPC) network for Cluster Isolation
  • AWS Public Subnet with the VPC complete with a Route Table and Route to an Internet Gateway to deploy the EMR Cluster within and access via SSH protocol
  • AWS VPC Endpoint for S3 Service so network traffic does not leave AWS
  • Security Group for the VPC which allows SSH Connections over Port 22 Scoped to only a Whitelisted IP address
  • EMR Service IAM Role for the EMR Cluster which allows the Service to create the necessary Cluster resources
  • EMR EC2 IAM Role which gives the EC2 instances of the EMR Cluster access to many different commonly used AWS Services
  • An S3 Bucket for Storing Results of Computation produced within the EMR Cluster
  • Three EC2 instances which will comprise the Computing resources of the EMR Cluster which I've also specified the Hadoop, Apache Hive and Apache Spark applications should be installed on as well

Below can find a single Terraform source file which I'll refer to as main.tf

# main.tf

############################
# Variables
############################
variable "aws_region" {
  # fill in another region as desired
  default = "us-east-1"
}

variable "aws_profile" {
  # works with my computers credentials chain which has a acct_a profile established
  default = "acct_a"
}

variable "vpc_cidr" {
  default = "10.192.0.0/24"
}

variable "keypair_name" {
  # must be a name of a keypair that exists in the acct_a account within the region being deployed
  # and accessible from the computer you which to establish SSH connection from
  default = "emr-keypair"
}

variable "whitelisted_ip" {
  # the IP address which you wish to establish the SSH connection from
  # you can use the AWS REST endpoint https://checkip.amazonaws.com
  default = "11.22.33.44"
}

############################
# Providers
############################
provider "aws" {
  region  = var.aws_region
  profile = var.aws_profile
}

############################
# Data
############################
data "aws_availability_zones" "available" {}


############################
# Resources
############################

# Network Resources
resource "aws_vpc" "emr_vpc" {
    cidr_block           = var.vpc_cidr
    enable_dns_hostnames = true
    enable_dns_support   = true
}

resource "aws_internet_gateway" "igw" {
    vpc_id = aws_vpc.emr_vpc.id
}

resource "aws_subnet" "pub_subnet_one" {
    # creates a subnet 16 IP addresses long (usable addresses 10.192.0.1 to 10.192.0.14)
    cidr_block              = cidrsubnet(var.vpc_cidr, 4, 0)
    vpc_id                  = aws_vpc.emr_vpc.id
    map_public_ip_on_launch = true
    availability_zone       = data.aws_availability_zones.available.names[0]
}

resource "aws_route_table" "pub_route_tbl" {
    vpc_id = aws_vpc.emr_vpc.id
    route {
        cidr_block = "0.0.0.0/0"
        gateway_id = aws_internet_gateway.igw.id
    }
}

resource "aws_route_table_association" "pub_subnet_one_asn" {
    subnet_id       = aws_subnet.pub_subnet_one.id
    route_table_id  = aws_route_table.pub_route_tbl.id 
}

resource "aws_vpc_endpoint" "s3" {
    vpc_id       = aws_vpc.emr_vpc.id
    service_name = "com.amazonaws.${var.aws_region}.s3"
}

# Default Security Group allowing All Inter Network Comm
resource "aws_security_group" "default" {
    name       = "emr-vpc-default-sg"
    vpc_id     = aws_vpc.emr_vpc.id
    depends_on = [ aws_vpc.emr_vpc ]
    ingress {
        from_port = "0"
        to_port   = "0"
        protocol  = "-1"
        self      = true
    }
    egress {
        from_port = "0"
        to_port   = "0"
        protocol  = "-1"
        self      = true
    }
}

# SSH Enabling Security Group (Scoped to Specific IP)
resource "aws_security_group" "ssh_myip" {
  name = "ssh-from-my-ip"
  description = "Allow SSH Access from Specific IP"
  vpc_id = aws_vpc.emr_vpc.id

  ingress {
    description = "SSH From My IP"
    from_port = 22
    to_port = 22
    protocol = "tcp"
    cidr_blocks = ["${var.whitelisted_ip}/32"]
  }

  egress {
    from_port        = 0
    to_port          = 0
    protocol         = "-1"
    cidr_blocks      = ["0.0.0.0/0"]
  }
}

# IAM role for EMR Service
resource "aws_iam_role" "iam_emr_service_role" {
  name = "iam_emr_service_role"
  
  managed_policy_arns = [
    "arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole"
  ]

  assume_role_policy = <<EOF
{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "elasticmapreduce.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
}

# IAM Role for EC2 Instance Profile
resource "aws_iam_role" "iam_emr_profile_role" {
  name = "iam_emr_profile_role"

  managed_policy_arns = [
    "arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role"
  ]

  assume_role_policy = <<EOF
{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
}

resource "aws_iam_instance_profile" "emr_profile" {
  name = "emr_profile"
  role = aws_iam_role.iam_emr_profile_role.name
}


# s3 bucket for writing results to from EMR
resource "aws_s3_bucket" "emr_results" {
  bucket = "tci-emr-results"
}


# EMR Cluster with Hadoop, Hive and, Spark applications
resource "aws_emr_cluster" "cluster" {
  name = "my-cluster"
  release_label = "emr-5.33.0"
  applications = ["Hadoop", "Hive", "Spark"]

  ebs_root_volume_size = "20"

  ec2_attributes {
    key_name = var.keypair_name
    subnet_id = aws_subnet.pub_subnet_one.id
    additional_master_security_groups = aws_security_group.ssh_myip.id
    instance_profile = aws_iam_instance_profile.emr_profile.arn
  }

  master_instance_group {
    instance_type = "m5.xlarge"
  }

  core_instance_group {
    instance_type = "m5.xlarge"
    instance_count = 2
  }

  service_role = aws_iam_role.iam_emr_service_role.arn
}

############################
# Outputs
############################

output "emr_ec2_role_arn" {
  value = aws_iam_role.iam_emr_profile_role.arn
}

output "emr_master_dns" {
  value = aws_emr_cluster.cluster.master_public_dns
}

Its worth noting that the IAM roles being used here are using the Default Managed Policies provided by AWS which are rather liberal in their access to many different services and each one having very open scope. These managed policies are useful for learning the EMR service but, for a production deployment the IAM Policies attached to such EMR IAM Roles should be much more narrowly scoped to the specific services and resource actually being used.

Steps to Provision EMR Cluster with Terraform

1) Save the contents of the above Terraform code to a directory named emr-cluster with Terraform code in file named main.tf but, be sure to enter valid values for your equivalent acct_a profile (or other credential) as well as VPC CIDR range, EC2 Key-Pair name, and IP Address to whitelist.

2) change directories into the  emr-cluster directory and initialize the Terraform project

terraform init

3) Generate the Terraform plan and save it to a file named emr-cluster.tfplan

terraform plan -out emr-cluster.tfplan

4) Execute the plan by applying the saved emr-cluster.tfplan file

terraform apply "emr-cluster.tfplan"

Upon completion of the last step there will be the following output:

  • EMR Cluster's master node URL which can be used to establish SSH connections to
  • EMR EC2 Role ARN which is needed to establish a bucket policy in acct_b allowing the EMR Cluster Read-Only S3 Access

Provisioning S3 Bucket and Read-Only EMR Access From Account A to Account B

In this section I provide the necessary Terraform code to provision an S3 bucket in acct_b which will serve as input data to the EMR Cluster in acct_a. This cross account access will be made possible via a S3 Bucket Policy, also created using Terraform, which allows Read-Only access from the EMR EC2 IAM Role output from the Terraform code execution in the last step.

Below is the Terraform code necessary to create the S3 Bucket and S3 Bucket Policy just described.

############################
# Variables
############################
variable "aws_region" {
  default = "us-east-1"
}

variable "aws_profile" {
  default = "acct_b"
}

variable "emr_ec2_role_arn" {
  # Be sure to replace this with the output of the other Terraform script
  default = "arn:aws:iam::ACCOUNT-A-ID-GOES-HERE:role/iam_emr_profile_role"
}

############################
# Providers
############################
provider "aws" {
  region  = var.aws_region
  profile = var.aws_profile
}

############################
# Resources
############################

# s3 bucket for writing results to from EMR
resource "aws_s3_bucket" "emr_inputs" {
  bucket = "tci-emr-inputs"
}

# Policy allowing the acct_a EMR cluster to access this acct_b bucket
resource "aws_s3_bucket_policy" "allow_acct_a" {
  bucket = aws_s3_bucket.emr_inputs.id

  policy = <<EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AcctAEmrS3Access",
            "Action": [
                "s3:Get*",
                "s3:List*"
            ],
            "Effect": "Allow",
            "Resource": [
                "${aws_s3_bucket.emr_inputs.arn}",
                "${aws_s3_bucket.emr_inputs.arn}/*"
            ],
            "Principal": {
                "AWS": "${var.emr_ec2_role_arn}"
            }
        }
    ]
}
EOF
}

Steps to Provision S3 Bucket and Bucket Policy with Terraform

1) Save the contents of the above Terraform code to a directory named s3-resources with Terraform code in file named main.tf but, be sure to enter valid values for your equivalent acct_b profile (or other credential) as well as the EC2 IAM Role.

2) change directories into the s3-resources directory and initialize the Terraform project

terraform init

3) Generate the Terraform plan and save it to a file named s3-resources.tfplan

terraform plan -out s3-resources.tfplan

4) Execute the plan by applying the saved s3-resources.tfplan file

terraform apply "s3-resources.tfplan"

Conclusion

At this point you should be able to load input data into the acct_b S3 bucket, SSH onto the EMR Cluster provisioned in acct_a from the IP address specifically whitelisted and, read the data in the acct_b S3 bucket from acct_a EMR Cluster.

To destroy the provisioned resources and avoid unnecessary charges on idle resources execute the following in each directory the provisioning took place in.

terraform destroy

Be sure to enter yes when prompted.

As always, thanks for reading and don't be shy about commenting or critiquing below.

Share with friends and colleagues

[[ likes ]] likes

Community favorites for DevOps

theCodingInterface