Managing S3 Data Store Partitions with AWS Glue Crawlers and Glue Partitions API using Boto3 SDK

Introduction

In this article I dive into partitions for S3 data stores within the context of the AWS Glue Metadata Catalog covering how they can be recorded using Glue Crawlers as well as the the Glue API with the Boto3 SDK. I discuss how partitions can be used to facilitate data retrieval within Glue Spark ETL jobs, along with AWS Athena, explain some considerations for using either Cralwers or the Glue API to manage partitions.

S3 Data Store Partitions

Partitions are used in conjunction with S3 data stores to organize data using a clear naming convention that is meaningful and navigable. Partitions are particularly important for efficient data traversal and retrieval in Glue ETL jobs along with querying S3 data using Athena. To demonstrate the use of partitions in an S3 data store I will work with fictitiously generated sales data partitioned by year, month, and day.

To start I need an S3 bucket to save sales data to which I can make using the AWS CLI.

aws s3 mb s3://tci-sales-data --region us-east-2

Then I can use the below Python script to generate fake sales data using the popular Faker library. This script generates fictitious daily sales data for 5 products and saves the data as CSV files in S3.

# sales_generator.py

import argparse
import boto3
import csv
import random
import io

from datetime import date, datetime

from faker import Faker, providers


class Product(providers.BaseProvider):
    products = [
      {'name':'product-a', 'price': 1.99},
      {'name':'product-b', 'price': 2.99},
      {'name':'product-c', 'price': 3.99},
      {'name':'product-d', 'price': 6.99},
      {'name':'product-e', 'price': 12.99}
    ]

    def product(self):
        return random.choice(self.products)


def generate_sales(s3bucket, items, date):
    faker = Faker()
    faker.add_provider(Product)
    faker.add_provider(providers.date_time)
    faker.add_provider(providers.address)
    faker.add_provider(providers.person)

    start = datetime(date.year, date.month, date.day, 0, 0)
    end = datetime(date.year, date.month, date.day, 23, 59)

    iostream = io.StringIO()
    writer = csv.writer(iostream)
    writer.writerow(['created','firstname','lastname','city','product','unitprice','quantity'])

    for _ in range(items):
        fname = faker.first_name()
        lname = faker.last_name()
        city = faker.city()
        qty = random.randint(1, 10)
        product = faker.product()
        ds = faker.date_time_between(start_date=start, end_date=end)

        writer.writerow([ds.isoformat(), fname, lname, city, product['name'], product['price'], qty])

    key_opts = {
        'year': start.year,
        'month': "{:02d}".format(start.month),
        'day': "{:02d}".format(start.day),
        'filename': start.strftime('%Y%m%d.json') 
    }
    s3key = "sales/year={year}/month={month}/day={day}/{filename}".format(**key_opts)

    s3 = boto3.client('s3')
    s3.put_object(ACL='private',
                  Bucket=s3bucket,
                  Key=s3key,
                  Body=iostream.getvalue().encode('utf-8'))


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--s3bucket')
    parser.add_argument('--items', type=int)
    parser.add_argument('--date', type=date.fromisoformat)
    args = parser.parse_args()

    generate_sales(args.s3bucket, args.items, args.date)

I then execute it through the CLI shell a few times to generate 3 days of sales data saved to S3, parititioned by year, month, and day.

$ python sales_generator.py --s3bucket tci-sales-data --items 10 --date '2021-01-01'
$ python sales_generator.py --s3bucket tci-sales-data --items 10 --date '2021-01-02'
$ python sales_generator.py --s3bucket tci-sales-data --items 10 --date '2021-01-03'

I can use the AWS CLI once again to verify that data was stored in the S3 bucket as shown below.

$ aws s3 ls s3://tci-sales-data --recursive
2021-03-13 14:49:58        698 sales/year=2021/month=01/day=01/20210101.json
2021-03-13 14:50:02        708 sales/year=2021/month=01/day=02/20210102.json
2021-03-13 14:50:07        705 sales/year=2021/month=01/day=03/20210103.json

Glue Crawlers for Glue Table Generation

AWS Glue Crawlers are used to automate the generation of AWS Glue Catalog Tables which are a repository for dataset metadata such as data type classification, data schema inferrence, and data partition management. To demonstrate this I use the below CloudFormation template to create an Glue Catalog Database along with a Glue Crawler to crawl the S3 data store created in the last section.

# template.yaml

AWSTemplateFormatVersion: 2010-09-09

Parameters:
  S3DataStore:
    Type: String
    Description: Path to root directory of sales dataset in S3

Resources: 
  GlueDefaultIamRole:
    Type: 'AWS::IAM::Role'
    Properties:
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Principal:
              Service:
              - glue.amazonaws.com
            Action:
              - 'sts:AssumeRole'
      Path: /
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/AmazonS3FullAccess
        - arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

  SalesGlueDatabase:
    Type: AWS::Glue::Database
    Properties:
      CatalogId: !Ref AWS::AccountId
      DatabaseInput:
        Name: "sales"
        Description: Glue metadata catalog database sales dataset

  SalesGlueCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Name: 'sales'
      DatabaseName: !Ref SalesGlueDatabase
      Description: Crawls the Sales Data
      Role: !GetAtt GlueDefaultIamRole.Arn
      Targets:
        S3Targets:
          - Path: !Ref S3DataStore

I can again use the AWS CLI to create the CloudFormation stack and subsequent AWS resources.

$ aws cloudformation create-stack --stack-name tci-sales-demo \
  --parameters ParameterKey=S3DataStore,ParameterValue=s3://tci-sales-data/sales \
  --capabilities CAPABILITY_IAM \
  --template-body file://template.yaml

Then if I login to the AWS Management console and go to the Glue service page within the same default region I deployed to using the CloudFormation template I will see a new crawler in the Glue Crawlers page. I can then start a crawler run by selecting the row containing the crawler in the table and clicking Run Crawler.

 

 

After a minute or so the crawler will generate a new table in my Glue Catalog which will appear in the Tables page of the Glue UI as shown below.

Clicking the row associated with the sales table will display the details of the metadata represented in this Glue Table associated with the S3 data store.

There are two ways to view the partitions for a Glue Table and it's associated data store

1) In the details view of the Tables UI you can click the View Partitions and view them there. You can see from the UI screenshot below there are three partitions, one for each of the three days of random sales data I created earlier.

2) You can use the API via either one of the SDK's such as Boto3 or the AWS CLI which itself utilizes Boto3 under-the-hood

Using the AWS CLI produces output as shown below.

$ aws glue get-partitions --database-name sales --table-name sales                  
{
    "Partitions": [
        {
            "Values": [ "2021", "01", "03" ],
            "DatabaseName": "sales",
            "TableName": "sales",
            "CreationTime": "2021-03-13T14:51:59-06:00",
            "LastAccessTime": "2021-03-13T14:51:59-06:00",
            "StorageDescriptor": {
                "Columns": [
                    { "Name": "created",    "Type": "string" },
                    { "Name": "firstname",  "Type": "string" },
                    { "Name": "lastname",   "Type": "string" },
                    { "Name": "city",       "Type": "string" },
                    { "Name": "product",    "Type": "string" },
                    { "Name": "unitprice",  "Type": "double" },
                    { "Name": "quantity",   "Type": "bigint" }
                ],
                "Location": "s3://tci-sales-data/sales/year=2021/month=01/day=03/",
                "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
                "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
                "Compressed": false,
                "NumberOfBuckets": -1,
                "SerdeInfo": {
                    "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
                    "Parameters": {
                        "field.delim": ","
                    }
                },
                "BucketColumns": [],
                "SortColumns": [],
                "Parameters": {
                    "areColumnsQuoted": "false",
                    "averageRecordSize": "60",
                    "classification": "csv",
                    "columnsOrdered": "true",
                    "compressionType": "none",
                    "delimiter": ",",
                    "objectCount": "1",
                    "recordCount": "11",
                    "sizeKey": "705",
                    "skip.header.line.count": "1",
                    "typeOfData": "file"
                },
                "StoredAsSubDirectories": false
            },
            "Parameters": {},
            "CatalogId": "omitted"
        },
        {
            "Values": [ "2021",  "01",  "01" ],
            "DatabaseName": "sales",
            "TableName": "sales",
            "CreationTime": "2021-03-13T14:51:59-06:00",
            "LastAccessTime": "2021-03-13T14:51:59-06:00",
            "StorageDescriptor": {
                "Columns": [
                    { "Name": "created",    "Type": "string" },
                    { "Name": "firstname",  "Type": "string" },
                    { "Name": "lastname",   "Type": "string" },
                    { "Name": "city",       "Type": "string" },
                    { "Name": "product",    "Type": "string" },
                    { "Name": "unitprice",  "Type": "double" },
                    { "Name": "quantity",   "Type": "bigint" }
                ],
                "Location": "s3://tci-sales-data/sales/year=2021/month=01/day=01/",
                "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
                "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
                "Compressed": false,
                "NumberOfBuckets": -1,
                "SerdeInfo": {
                    "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
                    "Parameters": {
                        "field.delim": ","
                    }
                },
                "BucketColumns": [],
                "SortColumns": [],
                "Parameters": {
                    "areColumnsQuoted": "false",
                    "averageRecordSize": "61",
                    "classification": "csv",
                    "columnsOrdered": "true",
                    "compressionType": "none",
                    "delimiter": ",",
                    "objectCount": "1",
                    "recordCount": "11",
                    "sizeKey": "698",
                    "skip.header.line.count": "1",
                    "typeOfData": "file"
                },
                "StoredAsSubDirectories": false
            },
            "Parameters": {},
            "CatalogId": "omitted"
        },
        {
            "Values": [ "2021", "01", "02" ],
            "DatabaseName": "sales",
            "TableName": "sales",
            "CreationTime": "2021-03-13T14:51:59-06:00",
            "LastAccessTime": "2021-03-13T14:51:59-06:00",
            "StorageDescriptor": {
                "Columns": [
                    { "Name": "created",    "Type": "string" },
                    { "Name": "firstname",  "Type": "string" },
                    { "Name": "lastname",   "Type": "string" },
                    { "Name": "city",       "Type": "string" },
                    { "Name": "product",    "Type": "string" },
                    { "Name": "unitprice",  "Type": "double" },
                    { "Name": "quantity",   "Type": "bigint" }
                ],
                "Location": "s3://tci-sales-data/sales/year=2021/month=01/day=02/",
                "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
                "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
                "Compressed": false,
                "NumberOfBuckets": -1,
                "SerdeInfo": {
                    "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
                    "Parameters": {
                        "field.delim": ","
                    }
                },
                "BucketColumns": [],
                "SortColumns": [],
                "Parameters": {
                    "areColumnsQuoted": "false",
                    "averageRecordSize": "61",
                    "classification": "csv",
                    "columnsOrdered": "true",
                    "compressionType": "none",
                    "delimiter": ",",
                    "objectCount": "1",
                    "recordCount": "11",
                    "sizeKey": "708",
                    "skip.header.line.count": "1",
                    "typeOfData": "file"
                },
                "StoredAsSubDirectories": false
            },
            "Parameters": {},
            "CatalogId": "omitted"
        }
    ]
}

I see three partitions just the same as the Glue Table UI showed previously.

For completeness I'll also demonstrate using the Python Boto3 SDK.

# glueutils.py

import argparse
import json

import boto3

def get_partitions(client, dbname, tblname):
    response = client.get_partitions(DatabaseName=dbname, TableName=tblname)
    next_token = response['NextToken']
    partitions = response['Partitions']
    while next_token:
        response = client.get_partitions(DatabaseName=dbname, TableName=tblname, NextToken=next_token)
        next_token = response.get('NextToken')
        partitions += response['Partitions']

    for part in partitions:
        print('\nPartition')
        print('  Values: ', part['Values'])
        print('  Columns: ', json.dumps(part['StorageDescriptor']['Columns'], indent=4))
        print('  Location: ', part['StorageDescriptor']['Location'])
        print('  InputFormat: ', part['StorageDescriptor']['InputFormat'])
        print('  OutputFormat: ', part['StorageDescriptor']['OutputFormat'])

    return partitions


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('cmd', help='the command to run', choices=['get-partitions'])
    parser.add_argument('--database', help='glue database name')
    parser.add_argument('--table', help='glue table name')
    args = parser.parse_args()

    glue = boto3.client('glue')

    partitions = get_partitions(glue, args.database, args.table)

I've modeled this Boto3 wrapper scipt named glueutils.py script to be similar to the AWS CLI in that I can run it from the command line and pass it arguments to control its behavior. To list the partitions with the above glueutils.py script and Boto3 looks as shown below.

$ python glueutils.py get-partitions --database sales --table sales

Partition
  Values:  ['2021', '01', '03']
  Columns:  [
    { "Name": "created",   "Type": "string" },
    { "Name": "firstname", "Type": "string" },
    { "Name": "lastname",  "Type": "string" },
    { "Name": "city",      "Type": "string" },
    { "Name": "product",   "Type": "string" },
    { "Name": "unitprice", "Type": "double" },
    { "Name": "quantity",  "Type": "bigint" }
  ]
  Location:  s3://tci-sales-data/sales/year=2021/month=01/day=03/
  InputFormat:  org.apache.hadoop.mapred.TextInputFormat
  OutputFormat:  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Partition
  Values:  ['2021', '01', '01']
  Columns:  [
    { "Name": "created",   "Type": "string" },
    { "Name": "firstname", "Type": "string" },
    { "Name": "lastname",  "Type": "string" },
    { "Name": "city",      "Type": "string" },
    { "Name": "product",   "Type": "string" },
    { "Name": "unitprice", "Type": "double" },
    { "Name": "quantity",  "Type": "bigint" }
  ]
  Location:  s3://tci-sales-data/sales/year=2021/month=01/day=01/
  InputFormat:  org.apache.hadoop.mapred.TextInputFormat
  OutputFormat:  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Partition
  Values:  ['2021', '01', '02']
  Columns:  [
    { "Name": "created",   "Type": "string" },
    { "Name": "firstname", "Type": "string" },
    { "Name": "lastname",  "Type": "string" },
    { "Name": "city",      "Type": "string" },
    { "Name": "product",   "Type": "string" },
    { "Name": "unitprice", "Type": "double" },
    { "Name": "quantity",  "Type": "bigint" }
  ]
  Location:  s3://tci-sales-data/sales/year=2021/month=01/day=02/
  InputFormat:  org.apache.hadoop.mapred.TextInputFormat
  OutputFormat:  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Again, there are three partitions so its safe too say the results are accurate and consistent.

Retrieving Data from S3 Data Stores in Glue Spark ETL Jobs

In this section I want to set the stage for a propper understanding of why a properly partitioned S3 data stores is important for both efficient and complete data retrieval in Glue ETL Jobs. There are two aspects to that last statement: (i) partitions allow for filtering to a single partition or a range of partitions as long as those partitions are accounted for in the Glue Catalog and, (ii) if partitions are not present in the Glue Catalog the underlying data will not be visible to Glue Spark ETL jobs or Athena.

To demonstrate this I'll use the AWS Glue docker image to run a Glue Spark job locally on my laptop and show the effects of partitions on retrieval.

First I must pull down the Glue Docker image. Note that the most recent version of Glue is version 2.0 but AWS has yet to release an updated version of the Glue Docker image. For the purposes of this discussion the legacy version 1.0 will work just the same.

docker pull amazon/aws-glue-libs:glue_libs_1.0.0_image_01

Once the image is downloaded I can run it in a container by specifying a local directory to be used for sharing notebooks with as well as a path to my AWS CLI credentials.

docker run -itd -p 8888:8888 -p 4040:4040 \
    -v /path/to/.aws:/root/.aws:ro \
    -v /path/to/notebooks:/home/jupyter/jupyter_default_dir \
    --name glue_jupyter \
    amazon/aws-glue-libs:glue_libs_1.0.0_image_01 \
    /home/jupyter/jupyter_start.sh

Then visiting localhost:8888 in a browser serves up a Jupyter notebook UI where I can create a new PySpark notebook.

Then within the newly created Glue Spark Notebook I add the following Glue and PySpark imports then create a SparkSession and GlueContext.

from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.utils import getResolvedOptions

from pyspark import SparkContext
from pyspark.sql.session import SparkSession

spark = SparkSession(SparkContext.getOrCreate())
glue_ctx = GlueContext(SparkContext.getOrCreate())

Next I create and populate a Glue DynamicFrame object with all data for which there are partitions present in the S3 store and represented in the sales Glue Table.

all_dyf = glue_ctx.create_dynamic_frame.from_catalog(
					database='sales',
					table_name='sales'
)

all_dyf.toDF().orderBy('created').show(100)
+-------------------+---------+--------+-----------------+---------+---------+--------+----+-----+---+
|            created|firstname|lastname|             city|  product|unitprice|quantity|year|month|day|
+-------------------+---------+--------+-----------------+---------+---------+--------+----+-----+---+
|2021-01-01T00:08:59|   Ariana|   Jones|         Toddside|product-d|     6.99|       9|2021|   01| 01|
|2021-01-01T01:46:11| Victoria|  Huerta|       Harrisside|product-e|    12.99|       4|2021|   01| 01|
|2021-01-01T02:19:16|  Jeffrey|  Taylor|       Mooremouth|product-c|     3.99|       6|2021|   01| 01|
|2021-01-01T03:02:43|   Kendra| Coleman|   East Aimeeberg|product-b|     2.99|       5|2021|   01| 01|
|2021-01-01T06:59:06|     Jill| Goodwin|   East Kellystad|product-c|     3.99|       7|2021|   01| 01|
|2021-01-01T10:59:36| Kimberly| Holland|        Robinberg|product-b|     2.99|       4|2021|   01| 01|
|2021-01-01T13:39:06|  Theresa|   Perry|      West Donald|product-a|     1.99|       6|2021|   01| 01|
|2021-01-01T14:24:50|  Timothy|     Orr|     Port Melissa|product-c|     3.99|       7|2021|   01| 01|
|2021-01-01T20:10:22|    Casey|   Lewis|        Singhview|product-d|     6.99|       3|2021|   01| 01|
|2021-01-01T22:07:33|     Tina|   Davis|     Gabrielshire|product-b|     2.99|       1|2021|   01| 01|
|2021-01-02T03:25:12|     Mark|   Wells| North Edwardview|product-e|    12.99|       5|2021|   01| 02|
|2021-01-02T03:31:06|  Chelsea| Sanders|        Lake Adam|product-b|     2.99|       3|2021|   01| 02|
|2021-01-02T05:43:44|    Tonya| Montoya|       Smithmouth|product-d|     6.99|      10|2021|   01| 02|
|2021-01-02T07:36:14|   Thomas| Hubbard|     Port Matthew|product-b|     2.99|       9|2021|   01| 02|
|2021-01-02T11:10:10|   Joseph|  Brooks|      New Randall|product-d|     6.99|       1|2021|   01| 02|
|2021-01-02T13:50:47|      Joe|  Miller|Lake Stephenmouth|product-b|     2.99|       8|2021|   01| 02|
|2021-01-02T15:28:51|   Tricia|  Murphy|  Jenniferchester|product-b|     2.99|       4|2021|   01| 02|
|2021-01-02T20:59:43|    Shane|Stafford|        New Jason|product-d|     6.99|       7|2021|   01| 02|
|2021-01-02T21:58:02|   Brooke|   Smith|      South Ralph|product-b|     2.99|       3|2021|   01| 02|
|2021-01-02T23:39:59|  Tiffany|   Blake|        Port Mary|product-c|     3.99|       4|2021|   01| 02|
|2021-01-03T04:19:33|     John|   Baker|   West Aaronfort|product-a|     1.99|       8|2021|   01| 03|
|2021-01-03T08:23:44|     Carl|  Rogers|       Gregoryton|product-a|     1.99|       4|2021|   01| 03|
|2021-01-03T10:19:19| Jennifer|   Price|     Lake Kenneth|product-b|     2.99|      10|2021|   01| 03|
|2021-01-03T13:38:28|  Whitney|      Li|   West Alexandra|product-a|     1.99|       5|2021|   01| 03|
|2021-01-03T14:50:00| Victoria|  Montes|      Hansonville|product-b|     2.99|       9|2021|   01| 03|
|2021-01-03T15:16:55| Margaret|  Torres|    North Kaitlyn|product-b|     2.99|       8|2021|   01| 03|
|2021-01-03T16:44:52| Nicholas|   Scott|        Jonesfort|product-d|     6.99|      10|2021|   01| 03|
|2021-01-03T16:48:35| Michelle|  Carter|         Daleland|product-d|     6.99|       1|2021|   01| 03|
|2021-01-03T23:00:50|     Lori|  Wagner|      West Carmen|product-a|     1.99|       4|2021|   01| 03|
|2021-01-03T23:29:05|  Matthew| Gregory|      Santosshire|product-b|     2.99|       7|2021|   01| 03|
+-------------------+---------+--------+-----------------+---------+---------+--------+----+-----+---+

Note that this shows all the data for each of the three days of fabricated sales data generated previously. Hopefully, this is not much of a suprise so far.

Next I move on to specifying a pushdown predicate in the call to DynamicFrameReader class's from_catalog(...) method which filters the partitions down to just the 3rd of January 2021. This demonstrates the usefulness of filtering by partition values in that only 1/3rd of the total dataset was read and returned. Obviously this is pretty useless for such a small dataset but when programming against very large well partitioned datasets this will be extremely beneficial.

jan3_dyf = glue_ctx.create_dynamic_frame.from_catalog(
                database='sales',
                table_name='sales',
                push_down_predicate="(year == '2021' and month == '01' and day == '03')"
)
jan3_dyf.toDF().orderBy('created').show(100)
+-------------------+---------+--------+--------------+---------+---------+--------+----+-----+---+
|            created|firstname|lastname|          city|  product|unitprice|quantity|year|month|day|
+-------------------+---------+--------+--------------+---------+---------+--------+----+-----+---+
|2021-01-03T04:19:33|     John|   Baker|West Aaronfort|product-a|     1.99|       8|2021|   01| 03|
|2021-01-03T08:23:44|     Carl|  Rogers|    Gregoryton|product-a|     1.99|       4|2021|   01| 03|
|2021-01-03T10:19:19| Jennifer|   Price|  Lake Kenneth|product-b|     2.99|      10|2021|   01| 03|
|2021-01-03T13:38:28|  Whitney|      Li|West Alexandra|product-a|     1.99|       5|2021|   01| 03|
|2021-01-03T14:50:00| Victoria|  Montes|   Hansonville|product-b|     2.99|       9|2021|   01| 03|
|2021-01-03T15:16:55| Margaret|  Torres| North Kaitlyn|product-b|     2.99|       8|2021|   01| 03|
|2021-01-03T16:44:52| Nicholas|   Scott|     Jonesfort|product-d|     6.99|      10|2021|   01| 03|
|2021-01-03T16:48:35| Michelle|  Carter|      Daleland|product-d|     6.99|       1|2021|   01| 03|
|2021-01-03T23:00:50|     Lori|  Wagner|   West Carmen|product-a|     1.99|       4|2021|   01| 03|
|2021-01-03T23:29:05|  Matthew| Gregory|   Santosshire|product-b|     2.99|       7|2021|   01| 03|
+-------------------+---------+--------+--------------+---------+---------+--------+----+-----+---+

Adding New Partitions and Registering them with a Crawler

At this point I will add another day of generated sales data to the S3 data store and see what effect hat has on my ability to retrieve the new data from a Glue ETL job.

python sales_generator.py --s3bucket tci-sales-data --items 10 --date '2021-01-04'

If I now go and fetch all data to recreate the DynamicFrame the newly added data will not be present because the new partition for Jan 4th has not been added to the Glue Catalog Table.

all_dyf = glue_ctx.create_dynamic_frame.from_catalog(database='sales', table_name='sales')

all_dyf.toDF().orderBy('created').show(100)
+-------------------+---------+--------+-----------------+---------+---------+--------+----+-----+---+
|            created|firstname|lastname|             city|  product|unitprice|quantity|year|month|day|
+-------------------+---------+--------+-----------------+---------+---------+--------+----+-----+---+
|2021-01-01T00:08:59|   Ariana|   Jones|         Toddside|product-d|     6.99|       9|2021|   01| 01|
|2021-01-01T01:46:11| Victoria|  Huerta|       Harrisside|product-e|    12.99|       4|2021|   01| 01|
|2021-01-01T02:19:16|  Jeffrey|  Taylor|       Mooremouth|product-c|     3.99|       6|2021|   01| 01|
|2021-01-01T03:02:43|   Kendra| Coleman|   East Aimeeberg|product-b|     2.99|       5|2021|   01| 01|
|2021-01-01T06:59:06|     Jill| Goodwin|   East Kellystad|product-c|     3.99|       7|2021|   01| 01|
|2021-01-01T10:59:36| Kimberly| Holland|        Robinberg|product-b|     2.99|       4|2021|   01| 01|
|2021-01-01T13:39:06|  Theresa|   Perry|      West Donald|product-a|     1.99|       6|2021|   01| 01|
|2021-01-01T14:24:50|  Timothy|     Orr|     Port Melissa|product-c|     3.99|       7|2021|   01| 01|
|2021-01-01T20:10:22|    Casey|   Lewis|        Singhview|product-d|     6.99|       3|2021|   01| 01|
|2021-01-01T22:07:33|     Tina|   Davis|     Gabrielshire|product-b|     2.99|       1|2021|   01| 01|
|2021-01-02T03:25:12|     Mark|   Wells| North Edwardview|product-e|    12.99|       5|2021|   01| 02|
|2021-01-02T03:31:06|  Chelsea| Sanders|        Lake Adam|product-b|     2.99|       3|2021|   01| 02|
|2021-01-02T05:43:44|    Tonya| Montoya|       Smithmouth|product-d|     6.99|      10|2021|   01| 02|
|2021-01-02T07:36:14|   Thomas| Hubbard|     Port Matthew|product-b|     2.99|       9|2021|   01| 02|
|2021-01-02T11:10:10|   Joseph|  Brooks|      New Randall|product-d|     6.99|       1|2021|   01| 02|
|2021-01-02T13:50:47|      Joe|  Miller|Lake Stephenmouth|product-b|     2.99|       8|2021|   01| 02|
|2021-01-02T15:28:51|   Tricia|  Murphy|  Jenniferchester|product-b|     2.99|       4|2021|   01| 02|
|2021-01-02T20:59:43|    Shane|Stafford|        New Jason|product-d|     6.99|       7|2021|   01| 02|
|2021-01-02T21:58:02|   Brooke|   Smith|      South Ralph|product-b|     2.99|       3|2021|   01| 02|
|2021-01-02T23:39:59|  Tiffany|   Blake|        Port Mary|product-c|     3.99|       4|2021|   01| 02|
|2021-01-03T04:19:33|     John|   Baker|   West Aaronfort|product-a|     1.99|       8|2021|   01| 03|
|2021-01-03T08:23:44|     Carl|  Rogers|       Gregoryton|product-a|     1.99|       4|2021|   01| 03|
|2021-01-03T10:19:19| Jennifer|   Price|     Lake Kenneth|product-b|     2.99|      10|2021|   01| 03|
|2021-01-03T13:38:28|  Whitney|      Li|   West Alexandra|product-a|     1.99|       5|2021|   01| 03|
|2021-01-03T14:50:00| Victoria|  Montes|      Hansonville|product-b|     2.99|       9|2021|   01| 03|
|2021-01-03T15:16:55| Margaret|  Torres|    North Kaitlyn|product-b|     2.99|       8|2021|   01| 03|
|2021-01-03T16:44:52| Nicholas|   Scott|        Jonesfort|product-d|     6.99|      10|2021|   01| 03|
|2021-01-03T16:48:35| Michelle|  Carter|         Daleland|product-d|     6.99|       1|2021|   01| 03|
|2021-01-03T23:00:50|     Lori|  Wagner|      West Carmen|product-a|     1.99|       4|2021|   01| 03|
|2021-01-03T23:29:05|  Matthew| Gregory|      Santosshire|product-b|     2.99|       7|2021|   01| 03|
+-------------------+---------+--------+-----------------+---------+---------+--------+----+-----+---+

The simplest way to register the new partition for the 4th of January is to rerun the cralwer. However, this will result in the crawler revisiting all the data that was previously crawled along with the new partition's data. This may not be that big a deal since there is a 10 minute minimum runtime charge associated with every Glue Crawler run. However, if I update the previously generated Glue Crawler with a recrawl policy of incremental then only newly added folder data will be crawled which may be very important for rather large S3 data stores but, unfortunately this functionality is not yet avaiable via CloudFormation so either the AWS CLI or SDK is required to apply this configuration update. I will opt to use the Boto3 SDK.

import boto3

glue = boto3.client('glue')

resp = glue.update_crawler(
			Name='sales',
			DatabaseName='sales',
			RecrawlPolicy={'RecrawlBehavior':'CRAWL_NEW_FOLDERS_ONLY'},
			SchemaChangePolicy={'UpdateBehavior':'LOG','DeleteBehavior':'LOG'}
) 

Then if I go to the Glue Crawler UI in the AWS Management Console I can rereun the crawler.

Then check out the partitions UI again to verify that a new partition has been added.

Rerunning the Glue Spark Code will now show all four partitions data including the newly added Jan 4th partition.

all_dyf = glue_ctx.create_dynamic_frame.from_catalog(database='sales', table_name='sales')

all_dyf.toDF().orderBy('created').show(100)
+-------------------+---------+--------+-----------------+---------+---------+--------+----+-----+---+
|            created|firstname|lastname|             city|  product|unitprice|quantity|year|month|day|
+-------------------+---------+--------+-----------------+---------+---------+--------+----+-----+---+
|2021-01-01T00:08:59|   Ariana|   Jones|         Toddside|product-d|     6.99|       9|2021|   01| 01|
|2021-01-01T01:46:11| Victoria|  Huerta|       Harrisside|product-e|    12.99|       4|2021|   01| 01|
|2021-01-01T02:19:16|  Jeffrey|  Taylor|       Mooremouth|product-c|     3.99|       6|2021|   01| 01|
|2021-01-01T03:02:43|   Kendra| Coleman|   East Aimeeberg|product-b|     2.99|       5|2021|   01| 01|
|2021-01-01T06:59:06|     Jill| Goodwin|   East Kellystad|product-c|     3.99|       7|2021|   01| 01|
|2021-01-01T10:59:36| Kimberly| Holland|        Robinberg|product-b|     2.99|       4|2021|   01| 01|
|2021-01-01T13:39:06|  Theresa|   Perry|      West Donald|product-a|     1.99|       6|2021|   01| 01|
|2021-01-01T14:24:50|  Timothy|     Orr|     Port Melissa|product-c|     3.99|       7|2021|   01| 01|
|2021-01-01T20:10:22|    Casey|   Lewis|        Singhview|product-d|     6.99|       3|2021|   01| 01|
|2021-01-01T22:07:33|     Tina|   Davis|     Gabrielshire|product-b|     2.99|       1|2021|   01| 01|
|2021-01-02T03:25:12|     Mark|   Wells| North Edwardview|product-e|    12.99|       5|2021|   01| 02|
|2021-01-02T03:31:06|  Chelsea| Sanders|        Lake Adam|product-b|     2.99|       3|2021|   01| 02|
|2021-01-02T05:43:44|    Tonya| Montoya|       Smithmouth|product-d|     6.99|      10|2021|   01| 02|
|2021-01-02T07:36:14|   Thomas| Hubbard|     Port Matthew|product-b|     2.99|       9|2021|   01| 02|
|2021-01-02T11:10:10|   Joseph|  Brooks|      New Randall|product-d|     6.99|       1|2021|   01| 02|
|2021-01-02T13:50:47|      Joe|  Miller|Lake Stephenmouth|product-b|     2.99|       8|2021|   01| 02|
|2021-01-02T15:28:51|   Tricia|  Murphy|  Jenniferchester|product-b|     2.99|       4|2021|   01| 02|
|2021-01-02T20:59:43|    Shane|Stafford|        New Jason|product-d|     6.99|       7|2021|   01| 02|
|2021-01-02T21:58:02|   Brooke|   Smith|      South Ralph|product-b|     2.99|       3|2021|   01| 02|
|2021-01-02T23:39:59|  Tiffany|   Blake|        Port Mary|product-c|     3.99|       4|2021|   01| 02|
|2021-01-03T04:19:33|     John|   Baker|   West Aaronfort|product-a|     1.99|       8|2021|   01| 03|
|2021-01-03T08:23:44|     Carl|  Rogers|       Gregoryton|product-a|     1.99|       4|2021|   01| 03|
|2021-01-03T10:19:19| Jennifer|   Price|     Lake Kenneth|product-b|     2.99|      10|2021|   01| 03|
|2021-01-03T13:38:28|  Whitney|      Li|   West Alexandra|product-a|     1.99|       5|2021|   01| 03|
|2021-01-03T14:50:00| Victoria|  Montes|      Hansonville|product-b|     2.99|       9|2021|   01| 03|
|2021-01-03T15:16:55| Margaret|  Torres|    North Kaitlyn|product-b|     2.99|       8|2021|   01| 03|
|2021-01-03T16:44:52| Nicholas|   Scott|        Jonesfort|product-d|     6.99|      10|2021|   01| 03|
|2021-01-03T16:48:35| Michelle|  Carter|         Daleland|product-d|     6.99|       1|2021|   01| 03|
|2021-01-03T23:00:50|     Lori|  Wagner|      West Carmen|product-a|     1.99|       4|2021|   01| 03|
|2021-01-03T23:29:05|  Matthew| Gregory|      Santosshire|product-b|     2.99|       7|2021|   01| 03|
|2021-01-04T01:25:30|    David|  Vargas|       Orozcoview|product-d|     6.99|       7|2021|   01| 04|
|2021-01-04T09:00:31|   Kristi|   Brown|    Phillipsville|product-d|     6.99|       4|2021|   01| 04|
|2021-01-04T09:07:05|     Mary| Collier|    Crawfordmouth|product-e|    12.99|       4|2021|   01| 04|
|2021-01-04T10:30:49|    Emily|     Liu|       Sharonberg|product-d|     6.99|       2|2021|   01| 04|
|2021-01-04T11:21:18|    James| Johnson|         Cooktown|product-c|     3.99|       1|2021|   01| 04|
|2021-01-04T11:37:16|  Heather| Nichols|      Jocelynfurt|product-d|     6.99|       9|2021|   01| 04|
|2021-01-04T17:29:35|   Rachel|   Adams|    Benjaminburgh|product-d|     6.99|       1|2021|   01| 04|
|2021-01-04T18:50:24|  Michael|   Perry|       Port Alvin|product-c|     3.99|       8|2021|   01| 04|
|2021-01-04T20:06:05|     Anna|    Cook| West Cherylmouth|product-d|     6.99|       7|2021|   01| 04|
|2021-01-04T23:12:58| Brittney|  Harris|      Obrienville|product-c|     3.99|      10|2021|   01| 04|
+-------------------+---------+--------+-----------------+---------+---------+--------+----+-----+---+

Adding New Partitions and Registering them with the Glue API

In addition to adding new partitions via Glue Crawlers you can also use the Glue Partitions API along with one of the SDKs such as Boto3 to add partitions to the Glue Metadata Catalog. This approach can be very useful, particularly for mitigating unnecessary costs, if you know the type and structure of the data in an S3 data store will remain constant as new data is added. In this approach you would want to run a crawler manually once with a small to moderate amount of data in the expected format then you can simply use the Glue Partitions API to add new partitions to the data catalog as they are introduced in the S3 data store.

To demonstrate this I first add a couple of more new batches of fake sales data for the days January 5-7 then I'll show how you can use the Boto3 SDK to add register the new partitions to the Glue Metadata Catalog.

python sales_generator.py --s3bucket tci-sales-data --items 10 --date '2021-01-05'
python sales_generator.py --s3bucket tci-sales-data --items 10 --date '2021-01-06'
python sales_generator.py --s3bucket tci-sales-data --items 10 --date '2021-01-07'

Now that the new sales data has been added to the S3 data store including three new S3 directory partitions for Jan 5 - Jan 7 what remains is to write a new function in the glueutils.py script named add_partition(...) which will associate partitions to a given Glue database and table using the Glue client from the Boto3 SDK.

# glueutils.py

import argparse
import copy
import json

import boto3


def get_partitions(client, dbname, tblname):
    response = client.get_partitions(DatabaseName=dbname, TableName=tblname)
    next_token = response['NextToken']
    partitions = response['Partitions']
    while next_token:
        response = client.get_partitions(DatabaseName=dbname, TableName=tblname, NextToken=next_token)
        next_token = response.get('NextToken')
        partitions += response['Partitions']

    for part in partitions:
        print('\nPartition')
        print('  Values: ', part['Values'])
        print('  Columns: ', json.dumps(part['StorageDescriptor']['Columns'], indent=4))
        print('  Location: ', part['StorageDescriptor']['Location'])
        print('  InputFormat: ', part['StorageDescriptor']['InputFormat'])
        print('  OutputFormat: ', part['StorageDescriptor']['OutputFormat'])

    return partitions


def add_partition(client, dbname, tblname, values, location):
    part_response = client.get_partitions(DatabaseName=dbname, TableName=tblname)
    existing_paritions = part_response['Partitions']
    if not existing_paritions:
        raise RuntimeError('Must have existing partitions to use this functionality')
    
    existing_parition = copy.deepcopy(existing_paritions[0])

    storage_descriptor = existing_parition.pop('StorageDescriptor')
    storage_descriptor.update(Location=location)

    exclude_parameters = ['averageRecordSize', 'recordCount', 'sizeKey']

    for exclude_param in exclude_parameters:
        _ = storage_descriptor['Parameters'].pop(exclude_param)

    response = client.batch_create_partition(
        DatabaseName=dbname,
        TableName=tblname,
        PartitionInputList=[{
          'Values': values,
          'StorageDescriptor': storage_descriptor
        }]
    )

    if 'Errors' in response and response['Errors']:
        print(json.dumps(response, indent=4))


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('cmd', help='the command to run', choices=['get-partitions', 'add-partition'])
    parser.add_argument('--database', help='glue database name')
    parser.add_argument('--table', help='glue table name')

    parser.add_argument('--partition-values', help='a comma separated list of the partition values')
    parser.add_argument('--partition-location', help='the s3 path of the partition')

    args = parser.parse_args()

    glue = boto3.client('glue')

    if args.cmd == 'get-partitions':
        partitions = get_partitions(glue, args.database, args.table)
    elif args.cmd == 'add-partition':
        add_partition(glue,
                      args.database,
                      args.table,
                      args.partition_values.split(','),
                      args.partition_location)

Lastly, I call the function from the command line to register the new partitions with the Glue Catalog using the add_partition(...) method like so.

$ python glueutils.py add-partition --database sales --table sales \
  --partition-values 2021,01,05 \
  --partition-location s3://tci-sales-data/sales/year=2021/month=01/day=05/
$ python glueutils.py add-partition --database sales --table sales \
  --partition-values 2021,01,06 \
  --partition-location s3://tci-sales-data/sales/year=2021/month=01/day=06/
$ python glueutils.py add-partition --database sales --table sales \
  --partition-values 2021,01,07 \
  --partition-location s3://tci-sales-data/sales/year=2021/month=01/day=07/

Now if I list the partitions presetn in the Glue Catalog from any of the method I discussed earlier like Glue UI, AWS CLI, or Boto3 SDK I'll see the newly generated partitions along with the ones added previously. Below I've chosen to show them in the Glue UI.

Ok then. For completeness I'll give one more example of specifying a pushdown predicate in a Glue Spark ETL job but, this time selecting only the January days 4-6.

subset_dyf = glue_ctx.create_dynamic_frame.from_catalog(
        database='sales',
        table_name='sales',
        push_down_predicate="(year == '2021' and month == '01' and day in ('04', '05', '06'))"
)
subset_dyf.toDF().orderBy('created').show(100)
+-------------------+---------+---------+-----------------+---------+---------+--------+----+-----+---+
|            created|firstname| lastname|             city|  product|unitprice|quantity|year|month|day|
+-------------------+---------+---------+-----------------+---------+---------+--------+----+-----+---+
|2021-01-04T01:25:30|    David|   Vargas|       Orozcoview|product-d|     6.99|       7|2021|   01| 04|
|2021-01-04T09:00:31|   Kristi|    Brown|    Phillipsville|product-d|     6.99|       4|2021|   01| 04|
|2021-01-04T09:07:05|     Mary|  Collier|    Crawfordmouth|product-e|    12.99|       4|2021|   01| 04|
|2021-01-04T10:30:49|    Emily|      Liu|       Sharonberg|product-d|     6.99|       2|2021|   01| 04|
|2021-01-04T11:21:18|    James|  Johnson|         Cooktown|product-c|     3.99|       1|2021|   01| 04|
|2021-01-04T11:37:16|  Heather|  Nichols|      Jocelynfurt|product-d|     6.99|       9|2021|   01| 04|
|2021-01-04T17:29:35|   Rachel|    Adams|    Benjaminburgh|product-d|     6.99|       1|2021|   01| 04|
|2021-01-04T18:50:24|  Michael|    Perry|       Port Alvin|product-c|     3.99|       8|2021|   01| 04|
|2021-01-04T20:06:05|     Anna|     Cook| West Cherylmouth|product-d|     6.99|       7|2021|   01| 04|
|2021-01-04T23:12:58| Brittney|   Harris|      Obrienville|product-c|     3.99|      10|2021|   01| 04|
|2021-01-05T00:37:44| Virginia|  Johnson|     Gonzalezberg|product-e|    12.99|       4|2021|   01| 05|
|2021-01-05T04:57:20|    Carla|    Wolfe|     Jacksonmouth|product-a|     1.99|      10|2021|   01| 05|
|2021-01-05T10:40:06|    Emily|    Johns|       Deleonfort|product-c|     3.99|       1|2021|   01| 05|
|2021-01-05T16:04:35|   Andrew|    Young|       Matthewton|product-c|     3.99|       3|2021|   01| 05|
|2021-01-05T17:05:07|    Linda|   Thomas|         Rossfurt|product-b|     2.99|       8|2021|   01| 05|
|2021-01-05T17:25:53|    Scott|   Parker|     Port Caitlin|product-e|    12.99|       1|2021|   01| 05|
|2021-01-05T19:03:41|     Paul|  Shelton| Port Donaldmouth|product-c|     3.99|       7|2021|   01| 05|
|2021-01-05T20:12:42|   Jeremy|    Smith| South Russellton|product-b|     2.99|       1|2021|   01| 05|
|2021-01-05T21:36:54|   Teresa|     Nash|   North Jennifer|product-d|     6.99|      10|2021|   01| 05|
|2021-01-05T23:42:44|    Sheri|    Brown|South Amandamouth|product-e|    12.99|       8|2021|   01| 05|
|2021-01-06T05:19:30|     Lisa|  Sanchez|   West Stephanie|product-c|     3.99|       8|2021|   01| 06|
|2021-01-06T05:40:00|    Tammy|    Davis|   Alexandrashire|product-a|     1.99|       9|2021|   01| 06|
|2021-01-06T06:32:59|Christina|    Owens|        Lake Evan|product-b|     2.99|      10|2021|   01| 06|
|2021-01-06T09:13:23|   Wesley|Fernandez|   North Jennifer|product-d|     6.99|       7|2021|   01| 06|
|2021-01-06T13:07:47|   Joshua|   Stokes|      Lauriemouth|product-d|     6.99|       5|2021|   01| 06|
|2021-01-06T13:35:16| Mckenzie|    Smith|        Reneeport|product-b|     2.99|       5|2021|   01| 06|
|2021-01-06T13:41:00|    Traci|    Lynch|     Lake Michael|product-a|     1.99|      10|2021|   01| 06|
|2021-01-06T19:32:56|  Anthony|   George|       Robertside|product-a|     1.99|       6|2021|   01| 06|
|2021-01-06T19:40:25|Christina|    David|     Kevinborough|product-d|     6.99|      10|2021|   01| 06|
|2021-01-06T23:27:21|    Ricky| Cisneros|North Juanborough|product-b|     2.99|       4|2021|   01| 06|
+-------------------+---------+---------+-----------------+---------+---------+--------+----+-----+---+

Conclusion

In this article I have discussed why partitions are important for retrieval of data within services that utilize the Glue Catalog such as Glue Spark ETL jobs and AWS Athena. I've also covered how you can manage partitions using Glue Crawler as well as the AWS Glue API via the Boto3 SDK. 

As always, I thank you for reading and feel free to ask questions or critique in the comments section below.

Share with friends and colleagues

[[ likes ]] likes

Community favorites for Data Engineering

theCodingInterface