Blog | The Coding Interface

Exploring Online Analytical Processing Databases plus Extract, Transform and, Load in PostgreSQL

In this article I give an introduction to Online Analytical Processing databases comparing them against traditional Online Transaction Processing Systems. Emphasis is put on designing and building Star Schemas and Reporting tables using Data Engineering processes like Extract, Transform and Load all within a Aurora PostgreSQL database.

By Adam McQuistan on 02/17/2021

Data Engineering AWS PostgreSQL Databases DevOps

Building Data Lakes in AWS with S3, Lambda, Glue, and Athena from Weather Data

Data Engineering

In this aricle I cover creating rudimentary Data Lake on AWS S3 filled with historical Weather Data consumed from a REST API. The S3 Data Lake is populated using traditional serverless technologies like AWS Lambda, DynamoDB, and EventBridge rules along with several modern AWS Glue features such as Crawlers, ETL PySpark Jobs, and Triggers.

By Adam McQuistan on 02/25/2021

PySpark AWS Glue Data Engineering AWS Serverless Application Model AWS AWS-Lambda DevOps Python

Introduction to Redshift using Pagila Sample Dataset Including ETL from Postgres using AWS Glue

Data Engineering

In this article I give a practical introductory tutorial to using Amazon Redshift as an OLAP Data Warehouse solution for the popular Pagila Movie Rental dataset. I start with a basic overview of the unique architecture Redshift uses to accomplish its scalable and robust use case as an enterprise cloud data warehouse. Then armed with this basic knowledge of Redshift architecture I move on to give a practical example of designing a schema optimal for Redshift based off the Pagila sample dataset.

By Adam McQuistan on 03/05/2021

AWS Glue Redshift Data Engineering AWS AWS S3 psql PostgreSQL Databases DevOps Python

Example Driven High Level Overview of Spark with Python

Data Engineering

In this article I give a high level, example driven, overview of writing data processing programs using the Python programming language bindings for Spark which is commonly known as PySpark. I specifically cover the Spark SQL DataFrame API which I've found to be the most useful way to write data analytics code with PySpark. The target audience for this article are Python developers, ideally who have a cursory understanding of other popular PyData Stack libraries such as Pandas and Numpy.

By Adam McQuistan on 03/12/2021

PySpark Data Engineering Python

How To Use Window Functions in SQL

Data Engineering

When it comes to quantitative analysis on data in database tables standard SQL provides a set of aggregate functions like SUM(), MAX(), and MIN(). There are two main ways these functions get used in practice: (i) collapsing the table data down to represent the aggregate calculation result set or, (ii) presenting the aggregate calculation per row maintaining the granularity of the complete table. Windowing functions are used to accomplish this second option and will the focus of this article.

By Adam McQuistan on 03/29/2021

Data Engineering PostgreSQL Databases

How to Unnest Multi-Valued Array Fields in PySpark using Explode

Data Engineering

In this How To article I will show a simple example of how to use the explode function from the SparkSQL API to unravel multi-valued fields. I have found this to be a pretty common use case when doing data cleaning using PySpark, particularly when working with nested JSON documents in an Extract Transform and Load workflow.

By Adam McQuistan on 04/30/2021

PySpark Data Engineering Python

Using Apache Hive on AWS Elastic MapReduce (EMR) Clusters

Data Engineering

In this article I review key characteristics and functionality of Apache Hive and how you can utilize Amazon Elastic MapReduce (EMR) to provision a Apache Hive Cluster for experimentation and big data processing and analytics.

By Adam McQuistan on 06/24/2021

AWS EMR Apache Hive Data Engineering AWS Linux

How To Provision AWS EMR Cluster with Cross Account S3 Bucket Access Using Terraform

DevOps

In this article I demonstrate "How To" use Terraform to provision an AWS EMR Cluster along with establish Read Only S3 Bucket access for consuming data from another AWS Account.

By Adam McQuistan on 06/28/2021

AWS EMR Data Engineering DevOps

Introduction to PyFlink Relational Programming: Table API and SQL

Data Engineering

PyFlink is the Python API for Apache Flink which allows you to develop batch and stream data processing pipelines on modern distributed computing architectures.

By Adam McQuistan on 07/14/2021

Apache Flink Data Engineering Python

Processing Kafka Sources and Sinks with Apache Flink in Python

Data Engineering

In this article I go over how to use Apache Flink Table API in Python to consume data from and write data to a Confluent Community Platform Apache Kafka Cluster running locally in Docker.

By Adam McQuistan on 07/25/2021

Apache Flink Kafka Data Engineering Python

theCodingInterface

theCodingInterface

Posts about Data Engineering

Exploring Online Analytical Processing Databases plus Extract, Transform and, Load in PostgreSQL

Building Data Lakes in AWS with S3, Lambda, Glue, and Athena from Weather Data

Introduction to Redshift using Pagila Sample Dataset Including ETL from Postgres using AWS Glue

Example Driven High Level Overview of Spark with Python

How To Use Window Functions in SQL

How to Unnest Multi-Valued Array Fields in PySpark using Explode

Using Apache Hive on AWS Elastic MapReduce (EMR) Clusters

How To Provision AWS EMR Cluster with Cross Account S3 Bucket Access Using Terraform

Introduction to PyFlink Relational Programming: Table API and SQL

Processing Kafka Sources and Sinks with Apache Flink in Python

Navigation

Categories

Favorites

Tags

OAuth 2.0 and Open ID Connect Cheat Sheet

How To Construct an OpenCV Mat Object from C++ Arrays and Vectors

Implementing a Serverless Flask REST API using AWS SAM

How To Use Window Functions in SQL

JavaFX with Gradle, Eclipse, Scene Builder and OpenJDK 11: Java Coded Components

Aurora PostgreSQL Slow Query Logging and CloudWatch Alarms via AWS CDK

Setting Up OpenCV for C++ using CMake and VS Code on Mac OS

Django Authentication Part 1: Sign Up, Login, Logout

How To Upload and Download Files in AWS S3 with Python and Boto3

Bridging Node.js and Python with PyNode to Predict Home Prices

Building a Text Analytics App in Python with Flask, Requests, BeautifulSoup, and TextBlob

Django Authentication Part 2: Object Permissions with Django Guardian

theCodingInterface