Providing quality software engineering content in the form of tutorials, applications, services, and commentary suited for developers.

Posts about Data Engineering

Streaming Logs to S3 with Kinesis Firehose in a Serverless Project

In this article I demonstrate how to setup a AWS Serverless Application Model (SAM) project for near realtime streaming of CloudWatch logs to S3 using Kinesis Data Firehose. To keep things interesting I'll be using a Python based demo application that exposes two REST APIs, one for scraping and saving quotes from the web to a DyanmoDB table and another for listing the saved quotes.

Exploring Online Analytical Processing Databases plus Extract, Transform and, Load in PostgreSQL

In this article I give an introduction to Online Analytical Processing databases comparing them against traditional Online Transaction Processing Systems. Emphasis is put on designing and building Star Schemas and Reporting tables using Data Engineering processes like Extract, Transform and Load all within a Aurora PostgreSQL database.

Building Data Lakes in AWS with S3, Lambda, Glue, and Athena from Weather Data

In this aricle I cover creating rudimentary Data Lake on AWS S3 filled with historical Weather Data consumed from a REST API. The S3 Data Lake is populated using traditional serverless technologies like AWS Lambda, DynamoDB, and EventBridge rules along with several modern AWS Glue features such as Crawlers, ETL PySpark Jobs, and Triggers.

Introduction to Redshift using Pagila Sample Dataset Including ETL from Postgres using AWS Glue

In this article I give a practical introductory tutorial to using Amazon Redshift as an OLAP Data Warehouse solution for the popular Pagila Movie Rental dataset. I start with a basic overview of the unique architecture Redshift uses to accomplish its scalable and robust use case as an enterprise cloud data warehouse. Then armed with this basic knowledge of Redshift architecture I move on to give a practical example of designing a schema optimal for Redshift based off the Pagila sample dataset.

Example Driven High Level Overview of Spark with Python

In this article I give a high level, example driven, overview of writing data processing programs using the Python programming language bindings for Spark which is commonly known as PySpark. I specifically cover the Spark SQL DataFrame API which I've found to be the most useful way to write data analytics code with PySpark. The target audience for this article are Python developers, ideally who have a cursory understanding of other popular PyData Stack libraries such as Pandas and Numpy.

How To Use Window Functions in SQL

When it comes to quantitative analysis on data in database tables standard SQL provides a set of aggregate functions like SUM(), MAX(), and MIN(). There are two main ways these functions get used in practice: (i) collapsing the table data down to represent the aggregate calculation result set or, (ii) presenting the aggregate calculation per row maintaining the granularity of the complete table. Windowing functions are used to accomplish this second option and will the focus of this article.

Processing Streams of Stock Quotes with Kafka and Confluent ksqlDB

In this article I present an example of how one can use Kafka and the Confluent ksqlDB stream processing database to process a simplified dataset of fake stock quotes. The ultimate goal of this excercise will be to user ksqlDB to inspect a stream of stock quotes for individual companies in 1 minute windows and identify when a window has introduced a new daily high or low stock price.