Providing quality software engineering content in the form of tutorials, applications, services, and commentary suited for developers.

Posts about PySpark

Building Data Lakes in AWS with S3, Lambda, Glue, and Athena from Weather Data

In this aricle I cover creating rudimentary Data Lake on AWS S3 filled with historical Weather Data consumed from a REST API. The S3 Data Lake is populated using traditional serverless technologies like AWS Lambda, DynamoDB, and EventBridge rules along with several modern AWS Glue features such as Crawlers, ETL PySpark Jobs, and Triggers.

Example Driven High Level Overview of Spark with Python

In this article I give a high level, example driven, overview of writing data processing programs using the Python programming language bindings for Spark which is commonly known as PySpark. I specifically cover the Spark SQL DataFrame API which I've found to be the most useful way to write data analytics code with PySpark. The target audience for this article are Python developers, ideally who have a cursory understanding of other popular PyData Stack libraries such as Pandas and Numpy.