Providing quality software engineering content in the form of tutorials, applications, services, and commentary suited for developers.
In this How To article I will show a simple example of how to use the explode function from the SparkSQL API to unravel multi-valued fields. I have found this to be a pretty common use case when doing data cleaning using PySpark, particularly when working with nested JSON documents in an Extract Transform and Load workflow.
In this How To article I demonstrate running a simple Flask Python REST API service on a local minikube Kubernetes cluster using the VirtualBox Driver.
In this article I demonstrate using a Python based AWS Lambda SAM project with the AWS Data Wrangler Lambda Layer to perform data format translation from GZipped JSON files into Parquet upon an S3 upload event.
When it comes to quantitative analysis on data in database tables standard SQL provides a set of aggregate functions like SUM(), MAX(), and MIN(). There are two main ways these functions get used in practice: (i) collapsing the table data down to represent the aggregate calculation result set or, (ii) presenting the aggregate calculation per row maintaining the granularity of the complete table. Windowing functions are used to accomplish this second option and will the focus of this article.
Machine Learning is capturing significant attention among technologists and innovators due to a desire to shift from descriptive analytics focused on understanding what happened in the past towards predicting what is likely to occur in the future as well as prescribe actions to take in response to that prediction. In this article I focus on the use case of classifying email messages as either spam or ham with supervised machine learning using Python and SciKit Learn.
In this article I dive into partitions for S3 data stores within the context of the AWS Glue Metadata Catalog covering how they can be recorded using Glue Crawlers as well as the the Glue API with the Boto3 SDK.