This post attempts to continue the previous introductory series “Getting started with Spark in Python” with the topics UDFs and Window Functions. Unfortunately it stayed marinating in my Word press for quite a while (was already more less 70% complete) and only now had the opportunity to complete it. Note: For this post I’m using Spark 1.6.1. … More Getting Started with Spark (part 3) – UDFs & Window functions
This post attempts to follow the relatively recent new Spark release – Spark 2.0 – and understand the differences regarding Streaming applications. Why is streaming in particular?, you may ask. Well, Streaming is the ideal scenario most companies would like to use, and the competition landscape is definitely heating up, specially with Apache Flink and … More Spark 2.0: From quasi to proper-streaming?
While working in HP some years ago, I was exposed to not only internal training materials, but also a demo environment. I still remember the excitement when HP acquired Vertica Systems in 2011, and we had a new toy to play with… Come on, you can’t blame me, distributed DBs was something only the cool … More Overview HP Vertica vs AWS Redshift
If you are using AWS to host your applications, you probably heard that you can apply IAM Roles also to ec2 instances. In a lot of cases this can be a really cool way to avoid passing AWS credentials to your applications, and having the pain of having to manage key distribution among servers, as … More Spark – Redshift: AWS Roles to the rescue
Distcp (distributed copy) is a fairly old tool used to move a large quantity of files usually within hdfs, using MapReduce job to do so where mappers list the source files and reducers do the copy heavy lifting. Another useful integration is that it can also deal with file migrations between hdfs and AWS … More Distcp S3-hdfs for eu-central-1 (AWS Frankfurt) – details “behind the trenches”
Update: this tutorial has been updated mainly up to Spark 1.6.2 (with a minor detail regarding Spark 2.0), which is not the most recent version of Spark at the moment of updating of this post. Nonetheless, for the operations exemplified you can pretty much rest assured that the API has not changed substantially. I will try to … More Getting started with the Spark (part 2) – SparkSQL
This is part of a series of introductory posts about Spark, meant to help beginners getting started with it. Hope it helps! So what’s that funky business people call Spark? Essentially Apache Spark is a framework for distributing parallel computational (inherently iterative) work across many nodes in a cluster of servers maintaining high performance and … More Getting started with Spark in Python/Scala