Just enough to get you started with Apache Spark
When Big Data became a thing some years ago, I tried several frameworks, platforms and technologies. Most, if not all of them were not so good at hiding their inherent distributed nature, which caused a steep learning curve and increased complexity. The first thing I noticed in Apache Spark was the fact that it managed to create such abstractions, that it could hide all the ugly parts. It managed to do that in a way that you feel you are just manipulating in-memory data structures in some high-level programming language - like a HashSet in Java. You know, the usual stuff. This guide is meant to be a bare minimum tutorial for people who have never seen Spark before or has been a while since they last played with it.
Apache Spark is an analytics framework in a distributed computing environment, originally developer at UC Berkeley in 2009 (later donated to the Apache Software Foundation). It supports several languages such as Python, Scala and Java and can be used for Big Data and Machine Learning purposes.
It is important to remember that in the heart of Spark, there is the Resilient Distributed Dataset (RDD), the core data structure of Spark. We are going to talk about RDDs shortly. Spark comes with different components to make sure it spans across several fields and technologies:
- Spark Core: Spark’s distributed execution engine.
- Spark SQL: Support for SQL queries.
- Spark Streaming: Stream processing of live data streams.
- GraphX: API for graphs.
- MLlib: Machine Learning in Spark.
Harder, Better, Faster, Stronger (Spark vs. Hadoop MapReduce)
Since we already have Hadoop MapReduce, why use Apache Spark instead for data processing? A key difference here is that while Hadoop MapReduce uses the disk to persist intermediate results, Spark processes data in-memory. This makes Spark much faster and can be proven critical in real time applications such as stream processing. Spark may be up to 100 times faster. However, in batch processing where time is not critical but also the extra disk space might be handy i.e. large datasets, Hadoop MapReduce is still competitive. Finally Spark’s components allow for SQL queries, Graph algorithms and Machine Learning. To recap, what we get with Spark:
- Easier programming. Because of its abstractions the programmer does not need to know about any internals.
- You get to start Spark from a Command Line Interface! It can be as easy as start the CLI, load a file and start processing it.
- Faster stream processing.
- SQL queries!
RDDs are fundamental data structures of Spark along with the Dataframes. They are an abstraction which hides a lot of complicated details. All you need to know is their API and not necessarily what it is going on under the hood. The abbreviation breaks down as:
- Resilient, which means they are fault-tolerant: recompute missing or even damaged partitions due to node failures.
- Distributed, since they act on multiple nodes in a cluster.
- Dataset is a collection of partitioned data with primitive values or values of values, e.g. tuples or other objects.
There are three ways to create an RDD in Spark:
- Parallelizing an already existing collection in your driver program
- Referencing a dataset in an external storage system, e.g. HDFS
- Creating an RDD from already existing RDDs
Basic RDD API
RDDs offer an API which can be used to handle their lifecycle. You can read the fairly long API from the official source but you can get a taste with the following:
One of the most common operations is "parallelize". It takes as an argument a collection such as an array and it essentially does what the name suggests: copies the data to a distributed setting where we can run operations in parallel. Let’s see what the code (Scala) looks like:
Returns all the elements of the dataset as an array at the driver program (i.e. the master node). In some sense this is the reverse of "parallelize", calling the data back to the driver program. Note that this is a memory intensive call and should be used after a "filter" or similar size reducing operation, otherwise it can easily cause a memory exception if the dataset is too large to fit in memory. A safe alternative is to use "take" instead, in order to see a part of the data.
Dataframes are another abstraction built on top of RDDs, which means they are an even higher abstraction. As the official Spark guide says, you can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources (e.g. text file, Parquet file, JSON, etc.). Roughly, you can think of RDDs as operating on key/value datasets and dataframes on matrices (like SQL tables). If you are familiar with R, they are very similar abstractions with R's dataframes. Let’s see how to create one from a JSON file right away:
In line 1, we create a dataframe from a JSON which contains a student's grade along with its grade in a course. Line 2 creates a view which is the equivalent of a database table. We call this "grades". In line 3 we declare the SQL query we want to run and finally in line 4 we display the results. Simple as that!
Taking it from here, this is where magic happens. You are free to experiment with the API and try numerous processing functions, e.g. map, reduce, filter etc. There is a lot going, on of increasing complexity, which you can pick up by consulting the official API.
Credits to Christos Mantas for sharing his expertise on Spark.