Presentation: Spark Dataframes
The DataFrame abstraction, introduced in Apache Spark 1.3, is a distributed collection of data organized into named columns. It gives developers a DSL to expressively manipulate that data while the underlying Catalyst query optimizer ensures performance. In this presentation for the Maryland Apache Spark Meetup I provided an overview of the DataFrames API and some sample use cases/comparisons to RDDs.
Many thanks to Brian Husted, his Tetra concepts team and the Jailbreak Brewing Company for organizing and hosting this meetup! Thanks for great questions and discussions.
Materials
- Slides
- Code available via:
git clone https://github.com/medale/spark-mail.git
cd spark-mail
git checkout spark-md-meetup-june-2015
mvn clean install
# check sql-analytics/src/main/scala classes for sample code