The DataFrame abstraction, introduced in Apache Spark 1.3, is a distributed collection of data organized into named columns. It gives developers a DSL to expressively manipulate that data while the underlying Catalyst query optimizer ensures performance. In this presentation for the Maryland Apache Spark Meetup I provided an overview of the DataFrames API and some sample use cases/comparisons to RDDs.
Many thanks to Brian Husted, his Tetra concepts team and the Jailbreak Brewing Company for organizing and hosting this meetup! Thanks for great questions and discussions.
- Code available via: