Introducing Distributed DataFrame (DDF)

Let’s face it: Big Data Analytics is too hard! Unreasonably hard.

If you’re a data scientist or engineer, today to get anything done, you’re likely having to deal with HDFS, MapReduce, distributed parallel processing, directed acyclic graphs, monoids, monads, map, flatmap, etc. and on and on. Now, those things are all important tools and concepts to master, at some level. But too often, they get in the way of the data scientist/engineer’s analytic productivity. Fuel injectors and carburetors are important to the functioning of your car’s engine, but they shouldn’t get in the way each time you need to drive. It’s only that way because big data science technologies are newly introduced and evolving. And because we’re building them from the data storage layer bottom-up, you end up having to know a lot of things, bottom-up, to get stuff done.

Can it be simpler and easier?

At Adatao, when in search for a big data science and engineering framework, we look for one key thing: productivity. Big data science and engineering need a framework for much better productivity. Instead of viewing things bottom-up, we take a top-down view of the big data stack, and ask what kind of API we would want to maximize data-engineering productivity. We ended up with developing the Distributed DataFrame (DDF) framework.

Design principles

The core idea of DDF is drawn from the valuable lessons that we have learned through the experience of adjacent technologies.

  • The ease of app development on RDBMS: The SQL abstraction has boosted app developer productivity tremendously, hiding away all the complexity and diversity of the database engines underneath.

  • The sophistication of R: For decades, data analysis idioms and packages have evolved around the powerful concept of the data.frame, from basic data transformation, filtering and projection, to advanced data mining and machine learning.

  • The scale of parallel, distributed computing: Thanks to technologies like Hadoop MapReduce, Apache Spark, and other parallel computing frameworks, big compute capabilities have become widely available.

DDF combines those 3 paradigms into one unified, highly productive framework. Our goal is to make the Big-Data API as simple and accessible to data scientists and engineers as the equivalent “small-data” RDBMS API. DDF could be as transformative to Big Data as SQL was to RDBMS. It takes advantage of powerful data & compute engines, hides away all their complexities, exposing them to the data scientist and engineer as a simple, productive, user-friendly development interface.

Key features and benefits

1. Table-like abstraction on top of big data

DDF lets you think of your big data sources and in-memory structures as database tables. The underlying representation can be very big, very distributed, very complicated, but from above, you only have to think of a table.

2. Native R data.frame experience

If you’re familiar with R, it’s the concept of an R data.frame. R users get out-of-the-box familiar coding syntax & patterns. Concepts like schema, filtering, projection, transformation, data mining & machine learning support all built-in. DDF also borrows heavily from pandas and scikit-learn , for their well thought-through design with proven popular success from the Python community’s data mining and machine learning efforts.

3. Focus on analytics, not MapReduce

DDF nicely wraps the complicated MapReduce patterns in simple APIs with rich set of popular analytics idioms benefited from decades of wisdom in relational databases and data science. It allows users to perform the whole analytics pipeline, from ETL/data wrangling to model building and real-time scoring, with just few lines of code.

4. Seamlessly integrate with external ML libraries

DDF’s architecture is componentized and pluggable by design, even at run-time, making it easy for users to replace or extend any component (“handler”) at will without having to modify the API or ask for permission.

5. Share DDFs via URIs

DDF allows mutable transformations and can be shared via URIs, enabling seamless collaboration in client/server settings.

6. Support multiple languages

The core DDF API is written in Java. Users will also enjoy using the DDF APIs with their familiar programming languages, including Java, Scala, R and Python. Followings is a brief demo of a flight info analysis in multiple languages, using our DDF implementation on Apache Spark.

From a Scala shell:
From R:
From Python

At Adatao, DDF has powered the rapid construction of applications like BigApp and PredictiveEngine, directly on top of Spark RDDs. This enables us to provide easy interfaces such as Natural Language, R, and Python for the underlying Spark/Shark engine. DDF is also extensible by design so that we can have DDF implemented on top of other big data engines, such as Redshift, Flink.

DDF is open sourced!