Distributed DataFrame

Simplify Analytics on Disparate Data Sources
via a Uniform API Across Engines

    A simple yet powerful API above & across multiple data and compute engines. You now can:
  •   Process data at-source
  •   Bypass the absolute requirement for a Hadoop data lake
  •   Future-proof your analytics applications against rapidly changing data and compute engine landscape
//To start working with a DDF-on-Spark cluster:
DDFManager smanager = DDFManager.get("spark");
//Then, data can be loaded into a SparkDDF as follows:
DDF table = smanager.sql2ddf("select * from airline", false);
/* ETL, transform */
table = table.transform("dist= round(distance/2, 2)");
/* Run Machine learning using MLlib, then run prediction */
KMeansModel kmeansModel = (KMeansModel) ddf.ML.train("kmeans", 5, 5).getRawModel();
Int prediction = ddf.ML.applyModel(kmeansModel, false, true)
// To start working with flink:
DDFManager fmanager = DDFManager.get("flink");
// Data can be loaded into FlinkDDF as follows:
DDF flinkTable = fmanager.sql2ddf("select * from airline", false);
/* ETL, SQL query */
flinkTable.sql("select * from @this", "Error in SQL");

#Create DDF manager to run on Spark engine
dm <- DDFManager("spark")

#create DDF from table
ddf <- sql2ddf(dm, "select * from mtcars")

/*Basic Stats*/
#return number of columns/rows

#run standard summary on ddf
#Create DDF manager to run on Spark engine
dm = DDFManager("spark")

#create DDF from table
ddf = dm.sql2ddf("select * from airline_na")

/* Clean data */
#drop NA

/* Basic Stats */
#get num rows and number of columns

Implemented on

DDF Principles

The ease of app development on RDBMS

The SQL abstraction has boosted app developer productivity tremendously, hiding away all the complexity and diversity of the database engines underneath.

The sophistication of R

For decades, data analysis idioms and packages have evolved around the powerful concept of the data.frame, from basic data transformation, filtering and projection, to advanced data mining and machine learning.

The scale of parallel, distributed computing

Thanks to technologies like Hadoop MapReduce, Apache Spark, and other parallel computing frameworks, big compute capabilities have become widely available.

What People Are Saying About DDF

  • I'm working on making pandas work better with Spark, and it seems like wrapping pandas around DDF would be really cool.@holdenkarau
  • @adataoinc is going to open-source their amazing work with distributed data frames, amazing! #sparksummit.@davidbgonzalez
  • We are a team .. that would be very interested in getting this to open source release. We have ideas to build on top of a smoother abstraction than RDD and Schema RDD.@siditweet
  • I really think that Spark (and in general Scala and Java) was lacking precisely the data frame concept that is so handy when you do R and data analysis.@carlosfuertes