# Getting started This document summarizes all instructions to help first time users to get and use DDF in 10 minutes. --- * <a href="#getting-started">Getting started!</a> * <a href="#prerequisites">Prerequisites</a> * <a href="#building-ddf">Building DDF</a> * <a href="#java-example">Running a Java example</a> * <a href="#ddf-shell">Interactive Analysis with DDF's Shell</a> * <a href="#spark-api">Use Spark's Java API</a> * <a href="#ddf-R">Use DDF in R</a> * <a href="#ddf-R">Use DDF in Python</a> --- <a name="getting-started"></a> ## Getting Started This tutorial provides a quick introduction to using DDF with its implementation on Apache Spark. We will introduce how to build DDF and run DDF example in Java, Scala, R and Python. <a name="prerequisites"></a> ## Prerequisites Java 7+, Scala 2.10.x, maven 3.x, git. Download the DDF Binary from the website, or clone DDF from github ``` git clone https://github.com/ddf-project/DDF ``` <a name="building-ddf"></a> ## Building DDF If you're a binary user, you can skip the building part. DDF is currently compatible with Spark 1.3.1. Now set up the neccessary environment variables (MAVEN_OPTS, JAVA_TOOL_OPTIONS) for the build. ``` cd DDF bin/set-env.sh ``` If you don't want to set up the environment variables every time, please add the commands into ~/.bashrc or ~/.bash_profile. Sbt can be installed by the following command ``` bin/get-sbt.sh ``` Building ddf by ``` mvn clean install -DskipTests ``` Modules can also be built separately ``` (cd core ; mvn clean install -DskipTests) (cd spark ; mvn clean install -DskipTests) ``` If you are more familar with the sbt tool ``` sbt (bin/sbt) clean compile publishLocal ``` Unit tests can be run by ``` mvn test ``` Or if you want to run test for individual modules, ``` cd core; mvn test cd spark; mvn test ``` Or if you are more familar with the sbt tool, ``` sbt (bin/sbt) test ``` At this point, it should pass all unit tests. <a name="java-example"></a> ## Running a Java example Run this command in the DDF directory ``` bin/run-example io.ddf.spark.examples.RowCount ``` <a name="ddf-shell"></a> ## Interactive Analysis with DDF's Shell DDF's shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It's available in Java, Scala and Python ``` bin/ddf-shell ``` Each DDF application has a DDFManager as an entry point to a big data engine. Now, let's create a DDFManager for the Spark engine. ``` java import io.ddf.DDFManager; DDFManager sparkDDFManager = DDFManager.get("spark"); ``` Prepare data to run ``` java sparkDDFManager.sql("create table airline (Year int,Month int,DayofMonth int, DayOfWeek int,DepTime int,CRSDepTime int,ArrTime int,CRSArrTime int,UniqueCarrier string, FlightNum int, TailNum string, ActualElapsedTime int, CRSElapsedTime int, AirTime int, ArrDelay int, DepDelay int, Origin string, Dest string, Distance int, TaxiIn int, TaxiOut int, Cancelled int, CancellationCode string, Diverted string, CarrierDelay int, WeatherDelay int, NASDelay int, SecurityDelay int, LateAircraftDelay int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','", Boolean.FALSE); sparkDDFManager.sql("load data local inpath './resources/test/airlineWithNA.csv' into table airline", Boolean.FALSE); ``` Create DDF ``` java DDF ddf = sparkDDFManager.sql2ddf("select * from airline", Boolean.FALSE); ``` ###Get the number of rows/columns of the DDF: ``` java ddf.getNumRows(); print(ddf.getColumnNames()); ``` ### Data ETL DDF allows mutable transformations, e.g., all NA values can be dropped from the same DDF, similar to a Pandas dataframe. ``` java ddf.setMutable(true); ddf.dropNA(); ``` ### Basic Statistics DDF comes with a lot of R-like idioms, such as summary, fiveNum ``` java s = ddf.getSummary(); ``` Check the mean or NA count of the first column of the DDF: ``` java print(s[0].mean()); ``` Project the DDF with specific columns: ``` java newddf = ddf.VIEWS.project("deptime", "arrtime", "distance", "depdelay", "arrdelay"); ``` View some data in newddf: ``` java print(newddf.VIEWS.head(2)); ``` ### Machine learning Run Mllib's KMeans on newddf: ``` java import org.apache.spark.mllib.clustering.KMeansModel; kmeansModel = (KMeansModel) newddf.ML.train("kmeans", 5, 5).getRawModel(); ``` Predict on the kmeansModel: ``` java import org.apache.spark.mllib.linalg.Vectors; print(kmeansModel.predict(Vectors.dense(new double[] { 1232, 1341, 389, 100, 9 }))); ``` <a name="spark-api"></a> ## Use Spark's Java API You can also play with Spark Java API from the ddf-shell ``` java sc = sparkDDFManager.getJavaSparkContext(); ``` When done analysis, shutdown the DDFManager ``` java sparkDDFManager.shutdown(); ``` <a name="ddf-R"></a> ## Use DDF in R DDF can be used in R language with the same user experience. DDF on R is very powerful API that bring the R idioms into big data processing. After building DDF following previous steps, You can install DDF on R using the following command in the R directory. Go into DDF/R directory and run: ``` sh ./install.sh ``` It will install DDF on R package to system repo. Before using the package, users have to setup rJava environment by running those commands ``` R CMD javareconf -e export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_LD_LIBRARY_PATH ``` After this step you can start running provided example by using this command ``` R Rscript examples/spark.R ``` <a name="ddf-python"></a> ## Use DDF in Python DDF can also be used in Python. After building DDF, we nned to install the requirements for pyddf: ``` $ cd <DDF_DIRECTORY>/python $ pip install -r requirements.txt $ python setup.py develop ``` Then set environment variable ``` $ cd <DDF_DIRECTORY> $ export DDF_HOME=`pwd` ``` Now open your Python interpreter ``` $ cd <DDF_DIRECTORY>/python $ python ``` Run tests ``` $ cd <YOUR_DDF_DIRECTORY>/python $ DDF_HOME=../ python examples/basics.py $ DDF_HOME=../ python tests/test_ddf.py ``` [Next you can visit our release notes. ](release_notes.html)