
Dataset reorganizing
In this section, we will cover dataset reorganization techniques. Then, we will discuss some of Spark's special features for data reorganizing and also some of R's special methods for data reorganizing that can be used with the Spark notebook.
After this section, we will be able to reorganize datasets for various machine learning needs.
Dataset reorganizing tasks
Reorganizing datasets sounds easy but could be very challenging and also often very time consuming.
Two common data reorganizing tasks are—firstly, to obtain a subset of the data for modeling and, secondly, to aggregate data to a higher level. For example, we have students' data, but we need to have a dataset at the classroom level. For this, we will need to calculate some attributes for students and then reorganize it into new data.
To work with data reorganizing, data scientists and machine learning professionals often utilize their familiar SQL or R programming tools. Fortunately within the Spark environment, there are Spark SQL and R notebooks for users to continue their familiar paths; we will have detailed reviews in the following two sections for this.
Overall, we recommend using SparkSQL to reorganizing datasets. However, for the learning purpose, in this section, our focus will be on the utilization of R Notebook from Databricks Workspace.
R and Spark nicely complement each other for several important use cases in statistics and data science. The Databricks R notebooks include the SparkR package by default so that data scientists can effortlessly benefit from the power of Apache Spark in their R analyses. In addition to SparkR, any R package can be easily installed into the notebook. In this blog post, I will highlight a few of the features in our R notebooks.

To get started with R in Databricks, simply choose R as the language when creating a notebook. Since SparkR is a recent addition to Spark, remember to attach the R notebook to any cluster running Spark version 1.4 or higher. The SparkR package is imported and configured by default. You can run Spark queries in R.
Dataset reorganizing with Spark SQL
In the last section, we discussed using SparkSQL to reorganize datasets.
SQL can be a powerful tool to perform complex aggregations with many familiar examples to machine learning professionals.
SELECT
is a command to obtain some data subsets.
For data aggregation, machine learning professionals may use some of SpartSQL's simple.aggregate
or window functions.
Note
For more information about SparkSQL's various aggregation functions, go to https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.functions$.
For more information on SparkSQL's window functions, go to https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html.
Dataset reorganizing with R on Spark
R has a subset command to create subsets with the following formats:
# using subset function newdata <- subset(olddata, var1 >= 20, select=c(ID, var2))
Also, we may use the aggregate command from R, as follows:
aggdata <-aggregate(mtcars, by=list(cyl,vs), FUN=mean, na.rm=TRUE)
However, data often has multiple levels of grouping (nested treatments, split plot designs, or repeated measurements) and typically requires investigation at multiple levels. For example, from a long-term clinical study, we may be interested in investigating relationships over time or between times or patients or treatments. To make your job even more difficult, the data probably has been collected and stored in a way optimized for ease and accuracy of collection and in no way resembles the form you need for statistical analysis. You need to be able to fluently and fluidly reshape the data to meet your needs, but most software packages make it difficult to generalize these tasks, and new code needs to be written for each new case.
Especially, R has a reshape
package that was specially designed for data reorganization. The package reshape
uses a paradigm of melting and casting, where the data is melted into a form which distinguishes measured and identifying variables and then "casts" it into a new shape, whether it be a data frame, list, or highly dimensional array.
As we may recall, in section Data cleaning made easy, we had four tables for the purposes of illustration:
Users(userId INT, name String, email STRING,age INT, latitude: DOUBLE, longitude: DOUBLE,subscribed: BOOLEAN)
Events(userId INT, action INT, Default)
WebLog(userId, webAction)
Demographic(memberId, age, edu, income)
For this example, we often need to obtain a subset from the first data and aggregate the fourth data.