Applied ML using Spark

LEARNING OUTCOME

After this course you will be able to:

– Load data into Spark and transform it

– Train models using Spark’s Machine Learning libraries

– Make predictions and evaluate models

– Build a Machine Learning pipeline

– Tune model’s parameters with GridSearch and Spark

CONTENT

Apache Spark, a cluster computing framework, is one of the most popular open source projects in the world.
This hands-on course focus on applying different Machine Learning algorithms to datasets using Apache Spark’s capabilities through the Spark Python API: PySpark.

The course is divided in two main parts: the first one is dedicated to the original Machine Learning library – MLlib, which is built on top of Resilient Distributed Datasets (RDDs); while the second one is focused on the spark.ml library, which is built on top of DataFrames.

For both parts, this course follows a exercise-driven approach: a new topic is introduced and then the students can practice it on a Jupyter Notebook especially designed to help them fixate the content. The exercises are linked together so, at the end of each part, the students will have gone through a whole Machine Learning pipeline.

The course is self-contained, but it assumes the students are able develop Python scripts and have been introduced to Apache Spark and Machine Learning algorithms.

Spark MLlib

– Vectors and Labeled Points, Local and Distributed Matrices

– Summary Statistics, Sampling and Hypothesis Testing

– Data Normalization and PCA for Feature Engineering

– Decision Trees, Random Forests, Gradient-Boosting Trees and Linear Methods

– Evaluation

Spark.ml

– Built-in and external Data Sources

– Explode, User-Defined Functions and Pivot

– Statistics, Random Data Generation and Sampling on DataFrames

– Handling Missing Data and Imputing Values

– Transformers and Estimators

– Data Normalization, Feature Vectors, Categorical Features, PCA and R Formulas

– Pipelines

– Decision Trees, Random Forests, Gradient-Boosting Trees and Linear Methods

– Evaluation

– Saving and Reloading Models

Additional topics

– Model Tuning: Cross Validation and Test-Validation Split in Spark

– Spark Scikit-Learn: SKLearn Grid Search with Spark