Since its debut in 2010, Apache Spark has become one of the most popular Big Data technologies in the Apache open source ecosystem. In addition to enabling processing of large data sets through its distributed computing architecture, Spark provides out-of-the-box support for machine learning, streaming and graph processing in a single framework.

Spark has been supported by companies like Microsoft, Google, Amazon and IBM and companies have started to integrate Apache Spark into their tool chain and the interest is growing. Unlike other big-data technologies which require intensive programming using Java etc., Spark enables data scientists to work with a big-data technology using higher level languages like Python and R making it accessible to conduct experiments and for rapid prototyping.

What you will learn

In this workshop, you will learn the core statistical and machine learning techniques supported by Apache Spark. Through examples primarily in Python, you will learn the importance of the algorithms and how to methodically apply various anomaly techniques including:

  • Key tenets of Spark’s distributed computing framework
  • Statistical and Machine Learning techniques
  • Apache Spark's support for solving large-scale machine learning problems
  • Considerations when building applications with Apache Spark
  • Practical Case studies with fully functional code

Course summary

Apache Spark is becoming the platform of choice for data scientists who want to design and run large scale machine learning applications in higher level languages like Python and R. The availability of a rich API has lowered the bar for data scientists by letting them leverage the power of Apache Spark to build machine learning applications without having to write/translate cumbersome Java-based big-data applications. Though the bar to get Apache Spark up and running is low, to fully leverage the power of Spark, applications must be designed and tuned optimally.

In this workshop, we will discuss the key tenets of large-scale machine learning and discuss the various algorithms supported by Apache Spark to build statistical and machine learning applications. Through practical case-studies, we will discuss how these algorithms can be applied and scaled to address the big data challenges in the enterprise.


Day 1

On day one, we will begin with an introduction to Apache Spark. Through examples we will understand how to build applications using PySpark, the Python API for Spark. We will then focus on the MLLib API for building Machine Learning applications in Spark.

What you will learn

  • Apache Spark: An introduction
  • Considerations for large Scale machine learning: How is it different from running algorithms on your laptop?
  • Basic Statistical techniques in Apache Spark
  • Machine Learning Techniques in Apache Spark: Clustering and Classification
  • Evaluating performance in machine learning algorithms
  • Case study 1: Clustering large financial assets with Apache Spark
  • Case study 2: Large Scale regression modeling with Apache Spark


Day 2

On day two, we will focus on building end-to-end applications with Apache Spark

What you will learn

  • Feature Engineering with Spark: Feature extraction, transformation, dimensionality reduction, and selection
  • Machine Learning Pipelines: Constructing, evaluating, and tuning ML Pipelines
  • Deploying Machine Learning models: Tools and Best practices
  • Case study 3: Processing large datasets with Spark
  • Case study 4: An end-to-end machine learning application pipeline

Download Brochure

QuantUniversity Meetup Slides 8/8/2016


Back to top