Apache Spark is becoming the platform of choice for data scientists who want to design and run large scale machine learning applications in higher level languages like Python and R. The availability of a rich API has lowered the bar for data scientists by letting them leverage the power of Apache Spark to build machine learning applications without having to write/translate cumbersome Java-based big-data applications. Though the bar to get Apache Spark up and running is low, to fully leverage the power of Spark, applications must be designed and tuned optimally.
In this workshop, we will discuss the key tenets of large-scale machine learning and discuss the various algorithms supported by Apache Spark to build statistical and machine learning applications. Through practical case-studies, we will discuss how these algorithms can be applied and scaled to address the big data challenges in the enterprise.
On day one, we will begin with an introduction to Apache Spark. Through examples we will understand how to build applications using PySpark, the Python API for Spark. We will then focus on the MLLib API for building Machine Learning applications in Spark.
What you will learn
- Apache Spark: An introduction
- Considerations for large Scale machine learning: How is it different from running algorithms on your laptop?
- Basic Statistical techniques in Apache Spark
- Machine Learning Techniques in Apache Spark: Clustering and Classification
- Evaluating performance in machine learning algorithms
- Case study 1: Clustering large financial assets with Apache Spark
- Case study 2: Large Scale regression modeling with Apache Spark
On day two, we will focus on building end-to-end applications with Apache Spark