Course Big Data Fundamentals via PySpark

Learn the fundamentals of working with big data with PySpark.


This online course about Big Data Fundamentals via PySpark covers a key part of what a future data analyst would require.

There’s been a lot of buzz about Big Data over the past few years, and it’s finally become mainstream for many companies. But what is this Big Data? This course covers the fundamentals of Big Data via PySpark. Spark is “lightning fast cluster computing” framework for Big Data. It provides a general data processing platform engine and lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. You’ll use PySpark, a Python packaged for spark programming and it’s powerful, higher-level libraries such as SparkSQL, MLlib (for machine learning), etc., to interact with works of William Shakespeare, analyze Fifa football 2018 data and perform clustering of genomic datasets. At the end of this course, you will gain an in-depth understanding of PySpark and it’s application to general Big Data analysis.

Enroll now in this Big Data Fundamentals via PySpark course, and don’t miss the opportunity of learning with the best, as Upendra Kumar Devisetty is. With 55 enriching exercises, 16 videos, and an estimated time of 4 hours to successfully end up the course, you will become one of the best.

Requisites before you start
Chapter 1: Introduction to Big Data analysis with Spark
This chapter introduces the exciting world of Big Data, as well as the various concepts and different frameworks for processing Big Data. You will understand why Apache Spark is considered the best framework for BigData.
Chapter 2: Programming in PySpark RDD’s
The main abstraction Spark provides is a resilient distributed dataset (RDD), which is the fundamental and backbone data type of this engine. This chapter introduces RDDs and shows how RDDs can be created and executed using RDD Transformations and Actions.
Chapter 3: PySpark SQL & DataFrames
In this chapter, you’ll learn about Spark SQL which is Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. This chapter shows how Spark SQL allows you to use DataFrames in Python.
Chapter 4: Machine Learning with PySpark MLlib
PySpark MLlib is the Apache Spark scalable machine learning library in Python consisting of common learning algorithms and utilities. Through this last chapter, you’ll learn important Machine Learning algorithms. You will build a movie recommendation engines and a spam filter, and use k-means clustering.
Chapter 5:
Big Data Fundamentals via PySpark. Learn the fundamentals of working with big data with PySpark.

Upendra Kumar Devisetty

Science Analyst at CyVerse

Upendra Kumar Devisetty is a Science Analyst at CyVerse where he scientifically interacts with biologists, bioinformaticians, programming teams and other members of CyVerse team. He also coordinates development across projects, and facilitates integration and cross-communication. His current work mainly focuses on integrative analysis of Big Data using high-throughput methods on advanced computing systems. As scientific computing is becoming indispensable for Big Data research, he started building a community to develop and propagate a set of best practices, including continuous testing, version control, virtualization, sharing code through notebooks, and standard data structures.