Data Science: Python Basics Cheat Sheet


Python Basics Cheat Sheet
Python is one of the most popular data science tool due to its low and gradual learning curve and the fact that it is a fully fledged programming language.
Data Science: PySpark RDD Basics Cheat Sheet

PySpark RDD Basics Cheat Sheet
“At a high level, every Spark application consists of a driver program that runs the user’s main
function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” via Spark.Aparche.Org
Data Science: NumPy Basics Cheat Sheet

NumPy Basics Cheat Sheet
NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.