• Spark SQL: This is Spark’s module for working with structured data, and it is designed to support workloads that combine familiar SQL database
queries with more complicated, algorithm-based analytics. Spark SQL supports the open source Hive project, and its SQL-like HiveQL query syntax.
Spark SQL also supports JDBC and ODBC connections, enabling a degree of integration with existing databases, data warehouses and business
intelligence tools. JDBC connectors can also be used to integrate with Apache Drill, opening up access to an even broader range of data
• Spark Streaming: This module supports scalable and fault-tolerant processing of streaming data, and can integrate with established sources of
data streams like Flume (optimized for data logs) and Kafka (optimized for distributed messaging). Spark Streaming’s design, and its use of
Spark’s RDD abstraction, are meant to ensure that applications written for streaming data can be repurposed to analyze batches of historical data
with little modification.
• MLlib: This is Spark’s scalable machine learning library, which implements a set of commonly used machine learning and statistical algorithms.
These include correlations and hypothesis testing, classification and regression, clustering, and principal component analysis.
• GraphX: This module began life as a separate UC Berkeley research project, which was eventually donated to the Apache Spark project.
GraphX supports analysis of and computation over graphs of data, and supports a version of graph processing’s Pregel API. GraphX includes a
number of widely understood graph algorithms, including PageRank.
• Spark R: This module was added to the 1.4.x release of Apache Spark,providing data scientists and statisticians using R with a lightweight
mechanism for calling upon Spark’s capabilities.