Apache Spark – My Blog

Deployment

Feb 23, 2024

—

by

Spark application, using spark-submit, is a shell command used to deploy the Spark application on a cluster. It uses all respective cluster managers through a uniform interface. Therefore, you do not have to configure your application for each one. Example Let us take the same example of word count, we used before, using shell commands.…

Core Programming

Feb 23, 2024

—

by

admin

in Apache Spark

Spark Core is the base of the whole project. It provides distributed task dispatching, scheduling, and basic I/O functionalities. Spark uses a specialized fundamental data structure known as RDD (Resilient Distributed Datasets) that is a logical collection of data partitioned across machines. RDDs can be created in two ways; one is by referencing datasets in…

Installation

Feb 23, 2024

—

by

admin

in Apache Spark

Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based system. The following steps show how to install Apache Spark. Step 1: Verifying Java Installation Java installation is one of the mandatory things in installing Spark. Try the following command to verify the JAVA version.$java -version If Java is already,…

RDD

Feb 23, 2024

—

by

admin

in Apache Spark

Resilient Distributed Datasets Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.…

Introduction

Feb 23, 2024

—

by

admin

in Apache Spark

Industries are using Hadoop extensively to analyze their data sets. The reason is that Hadoop framework is based on a simple programming model (MapReduce) and it enables a computing solution that is scalable, flexible, fault-tolerant and cost effective. Here, the main concern is to maintain speed in processing large datasets in terms of waiting time…

Category: Apache Spark

Deployment

Core Programming

Installation

RDD

Introduction