25+ Best Apache Spark Interview Questions & Answers by [EXPERTS]
Last updated on 03rd Jun 2020, Blog, Interview Questions
Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.It is used for processing and analyzing a large amount of data. Just like Hadoop MapReduce, it also works with the system to distribute data across the cluster and process the data in parallel.
1. Compare Hadoop and Spark.
We will compare HadoopMapReduce and Spark based on the following aspects:
Apache Spark vs. Hadoop
|Feature Criteria||Apache Spark||Hadoop|
|Speed||100 times faster than Hadoop||Decent speed|
|Processing||Real-time & Batch processing||Batch processing only|
|Difficulty||Easy because of high level modules||Tough to learn|
|Recovery||Allows recovery of partitions||Fault-tolerant|
|Interactivity||Has interactive modes||No interactive mode except Pig & Hive|
2. What is Apache Spark?
- Apache Spark is an open-source cluster computing framework for real-time processing.
- It has a thriving open-source community and is the most active Apache project at the moment.
- Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
Spark is one of the most successful projects in the Apache Software Foundation. Spark has clearly evolved as the market leader for Big Data processing. Many organizations run Spark on clusters with thousands of nodes. Today, Spark is being adopted by major players like Amazon, eBay, and Yahoo!
3.List the key features of Apache Spark.
The following are the key features of Apache Spark:
- Multiple Format Support
- Lazy Evaluation
- Real Time Computation
- Hadoop Integration
- Machine Learning
4. What are the languages supported by Apache Spark and which is the most popular one?
Apache Spark supports the following four languages: Scala, Java, Python and R. Among these languages, Scala and Python have interactive shells for Spark. The Scala shell can be accessed through ./bin/spark-shell and the Python shell through ./bin/pyspark. Scala is the most used among them because Spark is written in Scala and it is the most popularly used for Spark.
5. What are the benefits of Spark over MapReduce?
Spark has the following benefits over MapReduce:
- Due to the availability of in-memory processing, Spark implements the processing around 10 to 100 times faster than HadoopMapReduce whereas MapReduce makes use of persistence storage for any of the data processing tasks.
- Unlike Hadoop, Spark provides inbuilt libraries to perform multiple tasks from the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
- Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.
- Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.
6. What is YARN?
Similar to Hadoop, YARN is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster. YARN is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on YARN, the same way Hadoop Map Reduce can run on YARN. Running Spark on YARN necessitates a binary distribution of Spark as built on YARN support.
7. Do you need to install Spark on all nodes of the YARN cluster?
No, because Spark runs on top of YARN. Spark runs independently from its installation. Spark has some options to use YARN when dispatching jobs to the cluster, rather than its own built-in manager, or Mesos. Further, there are some configurations to run YARN. They include master, deploy-mode, driver-memory, executor-memory, executor-cores, and queue.
8. Is there any benefit of learning MapReduce if Spark is better than MapReduce?
Yes, MapReduce is a paradigm used by many big data tools including Spark as well. It is extremely relevant to use MapReduce when the data grows bigger and bigger. Most tools like Pig and Hive convert their queries into MapReduce phases to optimize them better.
9. Explain the concept of Resilient Distributed Dataset (RDD).
RDD stands for Resilient Distribution Datasets. An RDD is a fault-tolerant collection of operational elements that run in parallel. The partitioned data in RDD is immutable and distributed in nature. There are primarily two types of RDD:
- Parallelized Collections: Here, the existing RDDs running parallel with one another.
- Hadoop Datasets: They perform functions on each file record in HDFS or other storage systems.
RDDs are basically parts of data that are stored in the memory distributed across many nodes. RDDs are lazily evaluated in Spark. This lazy evaluation is what contributes to Spark’s speed.
10. How do we create RDDs in Spark?
Spark provides two methods to create RDD:
|1||method valDataArray = Array(2,4,6,8,10)|
|2||valDataRDD = sc.parallelize(DataArray)|
1. By parallelizing a collection in your Driver program.
3. By loading an external dataset from external storage like HDFS, HBase, shared file system.
11. What is Executor Memory in a Spark application?
Every spark application has the same fixed heap size and fixed number of cores for a spark executor. The heap size is what is referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize.
12. Define Partitions in Apache Spark.
As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in MapReduce. It is a logical chunk of a large distributed data set. Partitioning is the process to derive logical units of data to speed up the processing process. Spark manages data using partitions that help parallelize distributed data processing with minimal network traffic for sending data between executors. By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks. Everything in Spark is a partitioned RDD.
13. What operations does RDD support?
RDD (Resilient Distributed Dataset) is the main logical data unit in Spark. An RDD has distributed a collection of objects. Distributed means, each RDD is divided into multiple partitions. Each of these partitions can reside in memory or stored on the disk of different machines in a cluster. RDDs are immutable (Read Only) data structure. You can’t change the original RDD, but you can always transform it into a different RDD with all changes you want.
RDDs support two types of operations: transformations and actions.
Transformations: Transformations create new RDD from existing RDD like map, reduceByKey and filter we just saw. Transformations are executed on demand. That means they are computed lazily.
Actions: Actions return final results of RDD computations. Actions triggers execution using a lineage graph to load the data into the original RDD, carry out all intermediate transformations and return final results to the Driver program or write it out to the file system.
14. What do you understand by Transformations in Spark?
Transformations are functions applied on RDD, resulting in another RDD. It does not execute until an action occurs. map() and filter() are examples of transformations, where the former applies the function passed to it on each element of RDD and results into another RDD. The filter() creates a new RDD by selecting elements from the current RDD that pass the function argument.
As we can see here, rawData RDD is transformed into moviesData RDD. Transformations are lazily evaluated.
15. Define Actions in Spark.
An action helps in bringing back the data from RDD to the local machine. An action’s execution is the result of all previously created transformations. Actions triggers execution using a lineage graph to load the data into the original RDD, carry out all intermediate transformations and return final results to the Driver program or write it out to the file system.
reduce() is an action that implements the function passed again and again until one value if left. take() action takes all the values from RDD to a local node.
As we can see here, moviesData RDD is saved into a text file called MoviesData.txt.
16. Define functions of SparkCore.
Spark Core is the base engine for large-scale parallel and distributed data processing. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. SparkCore performs various important functions like memory management, monitoring jobs, fault-tolerance, job scheduling and interaction with storage systems. Further, additional libraries, built atop the core allow diverse workloads for streaming, SQL, and machine learning. It is responsible for:
- Memory management and fault recovery
- Scheduling, distributing and monitoring jobs on a cluster
- Interacting with storage systems
17. What do you understand by Pair RDD?
Apache defines PairRDD functions class as
- class PairRDDFunctions[K, V] extendsLogging withHadoopMapReduceUtil withSerializable
Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey() method that collects data based on each key and a join() method that combines different RDDs together, based on the elements having the same key.
18. Name the components of Spark Ecosystem.
- Spark Core: Base engine for large-scale parallel and distributed data processing
- Spark Streaming: Used for processing real-time streaming data
- Spark SQL: Integrates relational processing with Spark’s functional programming API
- GraphX: Graphs and graph-parallel computation
- MLlib: Performs machine learning in Apache Spark
19. How is Streaming implemented in Spark? Explain with examples.
Spark Streaming is used for processing real-time streaming data. Thus it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams. The fundamental stream unit is DStream which is basically a series of RDDs (Resilient Distributed Datasets) to process the real-time data. The data from different sources like Flume, HDFS is streamed and finally processed to file systems, live dashboards and databases. It is similar to batch processing as the input data is divided into streams like batches.
20. Is there an API for implementing graphs in Spark?
GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph.
The property graph is a directed multi-graph which can have multiple edges in parallel. Every edge and vertex have user defined properties associated with it. Here, the parallel edges allow multiple relationships between the same vertices. At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph: a directed multigraph with properties attached to each vertex and edge.
To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and mapReduceTriplets) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.
21. What is PageRank in GraphX?
PageRank measures the importance of each vertex in a graph, assuming an edge from u to v represents an endorsement of v’s importance by u. For example, if a Twitter user is followed by many others, the user will be ranked highly.
GraphX comes with static and dynamic implementations of PageRank as methods on the PageRank Object. Static PageRank runs for a fixed number of iterations, while dynamic PageRank runs until the ranks converge (i.e., stop changing by more than a specified tolerance). GraphOps allows calling these algorithms directly as methods on Graph.
22. How is machine learning implemented in Spark?
MLlib is a scalable machine learning library provided by Spark. It aims at making machine learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and alike.
23. Is there a module to implement SQL in Spark? How does it work?
Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing.
Spark SQL integrates relational processing with Spark’s functional programming. Further, it provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.
The following are the four libraries of Spark SQL.
- Data Source API
- DataFrame API
- Interpreter & Optimizer
- SQL Service
24. What is a Parquet file?
Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and considers it to be one of the best big data analytics formats so far.
Parquet is a columnar format, supported by many data processing systems. The advantages of having a columnar storage are as follows:
- Columnar storage limits IO operations.
- It can fetch specific columns that you need to access.
- Columnar storage consumes less space.
- It gives better-summarized data and follows type-specific encoding.
25. How can Apache Spark be used alongside Hadoop?
The best part of Apache Spark is its compatibility with Hadoop. As a result, this makes for a very powerful combination of technologies. Here, we will be looking at how Spark can benefit from the best of Hadoop. Using Spark and Hadoop together helps us to leverage Spark’s processing to utilize the best of Hadoop’s HDFS and YARN.
Hadoop components can be used alongside Spark in the following ways:
- HDFS: Spark can run on top of HDFS to leverage the distributed replicated storage.
- MapReduce: Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing framework.
- YARN: Spark applications can also be run on YARN (HadoopNextGen).
- Batch & Real Time Processing: MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real-time processing.
26. What is RDD Lineage?
Spark does not support data replication in the memory and thus, if any data is lost, it is rebuilt using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other datasets.
27. What is Spark Driver?
Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, a driver in Spark creates SparkContext, connected to a given Spark Master.
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.
28. What file systems does Spark support?
The following three file systems are supported by Spark:
- Hadoop Distributed File System (HDFS).
- Local File system.
- Amazon S3
29. List the functions of Spark SQL.
Spark SQL is capable of:
- Loading data from a variety of structured sources.
- Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau.
- Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more.
30. What is Spark Executor?
When SparkContext connects to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.
31. Name types of Cluster Managers in Spark.
The Spark framework supports three major types of Cluster Managers:
- Standalone: A basic manager to set up a cluster.
- Apache Mesos: Generalized/commonly-used cluster manager, also runs HadoopMapReduce and other applications.
- YARN: Responsible for resource management in Hadoop.
32. What do you understand about worker nodes?
Worker node refers to any node that can run the application code in a cluster. The driver program must listen for and accept incoming connections from its executors and must be network addressable from the worker nodes.
Worker node is basically the slave node. Master node assigns work and the worker node actually performs the assigned tasks. Worker nodes process the data stored on the node and report the resources to the master. Based on the resource availability, the master schedule tasks.
33. Illustrate some demerits of using Spark.
The following are some of the demerits of using Apache Spark:
- Since Spark utilizes more storage space compared to Hadoop and MapReduce, there may arise certain problems.
- Developers need to be careful while running their applications in Spark.
- Instead of running everything on a single node, the work must be distributed over multiple clusters.
- Spark’s “in-memory” capability can become a bottleneck when it comes to cost-efficient processing of big data.
- Spark consumes a huge amount of data when compared to Hadoop.
34. List some use cases where Spark outperforms Hadoop in processing.
- Sensor Data Processing: Apache Spark’s “In-memory” computing works best here, as data is retrieved and combined from different sources.
- Real Time Processing: Spark is preferred over Hadoop for real-time querying of data. e.g. Stock Market Analysis, Banking, Healthcare, Telecommunications, etc.
- Stream Processing: For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.
- Big Data Processing: Spark runs upto 100 times faster than Hadoop when it comes to processing medium and large-sized datasets.
35. What is a Sparse Vector?
A sparse vector has two parallel arrays; one for indices and the other for values. These vectors are used for storing non-zero entries to save space.
36. Can you use Spark to access and analyze data stored in Cassandra databases?
Yes, it is possible if you use Spark Cassandra Connector.To connect Spark to a Cassandra cluster, a Cassandra Connector will need to be added to the Spark project. In the setup, a Spark executor will talk to a local Cassandra node and will only query for local data. It makes queries faster by reducing the usage of the network to send data between Spark executors (to process data) and Cassandra nodes (where data lives).
37. Is it possible to run Apache Spark on Apache Mesos?
Yes, Apache Spark can be run on the hardware clusters managed by Mesos. In a standalone cluster deployment, the cluster manager in the below diagram is a Spark master instance. When using Mesos, the Mesos master replaces the Spark master as the cluster manager. Mesos determines what machines handle what tasks. Because it takes into account other frameworks when scheduling these many short-lived tasks, multiple frameworks can coexist on the same cluster without resorting to a static partitioning of resources.
38. How can Spark be connected to Apache Mesos?
To connect Spark with Mesos:
- Configure the spark driver program to connect to Mesos.
- Spark binary package should be in a location accessible by Mesos.
- Install Apache Spark in the same location as that of Apache Mesos and configure the property ‘spark.mesos.executor.home’ to point to the location where it is installed.
Learn Apache Spark Certification Course and Get Hired by TOP MNCsWeekday / Weekend BatchesSee Batch Details
39. How can you minimize data transfers when working with Spark?
Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:
- Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
- Using Accumulators – Accumulators help update the values of variables in parallel while executing.
The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.
40. What are broadcast variables?
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
41. Explain accumulators in Apache Spark.
Accumulators are variables that are only added through an associative and commutative operation. They are used to implement counters or sums. Tracking accumulators in the UI can be useful for understanding the progress of running stages. Spark natively supports numeric accumulators. We can create named or unnamed accumulators.
42. Why is there a need for broadcast variables when working with Apache Spark?
Broadcast variables are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup().
43. How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.
44. What is the significance of Sliding Window operation?
Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.
45. What is a DStream in Apache Spark?
Discretized Stream (DStream) is the basic abstraction provided by Spark Streaming. It is a continuous stream of data. It is received from a data source or from a processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs and each RDD contains data from a certain interval. Any operation applied on a DStream translates to operations on the underlying RDDs.
DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations:
- Transformations that produce a new DStream.
- Output operations that write data to an external system.
There are many DStream transformations possible in Spark Streaming. Let us look at filter(func). filter(func) returns a new DStream by selecting only the records of the source DStream on which func returns true.
46. Explain Caching in Spark Streaming.
DStreams allow developers to cache/ persist the stream’s data in memory. This is useful if the data in the DStream will be computed multiple times. This can be done using the persist() method on a DStream. For input streams that receive data over the network (such as Kafka, Flume, Sockets, etc.), the default persistence level is set to replicate the data to two nodes for fault-tolerance.
47. When running Spark applications, is it necessary to install Spark on all the nodes of the YARN cluster?
Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.
48. What are the various data sources available in Spark SQL?
Parquet file, JSON datasets and Hive tables are the data sources available in Spark SQL.
49. What are the various levels of persistence in Apache Spark?
Apache Spark automatically persists the intermediary data from various shuffle operations, however, it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.
The various storage/persistence levels in Spark are:
- MEMORY_ONLY: Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they’re needed. This is the default level.
- MEMORY_AND_DISK: Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on disk, and read them from there when they’re needed.
- MEMORY_ONLY_SER: Store RDD as serialized Java objects (one byte array per partition).
- MEMORY_AND_DISK_SER: Similar to MEMORY_ONLY_SER, but spill partitions that don’t fit in memory to disk instead of recomputing them on the fly each time they’re needed.
- DISK_ONLY: Store the RDD partitions only on disk.
- OFF_HEAP: Similar to MEMORY_ONLY_SER, but store the data in off-heap memory.
50. Does Apache Spark provide checkpoints?
Checkpoints are similar to checkpoints in gaming. They make it run 24/7 and make it resilient to failures unrelated to the application logic.
Lineage graphs are always useful to recover RDDs from a failure but this is generally time-consuming if the RDDs have long lineage chains. Spark has an API for checkpointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint – is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.
51. How Spark uses Akka?
Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.
52. What do you understand by Lazy Evaluation?
Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget – but it does nothing, unless asked for the final result. When a transformation like map() is called on an RDD, the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.
Enroll in Apache Spark Training to Build Skills & Advance Your Career
- Instructor-led Sessions
- Real-life Case Studies
53. What do you understand about SchemaRDD in Apache Spark RDD?
SchemaRDD is an RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column.
SchemaRDD was designed as an attempt to make life easier for developers in their daily routines of code debugging and unit testing on SparkSQL core modules. The idea can boil down to describing the data structures inside RDD using a formal description similar to the relational database schema. On top of all basic functions provided by common RDD APIs, SchemaRDD also provides some straightforward relational query interface functions that are realized through SparkSQL.
Now, it is officially renamed to DataFrame API on Spark’s latest trunk.
54. How is Spark SQL different from HQL and SQL?
Spark SQL is a special component on the Spark Core engine that supports SQL and Hive Query Language without changing any syntax. It is possible to join SQL table and HQL table to Spark SQL.
55. Explain a scenario where you will be using Spark Streaming.
When it comes to Spark Streaming, the data is streamed in real-time onto our Spark program.
Twitter Sentiment Analysis is a real-life use case of Spark Streaming. Trending Topics can be used to create campaigns and attract a larger audience. It helps in crisis management, service adjusting and target marketing.
Sentiment refers to the emotion behind a social media mention online. Sentiment Analysis is categorizing the tweets related to a particular topic and performing data mining using Sentiment Automation Analytics Tools.
Spark Streaming can be used to gather live tweets from around the world into the Spark program. This stream can be filtered using Spark SQL and then we can filter tweets based on the sentiment. The filtering logic will be implemented using MLlib where we can learn from
56. What are the main features of Apache Spark?
Following are the main features of Apache Spark:
- Integration with Hadoop.
- Includes an interactive language shell called Scala in which spark is written.
- ·Robust Distributed Data sets are cached between compute nodes in a cluster.
- Offers various analytical tools for real-time analysis, graphic processing, and interactive query analysis.
57. Define RDD.
Resilient Distributed Datasets (RDD) represents a fault-tolerant set of elements that operate in parallel. The data in the RDD section is distributed and immutable. There are mainly two types of RDDs.
- Parallelized collections: The existing RDDs operating parallel to each other.
- Hadoop datasets: The dataset that performs a function for each file record in HDFS or other storage systems.
58. What is the use of the Spark engine?
The objective of the Spark engine is to plan, distribute, and monitor data applications in a cluster.
59. What is the Partition?
Partition is a process of obtaining logical units of data for speeding up data processing. In simple words, partitions are smaller and logical data separation is similar to a ‘split’ in MapReduce.
60. What type of operations are supported by RDD support?
Transformations and actions are the two types of operations supported by RDD.
61. What do you mean by transformations in Spark?
In simple words, transformations are functions implemented in RDD. It will not work until action is performed. map() and filter() are some examples of transformations.
While map() function is repeated on each RDD line and splits into a new RDD, the filter function () creates a new RDD by selecting the elements that pass the function argument from the current RDD.
62. Explain Actions.
Actions in Spark makes it possible to bring data from RDD to the local machine. Reduce () and take () are the functions of Actions. Reduce() function is performed only when action repeats one by one until one value leaves. The take () accepts all RDD values to the local key.
63. Explain the functions supported by Spark Core.
Various functions are supported by Spark Core like job scheduling, fault-tolerance, memory management, monitoring jobs and much more.
64. Define RDD Lineage?
Spark does not support data replication, so if you have lost information, it is reconstructed using RDD Lineage. RDD generation is a way to reconstruct lost data. The best thing to do is always remember how to create RDDs from other datasets.
65. What does Spark Driver do?
The Spark driver is a program that runs on the main node of the device and announces transformations and actions on the data RDD. In a nutshell, the driver in Spark makes SparkContext in conjunction with the given Spark Master. It also provides RDD graphs to the master, where the cluster manager operates.
66. What is Hive?
By default, Hive supports Spark on YARN mode.
HIVE execution is configured in Spark through:
hive> set spark.home=/location/to/sparkHome;
hive> set hive.execution.engine=spark;
67. List the most frequently used Spark ecosystems.
- For developers, Spark SQL (Shark).
- To process live data streams, Spark Streaming.
- To generate and compute, GraphX.
- MLlib (Machine Learning Algorithms)
- For promoting R programming in the Spark Engine, SparkR.
68. Explain Spark Streaming.
Stream processing is an extension to the Spark API, which allows live data streaming. Data from various sources such as Kafka, Flume and Kinesis are processed and sent to the file system, live dashboards and databases. In terms of input data, it’s similar to batch processing for dividing data into streams like batches.
69. What is GraphX?
Spark uses GraphX for graphics processing and building. GraphX lets programmers think about big data.
70. What does MLlib do?
Spark supports MLlib, which is a scalable Machine Learning library. Its objective is to make Machine Learning easy and scalable with common learning algorithms and use cases like regression filtering, clustering, dimensional reduction, and the like.
71. Define Spark SQL?
Spark SQL is also known as Shark, is used for processing structured data. Using this module, relational queries on data are executed by Spark. It supports SchemaRDD, which consists of schema objects and row objects representing the data type of each column in a row. This is just like a table in relational databases.
72. What are the different cluster managers available in Apache Spark?
- Standalone Mode: By default, applications submitted to the standalone mode cluster will run in FIFO order, and each application will try to use all available nodes. You can launch a standalone cluster either manually, by starting a master and workers by hand or use our provided launch scripts. It is also possible to run these daemons on a single machine for testing.
- Apache Mesos: Apache Mesos is an open-source project to manage computer clusters, and can also run Hadoop applications. The advantages of deploying Spark with Mesos include dynamic partitioning between Spark and other frameworks as well as scalable partitioning between multiple instances of Spark.
- Hadoop YARN: Apache YARN is the cluster resource manager of Hadoop 2. Spark can be run on YARN as well.
- Kubernetes: Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.
73. Which file systems are supported by Apache Spark?
- Hadoop file distribution system (HDFS)
- Amazon S3
- Local File system
74. Define Partitions.
As the name suggests, a partition is a smaller and logical division of data similar to a ‘split’ in MapReduce. Partitioning is the process of deriving logical units of data to speed up data processing. Everything in Spark is a partitioned RDD.
75. Name a few functions of Spark SQL.
Following are the functions of Spark SQL:
- Loads data from various structured sources.
- Query data using SQL elements.
- Provides advanced integration between regular Python/Java/Scala code and SQL.
76. What are the advantages of using Spark over MapReduce?
- Spark implements 10-100X times faster data processing than MapReduce due to the availability of in-memory processing. MapReduce uses persistence storage for data processing tasks.
- Spark offers in-built libraries to execute multiple tasks using machine learning, steaming, batch processing, and more. Whereas, Hadoop supports only batch processing.
- Spark supports in-memory data storage and caching, but Hadoop is highly disk-dependent.
77. Is there any benefit of learning MapReduce?
Yes, MapReduce is a standard used by many big data tools, including Apache Spark. As data grows, it becomes extremely important to use MapReduce. Many tools, such as Pig and Hive, convert queries to the MapReduce phases for optimizing them.
78. What is Lazy Evaluation?
If you create any RDD from an existing RDD that is called as transformation and unless you call an action your RDD will not be materialized the reason is Spark will delay the result until you really want the result because there could be some situations you have typed something and it went wrong and again you have to correct it in an interactive way it will increase the time and it will create un-necessary delays. Also, Spark optimizes the required calculations and takes intelligent decisions which is not possible with line by line code execution. Spark recovers from failures and slow workers.
79. Can we use Spark for accessing and analyzing the data stored in Cassandra Database?
Yes, we can use Spark for accessing and analyzing the data stored using the Spark Cassandra Connector. For connecting Spark to a Cassandra cluster, we need to add the Cassandra Connector to the Spark project.
80. How to connect Spark with Apache Mesos?
1. First, configure the Spark driver program to connect to Mesos.
2. Next, Spark binary package must be in a location available to Mesos.
3. By installing Apache Spark in the exact location of Apache Mesos and configure the property ‘spark.mesos.executor.home’ to point to the location where it is installed.
81. Define broadcast variables.
Broadcast variables enable the developers to have a read-only variable cached on each machine instead of copying it with tasks. In an efficient way, it allows every node to copy a large input dataset. Using efficient broadcast algorithms, Spark tries to share broadcast variables.
82. Define Transformations in Spark?
Transformations are the functions that are applied to RDD that helps in creating another RDD. Transformation does not occur until action takes place. The examples of transformation are Map () and filer().
83. What is the Apache Spark Machine learning library?
It is a scalable Machine learning library that discusses both high speed and high-quality algorithms.
To make machine learning scalable and easy, MLlib is created. There are machine learning libraries that included an implementation of various machine learning algorithms. For example, clustering, regression, classification and collaborative filtering. Some lower level machine learning primitives like generic gradient descent optimization algorithms are also present in MLlib.
In Apache Spark Version 2.0 the RDD-based API in spark. MLlib package entered in maintenance mode. In this release, the DataFrame-based API is the primary Machine Learning API for Spark.Therefore, MLlib will not add any new feature to the RDD based API.
DataFrame based API is that it is more user-friendly than RDD therefore MLlib is switching to DataFrame API. Some of the benefits of using DataFrames are it includes Spark Data sources, Spark SQL DataFrame queries Tungsten and Catalyst Optimization and uniform APIs across languages. This Machine learning library also uses the linear algebra package Breeze. Breeze is a collection of libraries for numerical computing and machine learning.
84. What are the different data sources supported by Spark SQL?
1. Parquet file
2. JSON datasets
3. Hive table
85. Explain the features of Apache Spark because of which it is superior to Apache MapReduce?
Hadoop is designed for batch processing. Batch processing is very efficient in the processing of high volume data.
HadoopMapReduce is batch-oriented processing tool, it takes large dataset in the input, processes it and produces a result.
HadoopMapReduce adopted the batch-oriented model. Batch is essentially processing data at rest, taking a large amount of data at once and producing output. MapReduce process is slower than spark because it produces a lot of intermediary data.
Spark also supports batch processing systems as well as stream processing.
Spark Streaming processes data streams in micro-batches, Micro batches are an essentially collect and then process kind of
computational model. Spark processes faster than map reduces because it caches input data in memory by RDD.
86. Why is Apache Spark faster than Apache Hadoop?
Apache Spark is faster than Apache Hadoop due to below reasons:
- Apache Spark provides in-memory computing. Spark is designed to transform data In-memory and hence reduces time for disk I/O. While MapReduce writes intermediate results back to Disk and reads it back.
- Spark utilizes direct acyclic graph that helps to do all the optimization and computation in a single stage rather than multiple stages in the MapReduce model
- Apache Spark core is developed using SCALA programming language which is faster than JAVA. SCALA provides inbuilt concurrent execution by providing immutable collections. While in JAVA we need to use Thread to achieve parallel execution.
87. List down the languages supported by Apache Spark.
Apache Spark supports Scala, Python, Java, and R.
Apache Spark is written in Scala. Many people use Scala for the purpose of development. But it also has API in Java, Python, and R.
88. Name various types of Cluster Managers in Spark.
- Apache Meos – Commonly used cluster manager
- Standalone – A basic cluster manager for setting up a cluster
- YARN – Used for resource management
89. Is it possible to use Apache Spark for accessing and analyzing data stored in Cassandra databases?
Yes, it is possible to use Apache Spark for accessing as well as analyzing data stored in Cassandra databases using the Spark Cassandra Connector. It needs to be added to the Spark project during which a Spark executor talks to a local Cassandra node and will query only local data.
Connecting Cassandra with Apache Spark allows making queries faster by means of reducing the usage of the network for sending data between Spark executors and Cassandra nodes.
90. What do you mean by the worker node?
Any node that is capable of running the code in a cluster can be said to be a worker node. The driver program needs to listen for incoming connections and then accept the same from its executors. Additionally, the driver program must be network addressable from the worker nodes.
A worker node is basically a slave node. The master node assigns work that the worker node then performs. Worker nodes process data stored on the node and report the resources to the master node. The master node schedules tasks based on resource availability.
91. Please explain the sparse vector in Spark.
A sparse vector is used for storing non-zero entries for saving space. It has two parallel arrays:
- One for indices
- The other for values
An example of a sparse vector is as follows:
92. How will you connect Apache Spark with Apache Mesos?
Step by step procedure for connecting Apache Spark with Apache Mesos is:
- Configure the Spark driver program to connect with Apache Mesos
- Put the Spark binary package in a location accessible by Mesos
- Install Apache Spark in the same location as that of the Apache Mesos
- Configure the spark.mesos.executor.home property for pointing to the location where the Apache Spark is installed
93. Can you explain how to minimize data transfers while working with Spark?
Minimizing data transfers as well as avoiding shuffling helps in writing Spark programs capable of running reliably and fast. Several ways for minimizing data transfers while working with Apache Spark are:
- Avoiding – ByKey operations, repartition, and other operations responsible for triggering shuffles
- Using Accumulators – Accumulators provide a way for updating the values of variables while executing the same in parallel
- Using Broadcast Variables – A broadcast variable helps in enhancing the efficiency of joins between small and large RDDs
94. What are broadcast variables in Apache Spark? Why do we need them?
Rather than shipping a copy of a variable with tasks, a broadcast variable helps in keeping a read-only cached version of the variable on each machine.
Broadcast variables are also used to provide every node with a copy of a large input dataset. Apache Spark tries to distribute broadcast variables by using effective broadcast algorithms for reducing communication costs.
Using broadcast variables eradicates the need of shipping copies of a variable for each task. Hence, data can be processed quickly. Compared to an RDD lookup(), broadcast variables assist in storing a lookup table inside the memory that enhances retrieval efficiency.
95. Please provide an explanation on DStream in Spark.
DStream is a contraction for Discretized Stream. It is the basic abstraction offered by Spark Streaming and is a continuous stream of data. DStream is received from either a processed data stream generated by transforming the input stream or directly from a data source.
A DStream is represented by a continuous series of RDDs, where each RDD contains data from a certain interval. An operation applied to a DStream is analogous to applying the same operation on the underlying RDDs. A DStream has two operations:
- Output operations responsible for writing data to an external system
- Transformations resulting in the production of a new DStream
It is possible to create DStream from various sources, including Apache Kafka, Apache Flume, and HDFS. Also, Spark Streaming provides support for several DStream transformations.
96. How to attain fault tolerance in Spark?Is Apache Spark fault tolerant? if yes, how?
The basic semantics of fault tolerance in Apache Spark is, all the Spark RDDs are immutable. It remembers the dependencies between every RDD involved in the operations, through the lineage graph created in the DAG and in the event of any failure, Spark refers to the lineage graph to apply the same operations to perform the tasks.
There are two types of failures – Worker or driver failure. In case if the worker fails, the executors in that worker node will be killed, along with the data in their memory. Using the lineage graph, those tasks will be accomplished in any other worker nodes. The data is also replicated to other worker nodes to achieve fault tolerance. There are two cases:
1.Data received and replicated – Data is received from the source, and replicated across worker nodes. In the case of any failure, the data replication will help achieve fault tolerance.
2.Data received but not yet replicated – Data is received from the source but buffered for replication. In the case of any failure, the data needs to be retrieved from the source.
For stream inputs based on receivers, the fault tolerance is based on the type of receiver:
- Reliable receiver – Once the data is received and replicated, an acknowledgment is sent to the source. In case if the receiver fails, the source will not receive acknowledgment for the received data. When the receiver is restarted, the source will resend the data to achieve fault tolerance.
- Unreliable receiver – The received data will not be acknowledged to the source. In the case of any failure, the source will not know if the data has been received or not, and it will nor resend the data, so there is data loss.
To overcome this data loss scenario, Write Ahead Logging (WAL) has been introduced in Apache Spark 1.2. With WAL enabled, the intention of the operation is first noted down in a log file, such that if the driver fails and is restarted, the noted operations in that log file can be applied to the data. For sources that read streaming data, like Kafka or Flume, receivers will be receiving the data, and those will be stored in the executor’s memory. With WAL enabled, these received data will also be stored in the log files.
WAL can be enabled by performing the below:
- Setting the checkpoint directory, by using streamingContext.checkpoint(path)
- Enabling the WAL logging, by setting spark.stream.receiver.WriteAheadLog.enable to True.
97. What are the different levels of persistence in Spark?
Although the intermediary data from different shuffle operations automatically persists in Spark, it is recommended to use the persist () method on the RDD if the data is to be reused.
Apache Spark features several persistence levels for storing the RDDs on disk, memory, or a combination of the two with distinct replication levels. These various persistence levels are:
- DISK_ONLY – Stores the RDD partitions only on the disk.
- MEMORY_AND_DISK – Stores RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory, additional partitions are stored on the disk. These are read from here each time the requirement arises.
- MEMORY_ONLY_SER – Stores RDD as serialized Java objects with one-byte array per partition.
- MEMORY_AND_DISK_SER – Identical to MEMORY_ONLY_SER with the exception of storing partitions not able to fit in the memory to the disk in place of recomputing them on the fly when required.
- MEMORY_ONLY – The default level, it stores the RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory available, some partitions won’t be cached, resulting in recomputing the same on the fly every time they are required.
- OFF_HEAP – Works like MEMORY_ONLY_SER but stores the data in off-heap memory.
98. Can you list down the limitations of using Apache Spark?
- It doesn’t have a built-in file management system. Hence, it needs to be integrated with other platforms like Hadoop for benefitting from a file management system
- Higher latency but consequently, lower throughput
- No support for true real-time data stream processing. The live data stream is partitioned into batches in Apache Spark and after processing are again converted into batches. Hence, Spark Streaming is micro-batch processing and not truly real-time data processing
- Lesser number of algorithms available
- Spark streaming doesn’t support record-based window criteria
- The work needs to be distributed over multiple clusters instead of running everything on a single node
- While using Apache Spark for cost-efficient processing of big data, its ‘in-memory’ ability becomes a bottleneck
99. What are Accumulators?
Accumulators are the write only variables which are initialized once and sent to the workers. These workers will update based on the logic written and sent back to the driver which will aggregate or process based on the logic.
Only the driver can access the accumulator’s value. For tasks, Accumulators are write-only. For example, it is used to count the number errors seen in RDD across workers.
100. What is the main purpose of the Spark Engine?
The main purpose of the Spark Engine is to schedule, monitor, and distribute the data application along with the cluster.