25+ Best Apache Spark Interview Questions & Answers by [EXPERTS]
Last updated on 03rd Jun 2020, Blog, Interview Questions
Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.It is used for processing and analyzing a large amount of data. Just like Hadoop MapReduce, it also works with the system to distribute data across the cluster and process the data in parallel.
1. List the key features of Apache Spark.
- Multiple Format Support
- Lazy Evaluation
- Real-Time Computation
- Hadoop Integration
- Machine Learning
2. What is Apache Spark?
Apache Spark is an open-source distributed computing system that provides an in-memory processing engine for big data processing and analytics.
3. Explain the key features of Spark.
Spark offers in-memory processing, fault tolerance, support for various data sources, advanced analytics, and a rich set of libraries for diverse tasks.
4. What is RDD?
RDD is an abbreviation for Resilient Distributed Dataset, the basic data structure in Spark that represents an immutable, distributed collection of objects.
5. What’s the distinction between RDD and DataFrame?
RDD represents unstructured data, while DataFrame represents structured data using a tabular format. Dataframes offer optimizations through the Catalyst query optimizer.
6. What is lazy evaluation in Spark?
Lazy evaluation delays the execution of transformations until an action is called, optimizing the execution plan.
7. What is an action in Spark?
An action is an operation that triggers the execution of transformations and returns a result or writes data to an external storage system.
8. Explain lineage in Spark.
Lineage is the sequence of transformations that were applied to build an RDD, allowing recovery in case of data loss.
9. What is a Spark executor?
Executors are worker nodes that run tasks in a Spark application. They store data in memory and perform computations.
10. How does Spark achieve fault tolerance?
Spark achieves fault tolerance through lineage information, allowing lost data to be recomputed using the original transformations.
11. What is Spark SQL?
Spark SQL is a Spark module for structured data processing that provides a programming interface to work with structured data using SQL or DataFrame APIs.
12. What are accumulators in Spark?
Accumulators are variables that can be updated by multiple tasks in a parallel and fault-tolerant manner. They are mainly used for aggregating information across multiple tasks.
13. Explain the concept of data shuffling in Spark.
Data shuffling refers to the process of redistributing data across partitions, which can be a performance-intensive operation.
14. What is the purpose of SparkContext in Spark?
SparkContext is the starting point for any Spark functionality that connects to the Spark cluster.
15. How does Spark manage memory and storage?
Spark manages memory and storage using a combination of on-heap and off-heap storage and employs various mechanisms like caching and spill-over to disc.
16. What is a Spark driver?
The driver is the main program that creates SparkContext and coordinates tasks across the cluster.
17. How can you persist RDD in memory?
You can use the persist() or cache() methods to store RDDs in memory for faster access.
18. What is Spark Streaming?
Spark Streaming is a micro-batch processing engine that allows real-time processing and analysis of streaming data.
19. How does Spark Streaming achieve fault tolerance?
Spark Streaming achieves fault tolerance by storing the streaming data in a reliable distributed file system, such as HDFS, and maintaining metadata to recover from failures.
20. What is the role of a Spark scheduler?
The Spark scheduler allocates resources and manages task scheduling on worker nodes.
21. Explain the concept of data locality in Spark.
Data locality refers to the ability of Spark to schedule tasks on nodes where the data is already present, reducing data transfer overhead.
22. What is the significance of Catalyst in Spark?
Catalyst is Spark’s query optimizer that optimizes query plans for DataFrame operations.
23. How can you tune the performance of a Spark application?
Performance tuning in Spark involves adjusting memory configurations, optimizing data storage, and choosing appropriate transformations and actions.
24. What are Broadcast Variables in Spark?
Broadcast variables allow you to cache a read-only variable across nodes, reducing data transfer costs.
25. What is the distinction between map transformations and flatMap transformations?
map() applies a function to each element of an RDD and returns a new RDD, while flatMap() applies a function that returns an iterator and flattens the result.
26. Explain the concept of partitioning in Spark.
Partitioning is the process of dividing data into smaller chunks called partitions, which are the units of parallelism in Spark.
27. How can you trigger the execution of Spark transformations?
Transformations are executed lazily. An action such as count(), collect(), or saveAsTextFile() triggers their execution.
28. What is the purpose of the groupBy() transformation?
The groupBy() transformation is used to group elements of an RDD based on a key and perform aggregation operations.
29. What is a data frame?
A data frame, like a table in a relational database, is a distributed collection of data organized into named columns.
30. How does Spark handle data skewness?
Spark handles data skewness through techniques like skewed joins and broadcasting smaller data sets.
31. What is the Catalyst optimizer?
The Catalyst Optimizer is Spark’s query optimization framework, responsible for optimizing logical and physical query plans.
32. What is a checkpoint in Spark?
A checkpoint is a mechanism to truncate the lineage of an RDD, saving its data to a reliable distributed file system for fault tolerance.
33. Explain the concept of Broadcast Hash Join in Spark.
Broadcast Hash Join is a technique that broadcasts a small data frame to all worker nodes, improving join performance for a tiny and one large data frame.
34. What is the difference between cache and persist in Spark?
Both cache() and persist() methods store RDDs in memory, but persist() allows you to specify storage levels and persist to disk as well.
35. How does Spark handle skewed data in aggregations?
Spark’s skewed join technique replicates the crooked keys onto multiple nodes to distribute the load evenly.
36. What is the purpose of the SparkSession in Spark 2. X?
SparkSession is a unified entry point for creating DataFrames, working with Spark’s SQL features, and configuring Spark settings.
37. How can you submit a Spark application to a cluster?
You can use the spark-submit command-line tool to submit a packaged Spark application to a cluster.
38. Explain the difference between a transformation and an action in Spark.
Transformations are operations on RDDs that create a new RDD, while actions trigger computation and return a result or write data.
Learn Apache Spark Certification Course and Get Hired by TOP MNCsWeekday / Weekend BatchesSee Batch Details
39. What is the use of the merge () transformation?
The merge () transformation is used to reduce the number of partitions in an RDD or DataFrame, which can improve performance for operations that follow.
40. How can you handle missing or null values in Spark DataFrames?
You can use functions like na. Drop () to remove rows with missing values or na. Fill () to replace them with specified values.
41. Explain the difference between a narrow transformation and a comprehensive transformation.
Narrow transformations do not require data to be shuffled across partitions, while broad changes do, potentially causing data movement.
42. What is the purpose of the repartition() transformation?
The repartition() transformation redistributes data across a specified number of partitions, allowing you to increase or decrease partition count.
43. What is Structured Streaming in Spark?
Structured Streaming is the highest level API for actual time streamlining that treats data like an ongoing table and analyses it using SQL-like queries.
44. How can you optimize join operations in Spark?
Choosing proper join strategies, preventing skewness, and employing techniques like broadcast joins are all part of optimizing join operations.
45. What is the difference between local and cluster modes in Spark?
In local mode, Spark runs on a single machine, while in cluster mode, it runs on a distributed cluster of machines.
46. What is the purpose of the collect() action?
The collect() action retrieves all data from an RDD or DataFrame and returns it to the driver program as an array.
47. Explain the concept of speculative execution in Spark.
Speculative execution is a technique where Spark schedules backup tasks for slow-running tasks on other nodes to improve job completion times. Depending on the role and level of expertise being evaluated, the interviewer may ask questions focusing on specific areas. To perform well in a Spark interview, you must understand Spark’s basic principles, architecture, and numerous components.
48. Explain what flatMap and Map are in Spark.
A map is simply a line or row used to process data. FlatMap may map each input object to a variety of output elements. As a result, it is commonly used to generate the Array’s components.
49. Describe broadcast variables.
A programmer can use broadcast variables to send a copy of a read-only variable with each job. Instead, each computer caches the variable. Mutual variables in Spark are classified into broadcast variables and accumulators. Broadcast variables are stored in Array Buffers, which send values that can only be viewed by working nodes.
50. What exactly is Spark Accumulators in Hadoop?
Accumulators are offline Spark debuggers. Spark accumulators, like Hadoop counters, may track how many actions occur. The driver program, not the tasks, can read the accumulator’s value.
51. Define RDD.
Resilient Distributed Datasets (RDD) represent a fault-tolerant set of elements that operate in parallel. The data in the RDD section is distributed and immutable. There are mainly two types of RDDs.
- Parallelized collections: The existing RDDs operate parallel to each other.
- Hadoop datasets: The dataset that performs a function for each file record in HDFS or other storage systems.
52. What is the use of the Spark engine?
The objective of the Spark engine is to plan, distribute, and monitor data applications in a cluster.
Enroll in Apache Spark Training to Build Skills & Advance Your Career
- Instructor-led Sessions
- Real-life Case Studies
53. What is the Partition?
Partition is obtaining logical units of data to speed up data processing. Simply put, sections are more diminutive, and logical data separation is similar to a ‘split’ in MapReduce.
54. What types of operations are supported by RDD support?
Transformations and actions are the two types of operations supported by RDD.
55. What exactly is Spark Driver?
The Spark Driver code runs on the machine’s controller node and instructs RDDs on how to be altered and what to do with them. A Spark driver, in simple terms, produces a SparkContext that is attached to a specific Spark Master. The RDD graphs are also sent to Master, where the cluster management runs independently.
56. What exactly is the Spark Executor?
When SparkContext connects to a cluster manager, it receives an Executor on each node in the cluster. Executors are Spark processes that perform calculations and store the outcomes on the worker node. SparkContext assigns the final tasks to executors so that they can be completed.
57. What exactly do you mean by “worker node”?
Any node in a cluster that can run the application code is referred to as a worker node. The driver programs must listen for and accept connections from their executors. It must also be accessible through the network’s worker nodes.
58. What exactly is a sparse vector?
A sparse vector comprises two parallel arrays, one for the indices and one for the values. These vectors store entries that are not zero to save space.
59. Can Spark be used to access and analyze data contained in Cassandra databases?
A Cassandra Connector must be added to the Spark project to connect Spark to a Cassandra cluster. During setup, a Spark executor will communicate with a local Cassandra node and only request locally stored data. It accelerates queries by transmitting data between Spark executors (which process data) and Cassandra nodes (where data resides) using less network bandwidth.
60. Can Apache Spark be utilized in conjunction with Apache Mesos?
Yes, Apache Spark can run on Mesos-managed hardware clusters. The cluster manager in the diagram below is a Spark controller instance utilized when a collection is set up independently.
61. What are the broadcast variables?
Instead of transmitting a copy with each task, a programmer can use broadcast variables to keep a cached copy of a read-only variable on each machine. They can quickly distribute a huge input dataset to each node. Spark also attempts to spread out broadcast variables using efficient broadcast algorithms to save transmission costs.
62. Tell me about the accumulators in Apache Spark.
Accumulators are variables that can only be added using a two-way operation. They are used to perform functions such as counting and adding. Keeping track of accumulators in the UI helps you understand how the running stages progress. By default, Spark supports numeric accumulators.
63. What is the significance of broadcast variables while working with Apache Spark?
Broadcast variables can only be read, and they are stored in the memory cache of every system. When dealing with Spark, you don’t have to send copies of a variable for each task if you use broadcast variables. This allows data to be handled more quickly.
64. Explain Actions.
Actions in Spark make it possible to bring data from RDD to the local machine. Reduce () and take () are the functions of Actions. Reduce() function is performed only when action repeats one by one until one value leaves. The take () accepts all RDD values to the local key.
65. Explain the functions supported by Spark Core.
Various functions are supported by Spark Core like job scheduling, fault-tolerance, memory management, monitoring jobs, and much more.
66. Define RDD Lineage.
Spark does not support data replication, so if you have lost information, it is reconstructed using RDD Lineage. RDD generation is a way to reconstruct lost data. The best thing to do is always remember how to create RDDs from other datasets.
67. What does Spark Driver do?
The Spark driver is a program that runs on the main node of the device and announces transformations and actions on the data RDD. In a nutshell, the driver in Spark makes SparkContext in conjunction with the given Spark Master. It also provides RDD graphs to the master, where the cluster manager operates.
68. What is Hive?
By default, Hive supports Spark on YARN mode.
HIVE execution is configured in Spark through:
hive> set spark.home=/location/to/sparkHome;
hive> set hive. execution.engine=spark;
69. List the most frequently used Spark ecosystems.
- For developers, Spark SQL (Shark).
- To process live data streams, Spark Streaming.
- To generate and compute, GraphX.
- MLlib (Machine Learning Algorithms)
- For promoting R programming in the Spark Engine, SparkR.
70. What is GraphX?
Spark uses GraphX for graphics processing and building. GraphX lets programmers think about big data.
71. Define Spark SQL.
Spark SQL, also known as Shark, is used for processing structured data. Using this module, relational queries on data are executed by Spark. It supports SchemaRDD, which consists of schema objects and row objects representing the data type of each column in a row. This is just like a table in relational databases.
72. Which file systems are supported by Apache Spark?
- Hadoop file distribution system (HDFS)
- Amazon S3
- Local File system
73. Define Partitions.
As the name suggests, a partition is a more minor and logical data division similar to a ‘split’ in MapReduce. Partitioning is deriving logical data units to speed up data processing. Everything in Spark is a partitioned RDD.
74. Name a few functions of Spark SQL.
The following are the functions of Spark SQL:
- Loads data from various structured sources.
- Query data using SQL elements.
- Provides advanced integration between regular Python/Java/Scala code and SQL.
75. Is there any benefit of learning MapReduce?
Yes, MapReduce is a standard used by many big data tools, including Apache Spark. As data grows, it becomes essential to use MapReduce. Many agencies, such as Pig and Hive, convert queries to the MapReduce phases to optimize them.
76. Can we use Spark to access and analyze the Cassandra Database data?
Yes, we can use Spark for accessing and analyzing the data stored using the Spark Cassandra Connector. We need to add the Cassandra Connector to the Spark project for connecting Spark to a Cassandra cluster.
77. How to connect Spark with Apache Mesos?
1. First, configure the Spark driver program to connect to Mesos.
2. Next, Spark binary package must be in a location available to Mesos.
3. By installing Apache Spark in the exact location of Apache Mesos, configure the property ‘spark.mesos.executor.home’ to point to the location where it is installed
78. What does Spark SQL support the different data sources?
1. Parquet file
2. JSON datasets
3. Hive table
79. Why is Apache Spark faster than Apache Hadoop?
Apache Spark is faster than Apache Hadoop for to below reasons:
- Apache Spark provides in-memory computing. Spark is designed to transform data in memory, reducing time for disk I/O. While MapReduce writes intermediate results back to Disk and reads it back.
- Spark utilizes a direct acyclic graph that helps to do all the optimization and computation in a single stage rather than multiple stages in the MapReduce model.
- Apache Spark core is developed using SCALA programming language, which is faster than JAVA. SCALA provides inbuilt concurrent execution by providing immutable collections. While in JAVA, we need to use Thread to achieve parallel execution.
80. List down the languages supported by Apache Spark.
Apache Spark supports Scala, Python, Java, and R.
Apache Spark is written in Scala. Many people use Scala for development. But it also has API in Java, Python, and R
81. Name various types of Cluster Managers in Spark
- Apache Meos – Commonly used cluster manager
- Standalone – A primary cluster manager for setting up a cluster
- YARN – Used for resource management
82. What exactly are partitions?
A partition is a logical separation of data obtained via Map-reduce (split). Logical data is derived expressly to process the data. Small data bits can also help with scalability and speed up the process. Partitioned RDD is used for all input, intermediate, and output data.
83. How does Spark divide data?
Spark partitions data using the map-reduce API. We can generate a number of partitions in the Input format. By default, HDFS block size is partition size (for optimum performance), although partition size can be changed using tools such as Split.
84. How is data stored in Spark?
Spark is a processing engine; it does not include a storage engine. It is capable of retrieving data from any storage engine, including HDFS, S3, and other data resources.
85. Is it necessary to initialize Hadoop before running the spark application?
No, it is not required, however, because there is no dedicated storage in Spark, the data is stored on a local file system. You can load and process data from your local system; Hadoop or HDFS are not required to operate the Spark program
86. What exactly is SparkContext?
When a programmer constructs an RDD, SparkContext connects to the Spark cluster and generates a new SparkContext object. SparkContext instructs Spark on how to connect to the cluster. SparkConf is an important component in developing a programming application.
87. What are the features of SparkCore?
SparkCore is the Apache Spark framework’s basic engine. Spark’s key functionalities include memory management, fault tolerance, job scheduling and monitoring, and connecting with store systems.
88. What distinguishes SparkSQL from HQL and SQL?
SparkSQL is a unique component of the Spark core engine that supports SQL and HiveQueryLanguage without requiring any changes to the syntax. The SQL table and the HQL table can be joined.
89. When did we employ Spark Streaming?
Spark Streaming is an API for processing streaming data in real time. Spark streaming collects streaming data from various sources, such as web server log files, social media data, stock market data, and Hadoop ecosystems such as Flume and Kafka.
90. How does the Spark Streaming API work?
The programmer specifies a time in the configuration during which much data enters the Spark and is separated as a batch. Spark streaming begins with the input stream (DStream). The framework divides the data into little parts called batches, which are subsequently sent into the spark engine for processing.
91. What exactly is GraphX?
GraphX is a Spark API that allows you to manipulate graphs and collections. It integrates ETL, other types of analysis, and iterative graph computation. It is the fastest graph system, has fault tolerance, and is simple to use without specific abilities.
92. What exactly is File System API?
The FS API may read data from many storage media such as HDFS, S3, and the Local FileSystem. Spark reads data from several storage engines using the FS API.
93. Why are partitions immutable?
Each change results in the creation of a new partition. Partitions employ the HDFS API to ensure that they are immutable, distributed, and fault-tolerant. The section also recognizes data locality.
94. What is the definition of Transformation in Spark?
Transformations and Actions are two specific operations on RDDs provided by Spark. Transformation comes after a sluggish process and temporarily holds the data until the Action is called. Each Transformation produces or returns a new RDD. Transformation example: Spark transformations include Map, flatMap, groupByKey, reduceByKey, filter, co-group, join, sortByKey, Union, distinct, and sample.
95. What exactly is Action in Spark?
Actions are RDD operations; that value is returned to the spar driver programs, which start a task to run on a cluster. The output of Transformation is an input to Actions. Reduce, collect, and take a sample. All basic Apache spark methods are saveAsTextfile, saveAsSequenceFile, and countByKey.
96. What exactly is RDD Lineage?
Lineage is an RDD technique for reassembling lost partitions. Spark does not replicate data in memory; if data is lost, RDD employs lineage to reconstruct lost data. Each RDD remembers how it was constructed from other datasets.
97. Distinguish between Map and flatMap in Spark?
A map is a specific line or row processing data. Each input item in FlatMap can be mapped to several output items (therefore, the method should return a Seq rather than a single item). As a result, array elements are commonly returned.
98. What exactly are broadcast variables?
Broadcast variables allow the programmer to retain tasks. Spark supports two shared variables: broadcast variables (similar to Hadoop distributed cache) and accumulators (identical to Hadoop counters). Broadcast variables are saved as Array Buffers and sent to work nodes as read-only values.
99. What do Accumulators in Spark mean?
Accumulators are spark-of-line debuggers. Accumulators in Spark are similar to Hadoop counters in that they may be used to count the number of events and what is happening during the process. Only the driver program, not the tasks, can read an accumulator value.
100. How does RDD store data?
There are two techniques for persisting data: persist(), which continues data permanently, and cache(), which stays data briefly in memory. Numerous storage level options are available, including MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, and many more. Depending on the task, persist() and cache() employ different choices.