Get [LATEST] Hadoop Admin Interview Questions and Answers
Last updated on 12th Nov 2021, Blog, Interview Questions
Nobody knew what Hadoop is and today the elephant in the big data room has become the big data darling. According to Wikibon, the Hadoop market crossed $256 mn in vendor revenue and is anticipated to exponentially increase to $1.7 billion by end of now. Programmers, architects, system administrators and data warehousing professionals are leaving no stone unturned in learning Hadoop for storing and processing large data sets. Cracking a Hadoop Admin Interview becomes a tedious job if you do not spend enough time preparing for it.This article lists top Hadoop Admin Questions and Answers which are likely to be asked when being interviewed for Hadoop Adminstration jobs.
1. What is Hadoop and its key components?
Hadoop is an open-source distributed computing framework used to store and process large datasets. Its key components include HDFS (Hadoop Distributed File System) for storage and MapReduce for processing data in parallel.
2. Explain the role of the Hadoop Administrator?
Hadoop Administrators are responsible for setting up, configuring, and managing Hadoop clusters. They handle tasks such as capacity planning, monitoring, troubleshooting, and ensuring data security.
3. What are the common issues faced by Hadoop Administrators?
Common issues include cluster performance bottlenecks, data node failures, resource management, and data replication challenges.
4. How do you set up Hadoop High Availability (HA)?
Hadoop HA can be achieved using technologies like Hadoop High Availability (HDFS HA) for NameNode failover, or using Apache ZooKeeper for coordination and leader election.
5. How do you troubleshoot a slow Hadoop job?
Troubleshooting a slow Hadoop job involves analyzing logs, identifying resource bottlenecks, tuning configuration parameters, and optimizing the data processing flow.
6. Explain the process of adding new nodes to an existing Hadoop cluster.
To add new nodes, the Administrator installs Hadoop on the new machines, configures them to join the cluster, and adjusts resource allocation and data replication settings.
7. How do you enable data compression in Hadoop?
Data compression in Hadoop can be enabled by configuring compression codecs in Hadoop configuration files (like Snappy, Gzip, or LZO).
8. What is the role of the ResourceManager in YARN (Yet Another Resource Negotiator)?
The ResourceManager is responsible for managing resources across the cluster and allocating resources to applications.
9. How do you secure a Hadoop cluster?
Hadoop cluster security can be achieved by enabling Kerberos authentication, setting up secure communication using SSL/TLS, and implementing access controls.
10. What are the Hadoop log files, and where are they located?
Hadoop log files include NameNode, DataNode, ResourceManager, and NodeManager logs. They are usually found in the Hadoop log directory.
11. How do you handle disk failures in Hadoop clusters?
Disk failures can be handled by configuring Hadoop to use data replication (HDFS replication factor) to ensure data redundancy and fault tolerance.
12. What is the purpose of the Hadoop Balancer tool?
The Hadoop Balancer tool is used to balance data across DataNodes in a cluster to ensure equal data distribution and optimize storage usage.
13. How do you manage and monitor resource utilization in a Hadoop cluster?
Resource utilization can be managed and monitored using Hadoop’s resource manager web UI and various monitoring tools like Ambari or Cloudera Manager.
14. How do you ensure data integrity in Hadoop?
Data integrity is ensured through data checksums, replication, and regular data backups.
15. What are the different modes of Hadoop cluster deployment?
Hadoop can be deployed in the standalone mode, pseudo-distributed mode (single-node cluster), and fully-distributed mode (multi-node cluster).
16. Explain the process of upgrading Hadoop components in a cluster.
Upgrading Hadoop components involves installing the new version, migrating configuration files and data, and performing compatibility testing.
17. What are the advantages of using YARN over the traditional MapReduce framework in Hadoop 2.x?
YARN provides better cluster resource management and supports multiple data processing models, making it more scalable and flexible.
18. How do you handle NameNode failure in a High Availability setup?
In Hadoop HA, if the active NameNode fails, the standby NameNode takes over, and the cluster continues to operate with minimal downtime.
19. What is the purpose of Hadoop Rack Awareness?
Rack Awareness ensures that data replicas are distributed across different racks in a datacenter to improve fault tolerance and data locality.
20. How do you secure Hadoop data transmission between nodes?
Data transmission security can be achieved by enabling SSL/TLS encryption for data transfer between Hadoop nodes.
21. What are Hadoop federation and its benefits?
Hadoop Federation allows multiple independent namespaces and multiple NameNodes to manage separate parts of the HDFS namespace, improving scalability.
22. How do you manage Hadoop configuration files in a large cluster?
Configuration management tools like Apache Ambari or Cloudera Manager can be used to manage and distribute configuration files across the cluster.
23. Explain the process of handling missing or corrupt data blocks in Hadoop.
Missing or corrupt data blocks can be resolved by HDFS replication, which creates replicas of the lost or corrupted blocks from other DataNodes.
24.Discuss the difference RDBMS and Hadoop?
|Data volume||RDBMS cannot store and process a large amount of data||Hadoop works better for large amounts of data. It can easily store and process a large amount of data compared to RDBMS.|
|Throughput||RDBMS fails to achieve a high Throughput||Hadoop achieves high Throughput|
|Data variety||Schema of the data is known in RDBMS and it always depends on the structured data.||It stores any kind of data. Whether it could be structured, unstructured, or semi-structured.|
|Data processing||RDBMS supports OLTP(Online Transactional Processing)||Hadoop supports OLAP(Online Analytical Processing)|
|Read/Write Speed||Reads are fast in RDBMS because the schema of the data is already known.||Writes are fast in Hadoop because no schema validation happens during HDFS write.|
|Schema on reading Vs Write||RDBMS follows schema on write policy||Hadoop follows the schema on reading policy|
|Cost||RDBMS is a licensed software||Hadoop is a free and open-source framework|
25.Explain Big data and its characteristics?
Big Data refers to a large amount of data that exceeds the processing capacity of conventional database systems and requires a special parallel processing mechanism. This data can be either structured or unstructured data.
Characteristics of Big Data:
- Volume – It represents the amount of data that is increasing at an exponential rate i.e. in gigabytes, Petabytes, Exabytes, etc.
- Velocity – Velocity refers to the rate at which data is generated, modified, and processed. At present, Social media is a major contributor to the velocity of growing data.
- Variety – It refers to data coming from a variety of sources like audios, videos, CSV, etc. It can be either structured, unstructured, or semi-structured.
- Veracity – Veracity refers to imprecise or uncertain data.
- Value – This is the most important element of big data. It includes data on how to access and deliver quality analytics to the organization. It provides a fair market value on the used technology.
26.What is Hadoop and list its components?
- Storage unit– HDFS (DataNode, NameNode)
- Processing framework– YARN (NodeManager, ResourceManager)
Hadoop is an open-source framework used for storing large data sets and runs applications across clusters of commodity hardware. It offers extensive storage for any type of data and can handle endless parallel tasks.
Core components of Hadoop:
27.What is YARN and explain its components?
- Resource Manager – It runs on a master daemon and controls the resource allocation in the cluster.
- Node Manager – It runs on a slave daemon and is responsible for the execution of tasks for each single Data Node.
- Application Master – It maintains the user job lifecycle and resource requirements of individual applications. It operates along with the Node Manager and controls the execution of tasks.
- Container – It is a combination of resources such as Network, HDD, RAM, CPU, etc., on a single node.
Yet Another Resource Negotiator (YARN) is one of the core components of Hadoop and is responsible for managing resources for the various applications operating in a Hadoop cluster, and also schedules tasks on different cluster nodes.
28.What is the difference between a regular file system and HDFS?
|A small block size of data (like 512 bytes)||Large block size (orders of 64MB)|
|Multiple disks seek large files||Reads data sequentially after single seek|
29.What are the Hadoop daemons and explain their roles in a Hadoop cluster?
- NameNode – Is the Master node responsible to store the meta-data for all the directories and files.
- DataNode – It is the Slave node responsible to store the actual data.
- Secondary NameNode – It is responsible for the backup of NameNode and stores the entire metadata of data nodes like data node properties, addresses, and block reports of each data node.
- JobTracker – It is used for creating and running jobs. It runs on data nodes and allocates the job to TaskTracker.
- TaskTracker – It operates on the data node. It runs the tasks and reports the tasks to JobTracker.
Generally, the daemon is nothing but a process that runs in the background. Hadoop has five such daemons. They are:
30.What is Avro Serialization in Hadoop?
- The process of translating objects or data structures state into binary or textual form is called Avro Serialization. It is defined as a language-independent schema (written in JSON).
- It provides AvroMapper and AvroReducer for running MapReduce programs.
31. How can you skip the bad records in Hadoop?
Hadoop provides a feature called SkipBadRecords class for skipping bad records while processing mapping inputs.
32.Explain HDFS and its components?
- HDFS (Hadoop Distributed File System) is the primary data storage unit of Hadoop.
- It stores various types of data as blocks in a distributed environment and follows master and slave topology.
- NameNode – It is the master node and is responsible for maintaining the metadata information for the blocks of data stored in HDFS. It manages all the DataNodes.
- DataNode – It is the slave node and responsible for storing data in the HDFS.
33.What are the features of HDFS?
- Supports storage of very large datasets
- Write once read many access model
- Streaming data access
- Replication using commodity hardware
- HDFS is highly Fault Tolerant
- Distributed Storage
Learn Advanced Hadoop Administration Certification Training Course to Build Your SkillsWeekday / Weekend BatchesSee Batch Details
34.fault replication factor?
- Replication factor means the minimum number of times the file will replicate(copy) across the cluster.
- The default replication factor is 3
35.List the various HDFS Commands?
- The Various HDFS Commands are listed bellow
- copy from local
35.Compare HDFS (Hadoop Distributed File System) and NAS (Network Attached Storage)?
|It is a distributed file system used for storing data by commodity hardware.||It is a file-level computer data storage server connected to a computer network, provides network access to a heterogeneous group of clients.|
|It includes commodity hardware which will be cost-effective NAS is a high-end storage device that includes a high cost.||It is designed to work for the MapReduce paradigm. It is not suitable for MapReduce.|
36.What are the limitations of Hadoop 1.0?
- NameNode: No Horizontal Scalability and No High Availability
- Job Tracker: Overburdened.
- MRv1: It can only understand Map and Reduce tasks
37. How to commission (adding) the nodes in the Hadoop cluster?
- Update the network addresses in the dfs.include and mapred.include
- Update the NameNode: Hadoop dfsadmin -refreshNodes
- Update the Jobtracker: Hadoop mradmin-refreshNodes
- Update the slave file.
- Start the DataNode and NodeManager on the added Node.
38. How to decommission (removing) the nodes in the Hadoop cluster?
- Update the network addresses in the dfs.exclude and mapred.exclude
- Update the Namenode: $ Hadoop dfsadmin -refreshNodes
- Update the JobTracker: Hadoop mradmin -refreshNodes
- Cross-check the Web UI it will show “Decommissioning in Progress”
- Remove the Nodes from the include file and then run: Hadoop dfsadmin-refreshNodes, Hadoop mradmin -refreshNodes.
- Remove the Nodes from the slave file.
39.Compare Hadoop 1.x and Hadoop 2.x
|1. NameNode||In Hadoop 1.x, NameNode is the single point of failure||In Hadoop 2.x, we have both Active and passive NameNodes.|
|2. Processing||MRV1 (Job Tracker & Task Tracker)||MRV2/YARN (ResourceManager & NodeManager)|
40.What is a checkpoint?
Checkpointing is a method which takes a FsImage. It edits log and compacts them into a new FsImage. Therefore, instead of replaying an edit log, the NameNode can be load in the final in-memory state directly from the FsImage. This is surely more efficient operation which reduces NameNode startup time.
41. What is the difference between active and passive NameNodes?
- Active NameNode works and runs in the cluster.
- Passive NameNode has similar data as active NameNode and replaces it when it fails.
42. How will you resolve the NameNode failure issue?
- The following steps need to be executed to resolve the NameNode issue and make the Hadoop cluster up and running:
- Use the FsImage (file system metadata replica) to start a new NameNode.
- Now, configure DataNodes and clients, so that they can acknowledge the new NameNode, that is started.
- The new NameNode will start serving the client once it has completed loading the last checkpoint FsImage and enough block reports from the DataNodes.
43. List the different types of Hadoop schedulers.
- Hadoop FIFO scheduler
- Hadoop Fair Scheduler
- Hadoop Capacity Scheduler
44.How to keep an HDFS cluster balanced?
However, it is not possible to limit a cluster from becoming unbalanced. In order to give a balance to a certain threshold among data nodes, use the Balancer tool. This tool tries to subsequently even out the block data distribution across the cluster.
45. What is HDFS Federation?
- HDFS Federation enhances the present HDFS architecture through a clear separation of namespace and storage by enabling a generic block storage layer.
- It provides multiple namespaces in the cluster to improve scalability and isolation.
46. What is HDFS High Availability?
HDFS High availability is introduced in Hadoop 2.0. It means providing support for multiple NameNodes to the Hadoop architecture.
47. What is a rack-aware replica placement policy?
- Rack Awareness is the algorithm used for improving the network traffic while reading/writing HDFS files to the Hadoop cluster by NameNode.
- NameNode chooses the Datanode which is closer to the same rack or nearby rack for reading/Write request. The concept of choosing closer data nodes based on racks information is called Rack Awareness.
- Consider the replication factor is 3 for data blocks on HDFS it means for every block of data two copies are stored on the same rack, while the third copy is stored on a different rack. This rule is called Replica Placement Policy.
48. What is the main purpose of the Hadoop fsck command?
- Hadoop fsck / -files: Displays all the files in HDFS while checking.
- Hadoop fsck / -files -blocks: Displays all the blocks of the files while checking.
- Hadoop fsck / -files -blocks -locations: Displays all the files block locations while checking.
- Hadoop fsck / -files -blocks -locations -racks: Displays the networking topology for data-node locations.
- Hadoop fsck -delete: Deletes the corrupted files in HDFS.
- Hadoop fsck -move: Moves the corrupted files to a particular directory.
Hadoop fsck command is used for checking the HDFS file system. There are different arguments that can be passed with this command to emit different results.
49. What is the purpose of a DataNode block scanner?
- The purpose of the DataNode block scanner is to operate and periodically check all the blocks that are stored on the DataNode.
- If bad blocks are detected they will be fixed before any client reads.
50. What is the purpose of the admin tool?
- dfsadmin tool is used for examining the HDFS cluster status.
- dfsadmin – report command produces useful information about basic statistics of the cluster such as DataNodes and NameNode status, disk capacity configuration, etc.
- It performs all the administrative tasks on the HDFS.
51. What is the command used for printing the topology?
hdfs dfsadmin -point topology is used for printing the topology. It displays the tree of racks and DataNodes attached to the tracks.
52. What is RAID?
RAID (redundant array of independent disks) is a data storage virtualization technology used for improving performance and data redundancy by combining multiple disk drives into a single entity.
53. Does Hadoop requires RAID?
- In DataNodes, RAID is not necessary as storage is achieved by replication between the Nodes. ,/li>
- In NameNode’s disk RAID is recommended.
54. List the various site-specific configuration files available in Hadoop?
55. What is the main functionality of NameNode?
- Namespace – Manages metadata of HDFS.
- Block Management – Processes and manages the block reports and their location.
It is mainly responsible for:
56. Which command is used to format the NameNode?
$ hdfs namenode -format
57. How a client application interacts with the NameNode?
- Client applications associate the Hadoop HDFS API with the NameNode when it has to copy/move/add/locate/delete a file.
- The NameNode returns to the successful requests by delivering a list of relevant DataNode servers where the data is residing.
- The client can talk directly to a DataNode after the NameNode has given the location of the data
58. What is MapReduce and list its features?
MapReduce is a programming model used for processing and generating large datasets on the clusters with parallel and distributed algorithms. The syntax for running the MapReduce program is hadoop_jar_file.jar /input_path /output_path.
59. What are the features of MapReduce?
- Automatic parallelization and distribution.
- Built-in fault tolerance and redundancy are available.
- MapReduce Programming model is language independent
- Distributed programming complexity is hidden
- Enable data local processing
- Manages all the inter-process communication
60. What does the MapReduce framework consist of?
- Global resource scheduler
- One master RM
- One slave NM per cluster node.
- RM creates Containers upon request by AM
- The application runs in one or more containers
- One AM per application
- Runs in Container
MapReduce framework is used to write applications for processing large data in parallel on large clusters of commodity hardware. It consists of:
61. What are the two main components of ResourceManager?
- Scheduler- It allocates the resources (containers) to various running applications based on resource availability and configured shared policy.
- ApplicationManager- It is mainly responsible for managing a collection of submitted applications
62. What is a Hadoop counter?
Hadoop Counters measures the progress or tracks the number of operations that occur within a MapReduce job. Counters are useful for collecting statistics about MapReduce jobs for application-level or quality control.
Get JOB Oriented Hadoop Administration Training for Beginners By MNC Experts
- Instructor-led Sessions
- Real-life Case Studies
63. What are the main configuration parameters for a MapReduce application?
- The job configuration requires the following:
- Job’s input and output locations in the distributed file system
- The input format of data
- The output format of data
- Class containing the map function and reduced function
- JAR file containing the reducer, driver, and mapper classes
64. What are the steps involved to submit a Hadoop job?
- Hadoop job client submits the job jar/executable and configuration to the ResourceManager.
- ResourceManager then distributes the software/configuration to the slaves.
- ResourceManager then scheduling tasks and monitoring them.
- Finally, job status and diagnostic information are provided to the client.
Steps involved in Hadoop job submission:
65. How does the MapReduce framework view its input internally?
It views the input data set as a set of pairs and processes the map tasks in a completely parallel manner.
66. What are the basic parameters of Mapper?
The basic parameters of Mapper are listed below:
1. LongWritable and Text
2. Text and IntWritable
67. What are Writables and explain their importance in Hadoop?
- Writables are interfaces in Hadoop. They act as a wrapper class to almost all the primitive data types of Java.
- A serializable object which executes a simple and efficient serialization protocol, based on DataInput and DataOutput.
- Writables are used for creating serialized data types in Hadoop.
68. Why comparison of types is important for MapReduce?
- It is important for MapReduce as in the sorting phase the keys are compared with one another.
- For a Comparison of types, the WritableComparable interface is implemented.
69. What is “speculative execution” in Hadoop?
In Apache Hadoop, if nodes do not fix or diagnose the slow-running tasks, the master node can redundantly perform another instance of the same task on another node as a backup (the backup task is called a Speculative task). This process is called Speculative Execution in Hadoop.
70. What are the methods used for restarting the NameNode in Hadoop?
- You can use /sbin/hadoop-daemon.sh stop namenode command for stopping the NameNode individually and then start the NameNode using /sbin/hadoop-daemon.sh start namenode.
- Use /sbin/stop-all.sh and then use /sbin/start-all.sh command for stopping all the demons first and then start all the daemons.
The methods used for restarting the NameNodes are the following:
71. What is the difference between an “HDFS Block” and “MapReduce Input Split”?
- HDFS Block is the physical division of the disk which has the minimum amount of data that can be read/write, while MapReduce InputSplit is the logical division of data created by the InputFormat specified in the MapReduce job configuration.
- HDFS divides data into blocks, whereas MapReduce divides data into input split and empowers them to mapper function.
72. What are the different modes in which Hadoop can run?
- Standalone Mode(local mode) – This is the default mode where Hadoop is configured to run. In this mode, all the components of Hadoop such as DataNode, NameNode, etc., run as a single Java process and useful for debugging.
- Pseudo Distributed Mode(Single-Node Cluster) – Hadoop runs on a single node in a pseudo-distributed mode. Each Hadoop daemon works in a separate Java process in Pseudo-Distributed Mode, while in Local mode, each Hadoop daemon operates as a single Java process.
- Fully distributed mode (or multiple node cluster) – All the daemons are executed in separate nodes building into a multi-node cluster in the fully-distributed mode.
73. Why aggregation cannot be performed in Mapperside?
- We cannot perform Aggregation in mapping because it requires sorting of data, which occurs only at the Reducer side.
- For aggregation, we need the output from all the mapper functions, which is not possible during the map phase as map tasks will be running in different nodes, where data blocks are present.
74. What is the importance of “RecordReader” in Hadoop?
- RecordReader in Hadoop uses the data from the InputSplit as input and converts it into Key-value pairs for Mapper.
- The MapReduce framework represents the RecordReader instance through InputFormat.
75. What is the purpose of Distributed Cache in a MapReduce Framework?
- The Purpose of Distributed Cache in the MapReduce framework is to cache files when needed by the applications. It caches read-only text files, jar files, archives, etc.
- When you have cached a file for a job, the Hadoop framework will make it available to each and every data node where map/reduces tasks are operating.
76. How do reducers communicate with each other in Hadoop?
Reducers always run in isolation and the Hadoop Mapreduce programming paradigm never allows them to communicate with each other.
77. What is Identity Mapper?
- Identity Mapper is a default Mapper class that automatically works when no Mapper is specified in the MapReduce driver class.
- It implements mapping inputs directly into the output.
- IdentityMapper.class is used as a default value when JobConf.setMapperClass is not set.
78. What are the phases of MapReduce Reducer?
- Shuffle phase – In this phase, the sorted output from a mapper is an input to the Reducer. This framework will fetch the relevant partition of the output of all the mappers by using HTTP.
- Sort phase – In this phase, the input from various mappers is sorted based on related keys. This framework groups reducer inputs by keys. Shuffle and sort phases occur concurrently.
- Reduce phase – In this phase, reduce task aggregates the key-value pairs after shuffling and sorting phases. The OutputCollector.collect() method, writes the output of the reduce task to the Filesystem
The MapReduce reducer has three phases:
79. What is the purpose of MapReduce Partitioner in Hadoop?
The MapReduce Partitioner manages the partitioning of the key of the intermediate mapper output. It makes sure that all the values of a single key pass to same reducers by allowing the even distribution over the reducers.
80. How will you write a custom partitioner for a Hadoop MapReduce job?
- Build a new class that extends Partitioner Class
- Override the get partition method in the wrapper.
- Add the custom partitioner to the job as a config file or by using the method set Partitioner.
81. What is a Combiner?
A Combiner is a semi-reducer that executes the local reduce task. It receives inputs from the Map class and passes the output key-value pairs to the reducer class.
82. What is the use of SequenceFileInputFormat in Hadoop?
SequenceFileInputFormat is the input format used for reading in sequence files. It is a compressed binary file format optimized for passing the data between outputs of one MapReduce job to the input of some other MapReduce job.
83. What is Apache Pig?
- Apache Pig is a high-level scripting language used for creating programs to run on Apache Hadoop.
- The language used in this platform is called Pig Latin.
- It executes Hadoop jobs in Apache Spark, MapReduce, etc.
84. What are the benefits of Apache Pig over MapReduce?
- Pig Latin is a high-level scripting language while MapReduce is a low-level data processing paradigm.
- Without many complex Java implementations in MapReduce, programmers can perform the same implementations very easily using Pig Latin.
- Apache Pig decreases the length of the code by approx 20 times (according to Yahoo). Hence, this reduces development time by almost 16 times.
- Pig offers various built-in operators for data operations like filters, joins, sorting, ordering, etc., while to perform these same functions in MapReduce is an enormous task.
85. What are the Hadoop Pig data types?
- Atomic data types: These are the basic data types that are used in all the languages like int, string, float, long, etc.
- Complex Data Types: These are Bag, Map, and Tuple.
Hadoop Pig runs both atomic data types and complex data types.
86. List the various relational operators used in “Pig Latin”?
- ORDER BY
87. What is Apache Hive?
Apache Hive offers a database query interface to Apache Hadoop. It reads, writes, and manages large datasets that are residing in distributed storage and queries through SQL syntax.
88. Where do Hive stores table data in HDFS?
/usr/hive/warehouse is the default location where Hive stores the table data in HDFS.
89. Can the default “Hive Metastore” be used by multiple users (processes) at the same time?
By default, Hive Metastore uses the Derby database. So, it is not possible for multiple users or processes to access it at the same time.
90. What is a SerDe?
- SerDe is a combination of Serializer and Deserializer. It interprets the results of how a record should be processed by allowing Hive to read and write from a table.
91. What are the differences between Hive and RDBMS?
|Schema on Reading||Schema on write|
|Batch processing jobs||Real-time jobs|
|Data stored on HDFS||Data stored on the internal structure|
|Processed using MapReduce||Processed using database|
92. What is an Apache HBase?
Apache HBase is multidimensional and a column-oriented key datastore runs on top of HDFS (Hadoop Distributed File System). It is designed to provide high table-update rates and a fault-tolerant way to store a large collection of sparse data sets.
93. What are the various components of Apache HBase?
- Region Server: These are the worker nodes that handle read, write, update, and delete requests from clients. The region Server process runs on each and every node of the Hadoop cluster
- HMaster: It monitors and manages the Region Server in the Hadoop cluster for load balancing.
- ZooKeeper: ZHBase employs ZooKeeper for a distributed environment. It keeps track of each and every region server that is present in the HBase cluster.
94. What is WAL in HBase?
- Write-Ahead Log (WAL) is a file storage and it records all changes to data in HBase. It is used for recovering data sets.
- The WAL ensures all the changes to the data can be replayed when a RegionServer crashes or becomes unavailable.
95. What are the differences between the Relational database and HBase?
|It is a row-oriented datastore||It is a column-oriented datastore|
|It’s a schema-based database||Its schema is more flexible and less restrictive|
|Suitable for structured data||Suitable for both structured and unstructured data|
|Supports referential integrity||Doesn’t supports referential integrity|
|It includes thin tables||It includes sparsely populated tables|
|Accesses records from tables using SQL queries.||Accesses data from HBase tables using APIs and MapReduce.|
96. What is Apache Spark?
Apache Spark is an open-source framework used for real-time data analytics in a distributed computing environment. It is a data processing engine that provides faster analytics than Hadoop MapReduce.
97. Can we build “Spark” with any particular Hadoop version?
Yes, we can build “Spark” for any specific Hadoop version.
98. What is RDD?
RDD(Resilient Distributed Datasets) is a fundamental data structure of Spark. It is a distributed collection of objects, and each dataset in RDD is further distributed into logical partitions and computed on several nodes of the cluster
99. What is Apache ZooKeeper?
Apache ZooKeeper is a centralized service used for managing various operations in a distributed environment. It maintains configuration data, performs synchronization, naming, and grouping.
100. What is Apache Oozie?
- Oozie Workflow – It is a collection of actions sequenced in a control dependency DAG(Direct Acyclic Graph) for execution.
- Oozie Coordinator – If you want to trigger workflows based on the availability of data or time then you can use Oozie Coordinator Engine.
Apache Oozie is a scheduler that controls the workflow of Hadoop jobs. There are two kinds of Oozie jobs: