35+ POPULAR Hadoop [HDFS] Interview Questions & Answers | ACTE
HDFS Interview Questions and Answers

35+ POPULAR Hadoop [HDFS] Interview Questions & Answers

Last updated on 03rd Jul 2020, Blog, Interview Questions

About author

Krishnakumar (Lead Engineer - Director Level )

Highly Expertise in Respective Industry Domain with 10+ Years of Experience Also, He is a Technical Blog Writer for Past 4 Years to Renders A Kind Of Informative Knowledge for JOB Seeker

(5.0) | 15212 Ratings 968

Hadoop distributed file system (HDFS) is a system that stores very large dataset. As it is the most important component of Hadoop Architecture so it is the most important topic for an interview. In this blog, we provide the 50+ Hadoop HDFS interview questions and answers that are being framed by our company expert who provides training in Hadoop and another Bigdata framework.

1) What are the core components of Hadoop? 


Component        Description
HDFS Hadoop Distributed file system or HDFS is a Java-based distributed file system that allows us to store Big data across multiple nodes in a Hadoop cluster.
YARN YARN is the processing framework in Hadoop that allows multiple data processing engines to manage data stored on a single platform and provide Resource management.

2) What are the key features of HDFS?


Some of the prominent features of HDFS are as follows:

  • Cost effective and Scalable: HDFS, in general, is deployed on a commodity hardware. So, it is very economical in terms of the cost of ownership of the project. Also, one can scale the cluster by adding more nodes.
  • Variety and Volume of Data: HDFS is all about storing huge data i.e. Terabytes & Petabytes of data and different kinds of data. So, I can store any type of data into HDFS, be it structured, unstructured or semi structured.
  • Reliability and Fault Tolerance: HDFS divides the given data into data blocks, replicates it and stores it in a distributed fashion across the Hadoop cluster. This makes HDFS very reliable and fault tolerant. 
  • High Throughput: Throughput is the amount of work done in a unit time. HDFS provides high throughput access to application data.

3) Explain the HDFS Architecture and list the various HDFS daemons in HDFS cluster?


While listing various HDFS daemons, you should also talk about their roles in brief. Here is how you should answer this question:

Apache Hadoop HDFS Architecture follows a Master/Slave topology where a cluster comprises a single NameNode (Master node or daemon) and all the other nodes are DataNodes (Slave nodes or daemons).  Following daemon runs in HDFS cluster:

  • NameNode: It is the master daemon that maintains and manages the data block present in the DataNodes. 
  • DataNode: DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity hardware, that is responsible of storing the data as blocks.
  • Secondary NameNode: The Secondary NameNode works concurrently with the primary NameNode as a helper daemon. It performs checkpointing. 
4) What is checkpointing in Hadoop?


Checkpointing is the process of combining the Edit Logs with the FsImage (File system Image). It is performed by the Secondary NameNode.

5) What is a NameNode in Hadoop?


The Name Node is the master node that manages all the Data Nodes (slave nodes). It records the metadata information regarding all the files stored in the cluster (on the Data Nodes), e.g. The location of blocks stored, the size of the files, permissions, hierarchy, etc.

6) What is a DataNode?


Data Nodes are the slave nodes in HDFS. It is a commodity hardware that provides storage for the data. It serves the read and write request of the HDFS client. 

7) Is Namenode machine same as DataNode machine as in terms of hardware?


Unlike the DataNodes, a NameNode is a highly available server that manages the File System Namespace and maintains the metadata information. Therefore, NameNode requires higher RAM for storing the metadata information corresponding to the millions of HDFS files in the memory, whereas the DataNode needs to have a higher disk capacity for storing huge data sets. 

8) What is the difference between NAS (Network Attached Storage) and HDFS?


Here are the key differences between NAS and HDFS:

  • Network-attached storage (NAS) is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients. NAS can either be a hardware or software which provides a service for storing and accessing files. Whereas Hadoop Distributed File System (HDFS) is a distributed file system to store data using commodity hardware.
  • In HDFS, data blocks are distributed across all the machines in a cluster. Whereas in NAS, data is stored on a dedicated hardware.
  • HDFS is designed to work with MapReduce paradigm, where computation is moved to the data. NAS is not suitable for MapReduce since data is stored separately from the computations.
  • HDFS uses commodity hardware which is cost effective, whereas a NAS is a high-end storage devices which includes high cost.

9) What is the difference between traditional RDBMS and Hadoop?


This question seems to be very easy, but in an interview these simple questions matter a lot. So, here is how you can answer the very question:

  RDBMS Hadoop
Data Types RDBMS relies on the structured data and the schema of the data is always known. Any kind of data can be stored into Hadoop i.e. Be it structured, unstructured or semi-structured.
Processing RDBMS provides limited or no processing capabilities. Hadoop allows us to process the data which is distributed across the cluster in a parallel fashion.
Schemaon Read Vs. Write RDBMS is based on ‘schema on write’ where schema  validation is done before loading the data. On the contrary, Hadoop follows the schema on read policy.
Read/Write Speed In RDBMS, reads are fast because the schema of the data is already known. The writes are fast in HDFS because no schema validation happens during HDFS write.
Cost Licensed software, therefore, I have to pay for the software. Hadoop is an open source framework. So, I don’t need to pay for the software.
Best Fit Use Case RDBMS is used for OLTP (Online Trasanctional Processing) system. Hadoop is used for Data discovery, data analytics or OLAP system.
10) What is throughput? How does HDFS provides good throughput?


Throughput is the amount of work done in a unit time. HDFS provides good throughput because:

  • The HDFS is based on Write Once and Read Many Model, it simplifies the data coherency issues as the data written once can’t be modified and therefore, provides high throughput data access.
  • In Hadoop, the computation part is moved towards the data which reduces the network congestion and therefore, enhances the overall system throughput.
11) What is Secondary NameNode? Is it a substitute or back up node for the NameNode?


Here, you should also mention the function of the Secondary NameNode while answering the later part of this question so as to provide clarity:

A Secondary NameNode is a helper daemon that performs checkpointing in HDFS. No, it is not a backup or a substitute node for the NameNode. It periodically, takes the edit logs (meta data file) from NameNode and merges it with the FsImage (File system Image) to produce an updated FsImage as well as to prevent the Edit Logs from becoming too large.

12) What do you mean by meta data in HDFS? List the files associated with metadata.


The metadata in HDFS represents the structure of HDFS directories and files. It also includes the various information regarding HDFS directories and files such as ownership, permissions, quotas, and replication factor.

Tip: While listing the files associated with metadata, give a one line definition of each metadata file.

There are two files associated with metadata present in the NameNode:

Explore Curriculum

  • FsImage: It contains the complete state of the file system namespace since the start of the NameNode.
  • EditLogs: It contains all the recent modifications made to the file system with respect to the recent FsImage.
13) What is the problem in having lots of small files in HDFS?


As we know, the NameNode stores the metadata information regarding file system in the RAM. Therefore, the amount of memory produces a limit to the number of files in my HDFS file system. In other words, too much of files will lead to the generation of too much meta data and storing these meta data in the RAM will become a challenge. As a thumb rule, metadata for a file, block or directory takes 150 bytes.  

14) What is a heartbeat in HDFS?


Heartbeats in HDFS are the signals that are sent by DataNodes to the NameNode to indicate that it is functioning properly (alive). By default, the heartbeat interval is 3 seconds, which can be configured using dfs.heartbeat.interval in hdfs-site.xml.

15) How would you check whether your NameNode is working or not?


There are many ways to check the status of the NameNode. Most commonly, one uses the jps command to check the status of all the daemons running in the HDFS. Alternatively, one can visit the NameNode’s Web UI for the same. 

16) What is a block?


You should begin the answer with a general definition of a block. Then, you should explain in brief about the blocks present in HDFS and also mention their default size. 

Blocks are the smallest continuous location on your hard drive where data is stored. HDFS stores each file as blocks, and distribute it across the Hadoop cluster. The default size of a block in HDFS is 128 MB (Hadoop 2.x) and 64 MB (Hadoop 1.x) which is much larger as compared to the Linux system where the block size is 4KB. The reason of having this huge block size is to minimize the cost of seek and reduce the meta data information generated per block.

17) Suppose there is file of size 514 MB stored in HDFS (Hadoop 2.x) using default block size configuration and default replication factor. Then, how many blocks will be created in total and what will be the size of each block?


Default block size in Hadoop 2.x is 128 MB. So, a file of size 514 MB will be divided into 5 blocks ( 514 MB/128 MB) where the first four blocks will be of 128 MB and the last block will be of 2 MB only. Since, we are using the default replication factor i.e. 3, each block will be replicated thrice. Therefore, we will have 15 blocks in total where 12 blocks will be of size 128 MB each and 3 blocks of size 2 MB each.

18) How to copy a file into HDFS with a different block size to that of existing block size configuration?


Yes, one can copy a file into HDFS with a different block size by using ‘-Ddfs.blocksize=block_size’ where the block_size is specified in Bytes.

Let me explain it with an example: Suppose, I want to copy a file called test.txt of size, say of 120 MB, into the HDFS and I want the block size for this file to be 32 MB (33554432 Bytes) instead of the default (128 MB). So, I would issue the following command:

  • hadoop fs -Ddfs.blocksize=33554432 -copyFromLocal /home/edureka/test.txt /sample_hdfs

Now, I can check the HDFS block size associated with this file by:

  • hadoop fs -stat %o /sample_hdfs/test.txt

Else, I can also use the NameNode web UI for seeing the HDFS directory.

19) Can you change the block size of HDFS files?


Yes, I can change the block size of HDFS files by changing the default size parameter present in hdfs-site.xml. But, I will have to restart the cluster for this property change to take effect.

20) What is a block scanner in HDFS?


Block scanner runs periodically on every DataNode to verify whether the data blocks stored are correct or not. The following steps will occur when a corrupted data block is detected by the block scanner:

  • First, the DataNode will report about the corrupted block to the NameNode.
  • Then, NameNode will start the process of creating a new replica using the correct replica of the corrupted block present in other DataNodes.
  • The corrupted data block will not be deleted until the replication count of the correct replicas matches with the replication factor (3 by default).

This whole process allows HDFS to maintain the integrity of the data when a client performs a read operation. One can check the block scanner report using the DataNode’s web interface- localhost:50075/blockScannerReport as shown below:


Fig. – Block Scanner Report – Hadoop HDFS Interview Question

    Subscribe For Free Demo

    21) HDFS stores data using commodity hardware which has higher chances of failures. So, How HDFS ensures the Fault Tolerance capability of the system?


    HDFS provides fault tolerance by replicating the data blocks and distributing it among different DataNodes across the cluster. By default, this replication factor is set to 3 which is configurable. So, if I store a file of 1 GB in HDFS where the replication factor is set to default i.e. 3, it will finally occupy a total space of 3 GB because of the replication. Now, even if a DataNode fails or a data block gets corrupted, I can retrieve the data from other replicas stored in different DataNodes.  

    22) Replication causes data redundancy and consume a lot of space, then why is it pursued in HDFS?


    Replication is pursued in HDFS to provide the fault tolerance. And, yes, it will lead to the consumption of a lot of space, but one can always add more nodes to the cluster if required. By the way, in practical clusters, it is very rare to have free space issues as the very first reason to deploy HDFS was to store huge data sets. Also, one can change the replication factor to save HDFS space or use different codec provided by Hadoop to compress the data.

    23) Can we have different replication factor of the existing files in HDFS?


    Yes, one can have different replication factor for the files existing in HDFS. Suppose, I have a file named test.xml stored within the sample directory in my HDFS with the replication factor set to 1. Now, the command for changing the replication factor of text.xml file to 3 is:

    • hadoop fs -setrwp -w 3 /sample/test.xml

    Finally, I can check whether the replication factor has been changed or not by using following command:

    • hadoop fs -ls /sample


    • hadoop fsck /sample/test.xml -files
    24) What is a rack awareness algorithm and why is it used in Hadoop?


    Rack Awareness algorithm in Hadoop ensures that all the block replicas are not stored on the same rack or a single rack. Considering the replication factor is 3, the Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack and the next two replicas will be stored on a different (remote) rack but, on a different DataNode within that (remote) rack. There are two reasons for using Rack Awareness:

    • To improve the network performance: In general, you will find greater network bandwidth between machines in the same rack than the machines residing in different rack. So, the Rack Awareness helps to reduce write traffic in between different racks and thus provides a better write performance. 
    • To prevent loss of data: I don’t have to worry about the data even if an entire rack fails because of the switch failure or power failure. And if one thinks about it, it will make sense, as it is said that never put all your eggs in the same basket.
    25) How data or a file is written into HDFS?


    The best way to answer this question is to take an example of a client and list the steps that will happen while performing the write without going into much of the details:

    Suppose a client wants to write a file into HDFS. So, the following steps will be performed internally during the whole HDFS write process:

    • The client will divide the files into blocks and will send a write request to the NameNode.
    • For each block, the NameNode will provide the client a list containing the IP address of DataNodes (depending on replication factor, 3 by default) where the data block has to be copied eventually.
    • The client will copy the first block into the first DataNode and then the other copies of the block will be replicated by the DataNodes themselves in a sequential manner.
    26) How a data node is identified as saturated?


    When a data node is full and has no space left the name node will identify it.

    27) What is the function of ‘job tracker’?

    Ans:Job tracker is one of the daemons that runs on name node and submits and tracks the MapReduce tasks in Hadoop. There is only one job tracker who distributes the task to various task trackers. When it goes down all running jobs comes to a halt.

    28) What is the role played by task trackers?


    Daemons that run on What data nodes, the task tracers take care of individual tasks on slave node as entrusted to them by job tracker.

    29) How the client communicates with Name node and Data node in HDFS?


    The communication mode for clients with name node and data node in HDFS is SSH.

    30) What is a rack in HDFS?


    Rack is the storage location where all the data nodes are put together. Thus it is a physical collection of data nodes stored in a single location.

    Course Curriculum

    Get Practical Oriented HDFS Training to UPGRADE Your Knowledge & Skills

    • Instructor-led Sessions
    • Real-life Case Studies
    • Assignments
    Explore Curriculum

    31) What Is Mapreduce?


    Map reduce is an algorithm or concept to process Huge amount of data in a faster way. As per its name you can divide it Map and Reduce.

    • The main MapReduce job usually splits the input data-set into independent chunks. (Big data sets in the multiple small datasets)
    • MapTask: will process these chunks in a completely parallel manner (One node can process one or more chunks).The framework sorts the outputs of the maps.
    • Reduce Task : And the above output will be the input for the reducetasks, produces the final result.

    Your business logic would be written in the MappedTask and ReducedTask. Typically both the input and the output of the job are stored in a file-system (Not database). The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks

    32) Explain How Input And Output Data Format Of The Hadoop Framework?


    The MapReduce framework operates exclusively on pairs, that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types.
    See the flow mentioned below
    (input) -> map -> -> combine/sorting -> -> reduce -> (output)

    33) How Does An Hadoop Application Look Like Or Their Basic Components?


    Minimally an Hadoop application would have following components.

    • Input location of data
    • Output location of processed data.
    • A map task.
    • A reduced task.
    • Job configuration

    The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software / configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

    34) What Are The Restriction To The Key And Value Class ?


    The key and value classes have to be serialized by the framework. To make them serializable Hadoop provides a Writable interface. As you know from the java itself that the key of the Map should be comparable, hence the key has to implement one more interface Writable Comparable.

    35)Explain The Wordcount Implementation Via Hadoop Framework ?


    We will count the words in all the input file flow as below

    input Assume there are two files each having a sentence Hello World Hello World (In file 1) Hello World Hello World (In file 2)

    Mapper : There would be each mapper for the a file For the given sample input the first map output:

    • < Hello, 1>
    • < World, 1>
    • < Hello, 1>
    • < World, 1>

    The second map output:

    • < Hello, 1>
    • < World, 1>
    • < Hello, 1>
    • < World, 1>

    Combiner/Sorting (This is done for each individual map) So output looks like this The output of the first map:

    • < Hello, 2>
    • < World, 2>

    The output of the second map:

    • < Hello, 2>
    • < World, 2>

    Reducer : It sums up the above output and generates the output as below

    • < Hello, 4>
    • < World, 4>


    Final output would look like

    Hello 4 times

    World 4 times

    36) Which Interface Needs To Be Implemented To Create Mapper And Reducer For The Hadoop?



    37) What Mapper Does?


    Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs.

    38) What Is The Use Of Context Object?


    The Context object allows the mapper to interact with the rest of the Hadoop system. It Includes configuration data for the job, as well as interfaces which allow it to emit output.

    39) How Can You Add The Arbitrary Key-value Pairs In Your Mapper?


    You can set arbitrary (key, value) pairs of configuration data in your Job, e.g. with

    Job.getConfiguration().set(“myKey”, “myVal”), and then retrieve this data in your mapper with

    Context.getConfiguration().get(“myKey”). This kind of functionality is typically done in the Mapper’s setup() method.

    40) What Is Next Step After Mapper Or Maptask?


    The output of the Mapper are sorted and Partitions will be created for the output. Number of partition depends on the number of reducer.

    41) How Can We Control Particular Key Should Go In A Specific Reducer?


    Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioned.

    42) What Is Nas?


    It is one kind of file system where data can reside on one centralized machine and all the cluster member will read write data from that shared database, which would not be as efficient as HDFS.

    43) How Hdfa Differs With Nfs?


    Following are differences between HDFS and NAS

    • In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware.
    • HDFS is designed to work with MapReduce System, since computation is moved to data. NAS is not suitable for MapReduce since data is stored separately from the computations.
    • HDFS runs on a cluster of machines and provides redundancy using replication protocol. Whereas NAS is provided by a single machine therefore does not provide data redundancy.

    44) Where The Mapper’s Intermediate Data Will Be Stored?


    The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the Hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.

    45) What Is The Meaning Of Speculative Execution In Hadoop? Why Is It Important?


    Speculative execution is a way of coping with individual Machine performance. In large clusters where hundreds or thousands of machines are involved there may be machines which are not performing as fast as others.
    This may result in delays in a full job due to only one machine not performaing well. To avoid this, speculative execution in hadoop can run multiple copies of same map or reduce task on different slave nodes. The results from first node to finish are used.

    46) When The Reducers Are Are Started In A Mapreduce Job?


    In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.
    If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?
    Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer.
    Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished.

    47) Give A Brief Overview Of Hadoop History?


    In 2002, Doug Cutting created an open source, web crawler project.

    In 2004, Google published MapReduce, GFS papers.

    In 2006, Doug Cutting developed the open source, Mapreduce and HDFS project.

    In 2008, Yahoo ran 4,000 node Hadoop cluster and Hadoop won terabyte sort benchmark.

    In 2009, Facebook launched SQL support for Hadoop.

    48) How Indexing Is Done In Hdfs?


    Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.

    49) Does Hadoop Always Require Digital Data To Process?


    Yes. Hadoop always require digital data to be processed.

    50) On What Basis Namenode Will Decide Which Datanode To Write On?


    As the Namenode has the metadata (information) related to all the data nodes, it knows which datanode is free.

    Course Curriculum

    Learn HDFS Training Course for Beginners By Experts Trainers

    Weekday / Weekend BatchesSee Batch Details

    51) Doesn’t Google Have Its Very Own Version Of Dfs?


    Yes, Google owns a DFS known as “Google File System (GFS)” developed by Google Inc. for its own use.

    52) What Is The Communication Channel Between Client And Namenode/datanode?


    The mode of communication is SSH.

    53) What Is A Rack?


    Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

    54) On What Basis Data Will Be Stored On A Rack?


    When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be stored. While placing the datanodes, the key rule followed is “for every block of data, two copies will exist in one rack, third copy in a different rack“. This rule is known as “Replica Placement Policy“.

    55) Do We Need To Place 2nd And 3rd Data In Rack 2 Only?


    Yes, this is to avoid datanode failure.

    56) What If Rack 2 And Datanode Fails?


    If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by changing the value in replication factor which is set to 3 by default.

    57) Is Map Like A Pointer?


    No, Map is not like a pointer.

    58) Which Are The Two Types Of ‘writes’ In Hdfs?


    There are two types of writes in HDFS: posted and non-posted write. Posted Write is when we write it and forget about it, without worrying about the acknowledgement. It is similar to our traditional Indian post. In a Non-posted Write, we wait for the acknowledgement. It is similar to the today’s courier services. Naturally, non-posted write is more expensive than the posted write. It is much more expensive, though both writes are asynchronous.

    59) Why ‘reading’ Is Done In Parallel And ‘writing’ Is Not In Hdfs?


    Reading is done in parallel because by doing so we can access the data fast. But we do not perform the write operation in parallel. The reason is that if we perform the write operation in parallel, then it might result in data inconsistency. For example, you have a file and two nodes are trying to write data into the file in parallel, then the first node does not know what the second node has written and vice-versa. So, this makes it confusing which data to be stored and accessed.

    60) Can Hadoop Be Compared To Nosql Database Like Cassandra?


    Though NOSQL is the closet technology that can be compared to Hadoop, it has its own pros and cons. There is no DFS in NOSQL. Hadoop is not a database. It’s a file system (HDFS) and distributed programming framework (MapReduce).

    61) How Can I Install Cloudera Vm In My System?


    When you enrol for the hadoop course at Edureka, you can download the Hadoop Installation steps.pdf file from our dropbox.

    62) What Is Writable & Writablecomparable Interface?


    org.apache.hadoop.io.Writable is a Java interface. Any key or value type in the Hadoop Map-Reduce framework implements this interface. Implementations typically implement a static read(DataInput) method which constructs a new instance, calls readFields(DataInput) and returns the instance.

    org.apache.hadoop.io.WritableComparable is a Java interface. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface. WritableComparable objects can be compared to each other using Comparators.

    63) What do you understand by Checkpointing?


    Performed by the Secondary NameNode, Checkpointing reduces NameNode startup time. The process, in essence, involves combining FsImage with the edit log and compressing the two into a new FsImage.

    Checkpointing allows the NameNode to load the final in-memory state directly from the FsImage.

    64) How does Apache Hadoop differ from Apache Spark?


    There are several capable cluster computing frameworks for meeting Big Data challenges. Apache Hadoop is an apt solution for analyzing Big Data when efficiently handling batch processing is the priority.

    When the priority, however, is to effectively handle real-time data then we have Apache Spark. Unlike Hadoop, Spark is a low latency computing framework capable of interactively processing data.

    Although both Apache Hadoop and Apache Spark are popular cluster computing frameworks. That, however, doesn’t mean that both are identical by all means. In actual, both cater to different analysis requirements of Big Data. Following are the various differences between the two:

    • Engine Type – While Hadoop is just a basic data processing engine, Spark is a specialized data analytics engine.
    • Intended For – Hadoop is designed to deal with batch processing with Brobdingnagian volumes of data. Spark, on the other hand, serves the purpose of processing real-time data generated by real-time events, such as social media.
    • Latency – In computing, latency represents the difference between the time when the instruction of the data transfer is given and the time when the data transfer actually starts. Hadoop is a high-latency computing framework, whereas Spark is a low-latency computing framework.
    • Data Processing – Spark processes data interactively, while Hadoop can’t. Data is processed in the batch mode in Hadoop.
    • Complexity/The Ease of Use – Spark is easier to use thanks to an abstraction model. Users can easily process data with high-level operators. Hadoop’s MapReduce model is complex.
    • Job Scheduler Requirement – Spark features in-memory computation. Unlike Hadoop, Spark doesn’t require an external job scheduler.
    • Security Level – Both Hadoop and Spark are secure. But while Spark is just secured, Hadoop is tightly secured.
    • Cost – Since MapReduce model provides a cheaper strategy, Hadoop is less costly compared to Spark, which is costlier owing to having an in-memory computing solution.
    65) Enumerate the various configuration parameters that need to be specified in a MapReduce program?


    Following are the various configuration parameters that users need to specify in a MapReduce program:

    • The input format of data
    • Job’s input location in the distributed file system
    • Job’s output location in the distributed file system
    • The output format of data
    • The class containing the map function
    • The class containing the reduce function
    • The JAR file containing the mapper, reducer, and driver classes
    66) What do you understand by Combiner in Hadoop?


    Combiners enhance the efficiency of the MapReduce framework by reducing the data required sending to the reducers. A combiner is a mini reducer that is responsible for performing the local reduce task.

    A combiner receives the input from the mapper on a particular node, and sends the output to the reducer.

    67) Can you explain SequenceFileInputFormat?


    Sequence files are an efficient intermediate representation for data passing from one MapReduce job to the other. They can be generated as the output of other MapReduce tasks.

    The SequenceFileInputFormat is a compressed binary file format optimized for passing data among the outputs of one MapReduce job and the input of some other MapReduce job. It is an input format for reading within sequence files.

    68) Explain the core methods of a Reducer?


    There are three core methods of a Reducer, explained as follows:

    1. cleanup() – Used only once at the end of a task for cleaning the temporary files.
    2. reduce() – Always called once per key with the associated reduced task.
    3. setup() – Used for configuring various parameters, such as distributed cache and input data size.
    69) Do you know how to debug Hadoop code?


    Start by checking the list of MapReduce jobs that are currently running. Thereafter, check whether there are one or many orphaned jobs running or not. If there is then it is required to determine the location of RM logs. This can be done as follows:

    • Step 1 – Use the ps -ef | grep -I ResourceManager command to look for the log directory in the result. Find out the job ID and check whether there is an error message with the orphaned job.
    • Step 2 – Use the RM logs to identify the worker node involved in the execution of the task concerning the orphaned job.
    • Step 3 – Log in to the affected node and run the following code:
    • ps -ef | grep -iNodeManager
    • Step 4 – Examine the Node Manager log. Most of the errors are from the user-level logs for each MapReduce job.
    70) What is Pig?


    This is a project of Apache. It is a platform using which huge datasets are analyzed. It runs on the top of MapReduce.

    HDFS Sample Resumes! Download & Edit, Get Noticed by Top Employers! Download
    71) Use of Pig


    Pig is use for the purpose of analyzing huge datasets. Data flow are created using Pig in order to analyze data. Pig Latin language is use for this purpose.

    72) What is Pig Latin


    Pig Latin is a script language which is used in Apache Pig to create Data flow in order to analyze data.

    73) What is Hive?


    Hive is a project of Apache Hadoop. Hive is a dataware software which runs on the top of Hadoop.

    74) Use of Hive


    Hive works as a storage layer which is used to store structured data. This is very useful and convenient tool for SQL user as Hive use HQL.

    75) What is HQL?


    HQL is an abbreviation of Hive Query Language. This is designed for those user who are very comfortable with SQL. HQL is use to query structured data into hive.

    76) What is Sqoop?


    Sqoop is a short form of SQL to Hadoop. This is basically a command line tool to transfer data between Hadoop and SQL and vice-versa.

    77) Use of Sqoop?


    Sqoop is a CLI tool which is used to migrate data between RDBMS to Hadoop and vice-versa.

    78) How to Delete directory and files recursively from HDFS?


    below is command:

    • hdfs fs -rm -r <file_path>
    79) How to read file in HDFS?


    below is command:

    • hdfs fs -cat <file_path?
    80) What are the other file system available in market?


    FAT, NAS, EXT are the well-known file systems available in market.

    81) What are basic steps to be performed while working with big data?


    below are the basic steps to be done while working with Big Data:

    • Data Ingestion
    • Data Storage
    • Data Processing
    82) What is data ingestion?


    Before Big Data came into Picture, our data used to reside into RDBMS. Data Ingestion is a process to move/ingest your data from one place to another place. In the reference of Big Data, Data movement from RDBMS to Hadoop is known as Data Ingestion.

    83) What are JT, TT, Secondary name node in hadoop architecture?


    JT – Job Tracker which assigns jobs to Task trackers.

    TT- Task tracker which executes the job assigned by JT

    Secondary name node- its a name node keeps the metadata information of name node.

    After evry 30 min, the name node info is updated in secondary name node.

    84) What are the 2 types of table in hive?


    Managed/internal table

    Here once the table gets deleted both meta data and actual data is deleted

    External table

    Here once the table gets deleted only the mata data gets deleted but not the actual data.

    85) How to managed create a table in hive?


    • hive>create table student(sname string, sid int) row format delimited fileds terminated by ‘,’;
    • //hands on
    • hive>describe  student;
    86) How to load data into table created in hive?


    • hive>load data local inpath /home/training/simple.txt  into table student;
    • //hands on
    • hive> select * from student;
    87) How to create/load data into exteranal tables?


    without location

    • hive>create external table student(sname string, sid int) row format delimited fileds terminated by ‘,’;
    • hive>load data local inpath /home/training/simple.txt  into table student;

    With Location

    • hive>create external table student(sname string, sid int) row format delimited fileds terminated by ‘,’ location /GangBoard_HDFS;

    Here no need of load command

    88) What is column?


    collection of key value pair

    89) What Is Yarn?


    Full form of YARN is ‘Yet Another Resource Negotiator.’ YARN is a great and productive feature rolled out as a part of Hadoop 2.0.YARN is an extensive scale circulated system for running big information applications. YARN gives APIs for requesting and working with Hadoop’s bunch assets. These APIs are generally utilized by parts of Hadoop’s distributed systems, for example, Map Reduce, Spark, and Tez and much more which are on top of YARN

     90) What Is Resource Manager In YARN?


    The Resource Manager is the rack-aware master node in YARN. It is in charge of taking stock of accessible assets and runs several critical services, the most imperative of which is the Scheduler. Resource Manager is the master that referees all the accessible cluster assets and thus assists in managing the dispersed applications running on the YARN system.

    Resource Manager has two main parts:

    •  Scheduler
    •  Application supervisor
    91) What Is Apache Hadoop Yarn?


    Apache Hadoop YARN is the job scheduling, and resource management innovation in the open source Hadoop distributes preparing structure. One of Apache Hadoop’s center segments, YARN is in charge of designating system assets to the different applications running in a Hadoop cluster and scheduling tasks to be executed on various cluster nodes.

    92) What Are The Additional Benefits Yarn Brings In To Hadoop?


    The YARN structure, presented in Hadoop, is intended to share the responsibilities of Map Reduce and deal with the cluster administration task. This enables Map Reduce to execute information preparing and consequently, streamline the procedure. In Hadoop Map Reduce there are different openings for Map and Reduce errands while in YARN there is no fixed space. A similar container can be utilized for Map and Reduce undertakings prompting better usage.

    93) What Is the Difference between MapReduce 1 And MapReduce 2/yarn?


    In Map Reduce 1, Hadoop concentrated all tasks to the Job Tracker. It dispenses assets and scheduling the jobs over the cluster. Whereas in YARN, de-centralized this to facilitate the work pressure at job Tracker. The responsibility of Resource Manager is to allocate assets to the specific nods and Node administrators schedule the jobs on the application Master. YARN permits parallel execution and Application Master overseeing and execute the activity. This approach can ease numerous Job Tracker issues and enhances to scale up capacity and advance the job execution. Moreover, YARN can permit to make numerous applications to scale up on the disseminated condition.

    94) What Are The Scheduling Policies Available In Yarn?


    The YARN design has pluggable scheduling policies that rely upon the application’s prerequisites and the utilization case characterized for the running application. You can discover the YARN scheduling confirmations in the yarn-site.xml file. You can also locate the running application scheduling data in the Resource Manager UI.

    As there is three kind of scheduling policies that the YARN scheduler follows:

    •  FIFO(First In First Out) scheduler
    • Capacity Scheduler
    •  Fair scheduler
    95) What is ResourceManager in YARN?


    The YARN ResourceManager is a global component or daemon, one per cluster, which manages the requests to and resources across the nodes of the cluster.

    The ResourceManager has two main components – Scheduler and ApplicationsManager

    Scheduler – The scheduler is responsible for allocating resources to and starting applications based on the abstract notion of resource containers having a constrained set of resources.

    ApplicationManager – The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.

    Are you looking training with Right Jobs?

    Contact Us
    Get Training Quote for Free