25+Top MapReduce Interview Questions & Answers [UPDATED] | ACTE
MapReduce Interview Questions and Answers

25+Top MapReduce Interview Questions & Answers [UPDATED]

Last updated on 03rd Jul 2020, Blog, Interview Questions

About author

Sanjay (Sr Big Data DevOps Engineer )

Highly Expertise in Respective Industry Domain with 7+ Years of Experience Also, He is a Technical Blog Writer for Past 4 Years to Renders A Kind Of Informative Knowledge for JOB Seeker

(5.0) | 15212 Ratings 476

If you are into big data, you already know about the popularity of MapReduce. There is a huge demand for the MapReduce professionals in the market. It doesn’t matter if you are a beginner or looking to re-apply for a new job position, going through the 100 most popular MapReduce interview tions and ACTE can help you get prepared for the   MapReduce interview. So, without any delay, let’s jump into the tions.

1. What is MapReduce?

Ans:

  • MapReduce is at the core of Hadoop. It is a framework that enables Hadoop to scale across multiple clusters while working on big data.
  • The term “MapReduce” is derived from the two important tasks in the programming paradigm. The first one is “map” which converts one set of data into another. The conversion is done such that the output is in a simple format of key/value pairs. The reduce function, on the other hand, takes the input produced by “map” and creates smaller data tuples combining the previously created ones.

2. Can You Elaborate About Mapreduce Job?

Ans:

Based on the configuration, the MapReduce Job first splits the input data into independent chunks called Blocks. These blocks are processed by Map() and Reduce() functions. First Map function processes the data, then processed by reducing function. The Framework takes care of sorting the Map outputs, scheduling the tasks.

3. Where is Mapper output stored?

Ans:

The intermediate key value data of the mapper output will be stored on the local file system of the mapper nodes. This directory location is set in the config file by the Hadoop Admin. Once the Hadoop job completes execution, the intermediate will be cleaned up.

4. Explain the differences between a combiner and reducer.

Ans:

Combiner can be considered as a mini reducer that performs local reduce task. It runs on the Map output and produces the output to reducers input. It is usually used for network optimization when the map generates a greater number of outputs.

  •  Unlike a reducer, the combiner has a constraint that the input or output key and value types must match the output types of the Mapper.
  • Combiners can operate only on a subset of keys and values i.e. combiners can be executed on functions that are commutative.
  •  Combiner functions get their input from a single mapper whereas reducers can get data from multiple mappers as a result of partitioning.

5. When is it suggested to use a combiner in a MapReduce  job?

Ans:

Combiners are generally used to enhance the efficiency of a MapReduce program by aggregating the intermediate map output locally on specific mapper outputs. This helps reduce the volume of data that needs to be transferred to reducers. Reducer code can be used as a combiner, only if the operation performed is commutative. However, the execution of a combiner is not assured.

6. Explain what is Hadoop?

Ans:

It is an open-source software framework for storing data and running  applications on clusters of commodity hardware.  It provides enormous processing power and massive storage for any type of data.

7. Mention Hadoop core components?

Ans:

Hadoop core components include,

  •    HDFS &    MapReduce

8. What is NameNode in Hadoop?

Ans:

NameNode in Hadoop is where Hadoop stores all the file location information in HDFS. It is the master node on which job tracker runs and consists of metadata.

9. Mention what are the data components used by Hadoop?

Ans:

Data components used by Hadoop are

  •  Pig
  •  Hive

10. Mention what is the data storage component used by Hadoop?

Ans:

The data storage component used by Hadoop is HBase.

11. Mention what are the most common input formats defined in Hadoop?

Ans:

The most common input formats defined in Hadoop are;

  •  TextInputFormat
  •  KeyValueInputFormat
  •  SequenceFileInputFormat

12. In Hadoop what is InputSplit?

Ans:

It splits input files into chunks and assigns each split to a mapper for processing.

13. For a Hadoop job, how will you write a custom  partitioner?

Ans:

You write a custom partitioner for a Hadoop job, you follow the following path

  •   Create a new class that extends Partitioner Class
  •   Override method getPartition
  •   In the wrapper that runs the MapReduce
  •  Add the custom partitioner to the job by using method set Partitioner Class or – add the custom partitioner to the job as a config file

14. For a job in Hadoop, is it possible to change the number of mappers to be created?

Ans:

No, it is not possible to change the number of mappers to be created. The number of mappers is determined by the number of input splits.

15. Explain what is a sequence file in Hadoop?

Ans:

To store binary key/value pairs, sequence file is used. Unlike regular compressed file, sequence file support splitting even when the data inside the file is compressed.

16. What Is A “Map” In Hadoop?

Ans:

In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type.

17. What Is A “Reducer” In Hadoop?

Ans:

In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.

18. Mention how Hadoop is different from other data processing tools?

Ans:

In Hadoop, you can increase or decrease the number of mappers without worrying about the volume of data to be processed.

19. List out Hadoop’s three configuration files?

Ans:

The three configuration files are

  •  Core-site.xml
  •  Mapred-site.xml
  •   hdfs-site.xml

20. Explain how you can check whether Namenode is working beside using the jps command?

Ans:

Besides using the jps command, to check whether Namenode are working you can also use /etc/init.d/hadoop-0.20-namenode status.

    Subscribe For Free Demo

    21. What Is an Identity Mapper?

    Ans:

    Identity Mapper is the default Mapper class provided by Hadoop. when no other Mapper class is defined, Identify will be executed. It only writes the input data into output and does not perform any computations and calculations on the input data. The class name is org.apache.hadoop.mapred.lib.IdentityMapper.

    22. In Hadoop, which file controls reporting in Hadoop?

    Ans:

    In Hadoop, the hadoop-metrics.properties file controls reporting.

    23. For using Hadoop, list the network requirements?

    Ans:

    For using Hadoop the list of network requirements are:

    •    Password-less SSH connection
    •  Secure Shell (SSH) for launching server processes

    24. Mention what is rack awareness?

    Ans:

    Rack awareness is the way in which the namenode determines on how to place blocks based on the rack definitions.

    25. Explain what is a Task Tracker in Hadoop?

    Ans:

    A Task Tracker in Hadoop is a slave node daemon in the cluster that accepts tasks from a JobTracker. It also sends out the heartbeat messages to the JobTracker, every few minutes, to confirm that the JobTracker is still alive.

    26. Explain how you can debug Hadoop code?

    Ans:

    The popular methods for debugging Hadoop code are:

    • By using web interface provided by Hadoop framework
    • By using Counters

    27. Mention what is the use of Context Object?

    Ans:

    The Context Object enables the mapper to interact with the rest of the Hadoop system. It includes configuration data for the job, as well as interfaces which allow it to emit output.

    28. Mention what is the next step after Mapper or MapTask?

    Ans:

    The next step after Mapper or MapTask is that the outputs of the Mapper are sorted, and partitions will be created for the output.

    29. Mention what is the number of default partitioners in Hadoop?

    Ans:

    In Hadoop, the default partitioner is a “Hash” Partitioner.

    30. Explain the purpose of RecordReader in Hadoop?

    Ans:

    In Hadoop, the RecordReader loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper.

    Course Curriculum

    Learn Expert-led mapreduce Training with Dedicated Lab Environment

    • Instructor-led Sessions
    • Real-life Case Studies
    • Assignments
    Explore Curriculum

    31. Explain how is data partitioned before it is sent to the reducer if no custom partitioner is defined in Hadoop?

    Ans:

    If no custom partitioner is defined in Hadoop, then a default partitioner computes a hash value for the key and assigns the partition based on the result.

    32. Explain what happens when Hadoop spawned 50 tasks for a job and one of the tasks failed?

    Ans:

    It will restart the task again on some other TaskTracker if the task fails more than the defined limit.

    33. Mention what is the best way to copy files between HDFS clusters?

    Ans:

    The best way to copy files between HDFS clusters is by using multiple nodes and the distcp command, so the workload is shared.

    34. Mention what is the difference between HDFS and NAS?

    Ans:

    HDFS data blocks are distributed across local drives of all machines in a cluster while NAS data is stored on dedicated hardware.

    35. Explain how indexing in HDFS is done?

    Ans:

    Hadoop has a unique way of indexing. Once the data is stored as per the block size, the HDFS will keep on storing the last part of the data which says where the next part of the data will be. 

    36. What Is an Outputcommitter?

    Ans:

    OutPutCommitter describes the commit of the MapReduce task. FileOutputCommitter is the default available class available for OutputCommitter in MapReduce. It performs the following operationsAns:

    •  Create a temporary output directory for the job during initialization.
    • Then, it cleans the job as it removes temporary output directory post job completion.
    •  Sets up the task temporary output.
    •  Identifies whether a task needs commitment. The commit is applied if required.
    •  JobSetup, JobCleanup and TaskCleanup are important tasks during the output commit.

    37. Mention what is the Hadoop MapReduce APIs contract for a key and value class?

    Ans:

    For a key and value class, there are two Hadoop MapReduce APIs contract

    • The value must be defining the org.apache.hadoop.io.Writable interface
    • The key must be defining the org.apache.hadoop.io.WritableComparable interface

    38. Mention what are the three modes in which Hadoop can be run?

    Ans:

    The three modes in which Hadoop can be run are

    • Pseudo distributed mode
    •  Standalone (local) mode
    •  Fully distributed mode

    39. Mention what does the text input format do?

    Ans:

    The text input format will create a line object that is an hexadecimal number.  The value is considered as a whole line text while the key is considered as a line object. The mapper will receive the value as ‘text’ parameter while key is ‘longwritable’ parameter.

    40. Mention how many InputSplits is made by a Hadoop Framework?

    Ans:

    Hadoop will make 5 splits

    • 1 split for 64K files
    • 2 split for 65mb files
    • 2 splits for 127mb files

    41. Mention what is distributed cache in Hadoop?

    Ans:

    Distributed cache in Hadoop is a facility provided by MapReduce framework.  At the time of execution of the job, it is used to cache files.  The Framework copies the necessary files to the slave node before the execution of any task at that node.

    42. Explain how Hadoop Classpath plays a vital role in stopping or starting in Hadoop daemons?

    Ans:

    Classpath will consist of a list of directories containing jar files to stop or start daemons.

    43. What is the relationship between Job and Task in Hadoop?

    Ans:

    A single job can be broken down into one or many tasks in Hadoop.

    44.  Is it important for Hadoop MapReduce jobs to be written in Java?

    Ans:

    It is not necessary to write Hadoop MapReduce jobs in Java but users can write MapReduce jobs in any desired programming language like Ruby, Perl, Python, R, Awk, etc. through the Hadoop Streaming API.

    45. What is the process of changing the split size if there is limited storage space on Commodity Hardware?

    Ans:

    If there is limited storage space on commodity hardware, the split size can be changed by implementing the “Custom Splitter”. The call to Custom Splitter can be made from the main method.

    46.  What are the primary phases of a Reducer? 

    Ans:

    The 3 primary phases of a reducer are –

          1) Shuffle

          2) Sort

          3) Reduce

    47. What is a TaskInstance? 

    Ans:

    The actual Hadoop MapReduce jobs that run on each slave node are referred to as Task instances. Every task instance has its own JVM process. For every new task instance, a JVM process is spawned by default for a task.

    48. Can reducers communicate with each other? 

    Ans:

    Reducers always run in isolation and they can never communicate with each other as per the Hadoop MapReduce programming paradigm.

    49. What is the difference between Hadoop and RDBMS?

    Ans:

    •    In RDBMS, data needs to be pre-processed being stored, whereas Hadoop requires no pre-processing.
    •   RDBMS is generally used for OLTP processing whereas Hadoop is used for analytical requirements on huge volumes of data.
    •  Database cluster in RDBMS uses the same data files in shared storage 
    • Whereas in Hadoop the storage is independent of each processing node.

    50. Can we search files using wildcards?

    Ans:

    Yes, it is possible to search for files through wildcards.

    Course Curriculum

    Get On-Demand Mapreduce Training & Certification Course

    Weekday / Weekend BatchesSee Batch Details

    51. How is reporting controlled in hadoop?

    Ans:

    The file hadoop-metrics.properties file controls reporting.

    52. Is it possible to rename the output file?

    Ans:

    Yes, this can be done by implementing the multiple format output class.

    53. What do you understand by compute and storage nodes?

    Ans:

    • Storage node is the system, where the file system resides to store the data for processing.
    • Compute node is the system where the actual business logic is executed.

    54. When should you use a reducer?

    Ans:

    It is possible to process the data without a reducer but when there is a need to combine the output from multiple mappers – reducers are used. Reducers are generally used when shuffle and sort are required.

    55.  What is the role of a MapReduce partitioner?

    Ans:

    MapReduce is responsible for ensuring that the map output is evenly distributed over the reducers. By identifying the reducer for a particular key, mapper output is redirected accordingly to the respective reducer.

    56.  What is identity Mapper and identity reducer?

    Ans:

    • IdentityMapper is the default Mapper class in Hadoop. This mapper is executed when no mapper class is defined in the MapReduce job.
    • IdentityReducer is the default Reducer class in Hadoop. This mapper is executed when no reducer class is defined in the MapReduce job. This class merely passes the input key value pairs into the output directory.

    57.Compare Spark and MapReduce.

    Ans:

    Apache Spark and Hadoop MapReduce are both popular tools to work on big data. Below are some of the main differences between these two.

      Criteria Spark MapReduce
    Speed

    Spark is up to 100x faster inmemory and 10x faster

    in drive

    MapReduce is comparatively slower than Spark
    Security Spark only supports secret password authentication. Hadoop in addition to secret password authentication also supports ACLs which offers better security compared to Spark.
    Dependability Spark can work on its own without the need for any       other software. Hadoop is required for MapReduce to work
    Ease of Usability

    Spark is easy to use, learn, and implement, thanks

    to the APIs available in

    Java,   Python, and

    Scala.

    MapReduce is harder to learn and implement as it requires the developer to learn extensive Java and Scala programming language

    58. Why Compute Nodes And The Storage Nodes Are The  Same?

    Ans:

    Compute nodes for processing the data, Storage nodes for storing the data. By default Hadoop framework tries to minimize the network wastage, to achieve that goal Framework follows the Data locality concept. The Compute code executes where the data is stored, so the data node and compute node are the same. 

    59. What Is The Configuration Object Importance In Mapreduce?

    Ans:

    It’s used to set/get of parameter name & value pairs in XML file.It’s used to initialize values, read from external file and set as a value parameter.Parameter values in the program always overwrite with new values which are coming from external configure files.Parameter values received from Hadoop’s default values.

    60. Where Mapreduce Not Recommended?

    Ans:

    Mapreduce is not recommended for Iterative kind of processing. It means repeating the output in a loop manner.To process Series of Mapreduce jobs, MapReduce is not suitable. Each job persists data in the local disk, then again loads to another job. It’s a costly operation and not recommended.

    61. What Is a Namenode And It’s Responsibilities?

    Ans:

    Namenode is a logical daemon name for a particular node. It’s the heart of the entire Hadoop system which stores the metadata in FsImage and get all block information in the form of Heartbeat.

    62. What Is Jobtracker’s Responsibility?

    Ans:

    Scheduling the job’s tasks on the slaves. Slaves execute the tasks as directed by the JobTracker. Monitoring the tasks, if failed, re execute the failed tasks. 

    63. What Are The Jobtracker & Tasktracker In Mapreduce?

    Ans:

    MapReduce Framework consists of a single Job Tracker per Cluster, one Task Tracker per node. Usually A cluster has multiple nodes, so each cluster has a single Job Tracker and multiple TaskTrackers.JobTracker can schedule the job and monitor the Task Trackers. If Task Tracker failed to execute tasks, try to re-execute the failed tasks.

    TaskTracker follows the JobTracker’s instructions and executes the tasks. As a slave node, it reports the job status to Master JobTracker in the form of Heartbeat. 

    64. What Is Job Scheduling Importance In Hadoop Mapreduce?

    Ans:

    Scheduling is a systematic procedure of allocating resources in the best possible way among multiple tasks. Hadoop task tracker performing many procedures, sometimes a particular procedure should finish quickly and provide more priority, to do it few job schedulers come into the picture. Default Schedule is FIFO. Fair scheduling, FIFO and CapacityScheduler are the most popular hadoop scheduling in hadoop. 

    65. When Used Reducer?

    Ans:

    To combine multiple mapper’s output use a reducer. The Reducer has 3 primary phases: sort, shuffle and reduce. It’s possible to process data without a reducer, but used when the shuffle and sort is required. 

    66. What Is the Replication Factor?

    Ans:

    A chunk of data is stored in different nodes with in a cluster called replication factor. By default replication value is 3, but it’s possible to change it. Automatically each file is split into blocks and spread across the cluster. 

    67. Where The Shuffle And Sort Process Does?

    Ans:

    After Mapper generates the output temporary store the intermediate data on the local File System. Usually this temporary file is configured at core­site.xml in the Hadoop file. Hadoop Framework aggregate and sort this intermediate data, then update into Hadoop to be processed by the Reduce function. The Framework deletes this temporary data in the local system after Hadoop completes the job.

    68. Java Is Mandatory To Write Mapreduce Jobs?

    Ans:

    No, By default Hadoop implemented in JavaTM, but MapReduce applications need not be written in Java. Hadoop supports Python, Ruby, C++ and other Programming languages. Hadoop Streaming API allows to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.Hadoop Pipes allows programmers to implement MapReduce applications by using C++ programs.

    69. What Methods Can Control The Map And Reduce Function’s Output?

    Ans:

    setOutputKeyClass() and setOutputValueClass()

    If they are different, then the map output type can be set using the methods.

    setMapOutputKeyClass() and setMapOutputValueClass()

    70. What Is The Main Difference Between Mapper And Reducer?

    Ans:

    • Map method is called separately for each key/value that has been processed. It processes input key/value pairs and emits intermediate key/value pairs.
    • Reduce method is called separately for each key/values list pair. It processes intermediate key/value pairs and emits final key/value pairs.
    • Both are initialized and called before any other method is called. Both don’t have any parameters and no output.
    Mapreduce Sample Resumes! Download & Edit, Get Noticed by Top Employers! Download

    71. What Is Difference Between Mapside Join And Reduce Side Join? Or When We Go To Map Side Join And Reduce Join?

    Ans:

    Join multiple tables in mapper side, called map side join. Please note map side join should have strict format and sorted properly. If the dataset is smaller tables, go through the reducer phrase. Data should be partitioned properly.

    Join the multiple tables in the reducer side called reduce side join. If you have a large amount of data tables, plan to join both tables. One table is a large amount of rows and columns, another one has a few number of tables only, goes through Reduce side join. It’s the best way to join the multiple tables

    72. What Happens If The Number Of Reducer Is 0?

    Ans:

    Number of reducer = 0 also valid configuration in MapReduce. In this scenario, No reducer will execute, so mapper output is considered as output, Hadoop stores this information in a separate folder.

    73. When We Are Going To Combine? Why Is It Recommendable?

    Ans:

    Mappers and reducers are independent; they don’t talk to each other. When the functions that are commutative(a.b = b.a) and associative {a.(b.c) = (a.b).c} We go to a combiner to optimize the mapreduce process. Many mapreduce jobs are limited by the bandwidth, so by default Hadoop framework minimizes the data bandwidth network wastage. To achieve its goal, Mapreduce allows user-defined “Cominer function” to run on the map output. It’s an MapReduce optimization technique, but it’s optional.

    74. What Is The Main Difference Between Mapreduce Combiner And Reducer?

    Ans:

    Both Combiner and Reducer are optional, but most frequently used in MapReduce. There are three main differences such a

    • Combiner will get only one input from one Mapper. While Reducer will get multiple mappers from different mappers.
    •  If aggregation requires a used reducer, but if the function follows commutative (a.b=b.a) and associative a.(b.c)= (a.b).c law, use a combiner.
    •   Input and output keys and values types must be the same in the combiner, but the reducer can follow any type input, any output format. 

    75. What Is a Combiner?

    Ans:

    It’s a logical aggregation of key and value pairs produced by mapper. It reduces a lot of duplicated data transfer between nodes, so eventually optimize the job performance. The framework decides whether the combiner runs zero or multiple times. It’s not suitable where mean function occurs.

    76. What Is Partition?

    Ans:

    After the combiner and intermediate map output the Partitioner controls the keys after sorting and shuffling. Partitioner divides the intermediate data according to the number of reducers so that all the data in a single partition gets executed by a single reducer. It means each partition can be executed by only a single reducer. If you call reducer, automatically partition is called in reducer automatically.

    77. When We Go To Partition?

    Ans:

    By default Hive reads the entire dataset even if the application has a slice of data. It’s a bottleneck for mapreduce jobs. So Hive allows a special option called partitions. When you are creating a table, hive partition the table based on requirement. 

    78. What Are The Important Steps When You Are Partitioning The Table?

    Ans:

    Don’t over partition the data with too small partitions, it’s overhead to the namenode.

    If dynamic partition, at least one static partition should exist and set to strict mode by using given commands.

      SET hive.exec.dynamic.partition = true;

      SET hive.exec.dynamic.partition.mode = nonstrict;

    First load data into non­partitioned table, then load such data into a partitioned table. It’s not possible to load data from a local to partitioned table.

    insert overwrite table table_name partition(year) select * from non ­partitioned table

    79. Can You Elaborate Mapreduce Job Architecture?

    Ans:

    First Hadoop programmer submits Mapreduce program to JobClient.

    Job Client ret the JobTracker to get Job id, Job tracker provide JobID, it’s in the form of Job_HadoopStartedtime_00001. It’s a unique ID.

    Once JobClient receives a Job ID, copy the Job resources (job.xml, job.jar) to the File System (HDFS) and submit the job to JobTracker. JobTracker initiates Job and schedules the job.

    Based on configuration, job split the input splits and submit to HDFS. TaskTracker retrieve the job resources from HDFS and launch Child JVM. In this Child JVM, run the map and reduce tasks and notify to the Job tracker the job status.

    80. Why Task Tracker Launch Child Jvm?

    Ans:

    Most frequently, hadoop developers mistakenly submit wrong jobs or have  bugs. If Task Tracker uses an existing JVM, it may interrupt the main JVM, so other tasks may be influenced. Whereas child JVM if it’s trying to damage existing resources, TaskTracker kills that child JVM and retry or relaunch the new child JVM. 

    81. Why Jobclient, Job Tracker Submits Job Resources To File System?

    Ans:

    Data locality. Moving competition is cheaper than moving Data. So logic/ competition in Jar file and splits. So Where the data available, in File System Datanodes. So every resource copies where the data is available.

    82. How Many Mappers And Reducers Can Run?

    Ans:

    By default Hadoop can run 2 mappers and 2 reducers in one datanode. also each node has 2 map slots and 2 reducer slots. It’s possible to change this default value in Mapreduce.xml in the conf file.

    83. What Is a Chain Mapper?

    Ans:

    Chain mapper class is a special mapper class set which runs in a chain fashion within a single map task. It means, one mapper input acts as another mapper’s input, in this way a number of mappers are connected in chain fashion.

    84. How To Configure The Split Value?

    Ans:

    By default block size = 64mb, but to process the data, job trackers split the data. Hadoop architects use these formulas to know split size.

       split size = min (max_splitsize, max (block_size, min_split_size));

       split size = max(min_split_size, min (block_size, max_split, size));

       by default split size = block size

       Always No of splits = No of mappers.

       Apply above formulaAns:

       split size = Min (max_splitsize, max (64, 512kB) // max _splitsize = depends on env, may 1gb or 10gb split size = min (10gb (let assume), 64) split size = 64MB.

       split size = max(min_split_size, min (block_size, max_split, size)); split size = max (512kb, min (64, 10GB)); split size = max (512kb, 64);split size = 64 MB;

    85. How Much Ram Required To Process 64mb Data?

    Ans:

    Leg assume. 64 block size, system take 2 mappers, 2 reducers, so 64*4 = 256 MB memory and OS takes at least 30% extra space so at least 256 + 80 = 326MB Ram required to process a chunk of data.So in this way required more memory to process an unstructured process. 

    86. What is distributed Cache in MapReduce Framework? Explain.

    Ans:

    Distributed cache is an important part of the MapReduce framework. It is used to cache files across operations during the time of execution and ensures that tasks are performed faster. The framework uses the distributed cache to store important files that are frequently required to execute tasks at that particular node.

    87. What is heartbeat in HDFS? Explain.

    Ans:

    A heartbeat in HDFS is a signal mechanic used to signal if it is active or not. For example, a DataNode and NameNode use heartbeat to convey if they are active or not. Similarly, JobTracker and NameNode also use heartbeat to do the same.

    88. What happens when a DataNode fails?

    Ans:

    As big data processing is data and time sensitive, there are backup processes if DataNode fails. Once a DataNode fails, a new replication pipeline is created. The pipeline takes over the write process and resumes from where it failed. The whole process is governed by NameNode which constantly observes if any of the blocks is under-replicated or not.

    89. Can you tell us how many daemon processes run on a Hadoop system?

    Ans:

    There are five separate daemon processes on a Hadoop system. Each of the daemon processes has its JVM. Out of the five daemon processes, three runs on the master node whereas two runs on the slave nodes. They are as below.

    Master Nodes

    •  NameNode- maintains and store data in HDFS
    • Secondary NameNode – Works for NameNode and performs housekeeping functions.
    • JobTracker – Take care of the main MapReduce jobs. Also takes care of distributing tasks to machines listed under task tracker.

    Slave Nodes

    •   DataNode – manages HDFS data blocks.
    • TaskTracker – manages the individual Reduce and Map tasks.

    90. What Is Difference Between Block And Split?

    Ans:

    BlockAns: How much chunk data to store in the memory called block. 

     SplitAns: how much data to process the data called split. 

    91. Why Hadoop Framework Reads A File Parallel Why Not Sequential?

    Ans:

    To retrieve data faster, Hadoop reads data parallel, the main reason it can access data faster. While, writes in sequence, but not parallel, the main reason it might result in one node can be overwritten by another and where the second node. Parallel processing is independent, so there is no relation between two nodes, if writes data in parallel, it’s not possible where the next chunk of data has.

    For example: 100 MB data write parallel, 64 MB one block another block 36, if data writes parallel the first block doesn’t know where the remaining data is. So Hadoop reads parallel and writes sequentially.

    92. If I Am Changing the Block Size From 64 To 128, Then What Happen?

    Ans:

    Even if you have changed block size it does not affect existing data. After changing the block size, every file chunked after 128 MB of block size. It means old data is in 64 MB chunks, but new data stored in 128 MB blocks.

    93. What Is Issplitable()?

    Ans:

    By default this value is true. It is used to split the data in the input format. if un­structured data, it’s not recommendable to split the data, so process the entire file as a one split. to do it first change isSplitable() to false.

    94. How Much Hadoop Allows Maximum Block Size And Minimum Block Size?

    Ans:

    MinimumAns: 512 bytes. It’s local OS file system block size. No one can decrease fewer than block size.

    MaximumAns: Depends on environment. There is no upper bound. 

    95. What Are The Job Resource Files?

    Ans:

    job.xml and job.jar are core resources to process the Job. Job

    Clients copy the resources to the HDFS. 

    96. What’s The Mapreduce Job Consists?

    Ans:

    MapReduce job is a unit of work that client wants to be performed. It consists of input data, MapReduce program in Jar file and configuration setting in XML files. Hadoop runs this job by dividing it in different tasks with the help of JobTracker

    97. What Is The Data Locality?

    Ans:

    Wherever the data is there process the data, computation/process the data where the data is available, this process is called data locality. “Moving Computation is Cheaper than Moving Data” to achieve this goal following data locality. It’s possible when the data is splittable, by default it’s true. 

    98. What Is Speculative Execution?

    Ans:

    Hadoop runs the process in commodity hardware, so it’s possible to fail if the system also has low memory. So if the system failed, the process also failed, it’s not recommendable.Speculative execution is a process performance optimization technique.Computation/logic distribute to the multiple systems and execute which system executes quickly. By default this value is true. Now even the system crashed, not a problem, the framework chose logic from other systems.

    For Example: logic distributed on A, B, C, D systems, completed within a time.

    System A, System B, System C, System D systems executed 10 min, 8 mins, 9 mins 12 mins simultaneously. So consider system B and kill remaining system processes, framework take care to kill the other system process. 

    99. When We Go To Reducer?

    Ans:

    When sort and shuffle is required then only goes to reducers otherwise no need partition. If filter, no need to sort and shuffle. So without a reducer it’s possible to do this operation.

    Are you looking training with Right Jobs?

    Contact Us
    Get Training Quote for Free