+91 90691 39140 | +1 253 214 3115 | info@hub4tech.com | hub4tech

Hadoop Interview Questions

These Hadoop Interview Questions and answers have been designed for beginner and professional. Here you can find the answers of various interview Questions related to Hadoop.

Name the most common Input Formats defined in Hadoop? Which one is default?

The two most common Input Formats defined in Hadoop are:

  • TextInputFormat
  • KeyValueInputF5ormat
  • SequenceFileInputFormat

TextInputFormat is the Hadoop default.

What is InputSplit in Hadoop?

When a Hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called InputSplit.

What platform and Java version is required to run Hadoop?

Java 1.6.x or higher version are good for Hadoop, preferably from Sun. Linux and Windows are the supported operating system for Hadoop, but BSD, Mac OS/X and Solaris are more famous to work.

What kind of Hardware is best for Hadoop?

Hadoop can run on a dual processor/ dual core machines with 4-8 GB RAM using ECC memory. It depends on the workflow needs.

How is the splitting of file invoked in Hadoop framework?

It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class (like FileInputFormat) defined by the user.

What is the purpose of RecordReader in Hadoop?

The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the Input Format.

What is InputSplit in Hadoop? Explain.

When a hadoop job runs, it splits input files into chunks and assign each split to a mapper for processing. It is called InputSplit.

What is a Combiner?

The Combiner is a ‘mini-reduce’ process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.

How many InputSplits is made by a Hadoop Framework?

Hadoop will make 5 splits as following:

  • One split for 64K files
  • Two splits for 65MB files, and
  • Two splits for 127MB files
What is JobTracker?

JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.

What are some typical functions of Job Tracker?

The following are some typical tasks of JobTracker:-

  • Accepts jobs from clients
  • It talks to the NameNode to determine the location of the data.
  • It locates TaskTracker nodes with available slots at or near the data.
  • It submits the work to the chosen TaskTracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker.
What is Map/Reduce job in Hadoop?

Map/Reduce is programming paradigm which is used to allow massive scalability across the thousands of server.
Actually MapReduce refers two different and distinct tasks that Hadoop performs. In the first step maps jobs which takes the set of data and converts it into another set of data and in the second step, Reduce job. It takes the output from the map as input and compress those data tuples into smaller set of tuples.

Hadoop Interview Questions

Define TaskTracker.

TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations from a JobTracker.

What is Hadoop Streaming?

Hadoop streaming is a utility which allows you to create and run map/reduce job. It is a generic API that allows programs written in any languages to be used as Hadoop mapper.

What is a combiner in Hadoop?

A Combiner is a mini-reduce process which operates only on data generated by a Mapper. When Mapper emits the data, combiner receives it as input and sends the output to reducer.

Is it necessary to know java to learn Hadoop?

If you have a background in any programming language like C, C++, PHP, Python, Java etc. It may be really helpful, but if you are nil in java, it is necessary to learn Java and also get the basic knowledge of SQL.

How to debug Hadoop code?

There are many ways to debug Hadoop codes but the most popular methods are:

  • By using Counters.
  • By web interface provided by Hadoop framework.
What is distributed cache in Hadoop?

Distributed cache is a facility provided by MapReduce Framework. It is provided to cache files (text, archives etc.) at the time of execution of the job. The Framework copies the necessary files to the slave node before the execution of any task at that node.

Copyright ©2015 Hub4Tech.com, All Rights Reserved. Hub4Tech™ is registered trademark of Hub4tech Portal Services Pvt. Ltd.
All trademarks and logos appearing on this website are the property of their respective owners.