Hadoop is a large-scale distributed batch processing infrastructure. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.
How large an amount of work? Orders of magnitude larger than many existing systems work with. Hundreds of gigabytes of data constitute the low end of Hadoop-scale. Actually Hadoop is built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes. At this scale, it is likely that the input data set will not even fit on a single computer's hard drive, much less in memory. So Hadoop includes a distributed file system which breaks up input data and sends fractions of the original data to several machines in your cluster to hold. This results in the problem being processed in parallel using all of the machines in the cluster and computes output results as efficiently as possible.
Challenges at Large Scale
Performing large-scale computation is difficult. To work with this volume of data requires distributing parts of the problem to multiple machines to handle in parallel. Whenever multiple machines are used in cooperation with one another, the probability of failures rises. In a single-machine environment, failure is not something that program designers explicitly worry about very often: if the machine has crashed, then there is no way for the program to recover anyway.
In a distributed environment, however, partial failures are an expected and common occurrence. Networks can experience partial or total failure if switches and routers break down. Data may not arrive at a particular point in time due to unexpected network congestion. Individual compute nodes may overheat, crash, experience hard drive failures, or run out of memory or disk space. Data may be corrupted, or maliciously or improperly transmitted. Multiple implementations or versions of client software may speak slightly different protocols from one another. Clocks may become desynchronized, lock files may not be released, parties involved in distributed atomic transactions may lose their network connections part-way through, etc. In each of these cases, the rest of the distributed system should be able to recover from the component failure or transient error condition and continue to make progress. Of course, actually providing such resilience is a major software engineering challenge.
Different distributed systems specifically address certain modes of failure, while worrying less about others. Hadoop provides no security model, nor safeguards against maliciously inserted data. For example, it cannot detect a man-in-the-middle attack between nodes. On the other hand, it is designed to handle hardware failure and data congestion issues very robustly. Other distributed systems make different trade-offs, as they intend to be used for problems with other requirements (e.g., high security).
In addition to worrying about these sorts of bugs and challenges, there is also the fact that the compute hardware has finite resources available to it. The major resources include:
* Processor time
* Memory
* Hard drive space
* Network bandwidth
Individual machines typically only have a few gigabytes of memory. If the input data set is several terabytes, then this would require a thousand or more machines to hold it in RAM -- and even then, no single machine would be able to process or address all of the data.
Hard drives are much larger; a single machine can now hold multiple terabytes of information on its hard drives. But intermediate data sets generated while performing a large-scale computation can easily fill up several times more space than what the original input data set had occupied. During this process, some of the hard drives employed by the system may become full, and the distributed system may need to route this data to other nodes which can store the overflow.
Finally, bandwidth is a scarce resource even on an internal network. While a set of nodes directly connected by a gigabit Ethernet may generally experience high throughput between them, if all of the machines were transmitting multi-gigabyte data sets, they can easily saturate the switch's bandwidth capacity. Additionally if the machines are spread across multiple racks, the bandwidth available for the data transfer would be much less. Furthermore RPC requests and other data transfer requests using this channel may be delayed or dropped.
To be successful, a large-scale distributed system must be able to manage the above mentioned resources efficiently. Furthermore, it must allocate some of these resources toward maintaining the system as a whole, while devoting as much time as possible to the actual core computation.
Synchronization between multiple machines remains the biggest challenge in distributed system design. If nodes in a distributed system can explicitly communicate with one another, then application designers must be cognizant of risks associated with such communication patterns. It becomes very easy to generate more remote procedure calls (RPCs) than the system can satisfy! Performing multi-party data exchanges is also prone to deadlock or race conditions. Finally, the ability to continue computation in the face of failures becomes more challenging. For example, if 100 nodes are present in a system and one of them crashes, the other 99 nodes should be able to continue the computation, ideally with only a small penalty proportionate to the loss of 1% of the computing power. Of course, this will require re-computing any work lost on the unavailable node. Furthermore, if a complex communication network is overlaid on the distributed infrastructure, then determining how best to restart the lost computation and propagating this information about the change in network topology may be non trivial to implement.
How large an amount of work? Orders of magnitude larger than many existing systems work with. Hundreds of gigabytes of data constitute the low end of Hadoop-scale. Actually Hadoop is built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes. At this scale, it is likely that the input data set will not even fit on a single computer's hard drive, much less in memory. So Hadoop includes a distributed file system which breaks up input data and sends fractions of the original data to several machines in your cluster to hold. This results in the problem being processed in parallel using all of the machines in the cluster and computes output results as efficiently as possible.
Challenges at Large Scale
Performing large-scale computation is difficult. To work with this volume of data requires distributing parts of the problem to multiple machines to handle in parallel. Whenever multiple machines are used in cooperation with one another, the probability of failures rises. In a single-machine environment, failure is not something that program designers explicitly worry about very often: if the machine has crashed, then there is no way for the program to recover anyway.
In a distributed environment, however, partial failures are an expected and common occurrence. Networks can experience partial or total failure if switches and routers break down. Data may not arrive at a particular point in time due to unexpected network congestion. Individual compute nodes may overheat, crash, experience hard drive failures, or run out of memory or disk space. Data may be corrupted, or maliciously or improperly transmitted. Multiple implementations or versions of client software may speak slightly different protocols from one another. Clocks may become desynchronized, lock files may not be released, parties involved in distributed atomic transactions may lose their network connections part-way through, etc. In each of these cases, the rest of the distributed system should be able to recover from the component failure or transient error condition and continue to make progress. Of course, actually providing such resilience is a major software engineering challenge.
Different distributed systems specifically address certain modes of failure, while worrying less about others. Hadoop provides no security model, nor safeguards against maliciously inserted data. For example, it cannot detect a man-in-the-middle attack between nodes. On the other hand, it is designed to handle hardware failure and data congestion issues very robustly. Other distributed systems make different trade-offs, as they intend to be used for problems with other requirements (e.g., high security).
In addition to worrying about these sorts of bugs and challenges, there is also the fact that the compute hardware has finite resources available to it. The major resources include:
* Processor time
* Memory
* Hard drive space
* Network bandwidth
Individual machines typically only have a few gigabytes of memory. If the input data set is several terabytes, then this would require a thousand or more machines to hold it in RAM -- and even then, no single machine would be able to process or address all of the data.
Hard drives are much larger; a single machine can now hold multiple terabytes of information on its hard drives. But intermediate data sets generated while performing a large-scale computation can easily fill up several times more space than what the original input data set had occupied. During this process, some of the hard drives employed by the system may become full, and the distributed system may need to route this data to other nodes which can store the overflow.
Finally, bandwidth is a scarce resource even on an internal network. While a set of nodes directly connected by a gigabit Ethernet may generally experience high throughput between them, if all of the machines were transmitting multi-gigabyte data sets, they can easily saturate the switch's bandwidth capacity. Additionally if the machines are spread across multiple racks, the bandwidth available for the data transfer would be much less. Furthermore RPC requests and other data transfer requests using this channel may be delayed or dropped.
To be successful, a large-scale distributed system must be able to manage the above mentioned resources efficiently. Furthermore, it must allocate some of these resources toward maintaining the system as a whole, while devoting as much time as possible to the actual core computation.
Synchronization between multiple machines remains the biggest challenge in distributed system design. If nodes in a distributed system can explicitly communicate with one another, then application designers must be cognizant of risks associated with such communication patterns. It becomes very easy to generate more remote procedure calls (RPCs) than the system can satisfy! Performing multi-party data exchanges is also prone to deadlock or race conditions. Finally, the ability to continue computation in the face of failures becomes more challenging. For example, if 100 nodes are present in a system and one of them crashes, the other 99 nodes should be able to continue the computation, ideally with only a small penalty proportionate to the loss of 1% of the computing power. Of course, this will require re-computing any work lost on the unavailable node. Furthermore, if a complex communication network is overlaid on the distributed infrastructure, then determining how best to restart the lost computation and propagating this information about the change in network topology may be non trivial to implement.
This comment has been removed by a blog administrator.
ReplyDeleteHadoop i sreally a good booming technology in coming days. And good path for the people who are looking for the Good Hikes and long career. Hadoop Tutorial
ReplyDeleteYou can't short circuit your approach to Hadoop. It's NOT for newbies with training.
ReplyDeleteHadoop engineersq have to have HA systems training and experience, computational cluster training, SQL and NoSQL DBA training, Network design, Business intelligence, and distributed systems training. Don't rush to it because it's "in".
Hai dude,i have learn to lot of hadoop.Thanks in that reviews.
ReplyDeleteHadoop Training in Chennai
nice explanation just look at for online course gyanfinder
ReplyDeleteHi,Thanks a lot.so,i have learn to lot of hadoop.
ReplyDeleteHadoop Training in Chennai
thank u very much.....
ReplyDeleteBest Hadoop Training in Chennai
Thanks for InformationHadoop Course will provide the basic concepts of MapReduce applications developed using Hadoop, including a close look at framework components, use of Hadoop for a variety of data analysis tasks, and numerous examples of Hadoop in action. This course will further examine related technologies such as Hive, Pig, and Apache Accumulo. HADOOP Online Training
ReplyDeleteVery useful information thanks for sharing Salesforce Training in Hyderabad
ReplyDeleteVery useful information thanks for sharing Hadoop Training in Hyderabad
ReplyDeleteVery useful information thanks for sharing Hadoop Training in Hyderabad
ReplyDeleteHADOOP Online Training by hyderabadsys online Trainings with a fantastic and continuous staff. Our Hadoop hyderabadsys online Trainings substance outlined according to the current IT industry necessity. Apache Hadoop is having great request in the business sector, tremendous number of employment opportunities are there in the IT world. Taking into account this interest hyderabadsys online Trainings began giving Online classes on Hadoop Training through the different online traininng strategies like Gotomeeting, Webex. Hadoop internet preparing is one hot no problem that has been broadly utilized. Owing to this, there is a gigantic requirement for Hadoop web preparing Administrators. There are numerous merchants who offer Hadoop internet preparing organization preparing. Engineering development inside the most recent decade has been huge to the point that things once considered unlimited are currently regular place and capacities and employments that once obliged high abilities and broad preparing can now be performed by practically anybody.
ReplyDeleteHadoop Online Training
Contact us:
India +91 9030400777
Usa +1-347-606-2716
Email: contact@Hyderabadsys.com
Hadoop is really a good booming technology in coming days. And good path for the people who are looking for the Good Hikes and long career. We also provide Hadoop online training
ReplyDeleteExcellent Post thanks for sharing Hadoop Training in Hyderabad
ReplyDeleteValuable information thanks for sharing from Manasa
ReplyDeleteHave we gain the more knowledge on Hadoop please visit this site
ReplyDeletehttp://hadoopvideosonline.blogspot.in/
If we can get more knowledge on Hadoop please visit this site.
ReplyDeletehttp://hadoopvideosonline.blogspot.in/
There were more oppurtunities in Hadoop
ReplyDeleteHi
ReplyDeleteThank you for giving very valuable information on
Hadoop
http://hadoop-dina.blogspot.com/2015/03/big-data-tutorial-for-beginners-day.html
ReplyDeleteYou want big data interview question and answers for this link.
ReplyDeletebig data interview question and answers
Apache Hadoop is the leading technology now a days.The career growth is good for people stepping into this technology.Recently,I visited www.hadooponlinetutor.com,they are offering hadoop videos at $20only.If anyone require check it out.
ReplyDeleteLearn Hadoop in depth in a systematic and practical manner for free Click Here
ReplyDeletecan use please provide some more information that means it would be more easier to learn this concepts.
ReplyDeleteNice blog, excellent posts. Have some video tutorials at deeplexer.co have a look.
ReplyDeleteYour specific information Hadoop Training Chennai is truly tremendous for Hadoop Training in Chennai me. Keep overhaul your web journal.Thank you to such an extent...
ReplyDeleteHadoop cluster lab setup on rest for learning purpose, if anyone interested, please look at
ReplyDeletehttp://www.s4techno.com/lab-setup/
Hadoop cluster lab setup on rest for learning purpose, if anyone interested, please look at
ReplyDeletehttp://www.s4techno.com/lab-setup/
go here
ReplyDeleteGood article, Hadoop is a term you will hear and over again when discussing the processing of big data information. Apache Hadoop is an open-source soft ware frame work that supports data intensive distributed applications.
ReplyDeleteHadoop consists 2 major core components.
1. The Hadoop Distributed file system(HDRS)
2. Map Reduce. Read more...
Check this site Mindmajix for indepth Hadoop Tutorials
Go here if you’re looking for information on BigData Hadoop Tutorials
$20 can get you:
ReplyDeletea) Movie Tickets & popcorn,
b) A cuppo for your car keys,
c) A clothespin holder,
d) A Hadoop 5-in-1 package as investment in your future.
What are you buying today?
Visit Now: http://bit.ly/1SqESgK
ReplyDeleteAppreciation for nice Updates,if u r interested lpease visit
http://www.tekclasses.com/
Good source of information on hadoop and Map R. We always rely on this blog other than our regular hadoop online training classes.
ReplyDeleteList of Do Follow Social Bookmarking Sites in United States of America.
ReplyDeletewww.topleadsmedia.com
Great information about Hadoop. It will be helpful for us.
ReplyDeleteThank you !!
I would like to share useful things for hadoop job seekers Hadoop Interview Questions .
Present for hadoop is booming technology. For learning hadoop basics of java we need to learn first.Big data training in Hyderabad
ReplyDeleteThe Information which you provided is really useful for bigdata hadoop learners. Thank You for Sharing Valuable Information.I like this blog and this is very informative.
ReplyDeleteRegards..
Big Data Training in Chennai
Nice it is thanks for sharing
ReplyDeleteVisit - www.tekclasses.in
recently i came your blog and have been read along..it's very nice..we are offering dot net online training
ReplyDeleteGreat post about Hadoop. It will be helpful for us.
ReplyDeleteThank you so much..
For hadoop job seekers Hadoop Interview Questions
Hadoop Training In Noida .