What is hadoop?
Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing.
lets see some of the terms first!
Big data is a marketing term, not a technical term. Everything is big data these days.This is a totally unspecific term that is largely defined by what the marketing departments of various very optimistic companies can sell – and the C*Os of major companies buy, in order to make magic happen.
Actually, data mining was just as overused… it could mean anything such as
- collecting data (think NSA)
- storing data
- machine learning / AI (which predates the term data mining)
- non-ML data mining (as in “knowledge discovery”, where the term data mining was actually coined; but where the focus is on new knowledge, not on learning of existing knowledge)
- business rules and analytics
- anything involving data you want to sell for truckloads of money
Most “big” data mining isn’t big
What hadoop does?
Data quality suffers with size