Why do you need distributed data analytics?

It depends on your use-case, I can only describe why we need this so much that we have put almost one year of development effort into Radoop so far. We are a data mining research lab with many industrial projects and until a few years ago the typical data set was a few hundred megabytes to work on. It was easy to handle as we could load it into memory and analyze with SPSS, Weka, RapidMiner, R, Matlab, SAS or any other tool (thanks to academic licensing). All these tools have pros and cons and we could select the one that was a good fit for the given problem.

However, in the past few years, we are experiencing a huge growth of data size in our projects. It is not because those companies got so much bigger, but because they collect orders of magnitude more data about their operations. They not just collect pageviews on the website, but track clicks, mouse movements and all other interactions. Telecom companies collect more information on calls or even on machine-to-machine messages in the network. Sensor technology also provides data streams that are hard (or even impossible) to store and analyze. Medical companies can collect data with new technologies that was never possible before. Just consider that full genome sequencing produces GBs of data and many researchers claim that soon it will be possible to finish in a few minutes.

One-machine computing power is not growing with orders of magnitude, but data is. We see this trend in our projects and running ahead for a solution. Hadoop is a good fit for analyzing big distributed data sets and RapidMiner is a good fit as a data analytics interface. We hope that Radoop will also help you in the big data world!

Leave a Reply