Friday, November 21, 2014

How we came across the need for Big Data way back in Y2K

We were working on a batch job and the requirement was to finish processing all the transaction in a database within a maintenance window of less than a hour. The requirement was to read hundreds of thousands of records, call a service and validate and enrich those records using reference data returned by the service (Back then we had no web services, we had CORBA or IIOP (Java RMI/IIOP) to invoke those services). Now the challenge was the time window that was very brief. We figured out that reading the records sequentially was not an option and we had to process those records in parallel.

So I proposed we could use batch jobs running in multiple processes spread across multiple nodes to read and process the records from the database, the slaves or Yes, the "Mappers"!  as in MapReduce. Now how could we prevent the parallel batch jobs from stepping on each others toes and avoid picking up the same set of records for processing. How could we avoid this duplication? use row level read locks and slow down the whole application? Of course not, we needed a master thread or a process that would control the distribution of records across the slave processes, the "Reducer" as in MapReduce. I am sure you are saying reducers not only carry task distribution in MapReduce but they also consolidate the output of the mappers. Secondly you may say that there was no distributed file system. That is correct but we did not have any such requirement and the similarity ends there.

However, I can think of another example which was closer to MapReduce than the above one. It was college days and we were writing "Hello World" programs while learning new programming languages. And as with any novices we were also fascinated with idea of hacking fellow students passwords. So I wrote a program that could generate all possible combinations of  strings. It was a quick and dirty and dumb program. The dumb part came from the fact that the program did not ensure that it generated only unique combinations. The program would simply use a random number generator to generate a random ASCII code and add it to a fixed length character sequence. Once a unique password was generated it would be stored in a file. To speed up the process I distributed this random string generation task across multiple processes, the "Mappers". And used another process to accept results from these Mappers (Slaves) and check if it was a duplicate combination, the "Reducer". The Reducer was also responsible to handout a set of chars as well as the length of the combination to be generated to the slave jobs. Agree that this was nearest to MapReduce? Although this was not a very ethical application of technology but it gave me a good idea on how hackers can use this technique with lots of commodity hardware (bunch of Linux boxes as a parallel processing cluster) to hack passwords.

I am providing below the deck that I had prepared back then in year 2000 to present the architecture (no not for the unethical application) to my colleagues and client. This during that time was presented as an approach for extreme scalability. I later re-wrote the whole thing with Java 5.

Please also see an important conclusion at the end of this post about the MapReduce architecture.













Conclusion
I feel that MapReduce is not a new architecture style or pattern but it is based on the Master Slave style of architecture. It is tailored for achieving high scalibility by processing huge volume of data in parallel.