I was in the middle of a project. The situation, as usual, was not good at all in terms of achieving the required performance. There are about 1 billion rows of an employee table to be read from MYSQL database. I have to select some 400 millions of rows from this big table based on a filter criteria, say all employees joined in last seven years (based on a joining_num column). The client is one of the biggest in transportation industry and they have about thirty thousand offices across United States and Latin America. For each of the rows filtered from the above table, I need to connect to another table and find out the name of the department in which they initially joined, for some human resource related application. Solving this with a traditional JDBC is not an option, due to time the process is taking for fetching all records and iterating within for fetching the next step of related records. A database re-modelling is disapproved since many web and reporting components in the existing system are using these tables. A view or another copy of the combined table is not what I prefer, since proliferation of data is to be avoided, my principle in any data lake projects or transformation. And in this big data world, I have enormous compute power available to me. Hence leveraging distributed computing technology is obviously the best way to approach this problem.

Read More