Anyone who has an idea about what’s happening around in the world of technology would have come across the term “Big Data” and yes, as one would correctly assume, it means a huge amount of data or data that contains a greater variety of information in increasing volume with higher and higher velocity (the three V’s) in a decorative sentence.
Why Big Data?
This is one important, simple yet difficult question. To answer this, we have to understand, what we can derive from big data and where it is coming from.
There are millions of patients around the globe with simple flu to deadly cancer. Suppose we are taking only 10000 ICU patients and collecting only their ECG, EEG, PPG, heart rate, temperature, pressure, SpO2 and other vital parameters in their respective sampling frequency. We will be generating a file or a couple of files of a size of 10MB an hour altogether.
1 hour — 10MB
24 hour — 240 MB
24 hours/5 days — 1.2 GB (consider we are only collecting 5 days of data)
If this is only from one patient,
imagine what it would be for 10000 patients? — 12000 GB or 12 TB.
So we are able to collect this huge patient data and store it as well, but now what?
In this example, suppose we are trying to find the pattern of how patients went through different critical stages and their survival rates, we can predict whether 100001st patient will survive or not (to some extent), right? And yes, you guessed it… eventually, we reach “machine learning”. This is in-fact the final step in processing big data in most of the cases. But not always.
In some other cases, say in a mobile app, the data will be user taps and scrolls, which when processed, become user patterns which will eventually generate revenue if we deliver what users like/dislike coined with perfect targeted marketing.
How huge should the data be to be termed as Big Data?
Challenges associated with big data
Mainly two challenges are associated with big data:
Two simple challenges but very difficult to tackle because we are talking about petabytes of data. What if we want a chunk of data, queried in the database and need to wait for a couple of minutes to get the results? Really bad idea!
Even after we received the data, and start processing and it took a couple of days to derive some insights from data.. really really bad idea!!
For that, we can use different frameworks like the Hadoop framework for management and retrieval of big data and in the processing side, of course, efficient programming backed with powerful processors is the only hope. While writing a single line of code, a keen idea of time and space complexity should be there because the next day we may encounter another petabyte of data!