Big data analysis has been possible for some companies or organizations in the past, but at an extremely high cost. Google and Walmart are early examples.
Technical developments and cheap cloud-based services now also enable small companies to use big data to gather new insights and knowledge, as they can simply rent a cloud-based big data hosting solution, from any of the many suppliers available.
In big data, it has been common to talk about the three V’s: The volume (data size), velocity (speed of the data flow), and variety (how unstructured the information is) of the data. It has been reasonably easy to get two of the three before, but until recently, it has been very expensive to get all three at the same time.
One can see big data as affordable commercial supercomputing, where data mining and processing can be done economically at scale, usually using distributed systems and parallel processing in hundreds or even thousands of computers at the same time.
The most popular big data software platform is Apache Hadoop.
Apache Hadoop was the first supercomputing software platform to be affordable at scale. It is an open-source (i.e. free) software solution for big data analysis. It was originally developed by Yahoo and is considered to have been the major driving force in the growth of big data.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005 with the purpose of building a framework for distributed storage and processing of enormous data sets.
It has the capability to process massive amounts of unstructured data at low computational cost—hundreds, thousands, or even millions of gigabytes of data. It does this using a job scheduler that distributes the computations over many server computers working together in parallel.
Previously, large-scale data analysis had to be done using expensive specialized computer clusters that provided a highly reliable computing platform. Hadoop, on the other hand, can use relatively cheap standard computers, which reduces the cost.
Hadoop does this by assuming all computer nodes may fail, and includes software functionalities to recover from failing computer hardware. This enables the system to provide a high-availability service when hundreds or thousands of computers of varying reliability are thrown at a large problem.
The job scheduler in Hadoop is an open-source implementation of MapReduce, which was created by Google to handle the problem of creating web search indexes for something as large as the Internet itself. The MapReduce framework is used in most modern big data processing today.
The strength of MapReduce is its capability to take a query over a massive dataset and divide it into smaller portions that are executed in parallel on many computers. This solves the problem of the data being too large to fit into one computer.
For the parallel distributed processing to work, all computers in the cluster need to be able to access the data. Therefore, Hadoop also contains a distributed file system (HDFS) that spreads copies of the files over many computers, enabling parallel work.
To simplify development of analysis models, two complementary technologies are often used: Pig and Hive. Pig is a programming language used to define what analysis should be done in Hadoop, and Hive provides a more structured data warehouse layer over Hadoop.
Who defines what work should be done on the data? With big data analysis, a new profession is emerging, that of the data scientist who combines math skills and programming knowledge to help create a data-driven business or even new products.
A data revolution is now happening.
History will perhaps consider it as important as the Internet itself one day. The value of big data is both in its analytical power to improve current business operations, and create entirely new products based on big data analysis.
In the first case, you gain insights that help you optimize your company’s existing business. In the latter case, you create a new product that is based on big data analysis; for example, selling information on how to fertilize a farmer’s field most efficiently, or selling smart farming equipment that uses such data to operate more economically.
Given that Hadoop is the de-facto standard in big data analysis, how do you use it?
As Hadoop is free open-source software, you could download it and install it on you own computers. A somewhat simpler solution is to get a readymade distribution from a vendor who provides pre-packaged Hadoop solutions.
However, cloud hosting solutions are quickly becoming a popular choice. You can rent a big data platform from one of the many hosting companies like Amazon, Microsoft, or Google and get started more or less immediately.
The major revolution is that big data is now available to almost anyone at an affordable cost, even to small companies with some software expertise. You do not need to have a data center or a large computer cluster of your own anymore.
A credit card and a hosting supplier are all you need—in addition to data to process and some skills, of course.
Data Scientists and R
As the previous sections have clearly showed, analyzing data can be very useful. Data science is an interdisciplinary field working to extract insights from various forms of data.
Building systems for big data analysis are done by data scientists. A data scientist has knowledge in computer science, statistics, analytics, and math. But a data scientist also has a strong business insight and understands what analysis results have value to the organization.
Data scientists often use the R programming language to perform analysis of large amounts of data. R is a domain-specific language used for statistical computing and data mining and was first introduced in 1993. It is developed by Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand.
If you dive deeper into data analysis, you will probably spend a lot of time programming data-mining scripts in R.
Read the other articles in this blog post series on Business Insights:
- Part 1: Data mining and big data
- Part 2: What is big data?
- Part 3: Supercomputing for anyone (this post)