Big Data is one of the hottest topics on data systems
nowadays. Many of organizations tries to find a clue to start work with Big
Data, and there are many courses and conference sessions on Big Data. Microsoft
as a Database and software vendor started to provide specific solutions for Big
Data. In this post you’ll learn about Big Data and some related terminologies,
and a high level overview of Microsoft solution for Big Data.
Everyone thinks that Big Data is every database that is more
than 1TB, but this is not correct. A very simple Definition of Big Data is:
Big Data is data set collections with high volume, velocity
and variety information, which can be used to fetch information regarding
decision making.
From definition above three main dimensions of Big Data are
obvious;
Volume; Size of Data
Variety; Different formats of Data
Velocity; How face data increases and How fact it will be
processed
Regarding to increasing number of database systems
especially transactional systems, social networks, logging systems and many
other systems that produce large number of transaction per time slice, Businesses
faces Big Data as times goes on.
Large Volume of data set and variety of data and the concern
for velocity will make it harder and harder to work with Big Data in regular
Relational Database Systems or ever in Data Warehouses. So Database vendors
started to think about methods and tools for dealing with Big Data in an
efficient way.
Microsoft also joined the Big Data vendors with introducing
Microsoft HDInsight.
What is HDInsight?
Microsoft HDInsight powered by Hortonworks and Microsoft,
Hortonworks is the company that provides Hadoop based solutions for Big Data, which
are powerful solutions for Big Data. So HDInsight is Hadoop based solution for
Microsoft Windows to provide Big Data Solutions with Microsoft Technologies.
What is Hadoop?
Hadoop is Apache based open source project for reliable, scalable,
distributed computing.
Hadoop provides distributed processing of large data sets
across clusters of computers using programming models.
Hadoop project includes different components to work with
Big Data, some of the main components of Apache Hadoop listed below:
Map Reduce
MapReduce is a programming model for processing large data
sets
MapReduce framework of Hadoop is for writing applications
that process large amount of structures/semi-structured data in parallel across
large clusters.
Pig
Pig provides a high level language (Pig Latin) which is a
scripting language to execute MapReduce jobs.
Hive
Hive is a data warehouse that enables fetching meanings from
MapReduce job through an SQL-Like scripting language (HiveQL) from large data
sets.
What Microsoft HDInsight Provides?
Microsoft HDInsight provides apache based Hadoop technology
for working with Big Data, and query meaningful data for decision making from
large data sets.
Links for read more;
More about Hadoop:
http://hortonworks.com/what-is-apache-hadoop/
Hortonworks provides Microsoft HDInsight:
http://hortonworks.com/partners/microsoft/
Microsoft Big Data Solution web address:
http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data.aspx
Microsoft HDInsight PREVIEW installation:
http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW
In next post I’ll explain more about how to install
HDInsight Preview version and how to run some examples on it.