In our previous blogs, we defined Big Data and Big Data Analytics. In this we are going to discuss the tools that are used for solving big data from technology standpoint – Hadoop (HDFS, MapReduce) which is an open source computing framework and NoSQL which is non-relational database.
Big Data Technologies and Tools
High-availability distributed object-oriented platform or “Hadoop” is a software framework which analyse structured and unstructured data and distribute applications on different servers. Below is an overall Hadoop architecture –
Basic Application of Hadoop
Hadoop is used in maintaining, scaling, error handling, self healing and securing large scale of data. These data can be structured or unstructured. What I mean to say is if data is large then traditional systems are unable to handle it. Thus, Hadoop comes in the picture. Below are some basic features of Hadoop -[subscribelocker]
- Hadoop maintains and secures the data by storing and keeping its replica.
- It is focused on scaling according to data usage.
- It can detect and delete the failed task and as well as failed transaction of data.
- It not only recovers the data but also automatically restores the data at its place.
Typical Hadoop Platform Stack – HDFS + Hive + HBase + Pig
HDFS (Hadoop Distributed File System) – is part of Hadoop and is known as a special file system which deals with distribution and storage of large set of data. HDFS stores file as sequence of same size of block except the last block. It also deals with hardware failure and smoothen the data handling.
Hive – Hive was initiated by Facebook. Hive is data warehouse tool which is based on Hadoop and converts query language into MapReduce jobs. It deals with the storage , analysis and queries of large set of data. Query language in hive used as HQL statement. Hive Query Language is similar to standard SQL statement.
Hbase – Hbase is a Hadoop application which runs on top of HDFS. Hbase system represents set of table but Hbase is column oriented database management system i.e. different from the row oriented database management system. Generally if we talk about database then we think of relational database system but unlikely Hbase is not relational database at all and also it doesn’t support Structured Query Language like SQL. Java is prefered language use for Hbase application. One most important feature of Hbase is to real time read or write to large set of data.
Pig – initiated by Yahoo, became open source in 2007. Do you know why it is named as Pig? It is because it can handle any type of data!! Strange but true. Pig is a high level procedural programming platform developed for simplifying large data sets query in Hadoop and MapReduce. Pig has two components- one is PigLatin which is programming language and the other is run time environment where PigLatin programs are executed.
Advantage and Disadvantage of Hadoop:
As the term says NoSQL, it means non relational or Non-SQL database, refer to Hbase, Cassandra, MongoDb, Riak, CouchDB. It is not based on table formats and that’s the reason we don’t use SQL for data access. A traditional database deals with structured data while a relational database deals with the vertical as well as horizontal storage system. NoSQL deals with the unstructured, unpredictable kind of data according to the system requirement.
Cassandra database is used to handle the large set of data when we need to scale the database with high performance. Cassandra deals with the fault tolerance and replication of the data. With this we can go deeper in columns, supercolumns and more. It is a partial relational database system, supports best query capability but don’t have joins feature. It follows the column family model map with two dimensional and 3 dimensional. 2D model includes column family with some column in it, while 3D model created by associating super column in column family.
MongoDB is an agile NoSQL document database, unlike the traditional database which store the data in rows and column, MongoDB stores the document data in binary form of JSON document which is also known as BSON format. It is used for high scalability, availability and performance. In MongoDB dynamic schemas are the unit of database, which found in document where set of documents are found in collection while set of collection makes the database.
Riak is open source NoSQL database system which is designed for availability, fault tolerance, scalability and high performance. It provides three kind of storage key/value store, document oriented store and web shaped store. It also stores documents in the JSON format. When we talk about data modeling, we will see that there is no ‘Master’, only nodes are there. All nodes are same and don’t have different responsibility.
Advantage and Disadvantage of NoSQL:
[/subscribelocker]Tools from Companies – Cassandra, Riak, Redis, HBase, Oracle, membase, mongoDB.
Now, super fun geek time – here’s a funny parody on Hadoop and NoSQL. Enjoy and have a great weekend!