+91 90691 39140 | +1 253 214 3115 | info@hub4tech.com | hub4tech

Cloudera Impala

Cloudera Impala

What is Impala?

What is Impala?

  • A query engine that runs on Hadoop
  • An integrated part of cloudera suite
  • Used to achieve low latency SQL queries on data stored in Hadoop
  • Provide all the benefits of Hadoop framework
  • An Analytics database for Hadoop
  • Can be easily integrated with Apache Hive metastore

Impala and Hive

NoSQL Data Store

Apache HBASE - Overview

  • No SQL: Hbase is a type of “NoSQL” database. NoSql is a general term meaning that the database is not an RDBMS which support SQL as its primary access language
  • Hbase is an open source, non relational, distributed, sorted map, data store modeled after Google’s BigTable
  • Hbase is really more a Data store than Data Base. Hbase stores data in storefiles on HDFS
  • The HBase platform is a column oriented system which runs on top of HDFS (Hadoop Distributed Filesystem), provides real-time read/write access to those large datasets.
  • With YARN as the architectural center of Apache Hadoop, multiple data access engines such as Apache HBase interact with data stored in the cluster.


Hbase characteristics

Hbase data model

  • Tables – The HBase Tables are more like logical collection of rows stored in separate partitions called Regions.
  • Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys are unique in a Table/
  • Column Families – Data in a row are grouped together as Column Families.
  • Each Column Family has one more Columns and these Columns in a family are stored together in a low level storage file known as HFile.
  • Columns – A Column is identified by a Column Qualifier that consists of the Column Family name concatenated with the Column name using a colon – example: columnfamily:columnname.
  • Cell – A Cell stores data and is essentially a unique combination of rowkey, Column Family and the Column (Column Qualifier). The data stored in a Cell is called its value
  • Version – The data stored in a cell is versioned and versions of data are identified by the timestamp. The number of versions of data retained in a column family is configurable and this value by default is 3.
  • Timestamp - A timestamp is written alongside each value and is the identifier for a given version of a value. By default, the timestamp represents the time on the RegionServer when the data was written, but you can specify a different timestamp value when you put data into cell


  • A document-oriented database
    – documents encapsulate and encode data (or
    information) in some standard formats or encodings
  • NoSQL database
    – non-adherence to the widely used relational database
    – highly optimized for retrieve and append operations
  • uses BSON format
  • schema-less
    – No more configuring database columns with types
  • No transactions
  • No joins

The Basics

  • A MongoDB instance may have zero or more databases
  • A database may have zero or more collections.

    – Can be thought of as the relation (table) in DBMS, but with
    many differences.
  • A collection may have zero or more documents.

    – Docs in the same collection don’t even need to have the same

    – Docs are the records in RDBMS

    – Docs can embed other documents

    – Documents are addressed in the database via a unique key
  • A document may have one or more fields.
  • MongoDB Indexes is much like their RDBMS counterparts.

MongoDB Vs Relational DBMS

  • Collection vs table
  • Document vs row
  • Field vs column
  • schema-less vs schema-oriented

Example: Mongo Document

user = {
name: “Z",
occupation: "A scientist",
location: “New York"

Queries in MongoDB

Query expression objects indicate a pattern to

db.users.find( {last_name: 'Smith'} )

Several query objects for advanced queries

db.users.find( {age: {$gte: 23} } )

db.users.find( {age: {$in: [23,25]} } )

Exact match an entire embedded object

db.users.find( {address: {street: 'Oak Terrace',

city: 'Denton'}} )

Dot-notation for a partial match

db.users.find( {"address.city": 'Denton'} )

Is it Helpful?
Copyright ©2015 Hub4Tech.com, All Rights Reserved. Hub4Tech™ is registered trademark of Hub4tech Portal Services Pvt. Ltd.
All trademarks and logos appearing on this website are the property of their respective owners.