A NOVEL ARCHITECTURE TO INTEGRATE MULTI-SOURCEDATAIN DISTRIBUTED ENVIRONMENT
Download
Author:
SIDRA ZULFIQAR
Citable URI :
https://vspace.vu.edu.pk/detail.aspx?id=326
Publisher :
Virtual University
Date Issued:
7/4/2020 12:00:00 AM
Abstract
The amount of data has been increasing over the last few years due to the emergence of various
end-user applications. These applications utilize cloud computing infrastructure in the data
centers. Apart from the increasing volume of data, there are other factors such as variety,
velocity, and veracity of the data which result in the problem of big data. Traditional database
management systems are not efficient to handle big data. The use of big data platform is
necessary to resolve the big data problem. Hadoop is one of the platforms which resolve the
problem of big data. Hadoop uses a distributed storage system. Hive and HBase are some of the
big data tools for storing big data in Hadoop. They run on top of Hadoop distributed file system
(HDFS). Hive is a data warehouse framework for querying and analysis of data that is stored in
HDFS. Hive is an open-source software that lets programmers analyze large data sets on Hadoop.
HBase is a column-oriented, distributed and high fault-tolerant database. It is used to store and
manage big data. It can store billions of rows at a time. Both Hive and HBase can be used to store
the big data in Hadoop. When the data comes from multiple sources, it is stored into multiple
tables in Hive and HBase. As a result, its performance degrades when there is a need to perform
join operations.
In this thesis, we propose an architecture which stores data from multiple sources into a single
HBase table. A new table schema with a unique row key is designed which integrates
multi-source data in a table. There is no need to perform join operation in the proposed technique
as the data is integrated into a single HBase table. We evaluated the proposed technique using a
real testbed by considering a dataset of two publishers. We compare the performance by storing
data into Hive and also in the proposed HBase table. Results show improved query performance
of the proposed technique as compared to the traditional approach of using join operations in
multiple tables in Hive.
URI :
https://vspace.vu.edu.pk/details.aspx?id=326
Citation:
Zulfiqar,S(2019).A NOVEL ARCHITECTURE TO INTEGRATE MULTI-SOURCEDATAIN DISTRIBUTED ENVIRONMENT. Virtual University of Pakistan(Lahore,Pakistan).
Version :
Final Version
Terms of Use :
Detailed Terms :
Journal :
Files in this item |
Name |
Size |
Format |
Fall 2019_CS720_MS160400838.pdf |
3611kb |
pdf |