Browse

The Virtual University, Pakistan’s first University based completely on modern Information and Communication Technologies, was established by the Government as a public sector, not-for-profit institution with a clear mission: to provide extremely affordable world class education to aspiring students all over the country.

Using free-to-air satellite television broadcasts and the Internet, the Virtual University allows students to follow its rigorous programs regardless of their physical locations. It thus aims at alleviating the lack of capacity in the existing universities while simultaneously tackling the acute shortage of qualified professors in the country. By identifying the top Professors of the country, regardless of their institutional affiliations, and requesting them to develop and deliver hand-crafted courses, the Virtual University aims at providing the very best courses to not only its own students but also to students of all other universities in the country.

A NOVEL ARCHITECTURE TO INTEGRATE MULTI-SOURCEDATAIN DISTRIBUTED ENVIRONMENT

Download

Author: SIDRA ZULFIQAR


Citable URI : https://vspace.vu.edu.pk/detail.aspx?id=326

Publisher : Virtual University

Date Issued: 7/4/2020 12:00:00 AM


Abstract

The amount of data has been increasing over the last few years due to the emergence of various end-user applications. These applications utilize cloud computing infrastructure in the data centers. Apart from the increasing volume of data, there are other factors such as variety, velocity, and veracity of the data which result in the problem of big data. Traditional database management systems are not efficient to handle big data. The use of big data platform is necessary to resolve the big data problem. Hadoop is one of the platforms which resolve the problem of big data. Hadoop uses a distributed storage system. Hive and HBase are some of the big data tools for storing big data in Hadoop. They run on top of Hadoop distributed file system (HDFS). Hive is a data warehouse framework for querying and analysis of data that is stored in HDFS. Hive is an open-source software that lets programmers analyze large data sets on Hadoop. HBase is a column-oriented, distributed and high fault-tolerant database. It is used to store and manage big data. It can store billions of rows at a time. Both Hive and HBase can be used to store the big data in Hadoop. When the data comes from multiple sources, it is stored into multiple tables in Hive and HBase. As a result, its performance degrades when there is a need to perform join operations. In this thesis, we propose an architecture which stores data from multiple sources into a single HBase table. A new table schema with a unique row key is designed which integrates multi-source data in a table. There is no need to perform join operation in the proposed technique as the data is integrated into a single HBase table. We evaluated the proposed technique using a real testbed by considering a dataset of two publishers. We compare the performance by storing data into Hive and also in the proposed HBase table. Results show improved query performance of the proposed technique as compared to the traditional approach of using join operations in multiple tables in Hive.


URI : https://vspace.vu.edu.pk/details.aspx?id=326

Citation: Zulfiqar,S(2019).A NOVEL ARCHITECTURE TO INTEGRATE MULTI-SOURCEDATAIN DISTRIBUTED ENVIRONMENT. Virtual University of Pakistan(Lahore,Pakistan).

Version : Final Version

Terms of Use :

Detailed Terms :

Journal :

Files in this item

Name Size Format
Fall 2019_CS720_MS160400838.pdf 3611kb pdf


Copyright 2016 © Virtual University of Pakistan