Parallel processing of big data using Hadoop MapReduce Kh. Sh. Kuzibaev T. K. Urazmatov


Download 230.49 Kb.
bet3/4
Sana11.03.2023
Hajmi230.49 Kb.
#1261478
1   2   3   4
Bog'liq
Maqola ready english

Main part: Today, 2.5 (1018) quintillion bytes of data are created every day, and this figure suggests that 2.1 MB of data will be created per person per day in 2022.[1] New algorithms and technologies are required to work with this type of large-scale data. In 2018, the total amount of collected data was 912 exabytes, according to TrendFocus[2]. It was noted that the volume of data collected between 2013 and 2015 was more than that of the entire past history of mankind. It is estimated that by 2025, all data could be 163 zettabytes (ZB).
Big data is a collection of large-scale, voluminous and multi-format information flows originating from various types and autonomous data sources[2,3]. A key characteristic of big data is that it occupies storage space in large-scale data centers and storage area networks. The large dimensions of big data not only lead to data heterogeneity, but also result in different dimensions in the dataset[4]. Analysis of large amounts of data helps to identify patterns that are beyond human perception [5]. The term big data was first introduced in the 2008 issue of the journal Nature. Clifford Lynch, the editor of the magazine, spoke about this in his article on the intensive increase in the volume of information in the world. According to experts, streams with more than 100 GB of data per day can be called big data. Features developed by Meta Group (former Facebook) are important in explaining large volumes of data.

1 pic. Large data features.



  • Volume - the size of the data volume [3]. The volume of data combines the size, importance and whether it can be considered big data;

  • Variety is the ability to process different types of data at the same time, representing the type and nature of the data.

  • Velocity is the speed of data growth and the closeness to real-time of data processing time to achieve a result.

  • Value – The importance of information that can be achieved by processing and analyzing large data sets.

  • Veracity is an extended definition for big data that refers to data quality and data value.

On the basis of these features, the object that we have chosen, Leo Tolstoy’s work " War and Peace", can be called a large volume of information. This work consists of 2300 pages and uses a little less than 570 000 words. The number of characters exceeds 2 629 000. The problem we want to solve is to calculate the frequency of words in this work. In other words, it is necessary to calculate how many times each word is used in this huge work.
We used two methods to solve this problem:
1. Analytical computing using a program based on Java Core
2. Parallel computing based on Hadoop MapReduce
Now we will dwell on these two methods. Our Java Core-based software is written in the Eclipse IDE environment. This program consists of a single class called WordCount, which reads large amounts of data from a file using the java.io.FileInputStream library. In addition, the program uses libraries such as java.util.ArrayList, java.util.Iterator, java.util.Scanner. The main executive body of the program is as follows

The program writes the counted words to a file using the java.io.FileOutputStream library. The program performs calculations in a serial, non-parallel way. That is, it compiles the program code. In the next step, it transfers it to the JRE (Java runtime environment). The JRE, in turn, transmits to the CPU (central processing unit) and the calculation is performed in the CPU and returns in this sequence. I will not go into depth about our first method based on Java Core, I will go into depth about our second method.
Our second method is based on storing large volumes of data in a distributed storage system and processing it using parallel computing. We used Hadoop HDFS and Hadoop MapReduce, which are licensed under the Apache license. Apache Hadoop is a set of open source software tools that make it easy to use a large network of computers to solve big data and computational problems. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. The core of Apache Hadoop consists of a storage part known as the Hadoop Distributed File System (HDFS) and a processing part, which is the MapReduce programming model. Hadoop divides files into large blocks and distributes them across the nodes in the cluster. It then passes the packaged code to the nodes to process the data in parallel. The main Apache Hadoop framework consists of the following modules:



  • Hadoop Common - includes libraries and utilities needed by other Hadoop modules;

  • Hadoop Distributed File System (HDFS) is a distributed file system that stores data on commodity machines providing very high aggregate throughput across the cluster;

  • Hadoop YARN – (introduced in 2012) platform responsible for managing computing resources in clusters and using them to plan user applications;

  • Hadoop MapReduce is an implementation of the MapReduce programming model for large-scale data processing.

  • Hadoop Ozone - (introduced in 2020) Object store for Hadoop


2 pic. Modules of Hadoop


After configuring these 4 Hadoop modules on the computer, we create a job that processes the words in the work. Jobs for Hadoop can be created in programming languages such as Java, Python, C++, Scala. After configuring Hadoop on our computer and creating a job suitable for our purpose, we can run Hadoop modules through the command line. To do this, enter the start-all command in the command line. After this command, the following 4 Hadoop modules will be started:

  • Hadoop datanode

  • Hadoop namenode

  • Hadoop yarn nodemanager

  • Hadoop yarn resource manager

In the next step, we will create a new volume in HDFS using the same command line. To do this, enter the command hdfs dfs -mkdir /test in the command line. We can give an arbitrary name to the new folder. Then we will copy our .txt file, which we have defined as large data, to our new folder created in HDFS. To do this, enter the command hdfs dfs -put /home/codegyani/data.txt /test in the command line.

3 pic. The workflow of Hadoop modules


We will run our Job, which has saved our large data in a distributed file system, and now prepares it for processing. Run the job in command line hadoop jar /home/codegyani/wordcountdemo.jar com.javatpoint.WC_Runner /test/data.txt /r_output
run with this command.


Results: If we focus on the results of the conducted experiments, we can see that the results are absolutely uniform. That is, we saw that the number of words counted by both methods is 100% the same. The obtained result is shown in the diagram below.



4 pic. Frequency analysis of words in voluminous data

In addition to the frequency analysis of words, the time spent on processing large volumes of information is also important to us. Because our main goal is to speed up the processing process. In the image below, in a program based on Java Core, we can see the time spent on processing the selected large volume of data.





5 pic. Time spent on Java Core-based processing

In the figure below, we can see the time and other resources spent on processing based on parallel computing using Hadoop.



6 pic. Time taken for parallel processing through Hadoop.

Download 230.49 Kb.

Do'stlaringiz bilan baham:
1   2   3   4




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling