Question:
Overview
This assignment requires you to compile a set of data, load this data into hdfs and to write a map-‐
reduce process that will extract and present the data as outlined in the following sections.
Background
The Wikimedia Foundation, Inc. (http://wikimediafoundation.org/) is a nonprofit charitable organization dedicated to encouraging the growth, development and distribution of free, multilingual, educational content, and to providing the full content of these wiki-‐based projects to the public free of charge. The Wikimedia Foundation operates some of the largest collaboratively edited reference projects in the world; you are probably most familiar with Wikipedia which is a free encyclopedia and is available in over 50 languages (see https://meta.wikimedia.org/wiki/List_of_Wikipedias for a list of languages).
Information on all the projects that are the core of the Wikimedia Foundation available at
http://wikimediafoundation.org/wiki/Our_projects.
Aggregated page view statistics for Wikimedia projects is available at http://dumps.wikimedia.org/other/pagecounts-‐raw/. This page gives access to files that contain the total hourly page views for Wikimedia project pages by page. Information on the file format is given on this page view statistics page.
Required Tasks
The task of this assignment is twofold:
- 1. Use HDFS and MapReduce to identify the popularity of Wikipedia projects by the number of pages of each Wikipedia site which were accessed over an x hour period. Your job should allow you to directly identify from the output the most popular Wikipedia sites accessed over the time period selected. You can choose whichever x hour period you wish from the files available on the page view statistics page, with the constraint that x>=6.
- 2. Use HDFS and MapReduce to identify the average page count per language over the same
period, ordered by page count.
Deliverables
You will be required to document your approach for processing the data and producing the required
outputs using map-‐reduce only.
Your report (saved as a PDF document) should contain the following:
− Explanation of the steps you performed for loading the data sets into HDFS
− Detaild design, including diagrams and detailed explanations of each part of the process
− Explanations of any design decisions (evaluating alternatives) and any assumptions made
− Well written and fully commented Java code for the map-‐reduce process
− Examples of the output files from the map-‐reduce process illustrating the data produced at
each stage.
The output files from the map-‐reduce process should be included. If these are not included then
your assignment mark will be reduced by 30%.
Submission Details
The assignment is due by 26th February @23:00
You should create one document/report containing all the material for each item listed in the deliverables. Convert this document into a PDF. It is this PDF document that should be submitted. All images should be imbedded in this document.
In addition to the report the output files from the map-‐reduce process should be submitted. You will
need to extract these files from HDFS.
The Report and the Output Files should be ZIPPED (only zip format will be accepted) and it is this ZIP
file that should be submitted on WebCourses.
You will need to submit your assignment on WebCourses. You cannot submit your assignment via
email.
Marking Scheme
The marking scheme for this assignment is:
− 10% Explanation of the steps you performed for loading the data sets into HDFS.
− 25% Design and structure of the map-‐reduce process.
− 40% Well written and fully commented Java code for the map-‐reduce process.
− 15% Extent of use of map-‐reduce features and scalability.
− 10% Output files from the map-‐reduce process.
The documentation for your assignment must contain your name, your student number, your class, course (DT2??) and year information, assignment, lecturer name and your Failure to give this information will incur a 10% penalty.
The assignment most be performed individually.
Answer:
Explanation of the design decisions
The HDFS works on creating an up-to-date structure for the directory where the editing is of the logs to mainly create the storage and management in effective manner. The aim is to tackle and work on the namespace which could be easily served through the separate kinds of the name nodes. Here, the small file issue or the scalability which comes with the larger number of the files. (Pasari et al., 2016). The aim is to handle the data awareness and perform the mapping to reduce the tasks on the different nodes. The traffic for the file access could be achieved easily through the native Java application. The HDFS is designs to work on the limitations of portability and the performance set under the bottleneck where integration is working on enterprise level infrastructure.
Well written and fully commented Java code for the map-‐reduce process
Provided in Virtual Machine
Output
Provided in the Virtual Machine
- Use HDFS and MapReduce to identify the popularity of Wikipedia projects by the number of pages of each Wikipedia site which were accessed over an x hour period.
The open source software framework works on the MapReduce and working on the computer clusters which has been built from the commodity software. The modules in Hadoop Distributed File System with processing part that has been set under the packaged code into the nodes. (Steele et al., 2016). It will process the data in the parallel form with the locality of data with nodes manipulation with working on accessing and allowing the processing in a faster manner. The architecture works on the parallel file system where the system standards are working on the Hadoop Common that contains the libraries and the utilities, which is needed mainly by the other modules. With this, the Distributed File System, there is a process of storing the data on the machines of the commodity with higher aggregate bandwidth and the resource management platform. The Hadoop MapReduce works on the Google File System with the frameworks set under the implementation of the mapping and reducing parts for the user program.
Use HDFS and MapReduce to identify the average page count per language over the same period, ordered by page count.
The designed decisions works on the work scheduling with proper execution of the codes where HDFS uses this method on replicating data for the redundancy of data. It will be for the multiple racks where the approach is to reduce the impact or switching failure with remaining the availability of the data. The Hadoop cluster works on the multiple worker nodes which consist of the Job Tracker, Task Tracker with the Name Node or the Data Node. HDFS works on the dedication to generate snapshots for the memory structures that prevent the corruption of the file system or data loss. The job tracking server can easily manage the alternate file system through the NameNode and the DataNode Architecture. HDFS store files are for the different systems which works on the RAID storage on the hosts as well as the different process of the rack. Here, the file systems work on trading off methods with the better performance of the data throughput. (Singh et al., 2016).
There have been other file systems where the storage runs on the data nodes and the name nodes with the underlying operating system. The system works on the MapReduce jobs which works on the clustering and striving mainly to work on containing the data that is given the basic priority mainly for the nodes. The scheduling is mainly through the allocation or the delay of the MapReduce jobs which are set towards waiting for the slowest task. The execution could easily be enabled with the FIFO scheduling and the priorities are set depending upon the refactoring approach. The scheduler is for the jobs which have been grouped in the pools. It mainly works on handling the capacity which is split in between the jobs to reduce the slots and limit the running of the similar jobs. The queues are mainly allocated with the fraction of the resource capacity with free resources being allocated beyond any capacity, with a higher level of the priority to access the resources. (Bodhke et al., 2016). HDFS works on the system theory where the Data Warehouse works on parallel processing of the data with the real time system implementation and machine learning process.
Reference
Steele, B., Chandler, J., & Reddy, S. (2016). Hadoop and MapReduce. In Algorithms for Data Science (pp. 105-129). Springer International Publishing.
Pasari, R., Chaudhari, V., Borkar, A., & Joshi, A. (2016, August). Parallelization of Vertical Search Engine using Hadoop and MapReduce. In Proceedings of the International Conference on Advances in Information Communication Technology & Computing (p. 51). ACM.
Singh, R., & Kaur, P. J. (2016). Analyzing performance of Apache Tez and MapReduce with hadoop multinode cluster on Amazon cloud. Journal of Big Data, 3(1), 19.
Bodkhe, B., & Sood, S. P. (2016). Dynamic Slot Allocation Optimization Framework for HADOOP MapReduce Clusters. IJETT, 3(2).