Loading data into HDFS & Explanations: 530387

Question:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Overview

This  assignment    requires    you    to    compile    a    set    of    data,    load    this    data    into    hdfs    and    to    write    a    map-­‐

reduce  process    that    will    extract    and    present    the    data    as    outlined    in    the    following    sections.

 

Background

The  Wikimedia    Foundation,    Inc.    (http://wikimediafoundation.org/)    is    a    nonprofit    charitable    organization    dedicated    to    encouraging    the    growth,    development    and    distribution    of    free,    multilingual,    educational    content,    and    to    providing    the    full    content    of    these    wiki-­‐based    projects    to    the    public    free    of    charge.    The    Wikimedia    Foundation    operates    some    of    the    largest    collaboratively    edited    reference    projects    in    the    world;    you    are    probably    most    familiar    with    Wikipedia    which    is    a    free    encyclopedia    and    is    available    in   over    50    languages    (see    https://meta.wikimedia.org/wiki/List_of_Wikipedias    for    a    list    of    languages).

 

Information   on    all    the    projects    that    are    the    core    of    the    Wikimedia    Foundation    available    at

http://wikimediafoundation.org/wiki/Our_projects.

 

Aggregated    page    view    statistics    for    Wikimedia    projects    is    available    at    http://dumps.wikimedia.org/other/pagecounts-­raw/.         This    page    gives    access    to    files    that    contain    the    total    hourly    page    views    for    Wikimedia    project    pages    by    page.       Information    on    the    file    format    is    given    on    this    page    view    statistics    page.

 

Required    Tasks

The  task    of    this    assignment    is    twofold:

  1. 1. Use  HDFS    and    MapReduce    to    identify    the    popularity    of    Wikipedia    projects    by    the    number    of    pages    of    each    Wikipedia    site    which    were    accessed    over    an    x    hour    period.    Your    job    should    allow    you    to    directly    identify    from    the    output    the    most    popular    Wikipedia    sites    accessed    over    the    time    period    selected.       You    can    choose    whichever    x    hour    period    you    wish    from    the    files    available    on    the    page    view    statistics    page,    with    the    constraint    that    x>=6.
  2. 2. Use  HDFS    and    MapReduce    to    identify    the    average    page    count    per    language    over    the    same

period,  ordered    by    page    count.

 

 

 

Deliverables

You  will    be    required    to    document    your    approach    for    processing    the    data    and    producing    the    required

outputs  using    map-­‐reduce    only.

 

Your  report    (saved    as    a    PDF    document)    should    contain    the    following:

 

 

 

 

 

 

 

 

 

 

 

−    Explanation  of    the    steps    you    performed    for    loading    the    data    sets    into    HDFS

−    Detaild  design,    including    diagrams    and    detailed    explanations    of    each    part    of    the    process

−    Explanations  of    any    design    decisions    (evaluating    alternatives)    and    any    assumptions    made

−    Well  written    and    fully    commented    Java    code    for    the    map-­‐reduce    process

−    Examples  of    the    output    files    from    the    map-­‐reduce    process    illustrating    the    data    produced    at

each  stage.

 

The    output    files    from    the    map-­‐reduce    process    should    be    included.    If    these    are    not    included    then

your  assignment    mark    will    be    reduced    by    30%.

 

 

 

 

Submission  Details

The  assignment    is    due    by   26th February    @23:00

 

You  should    create    one    document/report    containing    all    the    material    for    each    item    listed    in   the    deliverables.    Convert    this    document    into    a    PDF.    It    is   this    PDF    document    that    should    be    submitted.    All    images    should    be    imbedded    in    this    document.

 

In  addition    to    the    report    the    output    files    from    the    map-­‐reduce    process    should    be    submitted.    You    will

need    to    extract    these    files    from    HDFS.

 

The  Report    and    the    Output    Files    should    be    ZIPPED    (only    zip    format    will    be    accepted)    and    it    is    this    ZIP

file  that    should    be    submitted    on    WebCourses.

 

You  will    need    to    submit    your    assignment    on    WebCourses.          You    cannot    submit    your    assignment    via

email.

 

Marking  Scheme

The  marking    scheme    for    this    assignment    is:

−    10%                                                    Explanation  of    the    steps    you    performed    for    loading    the    data    sets    into    HDFS.

−    25%                                                    Design  and    structure    of    the    map-­‐reduce    process.

−    40%                                                    Well  written    and    fully    commented    Java    code    for    the    map-­‐reduce    process.

−    15%                                                    Extent  of    use    of    map-­‐reduce    features    and    scalability.

−    10%                                                    Output  files    from    the    map-­‐reduce    process.

 

The  documentation    for    your    assignment    must    contain    your    name,    your    student    number,    your    class,    course    (DT2??)    and    year    information,    assignment,    lecturer    name    and    your    Failure    to    give    this    information    will    incur    a    10%    penalty.

The  assignment    most    be    performed    individually.

Answer:

1 2 3 4 5 7 8 9 10 11 12 13

Explanation of the design decisions

The HDFS works on creating an up-to-date structure for the directory where the editing is of the logs to mainly create the storage and management in effective manner. The aim is to tackle and work on the namespace which could be easily served through the separate kinds of the name nodes. Here, the small file issue or the scalability which comes with the larger number of the files. (Pasari et al., 2016). The aim is to handle the data awareness and perform the mapping to reduce the tasks on the different nodes. The traffic for the file access could be achieved easily through the native Java application. The HDFS is designs to work on the limitations of portability and the performance set under the bottleneck where integration is working on enterprise level infrastructure.

Well  written    and    fully    commented    Java    code    for    the    map-­‐reduce    process

Provided in Virtual Machine

Output

Provided in the Virtual Machine

 

  1. Use HDFS and MapReduce to identify the popularity of Wikipedia projects by the number of pages of each Wikipedia site which were accessed over an x hour period.

The open source software framework works on the MapReduce and working on the computer clusters which has been built from the commodity software. The modules in Hadoop Distributed File System with processing part that has been set under the packaged code into the nodes. (Steele et al., 2016). It will process the data in the parallel form with the locality of data with nodes manipulation with working on accessing and allowing the processing in a faster manner. The architecture works on the parallel file system where the system standards are working on the Hadoop Common that contains the libraries and the utilities, which is needed mainly by the other modules. With this, the Distributed File System, there is a process of storing the data on the machines of the commodity with higher aggregate bandwidth and the resource management platform. The Hadoop MapReduce works on the Google File System with the frameworks set under the implementation of the mapping and reducing parts for the user program.


  1. Use HDFS and MapReduce to identify the average page count per language over the same period, ordered by page count.
    The designed decisions works on the work scheduling with proper execution of the codes where HDFS uses this method on replicating data for the redundancy of data. It will be for the multiple racks where the approach is to reduce the impact or switching failure with remaining the availability of the data. The Hadoop cluster works on the multiple worker nodes which consist of the Job Tracker, Task Tracker with the Name Node or the Data Node. HDFS works on the dedication to generate snapshots for the memory structures that prevent the corruption of the file system or data loss. The job tracking server can easily manage the alternate file system through the NameNode and the DataNode Architecture. HDFS store files are for the different systems which works on the RAID storage on the hosts as well as the different process of the rack. Here, the file systems work on trading off methods with the better performance of the data throughput. (Singh et al., 2016).

There have been other file systems where the storage runs on the data nodes and the name nodes with the underlying operating system. The system works on the MapReduce jobs which works on the clustering and striving mainly to work on containing the data that is given the basic priority mainly for the nodes. The scheduling is mainly through the allocation or the delay of the MapReduce jobs which are set towards waiting for the slowest task. The execution could easily be enabled with the FIFO scheduling and the priorities are set depending upon the refactoring approach. The scheduler is for the jobs which have been grouped in the pools. It mainly works on handling the capacity which is split in between the jobs to reduce the slots and limit the running of the similar jobs. The queues are mainly allocated with the fraction of the resource capacity with free resources being allocated beyond any capacity, with a higher level of the priority to access the resources. (Bodhke et al., 2016). HDFS works on the system theory where the Data Warehouse works on parallel processing of the data with the real time system implementation and machine learning process.

Reference

Steele, B., Chandler, J., & Reddy, S. (2016). Hadoop and MapReduce. In Algorithms for Data Science (pp. 105-129). Springer International Publishing.

Pasari, R., Chaudhari, V., Borkar, A., & Joshi, A. (2016, August). Parallelization of Vertical Search Engine using Hadoop and MapReduce. In Proceedings of the International Conference on Advances in Information Communication Technology & Computing (p. 51). ACM.

Singh, R., & Kaur, P. J. (2016). Analyzing performance of Apache Tez and MapReduce with hadoop multinode cluster on Amazon cloud. Journal of Big Data3(1), 19.

Bodkhe, B., & Sood, S. P. (2016). Dynamic Slot Allocation Optimization Framework for HADOOP MapReduce Clusters. IJETT3(2).