Report-2282662

Combiner

Input

The script takes the output of sorted mapout.txt contents from the terminal.

Algorithm

The script goes through the input line by line, printing it based on year and rating

Output

The output is printed as a TSV file named comout.txt

Reducer

Input

The script takes the input of the sorted comout.txt from the terminal.

Algorithm

The algorithm splits each line of the input into six variables and computes the average rating of each movie per year.

Output

The output is printed as a TSV file named results.txt

Part 4 Distributed Computation

Although Condor and Hadoop are used in distributed computation, they have different architecture and functionality approaches. In Hadoop, the data processing is completed in batches of resource-intensive tasks while Condor executes computational tasks across distributed resources. In the case of petasize datasets, Condor would not function in the MapReduce fashion.

With regard to the data storage and processing model, Hadoop stores data on its local HDFS for parallel processing while Condor relies on existing file systems

and execution environment provided by the Operating System. In other words, Condor would focus on scheduling and management of computational tasks while the file management would be carried out by the underlying platforms.

Unlike Hadoop, condor would provide higher flexibility in job scheduling, job preemption, and fare-share scheduling. This would contradict the Hadoop approach of minimizing data transfer and optimizing locality. This implies that Condor would be slower than Hadoop in terms of job completion times.

In terms of application, Condor would be best suited for high computational tasks and calculations, scientific computations, and high-throughput tasks that call for distributed computing. On the other hand, Hadoop would be best suited for data-intensive jobs like warehousing, machine learning, and log analysis. In a nutshell, the application of Condor instead of Hadoop would consider various factors, translating to completely different applications and utilization in distributed computing.