Subject Code & Title : COSC 2637 Big Data Processing
Assessment Type :Individual assignment. Submit online via Canvas → Assignment 2. Marks awarded for
meeting requirements as closely as possible. Clarifications/updates may be made via announcements or relevant discussion forums.
Overview : Write Map Reduce and Spark programs which gives your chance to understand the complexity of Map Reduce and Spark programing, the essential components you learned in lectures, the unique debugging method, the impact of performance using different size clusters.
COSC 2637 Big Data Processing Assignment – RMIT University Australia.

COSC 2637 Big Data Processing Assignment - RMIT University Australia.

Learning Outcomes :
The key course learning outcomes are:
CLO 1. Model and implement efficient big data solutions for various application areas using appropriately
selected algorithms and data structures.
CLO 2. Analyze methods and algorithms, to compare and evaluate them with respect to time and space
requirements and make appropriate design choices when solving real-world problems.
CLO 3. Motivate and explain trade-offs in big data processing technique design and analysis in written and oral form.
CLO 4. Explain the Big Data Fundamentals, including the evolution of Big Data, the characteristics of
Big Data and the challenges introduced.
CLO 5. Apply non-relational databases, the techniques for storing and processing large volumes of
structured and unstructured data, as well as streaming data.
CLO 6. Apply the novel architectures and platforms introduced for Big data, in particular Hadoop and
Map Reduce.

Task 1 – Compute Co-occurrence Matrix
Task 1.1 – Implement both “pairs approach” and “strips approach” to compute the co-occurrence matrix
where the word-pair frequency is maintained. The context of a word is defined as the words in the same line.

Task 1.2 – Implement both “pairs approach” and “strips approach” to compute the co-occurrence matrix
where the word-pair relative frequency is maintained. The context of a word is defined as the words in the
same line. Note “pairs approach” should avoid the memory bottleneck issue.

You should use Java to develop your MapReduce program over AWS EMR (if you want to use other code
language, please contact lecturer for approval).

Task 2 – Spark Streaming :
Develop code in a Scala Maven project to monitor a folder in HDFS in real time such that any new file in the folder will be processed (in this assignment, you are required to load “3 little pigs”, “Melbourne” and “RMIT” files in the folder under monitoring in sequence order; note must wait for at least 10 seconds between two files). For each RDD in the stream, the following sub tasks are performed concurrently:
(a) Count the word frequency and save the output in HDFS.
Note, for each word, make sure space (” “), comma (“,”), semicolon (“;”), colon(“:”), period (“.”), apostrophe (“’”), quotation marks (“””), exclamation (“!”),question mark (“?”), and brackets (“[“, “{”, “(”, “<”,”]”, “)”, “}”,”>” ) are trimmed.
(b) Filter out the short words (i.e., < 5 characters) and save the output in HDFS.
(c) Count the co-occurrence of words in each RDD where the context is the same line; and save the out put in HDFS.

COSC 2637 Big Data Processing Assignment – RMIT University Australia.

You should use Scala to develop your MapReduce program over AWS EMR (if you want to use other code
language, please contact lecturer for approval).

Excellent Assignment Help

We Aim At:

Lowest Price.
100% Uniqueness.
Assignment Fastest Delivery.

Order Now View Sample

Call Now : +61 363 877 039