Due Date: 11pm Sun 29 May (End of Week 11)
One word document containing all your answers
1. Identify 10 movies in the movie title dataset you used in assignment 1. The 10 movies should be representative in structure of the movies in the dataset. The data of these 10 movies is called the sample data.
You include the sample data as part of the report.
2. Design a relational representation for the selected data by showing a table with headings and the tuples for the sample data.
3. Design a logical schema for the Hbase for the movie title dataset and show the data for the sample data together with the schema.
The schema would include a row key, and some column families. The sample data would be presented as attribute-value pairs in each column family.
You justify the reasons which you choose the row key and the column families.
4. Show HTables and region files for the sample data. You assume that each region can contain data of 3-4 movies for a column family. Each of these should be shown in a separate table for clarity.
5. Given a HDFS with two racks of nodes and each rack with three slave computers, draw a diagram to show a way in which the Hbase region files will be stored in the HDFS.
6. Identify two example queries and analyze how they can be benefitted by the Hbase you design above in comparison with the relational model. One of the queries should be a search query (like the one shown below) and the other must be an aggregate query (with sum, avg, etc).
An example search query is like “find the year of a specific movie”.
To address whether the query is benefitted by the Hbase, you need to explain which part of the data will be retrieved in referencing your answers to Parts 4 and 5 above, how the final answer is calculated (as the data is distributed) etc. You then compare with the processing of the relational model in Part 2. The analysis of the relational database is also dependent on how many records a disk block can store. You assume that the relational database is centrally stored. The comparison needs to consider measures like disk reading time, calculation time, data transportation time/cost, and other measures that you think meaningful.
You may use tables and diagrams to make the presentation more readable.