1. Assume you have 3 documents with the following terms: · D1 = “computer”, “web”, “storage”, “options” · D2 = “computer”, “game”, “development” · D3 = “web”, “development”, “frameworks” If the query Q is composed of terms “computer” and “development”, what is the relevance of each document to the query using the TF.IDF measure? 2. Explain and write the pseudocode for a Mapper/Reducer that takes as input a large file (possibly split into chucks) of integers and outputs: a. The sum of the squares of each integer b. The maximum integer 3. Assume you have a database with two relations (i.e. tables): customers and accounts. The schema for customers is composed of the following attributes: · customerID (integer) · name (string) · address (string) · phone (string) The schema for accounts is composed of the following attributes: · customerID (integer) · accountNumber (integer) · balance (float) Write the following queries using SQL: A. Find the customer name with account number 12345. B. Find all customer names who have at least one account with balance >$100,000 Write the following queries using Relational Algebra: C. Find all account numbers with balance <$20,000 D. Calculate the sum of all accounts for customer with ID=432 4. Compute the Jaccard similarities of each pair of the following three sets: {1, 2, 3, 4}, {2, 3, 5, 7}, and {2, 4, 6}. 5. List the first ten 3-shingles in the following sentence: "In this section, we introduce the simplest and most common approach, shingling, as well as an interesting variation." 6. Fill in the signature matrix using the following input matrix of shingles of documents and the given permutations: Input Matrix Document 1 Document 2 Document 3 1 1 0 0 1 1 1 0 1 1 1 0 Permutations: ? 1 =(2,4,3,1) ? 2 =(1,2,4,3) ? 3 =(4,3,2,1) Signature Matrix Document 1 Document 2 Document 3 7. Using your answers from problem 3, compute the column/column and signature/signature similarities of the following document pairs: a) 1-2 b) 1-3 c) 2-3 8. Perform a hierarchical clustering of the one-dimensional set of points 1, 4, 9, 16, 25, 36, 49, 64, 81, assuming clusters are represented by their centroid (average), and at each step the clusters with the closest centroids are merged. (show each step) 9. Use the k-means algorithm and Euclidean distance to cluster the following 5 examples into 2 clusters: p1=(2,10), p2=(2,5), p3=(8,4), p4=(5,8), p5=(7,4). Grocery Data 10. Given the following itemsets: {A,B,C} {B,C} {A,B,C,D} {A,C,D} {C,D} {B,C} {B,D} a) What is the support of {A}, {A,C}, {B}, {B,C}, {A,C}, {A,C,B}, {B,C,D}, {A,B,C,D} ? b) What is the confidence of {A}->C, {B}->C, {A,C}->B, {B,C}->D ? c) What is the interest of {A}->C, {B}->C, {A,C}->B, {B,C}->D ? 11. Apply the A-Priori Algorithm with support threshold 3 to the itemsets from problem 1. 12. If we use a triangular matrix to count pairs, and n, the number of items, is 20, what pair’s count is in a[100]? 13. Assume you are given the following collection of twelve baskets, each of which containing three of the six items 1 through 6: {1, 2, 3} {2, 3, 4} {3, 4, 5} {4, 5, 6} {1, 3, 5} {2, 4, 6} {1, 3, 4} {2, 4, 5} {3, 5, 6} {1, 2, 4} {2, 3, 5} {3, 4, 6} Suppose the support threshold is 4. On the first pass of the PCY Algorithm we use a hash table with 11 buckets, and the set {i, j} is hashed to bucket i×j mod 11. a) By any method, compute the support for each item and each pair of items. b) Which pairs hash to which buckets? c) Which buckets are frequent? d) Which pairs are counted on the second pass of the PCY Algorithm? 14. For this question, you will need to use the Orange Canvas data mining software. You can find it at: http://orange.biolab.si/ Be sure to read the documentation and especially the following: http://docs.orange.biolab.si/widgets/rst/associate/associationrules.html#association-rules Then download the file grocery-data.txt (from BlackBoard) and use it with the Orange Canvas software to answer the following questions: a) Which rule h…
Looking for solution of this Assignment?
WHY CHOOSE US?
We deliver quality original papers |
Our experts write quality original papers using academic databases. |
Free revisions |
We offer our clients multiple free revisions just to ensure you get what you want. |
Discounted prices |
All our prices are discounted which makes it affordable to you. Use code FIRST15 to get your discount |
100% originality |
We deliver papers that are written from scratch to deliver 100% originality. Our papers are free from plagiarism and NO similarity |
On-time delivery |
We will deliver your paper on time even on short notice or short deadline, overnight essay or even an urgent essay |