In Python, compute the following in the code structure provided below.
(A) – Document Term Matrix. Define a function called compute_dtm as follows: Take a list of docs as a parameter. Tokenize each document into lower-cased words without any leading and trailing punctuations. Let words denote the list of unique words in docs. Compute dtm, which is a 2-dimensional array created from the documents as follows: Each row (i) represents a document. Each column (j) represents a unique word in words. Each cell (i,j) is the count of word j in document i. Fill 0 if word j does not appear in document i. Return dtm and words.
(B) -Performance Analysis. Suppose your machine learning model returns a one-dimensional array of probabilities as the output. Write a function “performance_analysis” to do the following: Take three input parameters: probability array, ground-truth label array, and a threshold th. If a probability > th, the prediction is positive; otherwise, negative. Compare the predictions with the ground truth labels to calculate the confusion matrix as shown in the figure, where: True Positives (TP): the number of correct positive predictions. False Positives (FP): the number of postive predictives which actually are negatives. True Negatives (TN): the number of correct negative predictions. False Negatives (FN): the number of negative predictives which actually are positives. Calculate precision as TP/(TP+FP) and recall as TP/(TP+FN). Return the confusion matrix, precision, and recall 2. Call this function with th set to 0.5, print out confusion matrix, precision, and recall 3. Call this function with th varying from 0.05 to 1 with an increase of 0.05. Plot a line chart to see how precision and recall change by th. Observe how precision and recall change by th.
(C) Define a function called DTM as follows: A list of documents (docs) is passed to inialize a DTM object. The __init__ function creates two attributes: an attribute called words, which saves a list of unique words in the documents an attribute called dtm, which saves the document-term matrix returned by calling the function defined in Q1. This class contains two methods: max_word_freq(): returns the word with the maximum total count in the entire corpus. max_word_df(): returns the word with the largest document frequency, i.e. appear in the most of the documents.
# Structure of your solution:
import numpy as np
import pandas as pd
import string from matplotlib
import pyplot as plt
dtm = None
# add your code here
return dtm, words
def evaluate_performance(prob, truth, th):
conf, prec, rec = None, None, None
return conf, prec, rec
# add your code here
Looking for solution of this Assignment?
WHY CHOOSE US?
We deliver quality original papers
|Our experts write quality original papers using academic databases.|
|We offer our clients multiple free revisions just to ensure you get what you want.|
|All our prices are discounted which makes it affordable to you. Use code FIRST15 to get your discount|
|We deliver papers that are written from scratch to deliver 100% originality. Our papers are free from plagiarism and NO similarity|
|We will deliver your paper on time even on short notice or short deadline, overnight essay or even an urgent essay|