In this program you will be writing a spam predictor based on maximal likelihood. Input will be entirely through the console and will read as follows. Variables are defined later and examples follow after that. n_spam spam_word1 spam_word2 n_ham ham_word1 ham_word2 ... midlow midhigh n_msg message_word1 message_word2 ... n_spam is an integer that describes the number of spam words in the spam word list. These words can be repeated and each time counts as a separate occurence in a spam message. Following n_spam are n_spam number of spam words. n_ham is an integer that describes the number of ham words in the ham word list. These words can be repeated and each time counts as a separate occurence in a ham message. Following n_ham are n_ham number of ham words. Some of these words may appear as spam, and some may not. midlow is a floating point number. midhigh is a floating point number. Any word that is classified as having a probability of being spam that falls in the region between midlow and midhigh is ignored. n_msg is an integer that describes the number of words in the message that you will be using to classify as spam or not. Following n_msg is a list of words that are in the message to be classified. If the word was listed as ham or spam above, then include it in the computation as long as its probability does not fall in the midlow-midhigh region. If a word does not appear in the corpus, it is ignored. Use Laplacian smoothing to compute the probability of a word being spam. Now a walkthrough of Example #2 below: There are two spam words in the corpus: spam and yes There are two ham words in the corpus: spam and not Let's compute the probabilities of these words now: P(spam) = (1 + 1) / (2 + 2) = 1/4 = 0.5 P(yes) = (1 + 1) / (1 + 2) = 2/3 = 0.666667 P(spam) = 0.5 P(not) = (0 + 1) / (1 + 2) = 1/3 = 0.333333 Now, spam is tossed out because it has a probability between .45 and .55. The other two words (yes and not) are the only ones to classify messages in this system. This example has a message with 4 words. yes spam not boo. spam and boo are no longer in our list of words with probabilities, so they are tossed. alpha = P(yes) * P(not) = 0.666667 * 0.333333 = 0.222222 beta = (1-P(yes))* (1-P(not)) = 0.333333 * 0.666667 = 0.222222 So the final result is: alpha / (alpha + beta) = 0.222222 / (0.222222 + 0.222222) = 0.5 It should not be surprising that this system cannot classify this message conclusively. Some examples: -------------------- | EXAMPLE INPUT #1 | -------------------- 10 spam urgent important sweepstakes free chance win important free not 10 ham not spam please how help close correct correct free .45 .55 20 this is the message to classify as spam or ham do you think your system classify this correct or not -------------------- | EXAMPLE OUTPUT #1 | -------------------- P(ham)=0.333333 alpha=0.333333 beta=0.666667 P(correct)=0.25 alpha=0.0833333 beta=0.5 Result= 0.142857 -------------------- | EXAMPLE INPUT #2 | -------------------- 2 spam yes 2 spam not .45 .55 4 yes spam not boo -------------------- | EXAMPLE OUTPUT #2 | -------------------- P(yes)=0.666667 alpha=0.666667 beta=0.333333 P(not)=0.333333 alpha=0.222222 beta=0.222222 Result= 0.5