CS 478 Homework & Programming Assignments
All written reports are to be done on a word processor. Each assignment is limited to no more than 5 single-sided, single-spaced pages using 12 point font, including all figures. Figures may not be hand-drawn; they should be large enough to be easily legible. Anything beyond the 5-page limit will not be graded! You must work to include all important information in the allotted 5 pages. Communicating clearly and concisely what you have to say is an important skill you will use throughout your career.
Programming and homework assignments will be graded as follows:
- Completeness (i.e., implementing the full specification): 50%
- Technical accuracy: 20%
- Analytical thinking (i.e., design decisions, answers to questions, etc.): 20%
- Overall quality of the report (i.e., presentation, style, organization, etc.): 10%
There is clearly an element of subjectivity to all grading. This will be true here too, especially when it comes to your own analysis and answers to some of the more open-ended questions associated with the assignments. Your grade, in these instances, will therefore reflect perceived effort and understanding. Good writing, grammar, punctuation, etc. are extremely important because of the effect they have on the impact of your work.
HWK1: Thought Questions
A. Briefly go over the following list of Success Stories in Data/Text Mining to get a feel for where and how Machine Learning and Data Mining are used. Thoughtfully answer the following questions.
- For each of the following applications, decide whether ML/DM would offer a viable solution and briefly justify your decision:
- Predicting whether a particular chemical compound is carcinogenic (i.e., will induce cancer) or not.
- Determining which african-american applicants should be extended a home loan.
- Discovering what grocery items Walmart customers tend to buy together.
- Grouping Wells Fargo customers by socio-demographic attributes and banking habits.
- Predicting tomorrow's value of Microsoft's stock.
- Predicting whether a Netflix subscriber will rent a particular new release.
- Sorting a list of number in ascending order.
- Identifying terrorists.
- Briefly outline a possible application of ML/DM to some aspect of your life. This need not have to do with you directly; you may think of companies you/your friends/your relatives work for, schools you've attended, businesses you come in contact with regularly, etc.
B. Consider the following simple dataset.
All attributes are binary. T is the target attribute. Use ID3 to induce a decision tree from this dataset. Show all of your calculations, intermediate and final trees.
Download and install the latest version (Stable Book 3rd Ed.) of Weka on your computer.
- Using the Cardiology dataset, do the following:
- Build a neural network (using the Multilayer Perceptron/Backpropagation algorithm) that predicts whether a patient has a heart condition. Record the 10-fold cross-validation accuracy of your model as A1. Cross-validation is a form of model validation where a dataset is split into folds, and the learning algorithm is trained on all but one fold and tested on the remaining fold; the process is repeated for each fold and accuracy is averaged over all folds.
- Create a new attribute coarseBloodPressure, with values: 1 if blood pressure is less than or equal to 120, 2 if blood pressure is greater than 120 but less than or equal to 150, and 3 if blood pressure is greater than 150.
- Build a neural network (using the same algorithm as above) that predicts whether a patient has a heart condition, using the new attribute coarseBloodPressure instead of the original blood pressure. Record the 10-fold cross-validation accuracy of your model as A2.
- Compare A1 and A2. Any comments?
- Remove all records whose value of the attribute resting ecg is Abnormal.
- Construct a decision tree that predicts whether a patient has a heart condition, given the attributes age, sex, chest pain type, coarseBloodPressure, angina, peak, and slope. Insert the confusion matrix obtained with 10-fold cross-validation in your report.
- Using the CPU dataset, do the following:
- Cluster the data (using the simple k-Means algorithm, with k=3) and report on the nature and composition of the extracted clusters.
- Discretize the attributes MMIN, MMAX, CACH, CHMIN and CHMAX using 3 buckets in one step. Use the binning by frequency approach. Find associations among these attributes only (i.e., remove the other ones), using the Apriori algorithm, with support 0.1, confidence 0.95 and top 4 rules only being displayed. Insert the results in your report.
- Using the original CPU dataset, list the eigenvalues associated with the attributes selected by the Principal Components Analysis method, when the amount of variance you wish covered by the subset of attributes is 75%.
HWK3: Incremental Learning
Consider the following simple dataset. The target attribute is PlayTennis, which has values Yes or No for different Saturday mornings, depending on several other attributes of those mornings.
A. k-Nearest Neighbors.
- Design a simple distance metric for this space. Briefly justify your choice.
- Perform 7-fold cross-validation with k=3 for this dataset. Show your work (there should be 7 iterations with 7 corresponding predictions/accuracies on the held-out folds, and a final result).
B. Naive Bayes.
- Using the full dataset, induce the corresponding NB model. Show your result in the form of probability tables as we did in class.
- What would your model predict for the following two Saturday mornings: <Oct 1, Overcast, Cool, High, Weak>, and <May 26, Sunny, Hot, Normal, Strong>? Show your work.
PROG1: Decision Tree Learning
In this assignment, you are to implement the ID3 decision tree learning algorithm. We have prepared a basic ToolKit for you that you may find useful. Of course, you are welcome to implement everything from scratch, but remember that some of the functionality you will be implementing here will also be useful when implementing Backpropagation (see below). In either case, you should look at the ToolKit description so you may also become familiar with the ARFF file format, as well as cross-validation. We will ignore reduced error pruning here.
In order to get full credit, you must hand-in a detailed report of your work, carefully covering all of the following:
- Correctly implement the ID3 decision learning tree algorithm. (Note: you may wish to use a simple data set, like Lenses, that you can check by hand, to test your algorithm and make sure that it is working correctly; you should be able to get about 68% predictive accuracy on lenses with cross-validation). Your implementation must support the following.
- An option for choosing the splitting criterion: information gain or accuracy.
- Some mechanism to handle continuous-valued attributes.
- Some mechanism to handle unknown (or missing) attribute values.
- Use your ID3 algorithm on the Iris problem.
- Induce a decision tree using the entire dataset with information gain as the splitting criterion. Give a visual representation of the tree.
- Induce a decision tree using the entire dataset with accuracy as the splitting criterion. Give a visual representation of the tree. Compare this tree with the one obtained with information gain as the splitting criterion.
- Evaluate predictive accuracy using 10-fold cross-validation for information gain and accuracy. Compare the results.
- Repeat the experiment with the Voting problem.
- Describe and justify the method you used to handle missing values.
- Extend your algorithm so that, when accuracy is the splitting criterion, it may use up to 2 conditions in the tests at each node (e.g., attrX = Vx and attrY = Vy). You may choose to make that an user-specified option.
- Induce a decision tree using the entire dataset with this extended algorithm for both the Iris problem and the Voting problem. Give a visual representation of the trees and compare them with those obtained above.
- Explain why it may be necessary to thus extend the decision tree learning algorithm when using accuracy as the splitting criterion (and why the extension is of little value when information gain is the splitting criterion).
HWK4: Neural Nets
A. Consider the following simple dataset.
T is the (binary) target attribute. Consider a 2-layer feedforward neural network with two input units (one for A and one for B), a single hidden unit, and one output unit (for T). Initialize all weights (there should be 3 of them) to 0.1. Assume a learning rate of 0.3. Using incremental weight updates, show the values of the weights after each of the first three training iterations. Show your results in the form of a table as we did in class.
B. Assume that the units of a neural network are modified so they compute the squashing function tanh (instead of the sigmoid function). What is the resulting backpropagation weight update rule for the output layer? (Note, tanh’(x) = 1 – tanh2(x)).
HWK5: Data Issues
Thoughtfully answer the following questions.
- Using your knowledge of how ID3 selects its root node, describe how ID3 could be used to design a simple attribute/feature selection mechanism.
- Suppose that a potential large customer C has placed your company BetterSoft in competition with another company GoodSoft to test your relative abilities to develop good software. The task is to design an algorithm to solve a class P of problems. Your company produces algorithm A, whilst the other company produces algorithm B. Both you and your competitor are asked to run your own batch of 350 tests and report how often your algorithm gave acceptable solutions (as defined by C). GoodSoft comes out on top with a score of 83% against only 78% for your algorithm. Just as C is about to award its lucrative contract to GoodSoft, you realize that the problems in class P are not all of the same complexity. In fact, it turns out that there are two clear levels of difficulty: simple and complex. You ask C to collect more detailed data from GoodSoft and yourself. The results, when complexity is factored in, are as follows.
Simple Complex Alg A Alg B Alg A Alg B 81 out of 87 234 out of 270 192 out of 263 55 out of 80
- May this additional information change C's decision as to which company to hire? If so,how?
- A variable like complexity above is known as a confounding (or lurking) variable, because it interacts with the calculated outcome in a way that may easily be overlooked, but may have an adverse effect on the conclusions reached. Briefly describe another situation based on the real world where confounding effects would be at play.
- Using Weka and the Waveform dataset, experiment with CFS (known as CfsSubsetEval in Weka) and PCA (known as PrincipalComponents in Weka). Compare the results obtained with each method. What is the main difference between these two methods?
HWK6: Classification Model Evaluation
Assume that two individuals offer to sell you their predictive models M1 and M2. The confusion matrices produced by each model are as follows.
|Predicted True||Predicted False|
Performance of M1
|Predicted True||Predicted False|
Performance of M2
- What is the accuracy of each model?
- Assuming that precision is of paramount importance in your application, which of the two models would you buy? Why?
- Assuming that the cost of labeling as True something that is actually False far exceeds the cost of labeling as False something that is actually True, which of the two models would you buy? Why?
HWK7: Model Combination
Complete the following:
- Show that stacking N (base) algorithms with a majority (meta) learner (where the stacking considers only the predictions of the base learners) does not produce the same result as an ensemble of the same N (base) algorithms combined by majority voting. You need not write a formal proof; make your argument in prose (with diagrams as needed).
- Using Weka and a couple of datasets of your choice, compare the 10-fold cross-validation accuracies of the following algorithms:
- Decision tree (use Weka's J48)
- k-NN (use Weka's IBk algorithm and set k=3)
- Naive Bayes
- Backpropagation learning (Use Weka's MultilayerPerceptron)
- An ensemble of the above combined with majority vote (Use Weka's Vote algorithm under "meta"; make sure to set it up for majority voting)
- A stacking of the above with each one of them used in turn as the meta learner (you will get 4 different accuracies here)
- Briefly discuss your findings
PROG2: Backpropagation Learning
In this assignment, you are to implement the Backpropagation algorithm. Hopefully, this will be made easier by some of the work you did for PROG1 above.
In order to get full credit, you must hand-in a detailed report of your work, carefully covering all of the following:
- Correctly implement the Backpropagation algorithm. We will assume discrete classification. As a result, you should make sure that your output layer as one node per class and the predicted label is that of the node with the highest activation. You may also need to reformat the input files to reflect this design decision. Your implementation must support the following:
- Ability to create arbitrary network structure (number of hidden layers, number of nodes per layer, etc.).
- Random weight initialization (small random weights with mean 0).
- Incremental weight update.
- A reasonable stopping criterion.
- Training set randomization at each epoch.
- An option to include a momentum term in training.
- Use your backpropagation algorithm on the Iris problem, with a random 70/30 split.
- With a single hidden layer and a fixed number of hidden nodes (of your choice), experiment with different learning rates. Graph training and test set accuracy over time for several different learning rates. Based on these results, select a reasonable learning rate.
- With the learning rate selected and a single hidden layer, experiment with different numbers of hidden nodes, starting from 1 and adding 1 each time until you see no improvement on the training set's accuracy. For each choice of number of hidden nodes, graph training and test set accuracy over time.
- Record your best number of hidden nodes (i.e., the one resulting in highest accuracy on the test set).
- Use your backpropagation algorithm on the Vowel problem, with a random 75/25 split. (Note: Make sure you ignore the "Train or Test" attribute).
- Repeat the above experiments.
- With the learning rate selected, induce a 2-hidden layer neural network with 6 hidden nodes in the first layer and 4 hidden nodes in the second. Graph training and test set accuracy over time.
- Using only the best number of hidden nodes as recorded above for 1 hidden layer and the same training/test splits, re-run your backpropagation algorithm with the momentum term option to induce a neural network for both Iris and Vowel. Graph training and test set accuracy over time.
- Analyze the data you have collected and thoughtfully answer the following questions:
- Discuss the effect of different learning rates on the algorithm's performance.
- Discuss the effect of different numbers of hidden units on the algorithm's performance (1-hidden layer case).
- Compare your recorded best numbers of hidden nodes for each problem with the following heuristic value: H=N/(10(I+O)), where N is the size of the (training) data set, I is the number of network inputs and O is the number of outputs.
- How did the momentum term affect the learner's behavior (number of epochs to convergence, final accuracy, etc.)?
Perform the following activities: (note that for your experience, I am encouraging you to do this homework using the R software, as it is one of the richest and most versatile freeware tool for statistical analysis. If you prefer, you may also perform the assigned activities in Weka. Needed algorithms and visualizations are under the Cluster tab.)
- Download and install the latest version of R on your computer. See the link on our syllabus under Resources/Software.
- Load the R Stats Package and the R Datasets Package by typing 'library(stats)' and 'library(datasets)' respectively at the R prompt. Alternatively, you may use the Package Installer and the Package Manager of the R GUI to load these packages. Details on the various functions implemented in each package, as well as examples of usage may be found in the Package Manager by selecting the package of interest.
- Run the k-means algorithm (kmeans) on the iris dataset (iris was loaded above when you loaded the R Datasets Package). Of course, iris has a target attribute. It must be excluded from the clustering. The simplest way to do this is to create a copy of iris consisting of only the first 4 attributes. This can be accomplished with the command: 'iris_copy <- subset(iris, select=c(1:4))'. You can then run kmeans on iris_copy.
- Run k-means for k=2,3,4,5,7,9,11.
- For each value of k, report the size of the clusters and the F-measure (see this for details). Both size and cluster assignments are available in variables computed during k-means (see the documentation in the R Stats Package under k-means). You will need the target values from the original iris dataset to compute the F-score. You may write a small program in R to do this or export the data and compute elsewhere.
- Report the value of k that produces the highest F-score.
- Comment on anything interesting about your experiment.
- Display and include in your report the result of hierarchical clustering. Use the function plot() to graph the dendrogram. You can copy the result to a postscript file using the following commands: 'postscript("nameoffile.eps")', then 'plot(resultofhac)', and finally 'dev.off()'.
- By looking at the display or using the values of clustering heights, select a threshold at which you feel the clustering would be optimal and justify your choice. (In principle, we would do this by computing some quality measure during the clustering process, but for simplicity, we are just eyeballing here).
- How does the corresponding number of clusters compare with that obtained with k-means above?
HWK9: Association Rule Mining
In this assignment, you will make use of existing algorithms to build a solution to a frequent association rule mining problem. You may implement your solution in R or Weka, and/or write your own code for all or parts of the solution.
Complete and report on the following:
- Use the Apriori algorithm on the Binarized Lenses problem.
- Set minconf = 0.8. Start with minsup = 0.9 and report your results (i.e., number of rules found and sample rules) for decreasing values of minsup using decrements of 0.05, down to 0.05 (i.e., this means you should run the algorithm 18 times).
- Summarize your findings.
- Run the original Lenses problem against ID3 in Weka (or your own implementation if you prefer). Compare the rules you obtained with Apriori with the tree (or rules) induced by ID3.
- Use your algorithm on the Mirror Symmetry problem.
- Run Apriori for various combinations of minsup and minconf values.
- Summarize your findings.
- This is an artificial problem. Each attribute represents a bit position in a string of 30 bits: Lmost, Lmost1, ..., Lmost14, Rmost14, Rmost13, ..., Rmost1, Rmost and the attribute Symm is 1 if the pattern is symmetric about its center, and 0 otherwise. Given this interpretation, do any of the rules discovered by your Apriori algorithm make sense?
HWK10: LR & SVM
- Assume that logistic regression applied to a set of supercomputers produces the following simple model: log odds(solution in less than 5 min) = -14.0 + 0.25 NumCPUs. Answer the following questions (all answers should be in terms of e, where e is the inverse function of log; simplify as needed)
- What are the odds that a 64 CPU supercomputer will find a solution in less than 5 min?
- How much better are the odds for a 80 CPU supercomputer (i.e., give the odds ratio)?
- What is the probability that a 46 CPU supercomputer will find a solution in less than 5 min?
- What size supercomputer would you need to have a probability 0.73 (=e/(1+e)) of finding a solution in less than 5 minutes?
- Consider the following transformation (from 3D to 10D space): f(x) = f(x1, x2, x3) = (1, √2x1, √2x2, √2x3, x12, x22, x32, √2x1x2, √2x1x3, √2x2x3)
- Show that K(x, y) = < f(x), f(y) > is a kernel function by reducing K(x, y) to (1 + < x, y >)2 (recall that < x, y > is the inner product, i.e., the sum of products of pairwise coordinates).
- Using the SMO implementation of support vector machines in Weka, with the polynomial kernel, verify whether the above transformation is useful in learning the sqrt dataset. You should make sure that the polynomial kernel's degree in SMO is set to 2 (the default is 1) and that the useLowerOrder parameter is set to True. Compare SMO with, for example, VotedPerceptron, which is a simple linear model learner. Show the values of accuracy for both and discuss your findings. You may want to visualize the data in Weka.
HWK11: MetalearningWrite a simple program to test the validity of the No Free Lunch. Your program should act on binary classification tasks over binary input spaces.
- Requirements for your program are as follows:
- The user should be able to select the number I of input features. (For testing purposes, I would recommend choosing 3, or 4 at the most).
- The user should be able to select the size M of the test set. You would then use N-M examples for training and M for testing (where N is 2^I).
- Your program should interface with Weka, at least call the command line version of Weka from your code to run a classifier on a dataset.
- The user should be able to select the name of the Weka classifier to use (remember that these are of the form 'class.name').
- Your program should then run the selected classifier on all tasks (i.e., training sets) and test against the corresponding test set, and return the overall generalization performance (i.e., the sum of 'accuracy-50' across all tasks).
- You may find this code (in python) useful. It has a number of functions to set up ARFF training and test sets for binary tasks.
- Run your program with a decision tree (J48), a back propagation learner (MultilayerPerceptron), Naive Bayes (NB), a majority learner (ZeroR). Record the generalization performance.
- Implement a simple method for a minority learner (i.e., one that always returns the least frequent class). Modify your code so that this learner can be used instead of Weka (you do not need to put things into ARFF files, but can do everything in RAM in this simple case).
- Run your program on your minority learner and record the generalization performance.
- Is the NFL verified in every case? How do you explain what happens with the majority and minority learners?
HWK12: ReadingRead all of the articles listed on our schedule for Monday 11/19, and do the following.
- Briefly summarize each article in a short paragraph. Use your own words; do not cut and paste the abstract/summary.
- For each article, write down a comment or question that it brought to your mind.
- For each article, identify at least one thing the authors could have done/discussed that would improve the quality of their work.
HWK13: ThanksgivingThoughtfully answer the following questions:
- Consider your life's experiences and list one thing you have learned by being told, one thing you have learned by analogy, and one thing you have learned by induction. In each case, give enough details to show how your learning took place. Give a few reasons why you are grateful for your ability to learn.
- Consider all of the articles listed on our schedule for Wednesday 11/14. Using these and your own experience, do you think that people's view of privacy has been changing over the years? If so, why and how, and do you feel the trend will continue? If not, why not?
- Think of the members of your group (for your group project). For each one, indicate their name and write one thing you are grateful for about them specifically.
HWK14: GP QuestionnaireFill out and submit this questionnaire about the group project.
evaluateClustererpublic static java.lang.String evaluateClusterer(Clusterer clusterer, java.lang.String options) throws java.lang.Exception
Evaluates a clusterer with the options given in an array of strings. It takes the string indicated by "-t" as training file, the string indicated by "-T" as test file. If the test file is missing, a stratified ten-fold cross-validation is performed (distribution clusterers only). Using "-x" you can change the number of folds to be used, and using "-s" the random seed. If the "-p" option is present it outputs the classification for each test instance. If you provide the name of an object file using "-l", a clusterer will be loaded from the given file. If you provide the name of an object file using "-d", the clusterer built from the training data will be saved to the given file.
- - machine learning clusterer
- - the array of string containing the options
- a string describing the results
- - if model could not be evaluated successfully