We have decided to use Takelab for all of our experiments. Because Takelab was designed to be run on short sentences, we must combine the summary and description of each bug report into one large section of text. We intend to compare all possible pairs of duplicates. Although other researchers did not compare all possible duplicates this is statistically better because we are testing our model on a representative set of data. Where as if you change the percentage of duplicates in a dataset, you run into other problems.
Another issue we are considering has to deal with stack traces. Often times people will submit the stack trace in the bug report's description field. We have two options. We can either leave in the periods or remove them. We intend to test out both methods to see what produces better results, or to see if it matters at all.
I have also made a summary of the features from each of the papers that we read last week. We can use this to analize which features were the most commonly used and what features worked the best.
Since we must submit our paper soon, I have also updated the bibliography and made it into a BibTex file for LaTex. I have also moved the paper into the MSR LaTex template and added in the references.