We have gathered over several hundred thousand bug reports from eclipse, open office, firefox, and Android. The data is currently stored in both json files and csv files. We separated the data into several sets of variable size by the system the bug reports are from. There are two files for each particular set: one with comments and one without comments. We have not run across any papers that included the comments in any duplicate detection. We would like to add this new field to our model, but more tests must be done first. We also have a small set of about 1000 eclipse bug reports to test any methods, before moving to the much larger sets.
We have started preprocessing the data. First, we had to create another field for each duplicate report. This field contains the identification number of the master bug report (the original report of the bug that is also described in the duplicate report). I am now writing a python program to confirm that the master report is in the dataset. If the master report is not in the data set, we intend to either remove the duplicate status from that particular report or remove it all together from the set. Removing the duplicate label is logical, because when our program is applied to a full system of bug reports, all reports should be included. Presumably, the master report would be in the data, or else the report would have never been marked as a duplicate. Therefore, changing the status will only remove duplicates that are impossible to find masters for.
We have continued to read more literature. Some researchers have grouped the reports based on key words that like the report to a specific feature of the system. We looked into applying either LDA (Latent Dirichlet Allocation) or Labeled LDA. Although Label LDA provides slightly better results (because reports must be labeled by hand), LDA takes 60 times less time by automating the process. LDA is another possible feature we would like to incorporate into our model.