I wrote more python code this week. At this point, my program will go through a dataset and find all duplicate bug reports. Once it finds a bug report it checks to see if the master of that bug is also in the data set. If the master is in the set, it adds the duplicate to our new data set. Otherwise, it does not include the report, since this is not logical to include unmastered duplicates in our research. It also adds all singleton bug reports that without duplicates. We tested it on our small eclipse data set. We actually got a surprising result. Out of the 1001 bug reports in the data set there were 84 duplicates. My program then determined that of these 84 duplicates, only 22 of them had master in the set! This shockingly low number may be due to the fact that we chose such a small chunk of data to test it on. We will have to look into this fact more later. I was also able to begin some of the data preprocessing via python. I am currently able to tokenize, remove stop words, normalize, stem, legitimize, and detect N-grams with ease. Creating a vector space model and trying out some similarity measurements should be finished by next week.
I also read an interesting initial report, that gave basic statistics on 9 bug repositories (some of which we plan on studying). It found:
- Percentage of duplicate reports
- Amount of time spent processing duplicate and non duplicates
- How long reporters look for duplicates before submitting a new report
- Total number of reports submitted daily
- How many duplicates are submitted by frequent posters
One surprising statistic said that over 90% of duplicate bug reports are actual submitted by sporadic users. It gave us a few ideas for what to look into.
No comments:
Post a Comment