YSU CREU 2013/2014 Project Blog: December 2013

Monday, December 2, 2013

Week 11/25/13

I hope everyone had a nice Thanksgiving Break! I've been doing quite a bit of book keeping this week in between black friday and cyber monday shopping.

First, I submitted an abstract to the 2013 National Conference on Undergraduate Research. If it is accepted, we will be going to the University of Kentucky on April 3 to give an oral presentation. This will be the first opportunity to present this research, so I am very excited!

I also wrote a proposal for the Undergraduate Student Research Grant at my home university. I made it in LaTeX, so it is looking pretty sharp. I am going to try to get it approved by my advisors and the chair of the computer science and information systems department tomorrow. We are trying to get funding for various research supplies. Wish me luck!

I am also working on a painfully detailed procedural list, for all the papers we have read thus far. We should be able to work directly off this list, to reproduce some of the previous work before fully implementing our new system.

I am making progress on preprocessing our data. I am following along with a tutorial to produce a vector space model that we can apply a similarity measure to. We will probably start off with cosine similarity, but if I feel ambitious I might implement either the Dice or Jaccard similarity as well.

Have a nice day!

Week 11/11/13 and 11/18/13

I wrote more python code this week. At this point, my program will go through a dataset and find all duplicate bug reports. Once it finds a bug report it checks to see if the master of that bug is also in the data set. If the master is in the set, it adds the duplicate to our new data set. Otherwise, it does not include the report, since this is not logical to include unmastered duplicates in our research. It also adds all singleton bug reports that without duplicates. We tested it on our small eclipse data set. We actually got a surprising result. Out of the 1001 bug reports in the data set there were 84 duplicates. My program then determined that of these 84 duplicates, only 22 of them had master in the set! This shockingly low number may be due to the fact that we chose such a small chunk of data to test it on. We will have to look into this fact more later. I was also able to begin some of the data preprocessing via python. I am currently able to tokenize, remove stop words, normalize, stem, legitimize, and detect N-grams with ease. Creating a vector space model and trying out some similarity measurements should be finished by next week.

I also read an interesting initial report, that gave basic statistics on 9 bug repositories (some of which we plan on studying). It found:

Percentage of duplicate reports
Amount of time spent processing duplicate and non duplicates
How long reporters look for duplicates before submitting a new report
Total number of reports submitted daily
How many duplicates are submitted by frequent posters

One surprising statistic said that over 90% of duplicate bug reports are actual submitted by sporadic users. It gave us a few ideas for what to look into.