YSU CREU 2013/2014 Project Blog: October 2013

Saturday, October 26, 2013

Week 10/14/13 and 10/21/13

We have gathered over several hundred thousand bug reports from eclipse, open office, firefox, and Android. The data is currently stored in both json files and csv files. We separated the data into several sets of variable size by the system the bug reports are from. There are two files for each particular set: one with comments and one without comments. We have not run across any papers that included the comments in any duplicate detection. We would like to add this new field to our model, but more tests must be done first. We also have a small set of about 1000 eclipse bug reports to test any methods, before moving to the much larger sets.

We have started preprocessing the data. First, we had to create another field for each duplicate report. This field contains the identification number of the master bug report (the original report of the bug that is also described in the duplicate report). I am now writing a python program to confirm that the master report is in the dataset. If the master report is not in the data set, we intend to either remove the duplicate status from that particular report or remove it all together from the set. Removing the duplicate label is logical, because when our program is applied to a full system of bug reports, all reports should be included. Presumably, the master report would be in the data, or else the report would have never been marked as a duplicate. Therefore, changing the status will only remove duplicates that are impossible to find masters for.

We have continued to read more literature. Some researchers have grouped the reports based on key words that like the report to a specific feature of the system. We looked into applying either LDA (Latent Dirichlet Allocation) or Labeled LDA. Although Label LDA provides slightly better results (because reports must be labeled by hand), LDA takes 60 times less time by automating the process. LDA is another possible feature we would like to incorporate into our model.

Monday, October 14, 2013

Week 10/7/13

Hello Folks!

We are starting to collect our first data set from the Android bug repository. We would like to incorporate comments on defect reports in our model as well as the conventional descriptive and categorical data about the bug and product. A significant amount effort is required to extract this specific information in a form that will be convenient to use later. We are currently working out these challenges.

We are continuously doing background research to get new ideas and see what others have done. I have been updating a short summary table and also created detailed summaries of each paper. An interesting possible similarity to include from this weeks research categorizes the bug reports based on the probably context of the bug. This is done using a pre-made lists of words developed by those with in-depth knowledge of software development and the specific repository used.

Everyone is really excited about the possibility of submitting our future research to the MSR 2014 conference on Mining Software Repositories. There are several possible opportunities for our team. They have a both a research paper section and a data collection section. They also hold a data mining competition, that be very cool to participate in. Participants from all over the world are given the same data set to analyze and report on. The top reports actually get to present their data at the conference. This would be a great experience for everyone.

Sunday, October 6, 2013

Week 9/30/13

Good news, this blog is officially up to date!

This week, I created a table to summarize all of the background research we have completed so far. This table includes the dataset used for each experiment, specifics on each research team's mathematical model, and the results that were acquired. Because many of the papers gave methods or results that were dependent on other papers, this table will be very useful when we make decisions on our own model.

- I did more background research on getting a more accurate retrieval of duplicate bug reports. This method uses a popular equation called the BM25F that takes advantage of both global word usage as well as local word frequencies instead of the very common cosine similarity measurement.

In addition to watching more tutorials on web scraping using Xpath and processing data using Rapid Miner, I have installed Scrapy (a popular web scraping tool) and started through some tutorials.

Thanks for reading!

Week 9/23/13

· Hello,

This week, I read a paper on responsible conduct in research. This is in preparation to begin collecting data for our first round of experiments.

I also looked into another possible method of classifying duplicate bugs. This method creates a discriminative model that compares duplicates and non duplicates to calculate the probability that a new bug report is in fact a duplicate. This model also constantly updates its coefficients to reflect the new data in an ever changing corpus of reports.

I Finally, I have been watching tutorial videos on web scraping and using XPath. These tools will be very useful to collect the large set of data we plan to recover.