Monday, January 6, 2014

Week 12/30/13

Good News! Our Undergraduate Student Research Grant application was approved! We are still waiting for the official letter, but we received an email the other day that we will receive funding for the full amount requested. This will really help out to defray the cost of equipment, printing, programs, and literature. We are very thankful to Youngstown State University for the grant, and also the CREU program for their continued support of this project!

I was able to write the program to create a dictionary for each duplicate bug as a key. Linked to each key / duplicate bug  is an array containing each bug that is either its master or a duplicate of itself or its master and so on. This array will then contain all groups of reports that describe the same software problem. This can be used in our later algorithm to check if any bugs in the top K-similar list are actually duplicates. 

I also calculated the total number of pair of duplicates. After finding only the unique groups of duplicates, I calculated the group size choose 2 for each group. The sum of these were then taken to get the total number of duplicate pairs.  We can now use this to calculate the commonly used performance measure recall rate that was originally proposed by Runeson et al.  Although this method is very popular, it has not been studied extensively or proved to be the best way to measure performance. We would like to look into the validity of this and possibly propose another measurement technique instead of recall rate.

I then read a master Thesis written by Tomi Prifti titled Duplicate Defect Prediction. Some new things that were shown in this paper included a study on if the priority of a bug report affected the number of duplicate it had. Surprisingly, this had no significant effect.  He also studied the intentional duplicates or duplicates that are submitted multiple times out of frustration when a problem is not fixed promptly. These types of duplicates actually made up 5% of the total number of duplicates in the data set that he studied. He also implemented a group centroid vector that only considered the most recent 2000 bugs instead of the traditional tf/idf vector measurement. This is something I am certainly going to look into!

I hope everyone had a wonderful winter break and has a wonderful spring semester!

No comments:

Post a Comment