YSU CREU 2013/2014 Project Blog: January 2014

Wednesday, January 15, 2014

Week 1/6/14

Happy New Year!

This week, I downloaded pymongo, mongoDB, and all its dependencies. I was able to connect to the mongoDB server. This is needed to restore our small eclipse and eclipse 2008 databases into mongo databases. Although the small eclipse database can be accessed directly using simple techniques, eclipse 2008 has over 45,000 bugs. Therefore, I had to adapted the programs that I wrote last week to read pymongo databases. I should now be able to create a dictionary of all duplicate bugs and calculate the number of unique duplicate pairs on all of our even larger datasets. We were actually able to use our results from this program to match results given in another paper.

We are also looking into possibly using other similarity measures other than the very common cosine measurement. I think it might be cool to use a norm instead. Possibly the euclidean norm, p norm or infinity norm would be appropriate. My advisors suggested that we calculate the Dice and Jaccard similarities first to have a baseline to compare with previously done research. Then we would be able to tell the effectiveness of these other norms.We are still brainstorming on other avenues to go down.

The deadline for our MSR abstract is quickly approaching. We have been working on getting what we can written on the paper.

Monday, January 6, 2014

Week 12/30/13

Good News! Our Undergraduate Student Research Grant application was approved! We are still waiting for the official letter, but we received an email the other day that we will receive funding for the full amount requested. This will really help out to defray the cost of equipment, printing, programs, and literature. We are very thankful to Youngstown State University for the grant, and also the CREU program for their continued support of this project!

I was able to write the program to create a dictionary for each duplicate bug as a key. Linked to each key / duplicate bug is an array containing each bug that is either its master or a duplicate of itself or its master and so on. This array will then contain all groups of reports that describe the same software problem. This can be used in our later algorithm to check if any bugs in the top K-similar list are actually duplicates.

I also calculated the total number of pair of duplicates. After finding only the unique groups of duplicates, I calculated the group size choose 2 for each group. The sum of these were then taken to get the total number of duplicate pairs. We can now use this to calculate the commonly used performance measure recall rate that was originally proposed by Runeson et al. Although this method is very popular, it has not been studied extensively or proved to be the best way to measure performance. We would like to look into the validity of this and possibly propose another measurement technique instead of recall rate.

I then read a master Thesis written by Tomi Prifti titled Duplicate Defect Prediction. Some new things that were shown in this paper included a study on if the priority of a bug report affected the number of duplicate it had. Surprisingly, this had no significant effect. He also studied the intentional duplicates or duplicates that are submitted multiple times out of frustration when a problem is not fixed promptly. These types of duplicates actually made up 5% of the total number of duplicates in the data set that he studied. He also implemented a group centroid vector that only considered the most recent 2000 bugs instead of the traditional tf/idf vector measurement. This is something I am certainly going to look into!

I hope everyone had a wonderful winter break and has a wonderful spring semester!

Week 12/16/13 and 12/23/13

This week I applied what I learned last week about Gensim to our test data set (EclipseSmall). I was able to calculate the tf/idf (Term Frequency / inverse Document Frequency) for the 1001 bug reports in EclipseSmall. This involved doing the standard preprocessing (stemming, stop words, etc.) of each bug report, followed by converting each report into a vector, and finally doing the tf/idf calculations.

Next we want to create the similarity matrix I talked about last week and retrieve the top 5 most similar reports for each bug. We then want to see if any bug that is a duplicate has its master in the list of top 5 similar bug reports. To do this, we need to have a list of duplicate bugs. This is less trivial than it should be, since many bugs have multiple duplicates that are chained, duplicate to master, in various ways. I am working on a program to run through and find all of these duplicates.

We will also need to use some combinational logic to then calculate the set of unique pairs of duplicates. This number is essential to calculate recall rate, which is almost the current standard of duplicate detection performance.

I also read an interesting paper on Measuring the Semantic Similarity of Comments by Dit et al. In this paper, they look at the flow of the conversation in the comment section. If any comments get off topic they can be removed and therefore the conversation will be more readable. This is certainly an interesting application of similarity measurements that I did not think about before.

Week 12/2/13 and 12/9/13

My advisors, Computer Science and Information Systems Department Chair, and Dean of STEM have approved our Undergraduate Student Research Grant. I have now submitted it to the comity for review. It is apparently a tough year, since funding is limited and more applicants than normal have applied.
We are still in high hopes though!

We have not heard anything back from the 2013 National Conference on Undergraduate Research that we applied to present at. I will post updates when I hear them.

I also submitted my Mid Year Report this week.

Downloaded and installed the program Gensim. It is a program created to run different models. It is mainly for calculating and analyzing semeiotical structure of text. It has several built in functions that can be implemented into our python programs. I was able to run through several tutorials to get acquainted with the program. One very cool thing that I was able to do was to create and iterate over a similarity matrix for a corpus of documents. Many programs are not able to do this since for a large corpus, the size of the matrix grows at a rate of x^2. Gensim stores data on a computers disk and thus avoids this potential memory problem.

Another program that we have seen several times in our literature review is Lucene. We may consider working with this program as well.