YSU CREU 2013/2014 Project Blog

Wednesday, January 15, 2014

Week 1/6/14

Happy New Year!

This week, I downloaded pymongo, mongoDB, and all its dependencies. I was able to connect to the mongoDB server. This is needed to restore our small eclipse and eclipse 2008 databases into mongo databases. Although the small eclipse database can be accessed directly using simple techniques, eclipse 2008 has over 45,000 bugs. Therefore, I had to adapted the programs that I wrote last week to read pymongo databases. I should now be able to create a dictionary of all duplicate bugs and calculate the number of unique duplicate pairs on all of our even larger datasets. We were actually able to use our results from this program to match results given in another paper.

We are also looking into possibly using other similarity measures other than the very common cosine measurement. I think it might be cool to use a norm instead. Possibly the euclidean norm, p norm or infinity norm would be appropriate. My advisors suggested that we calculate the Dice and Jaccard similarities first to have a baseline to compare with previously done research. Then we would be able to tell the effectiveness of these other norms.We are still brainstorming on other avenues to go down.

The deadline for our MSR abstract is quickly approaching. We have been working on getting what we can written on the paper.

Monday, January 6, 2014

Week 12/30/13

Good News! Our Undergraduate Student Research Grant application was approved! We are still waiting for the official letter, but we received an email the other day that we will receive funding for the full amount requested. This will really help out to defray the cost of equipment, printing, programs, and literature. We are very thankful to Youngstown State University for the grant, and also the CREU program for their continued support of this project!

I was able to write the program to create a dictionary for each duplicate bug as a key. Linked to each key / duplicate bug is an array containing each bug that is either its master or a duplicate of itself or its master and so on. This array will then contain all groups of reports that describe the same software problem. This can be used in our later algorithm to check if any bugs in the top K-similar list are actually duplicates.

I also calculated the total number of pair of duplicates. After finding only the unique groups of duplicates, I calculated the group size choose 2 for each group. The sum of these were then taken to get the total number of duplicate pairs. We can now use this to calculate the commonly used performance measure recall rate that was originally proposed by Runeson et al. Although this method is very popular, it has not been studied extensively or proved to be the best way to measure performance. We would like to look into the validity of this and possibly propose another measurement technique instead of recall rate.

I then read a master Thesis written by Tomi Prifti titled Duplicate Defect Prediction. Some new things that were shown in this paper included a study on if the priority of a bug report affected the number of duplicate it had. Surprisingly, this had no significant effect. He also studied the intentional duplicates or duplicates that are submitted multiple times out of frustration when a problem is not fixed promptly. These types of duplicates actually made up 5% of the total number of duplicates in the data set that he studied. He also implemented a group centroid vector that only considered the most recent 2000 bugs instead of the traditional tf/idf vector measurement. This is something I am certainly going to look into!

I hope everyone had a wonderful winter break and has a wonderful spring semester!

Week 12/16/13 and 12/23/13

This week I applied what I learned last week about Gensim to our test data set (EclipseSmall). I was able to calculate the tf/idf (Term Frequency / inverse Document Frequency) for the 1001 bug reports in EclipseSmall. This involved doing the standard preprocessing (stemming, stop words, etc.) of each bug report, followed by converting each report into a vector, and finally doing the tf/idf calculations.

Next we want to create the similarity matrix I talked about last week and retrieve the top 5 most similar reports for each bug. We then want to see if any bug that is a duplicate has its master in the list of top 5 similar bug reports. To do this, we need to have a list of duplicate bugs. This is less trivial than it should be, since many bugs have multiple duplicates that are chained, duplicate to master, in various ways. I am working on a program to run through and find all of these duplicates.

We will also need to use some combinational logic to then calculate the set of unique pairs of duplicates. This number is essential to calculate recall rate, which is almost the current standard of duplicate detection performance.

I also read an interesting paper on Measuring the Semantic Similarity of Comments by Dit et al. In this paper, they look at the flow of the conversation in the comment section. If any comments get off topic they can be removed and therefore the conversation will be more readable. This is certainly an interesting application of similarity measurements that I did not think about before.

Week 12/2/13 and 12/9/13

My advisors, Computer Science and Information Systems Department Chair, and Dean of STEM have approved our Undergraduate Student Research Grant. I have now submitted it to the comity for review. It is apparently a tough year, since funding is limited and more applicants than normal have applied.
We are still in high hopes though!

We have not heard anything back from the 2013 National Conference on Undergraduate Research that we applied to present at. I will post updates when I hear them.

I also submitted my Mid Year Report this week.

Downloaded and installed the program Gensim. It is a program created to run different models. It is mainly for calculating and analyzing semeiotical structure of text. It has several built in functions that can be implemented into our python programs. I was able to run through several tutorials to get acquainted with the program. One very cool thing that I was able to do was to create and iterate over a similarity matrix for a corpus of documents. Many programs are not able to do this since for a large corpus, the size of the matrix grows at a rate of x^2. Gensim stores data on a computers disk and thus avoids this potential memory problem.

Another program that we have seen several times in our literature review is Lucene. We may consider working with this program as well.

Monday, December 2, 2013

Week 11/25/13

I hope everyone had a nice Thanksgiving Break! I've been doing quite a bit of book keeping this week in between black friday and cyber monday shopping.

First, I submitted an abstract to the 2013 National Conference on Undergraduate Research. If it is accepted, we will be going to the University of Kentucky on April 3 to give an oral presentation. This will be the first opportunity to present this research, so I am very excited!

I also wrote a proposal for the Undergraduate Student Research Grant at my home university. I made it in LaTeX, so it is looking pretty sharp. I am going to try to get it approved by my advisors and the chair of the computer science and information systems department tomorrow. We are trying to get funding for various research supplies. Wish me luck!

I am also working on a painfully detailed procedural list, for all the papers we have read thus far. We should be able to work directly off this list, to reproduce some of the previous work before fully implementing our new system.

I am making progress on preprocessing our data. I am following along with a tutorial to produce a vector space model that we can apply a similarity measure to. We will probably start off with cosine similarity, but if I feel ambitious I might implement either the Dice or Jaccard similarity as well.

Have a nice day!

Week 11/11/13 and 11/18/13

I wrote more python code this week. At this point, my program will go through a dataset and find all duplicate bug reports. Once it finds a bug report it checks to see if the master of that bug is also in the data set. If the master is in the set, it adds the duplicate to our new data set. Otherwise, it does not include the report, since this is not logical to include unmastered duplicates in our research. It also adds all singleton bug reports that without duplicates. We tested it on our small eclipse data set. We actually got a surprising result. Out of the 1001 bug reports in the data set there were 84 duplicates. My program then determined that of these 84 duplicates, only 22 of them had master in the set! This shockingly low number may be due to the fact that we chose such a small chunk of data to test it on. We will have to look into this fact more later. I was also able to begin some of the data preprocessing via python. I am currently able to tokenize, remove stop words, normalize, stem, legitimize, and detect N-grams with ease. Creating a vector space model and trying out some similarity measurements should be finished by next week.

I also read an interesting initial report, that gave basic statistics on 9 bug repositories (some of which we plan on studying). It found:

Percentage of duplicate reports
Amount of time spent processing duplicate and non duplicates
How long reporters look for duplicates before submitting a new report
Total number of reports submitted daily
How many duplicates are submitted by frequent posters

One surprising statistic said that over 90% of duplicate bug reports are actual submitted by sporadic users. It gave us a few ideas for what to look into.

Monday, November 11, 2013

Week 10/28/13 and 11/04/13

I was able to run through the json file containing small eclipse data set and read specific lines. This is certainly a step in the right direction , but it may only be useful for our toy set of bug reports. We have come across a slight hitch in our plan to use json files to store our bug reports. Because we need to access different parts of the file at a time, it looks as though we will need to load all of the data into memory to work with it. Since we are plaining on analyzing data sets containing several hundreds of thousands of bug reports, this is not feasible (especially since the files are a few gigabytes without the comments). Once the comments are included, this will probably be intractable. Therefore we are going to load the data into a mongo data base and work with that instead. I am installing pymongo to adapt the program I have been working on to use this database instead.

The deadline is already coming up for our paper submission. We have started writing the sections we can. We are continuously adding to the bibliography, and I wrote the first draft of the related works section. The introduction section is probably going to be next on our list.

Here is a summary of this weeks background research.

First we looked into using character n-grams instead of word n-grams. This process seems to be very innovative. It is not susceptible to as many natural language issues that word based systems are. For example, character n-grams are not phased by misspelled words, shortened words, or words with different ending. These require a great deal of preprocessing from other systems therefore may provide better results with less effort. Another astonishing fact is that character n-grams are not language dependent. Therefore implementing this system in another language would be trivial compared to other models.

Secondly we looked into another method of automatically detecting duplicate reports. This method implemented quite a few state of the art techniques, that had only been used in non automated systems. One of these is the information retrieval formula for calculating similarity called BM25F. Another novice aspect is the boolean feature that is true if two reports are based off of the same product. It also compared the top similar report with the other k-top similar report. It provides a threshold to determine if a report is significantly more similar to a new report than the top k reports. It combines this factors together and may get results up to 200% more accurate than previous fully automated models.

We need to make some decisions this week about the specifics of our model. We are currently considering a few different directions. A major decision will be which similarity measurement we are going to implement. We are also looking into what set of features we would like to consider as well as maybe an additional data set that has not been included in other research experiments.

That's all for today, enjoy your Veteran's Day!