This week I have completed the first version of the poster for the Grace Hopper Celebration of Women in Computing. I had to do a lot of finking with the poster to fit all of the important information on it. It was very hard to narrow down what should go on the poster. I think it looks pretty nifty now.
I also submitted an application for a scholarship to attend this celebration. I really hope that I get this scholarship. Since I am graduating, I am not sure if I can still get funding from the CREU program to attend. I need to look into this matter, but the scholarship would certainly help cut the costs down of attendance.
I set a date for presenting this research to the department for my senior project. I get to present on May 28th at 11:00 am. It should be a jolly good time!
I also wrote the first draft of the final report for this project. I can't believe the year is almost over! This has been such a great experience for me. I will really miss the weekly meetings with Dr. Bonita and Dr. Sharif!
Classification Algorithms for Detecting Duplicate Bug Reports in Large Open Source Repositories
Monday, April 21, 2014
Week 3/30/14-4/7/14
This week, I presented our paper at the YSU Quest forum for student research. It was very well received and the audience seemed very interested in the topic. I really enjoyed having the opportunity to present and got some good feedback. Some interesting question were posed at the end of the presentation. Since we had very good results, we did not look too closely at the very few negative cases (non duplicates) that were mislabeled as duplicates. One audience member asked what happened in these small number of cases. I intend to look into this more closely to find the source of the problem to improve our system. Another audience member, who often writes bug reports himself, asked if we had any issues with profanity in the reports we studied (since he is not often happy when he is writing his reports). This seems like an interesting thing to look into (although we did not encounter any problems from it in our tests).
I also presented our project from a more mathematical stand point at the Ohio MAA sectional meeting. I talked about the different Metric preserving functions that were essential to make some of our machine learning algorithms that cluster data (K-nearest neighbors) a logical choice. The audience here was curious about the datasets that we used. One audience member suggested that we also try out our system on some proprietary software, to see if our excellent results still hold.
I also presented our project from a more mathematical stand point at the Ohio MAA sectional meeting. I talked about the different Metric preserving functions that were essential to make some of our machine learning algorithms that cluster data (K-nearest neighbors) a logical choice. The audience here was curious about the datasets that we used. One audience member suggested that we also try out our system on some proprietary software, to see if our excellent results still hold.
Sunday, March 30, 2014
Week 3/17/14 - 3/24/14
· This week, I submitted a title and abstract to YSU's QUEST forum for student scholarship. My talk is going to be this Tuesday 4/1/14. I am very excited to present our research at this venue and have been working hard to create the presentation and practice it.
I am also attending the Ohio MAA sectional conference this friday where I will present all of the mathematics that I have learned to accompany this project. I am going to talk about the difference between similarity measures that are metrics and ones that are non-metrics. A metric is simply a function that preserves distance or always satisfies the triangle inequality. There is a lot of debate currently going on about wether or not it is important to use metrics. This topic is truly interdisciplinary (since it spans the subjects of Ecology, Biology, Chemistry, Computer Science, and Mathematics). In fact, the book I am currently ready on the subject is all about interpreting ecological data. I also intend to do an analysis of the techniques used for our model to determine if they are distance preserving or not. If time is permitting, I will also talk about subadditive and supperadditive functions, and the role they plan in this conundrum.
·
The first very rough draft of our Grace Hopper Celebration poster has been created. Much work is still needed to get this ready for the conference, but we have plenty of time. I chose to create the poster in LaTex, because I really enjoy typesetting!
·
Monday, March 17, 2014
Week 3/10/14
Our two papers have been accepted to the 2014 MSR Conference! One is a data paper that explains in detail how our data was collected and preprocessed. The other paper was a short research paper that explains the progress we have made thus far. We received a list of referee comments for both papers. We will be busy over the next week or so making these changes and getting our papers camera ready. Once these revisions are made, we will resubmit the papers for final approval. We will then travel to Hyperbad India to present this research!
I also wrote our extended abstract for the GHC poster session. We have already submitted this. I need to get working on our poster now. I am torn between using latex or an office environment to create this poster. I really like latex and the equations will look sharp, but some formatting parts can really be finicky.
I also submitted an abstract to the Ohio MAA sectional math fest. In this presentation, I intend to talk more about the underlying math of the project and how it all comes together.
My list of research supplies from the USR Grant has been approved and ordered. I am getting three books that should be very helpful to understand all of the different aspects of the project. I am also getting a two terabyte portable hard drive to store the immense amounts of data needed for our study and a headset for our weekly online meetings.
I also wrote our extended abstract for the GHC poster session. We have already submitted this. I need to get working on our poster now. I am torn between using latex or an office environment to create this poster. I really like latex and the equations will look sharp, but some formatting parts can really be finicky.
I also submitted an abstract to the Ohio MAA sectional math fest. In this presentation, I intend to talk more about the underlying math of the project and how it all comes together.
My list of research supplies from the USR Grant has been approved and ordered. I am getting three books that should be very helpful to understand all of the different aspects of the project. I am also getting a two terabyte portable hard drive to store the immense amounts of data needed for our study and a headset for our weekly online meetings.
Monday, March 10, 2014
Week 2/24/14 - 3/3/14
This week, I presented our research at the Regional Pi Mu Epsilon Conference. I gave a 15 minute presentation. I talked about the goals of this research, the importance for software companies to look into these problems, the basic techniques involved, new features that we considered, and the results obtained. The talk went well. The audience seemed really interested and had a bunch of questions. I really enjoyed attending the conference. I got to see some interesting talks given by fellow students. It was a really great experience.
I also wrote some sort pseudo code for the programs I previously wrote. We may include this in future papers or just have it on file to look back on. its really weird converting from actual code to pseudo code. I generally do this in the reverse direction.
I also read a very advanced thesis paper on LDA. I have to admit this was a hard read. It used a lot of probability theory that I have never seen before. I am going to have to do a lot of background reading to get a better sense of what was done.
Since we received the USR grant from Youngstown State University, I am also getting together a list of possible books, equipment, or accessories that will help us in our research.
Sunday, February 23, 2014
Week 2/17/14
This has been such a busy time. We submitted the abstract of our data paper to the MSR conference! The full paper is due on 2/21/14.
We are looking more into the implementation of LDA. I am currently figuring out a program called Vowpal Wabbit that will make the implementation much easier.
I also prepared a 15 minute presentation in beamer on the math behind our research and the progress we have made so far. I am going to present it at the Pi Mu Epsilon regional conference Saturday 2/22/14. This is the first time that I am presenting our research.
Week 2/10/14
We submitted our paper to the short research paper session on MSR. Our results showed that we had a 3-6% increase in accuracy compared to state of the art published results. We are very excited to get some feedback on our work and hope for the best!
We are already pushing ahead to make our model better. We would like to add some additional features. We are considering implementing the LDA algorithm. Currently the largest data set we have experimented on has been mozilla that initially had 78,236 bugs in it. We would like to run our algorithm on some much larger data sets (At least an order of magnitude larger)!
We would also like to submit a data paper to the MSR conference. This paper will talk about specifically how we collected our data and processed it. They have a specific paper section on this topic.
Week 1/27/14 - 2/3/14
We have finished all of our experiments that we intend to include in our MSR conference paper. We submitted our abstract for MSR conference. We calculate the similarity scores for 25 different features. 18 of these were adapted from the short text similarity papers and were calculated using Takelab. The other 7 are binary features that are one if the feature considered is the same for both bugs and 0 otherwise. These were created under the assumption that two duplicate bugs will be from the same piece of software and share other qualities.
We then use a support vector regression to calibrate our Binary classification Model (BCM). BCM essentially means we classify each bug as either a duplicate or non duplicates. This is more manageable then the ranking problem that produces a top -K list of the K most similar bugs. We tested tree datasets: Eclipse, Open Office, and Mozilla. Our results indicate that this method is better than previously recorded results.
Another additional different in our model compared to previous results is that we do not keep bugs that are still open. This allows us to avoid training our model on mislabeled data since we can not confirm if these bugs are duplicates or not.
Our full paper is due on 2/7/14. Therefore we are busy writing up the final draft. I am writing the section on how our data was collected and the properties of the datasets.
Week 1/21/14
We have decided to use Takelab for all of our experiments. Because Takelab was designed to be run on short sentences, we must combine the summary and description of each bug report into one large section of text. We intend to compare all possible pairs of duplicates. Although other researchers did not compare all possible duplicates this is statistically better because we are testing our model on a representative set of data. Where as if you change the percentage of duplicates in a dataset, you run into other problems.
Another issue we are considering has to deal with stack traces. Often times people will submit the stack trace in the bug report's description field. We have two options. We can either leave in the periods or remove them. We intend to test out both methods to see what produces better results, or to see if it matters at all.
I have also made a summary of the features from each of the papers that we read last week. We can use this to analize which features were the most commonly used and what features worked the best.
Since we must submit our paper soon, I have also updated the bibliography and made it into a BibTex file for LaTex. I have also moved the paper into the MSR LaTex template and added in the references.
Week 1/13/14
This week we are moving to a new approach. We read 3 papers that looked into Short text Similarity. These papers mainly compared the similarity of single sentences. Because there are very few words in a sentence, many other features must be considered to determine if they are related. For example, Wordnet has an enormous amount of works linked together in a hierarchy that determines if two words are used in similar context. These and many other methods used in these papers would be perfect to apply to our longer texts. The papers used Takelab and DKPro to analyze all of their data. These have all of the short text similarity features built in that we are interested in using. We are seriously considering running all of our experiments with the help of this software. Before we can do this though we need to adapt our datasets slightly. We intend to generate pairs of duplicates and non duplicates to both calibrate and test our model. The papers we read used datasets comprised of 20% duplicates and 80% non duplicates. We are currently working on randomly generating these.
Wednesday, January 15, 2014
Week 1/6/14
Happy New Year!
This week, I downloaded pymongo, mongoDB, and all its dependencies. I was able to connect to the mongoDB server. This is needed to restore our small eclipse and eclipse 2008 databases into mongo databases. Although the small eclipse database can be accessed directly using simple techniques, eclipse 2008 has over 45,000 bugs. Therefore, I had to adapted the programs that I wrote last week to read pymongo databases. I should now be able to create a dictionary of all duplicate bugs and calculate the number of unique duplicate pairs on all of our even larger datasets. We were actually able to use our results from this program to match results given in another paper.
This week, I downloaded pymongo, mongoDB, and all its dependencies. I was able to connect to the mongoDB server. This is needed to restore our small eclipse and eclipse 2008 databases into mongo databases. Although the small eclipse database can be accessed directly using simple techniques, eclipse 2008 has over 45,000 bugs. Therefore, I had to adapted the programs that I wrote last week to read pymongo databases. I should now be able to create a dictionary of all duplicate bugs and calculate the number of unique duplicate pairs on all of our even larger datasets. We were actually able to use our results from this program to match results given in another paper.
We are also looking into possibly using other similarity measures other than the very common cosine measurement. I think it might be cool to use a norm instead. Possibly the euclidean norm, p norm or infinity norm would be appropriate. My advisors suggested that we calculate the Dice and Jaccard similarities first to have a baseline to compare with previously done research. Then we would be able to tell the effectiveness of these other norms.We are still brainstorming on other avenues to go down.
The deadline for our MSR abstract is quickly approaching. We have been working on getting what we can written on the paper.
Monday, January 6, 2014
Week 12/30/13
Good News! Our Undergraduate Student Research Grant application was approved! We are still waiting for the official letter, but we received an email the other day that we will receive funding for the full amount requested. This will really help out to defray the cost of equipment, printing, programs, and literature. We are very thankful to Youngstown State University for the grant, and also the CREU program for their continued support of this project!
I was able to write the program to create a dictionary for each duplicate bug as a key. Linked to each key / duplicate bug is an array containing each bug that is either its master or a duplicate of itself or its master and so on. This array will then contain all groups of reports that describe the same software problem. This can be used in our later algorithm to check if any bugs in the top K-similar list are actually duplicates.
I also calculated the total number of pair of duplicates. After finding only the unique groups of duplicates, I calculated the group size choose 2 for each group. The sum of these were then taken to get the total number of duplicate pairs. We can now use this to calculate the commonly used performance measure recall rate that was originally proposed by Runeson et al. Although this method is very popular, it has not been studied extensively or proved to be the best way to measure performance. We would like to look into the validity of this and possibly propose another measurement technique
instead of recall rate.
I then read a master Thesis written by Tomi Prifti titled Duplicate Defect Prediction. Some new things that were shown in this paper included a study on if the priority of a bug report affected the number of duplicate it had. Surprisingly, this had no significant effect. He also studied the intentional duplicates or duplicates that are submitted multiple times out of frustration when a problem is not fixed promptly. These types of duplicates actually made up 5% of the total number of duplicates in the data set that he studied. He also implemented a group centroid vector that only considered the most recent 2000 bugs instead of the traditional tf/idf vector measurement. This is something I am certainly going to look into!
I hope everyone had a wonderful winter break and has a wonderful spring semester!
Week 12/16/13 and 12/23/13
This week I applied what I learned last week about Gensim to our test data set (EclipseSmall). I was able to calculate the tf/idf (Term Frequency / inverse Document Frequency) for the 1001 bug reports in EclipseSmall. This involved doing the standard preprocessing (stemming, stop words, etc.) of each bug report, followed by converting each report into a vector, and finally doing the tf/idf calculations.
Next we want to create the similarity matrix I talked about last week and retrieve the top 5 most similar reports for each bug. We then want to see if any bug that is a duplicate has its master in the list of top 5 similar bug reports. To do this, we need to have a list of duplicate bugs. This is less trivial than it should be, since many bugs have multiple duplicates that are chained, duplicate to master, in various ways. I am working on a program to run through and find all of these duplicates.
We will also need to use some combinational logic to then calculate the set of unique pairs of duplicates. This number is essential to calculate recall rate, which is almost the current standard of duplicate detection performance.
I also read an interesting paper on Measuring the Semantic Similarity of Comments by Dit et al. In this paper, they look at the flow of the conversation in the comment section. If any comments get off topic they can be removed and therefore the conversation will be more readable. This is certainly an interesting application of similarity measurements that I did not think about before.
Week 12/2/13 and 12/9/13
My advisors, Computer Science and Information Systems Department Chair, and Dean of STEM have approved our Undergraduate Student Research Grant. I have now submitted it to the comity for review. It is apparently a tough year, since funding is limited and more applicants than normal have applied.
We are still in high hopes though!
We have not heard anything back from the 2013 National Conference on Undergraduate Research that we applied to present at. I will post updates when I hear them.
I also submitted my Mid Year Report this week.
Downloaded and installed the program Gensim. It is a program created to run different models. It is mainly for calculating and analyzing semeiotical structure of text. It has several built in functions that can be implemented into our python programs. I was able to run through several tutorials to get acquainted with the program. One very cool thing that I was able to do was to create and iterate over a similarity matrix for a corpus of documents. Many programs are not able to do this since for a large corpus, the size of the matrix grows at a rate of x^2. Gensim stores data on a computers disk and thus avoids this potential memory problem.
Another program that we have seen several times in our literature review is Lucene. We may consider working with this program as well.
Another program that we have seen several times in our literature review is Lucene. We may consider working with this program as well.
Subscribe to:
Posts (Atom)