YSU CREU 2013/2014 Project Blog

Sunday, March 30, 2014

Week 3/17/14 - 3/24/14

· This week, I submitted a title and abstract to YSU's QUEST forum for student scholarship. My talk is going to be this Tuesday 4/1/14. I am very excited to present our research at this venue and have been working hard to create the presentation and practice it.

I am also attending the Ohio MAA sectional conference this friday where I will present all of the mathematics that I have learned to accompany this project. I am going to talk about the difference between similarity measures that are metrics and ones that are non-metrics. A metric is simply a function that preserves distance or always satisfies the triangle inequality. There is a lot of debate currently going on about wether or not it is important to use metrics. This topic is truly interdisciplinary (since it spans the subjects of Ecology, Biology, Chemistry, Computer Science, and Mathematics). In fact, the book I am currently ready on the subject is all about interpreting ecological data. I also intend to do an analysis of the techniques used for our model to determine if they are distance preserving or not. If time is permitting, I will also talk about subadditive and supperadditive functions, and the role they plan in this conundrum.

The first very rough draft of our Grace Hopper Celebration poster has been created. Much work is still needed to get this ready for the conference, but we have plenty of time. I chose to create the poster in LaTex, because I really enjoy typesetting!

Monday, March 17, 2014

Week 3/10/14

Our two papers have been accepted to the 2014 MSR Conference! One is a data paper that explains in detail how our data was collected and preprocessed. The other paper was a short research paper that explains the progress we have made thus far. We received a list of referee comments for both papers. We will be busy over the next week or so making these changes and getting our papers camera ready. Once these revisions are made, we will resubmit the papers for final approval. We will then travel to Hyperbad India to present this research!

I also wrote our extended abstract for the GHC poster session. We have already submitted this. I need to get working on our poster now. I am torn between using latex or an office environment to create this poster. I really like latex and the equations will look sharp, but some formatting parts can really be finicky.

I also submitted an abstract to the Ohio MAA sectional math fest. In this presentation, I intend to talk more about the underlying math of the project and how it all comes together.

My list of research supplies from the USR Grant has been approved and ordered. I am getting three books that should be very helpful to understand all of the different aspects of the project. I am also getting a two terabyte portable hard drive to store the immense amounts of data needed for our study and a headset for our weekly online meetings.

Monday, March 10, 2014

Week 2/24/14 - 3/3/14

This week, I presented our research at the Regional Pi Mu Epsilon Conference. I gave a 15 minute presentation. I talked about the goals of this research, the importance for software companies to look into these problems, the basic techniques involved, new features that we considered, and the results obtained. The talk went well. The audience seemed really interested and had a bunch of questions. I really enjoyed attending the conference. I got to see some interesting talks given by fellow students. It was a really great experience.

I also wrote some sort pseudo code for the programs I previously wrote. We may include this in future papers or just have it on file to look back on. its really weird converting from actual code to pseudo code. I generally do this in the reverse direction.

I also read a very advanced thesis paper on LDA. I have to admit this was a hard read. It used a lot of probability theory that I have never seen before. I am going to have to do a lot of background reading to get a better sense of what was done.

Since we received the USR grant from Youngstown State University, I am also getting together a list of possible books, equipment, or accessories that will help us in our research.

Sunday, February 23, 2014

Week 2/17/14

This has been such a busy time. We submitted the abstract of our data paper to the MSR conference! The full paper is due on 2/21/14.

We are looking more into the implementation of LDA. I am currently figuring out a program called Vowpal Wabbit that will make the implementation much easier.

I also prepared a 15 minute presentation in beamer on the math behind our research and the progress we have made so far. I am going to present it at the Pi Mu Epsilon regional conference Saturday 2/22/14. This is the first time that I am presenting our research.

Week 2/10/14

We submitted our paper to the short research paper session on MSR. Our results showed that we had a 3-6% increase in accuracy compared to state of the art published results. We are very excited to get some feedback on our work and hope for the best!

We are already pushing ahead to make our model better. We would like to add some additional features. We are considering implementing the LDA algorithm. Currently the largest data set we have experimented on has been mozilla that initially had 78,236 bugs in it. We would like to run our algorithm on some much larger data sets (At least an order of magnitude larger)!

We would also like to submit a data paper to the MSR conference. This paper will talk about specifically how we collected our data and processed it. They have a specific paper section on this topic.

Week 1/27/14 - 2/3/14

We have finished all of our experiments that we intend to include in our MSR conference paper. We submitted our abstract for MSR conference. We calculate the similarity scores for 25 different features. 18 of these were adapted from the short text similarity papers and were calculated using Takelab. The other 7 are binary features that are one if the feature considered is the same for both bugs and 0 otherwise. These were created under the assumption that two duplicate bugs will be from the same piece of software and share other qualities.

We then use a support vector regression to calibrate our Binary classification Model (BCM). BCM essentially means we classify each bug as either a duplicate or non duplicates. This is more manageable then the ranking problem that produces a top -K list of the K most similar bugs. We tested tree datasets: Eclipse, Open Office, and Mozilla. Our results indicate that this method is better than previously recorded results.

Another additional different in our model compared to previous results is that we do not keep bugs that are still open. This allows us to avoid training our model on mislabeled data since we can not confirm if these bugs are duplicates or not.

Our full paper is due on 2/7/14. Therefore we are busy writing up the final draft. I am writing the section on how our data was collected and the properties of the datasets.

Week 1/21/14

We have decided to use Takelab for all of our experiments. Because Takelab was designed to be run on short sentences, we must combine the summary and description of each bug report into one large section of text. We intend to compare all possible pairs of duplicates. Although other researchers did not compare all possible duplicates this is statistically better because we are testing our model on a representative set of data. Where as if you change the percentage of duplicates in a dataset, you run into other problems.

Another issue we are considering has to deal with stack traces. Often times people will submit the stack trace in the bug report's description field. We have two options. We can either leave in the periods or remove them. We intend to test out both methods to see what produces better results, or to see if it matters at all.

I have also made a summary of the features from each of the papers that we read last week. We can use this to analize which features were the most commonly used and what features worked the best.

Since we must submit our paper soon, I have also updated the bibliography and made it into a BibTex file for LaTex. I have also moved the paper into the MSR LaTex template and added in the references.