YSU CREU 2013/2014 Project Blog: 2013

Monday, December 2, 2013

Week 11/25/13

I hope everyone had a nice Thanksgiving Break! I've been doing quite a bit of book keeping this week in between black friday and cyber monday shopping.

First, I submitted an abstract to the 2013 National Conference on Undergraduate Research. If it is accepted, we will be going to the University of Kentucky on April 3 to give an oral presentation. This will be the first opportunity to present this research, so I am very excited!

I also wrote a proposal for the Undergraduate Student Research Grant at my home university. I made it in LaTeX, so it is looking pretty sharp. I am going to try to get it approved by my advisors and the chair of the computer science and information systems department tomorrow. We are trying to get funding for various research supplies. Wish me luck!

I am also working on a painfully detailed procedural list, for all the papers we have read thus far. We should be able to work directly off this list, to reproduce some of the previous work before fully implementing our new system.

I am making progress on preprocessing our data. I am following along with a tutorial to produce a vector space model that we can apply a similarity measure to. We will probably start off with cosine similarity, but if I feel ambitious I might implement either the Dice or Jaccard similarity as well.

Have a nice day!

Week 11/11/13 and 11/18/13

I wrote more python code this week. At this point, my program will go through a dataset and find all duplicate bug reports. Once it finds a bug report it checks to see if the master of that bug is also in the data set. If the master is in the set, it adds the duplicate to our new data set. Otherwise, it does not include the report, since this is not logical to include unmastered duplicates in our research. It also adds all singleton bug reports that without duplicates. We tested it on our small eclipse data set. We actually got a surprising result. Out of the 1001 bug reports in the data set there were 84 duplicates. My program then determined that of these 84 duplicates, only 22 of them had master in the set! This shockingly low number may be due to the fact that we chose such a small chunk of data to test it on. We will have to look into this fact more later. I was also able to begin some of the data preprocessing via python. I am currently able to tokenize, remove stop words, normalize, stem, legitimize, and detect N-grams with ease. Creating a vector space model and trying out some similarity measurements should be finished by next week.

I also read an interesting initial report, that gave basic statistics on 9 bug repositories (some of which we plan on studying). It found:

Percentage of duplicate reports
Amount of time spent processing duplicate and non duplicates
How long reporters look for duplicates before submitting a new report
Total number of reports submitted daily
How many duplicates are submitted by frequent posters

One surprising statistic said that over 90% of duplicate bug reports are actual submitted by sporadic users. It gave us a few ideas for what to look into.

Monday, November 11, 2013

Week 10/28/13 and 11/04/13

I was able to run through the json file containing small eclipse data set and read specific lines. This is certainly a step in the right direction , but it may only be useful for our toy set of bug reports. We have come across a slight hitch in our plan to use json files to store our bug reports. Because we need to access different parts of the file at a time, it looks as though we will need to load all of the data into memory to work with it. Since we are plaining on analyzing data sets containing several hundreds of thousands of bug reports, this is not feasible (especially since the files are a few gigabytes without the comments). Once the comments are included, this will probably be intractable. Therefore we are going to load the data into a mongo data base and work with that instead. I am installing pymongo to adapt the program I have been working on to use this database instead.

The deadline is already coming up for our paper submission. We have started writing the sections we can. We are continuously adding to the bibliography, and I wrote the first draft of the related works section. The introduction section is probably going to be next on our list.

Here is a summary of this weeks background research.

First we looked into using character n-grams instead of word n-grams. This process seems to be very innovative. It is not susceptible to as many natural language issues that word based systems are. For example, character n-grams are not phased by misspelled words, shortened words, or words with different ending. These require a great deal of preprocessing from other systems therefore may provide better results with less effort. Another astonishing fact is that character n-grams are not language dependent. Therefore implementing this system in another language would be trivial compared to other models.

Secondly we looked into another method of automatically detecting duplicate reports. This method implemented quite a few state of the art techniques, that had only been used in non automated systems. One of these is the information retrieval formula for calculating similarity called BM25F. Another novice aspect is the boolean feature that is true if two reports are based off of the same product. It also compared the top similar report with the other k-top similar report. It provides a threshold to determine if a report is significantly more similar to a new report than the top k reports. It combines this factors together and may get results up to 200% more accurate than previous fully automated models.

We need to make some decisions this week about the specifics of our model. We are currently considering a few different directions. A major decision will be which similarity measurement we are going to implement. We are also looking into what set of features we would like to consider as well as maybe an additional data set that has not been included in other research experiments.

That's all for today, enjoy your Veteran's Day!

Saturday, October 26, 2013

Week 10/14/13 and 10/21/13

We have gathered over several hundred thousand bug reports from eclipse, open office, firefox, and Android. The data is currently stored in both json files and csv files. We separated the data into several sets of variable size by the system the bug reports are from. There are two files for each particular set: one with comments and one without comments. We have not run across any papers that included the comments in any duplicate detection. We would like to add this new field to our model, but more tests must be done first. We also have a small set of about 1000 eclipse bug reports to test any methods, before moving to the much larger sets.

We have started preprocessing the data. First, we had to create another field for each duplicate report. This field contains the identification number of the master bug report (the original report of the bug that is also described in the duplicate report). I am now writing a python program to confirm that the master report is in the dataset. If the master report is not in the data set, we intend to either remove the duplicate status from that particular report or remove it all together from the set. Removing the duplicate label is logical, because when our program is applied to a full system of bug reports, all reports should be included. Presumably, the master report would be in the data, or else the report would have never been marked as a duplicate. Therefore, changing the status will only remove duplicates that are impossible to find masters for.

We have continued to read more literature. Some researchers have grouped the reports based on key words that like the report to a specific feature of the system. We looked into applying either LDA (Latent Dirichlet Allocation) or Labeled LDA. Although Label LDA provides slightly better results (because reports must be labeled by hand), LDA takes 60 times less time by automating the process. LDA is another possible feature we would like to incorporate into our model.

Monday, October 14, 2013

Week 10/7/13

Hello Folks!

We are starting to collect our first data set from the Android bug repository. We would like to incorporate comments on defect reports in our model as well as the conventional descriptive and categorical data about the bug and product. A significant amount effort is required to extract this specific information in a form that will be convenient to use later. We are currently working out these challenges.

We are continuously doing background research to get new ideas and see what others have done. I have been updating a short summary table and also created detailed summaries of each paper. An interesting possible similarity to include from this weeks research categorizes the bug reports based on the probably context of the bug. This is done using a pre-made lists of words developed by those with in-depth knowledge of software development and the specific repository used.

Everyone is really excited about the possibility of submitting our future research to the MSR 2014 conference on Mining Software Repositories. There are several possible opportunities for our team. They have a both a research paper section and a data collection section. They also hold a data mining competition, that be very cool to participate in. Participants from all over the world are given the same data set to analyze and report on. The top reports actually get to present their data at the conference. This would be a great experience for everyone.

Sunday, October 6, 2013

Week 9/30/13

Good news, this blog is officially up to date!

This week, I created a table to summarize all of the background research we have completed so far. This table includes the dataset used for each experiment, specifics on each research team's mathematical model, and the results that were acquired. Because many of the papers gave methods or results that were dependent on other papers, this table will be very useful when we make decisions on our own model.

- I did more background research on getting a more accurate retrieval of duplicate bug reports. This method uses a popular equation called the BM25F that takes advantage of both global word usage as well as local word frequencies instead of the very common cosine similarity measurement.

In addition to watching more tutorials on web scraping using Xpath and processing data using Rapid Miner, I have installed Scrapy (a popular web scraping tool) and started through some tutorials.

Thanks for reading!

Week 9/23/13

· Hello,

This week, I read a paper on responsible conduct in research. This is in preparation to begin collecting data for our first round of experiments.

I also looked into another possible method of classifying duplicate bugs. This method creates a discriminative model that compares duplicates and non duplicates to calculate the probability that a new bug report is in fact a duplicate. This model also constantly updates its coefficients to reflect the new data in an ever changing corpus of reports.

I Finally, I have been watching tutorial videos on web scraping and using XPath. These tools will be very useful to collect the large set of data we plan to recover.

Sunday, September 29, 2013

Week 9/16/13

Hello everyone!

We have been working with rapid minor. We started of processing different types of data like large text files and excel worksheets. We also calculated similarity measurements of each document.

Most likely, we will examine Firefox, Eclipse, and Open Office bugs. These are found on the site Bugzilla. There is not an automated way to get the bug data from a certain time period of this site. We are currently looking for an efficient method.

We also looked at a few models that use clustering of data to predict duplicates automatically. This type of model shows promise. It can be scaled to handle the massive amount of bug reports in a system. It is also efficient enough for real time bug detection.

Week 9/2/13 and 9/9/13

I hope everyone had a nice labor day!

We have been very busy. We looked into various methods of calculating text similarity. Some of these included Cosine, Dice, and Jaccard measures.

Other researchers have used execution data from the bug reports in the past. We did research on possibly creating this execution data for each error and comparing the similarity of the commands used along side of the natural language data.

We also started to use rapid minor to parse and process text and html documents.

Wednesday, September 4, 2013

Starting up 2013!

Here is the first blog for the fall 2013 CREU project at Youngstown State University. So far, we have been busy doing preliminary research. We may apply an interesting mathematical concept of similarity indices to the practical application of duplicate bug detecting in software development.
We are also testing out some data mining software to run experiments on. This year looks very promising, and we are very excited to get started!