Monday, November 11, 2013

Week 10/28/13 and 11/04/13


I was able to run through the json file containing small eclipse data set and read specific lines. This is certainly a step in the right direction , but it may only be useful for our toy set of bug reports. We have come across a slight hitch in our plan to use json files to store our bug reports. Because we need to access different parts of the file at a time, it looks as though we will need to load all of the data into memory to work with it. Since we are plaining on analyzing data sets containing several hundreds of thousands of bug reports, this is not feasible (especially since the files are a few gigabytes without the comments).  Once the comments are included, this will probably be intractable. Therefore we are going to load the data into a mongo data base and work with that instead. I am installing pymongo to adapt the program I have been working on to use this database instead. 

The deadline is already coming up for our paper submission. We have started writing the sections we can. We are continuously adding to the bibliography, and I wrote the first draft of the related works section. The introduction section is probably going to be next on our list.

Here is a summary of this weeks background research.

 First we looked into using character n-grams instead of word n-grams. This process seems to be very innovative. It is not susceptible to as many natural language issues that word based systems are. For example, character n-grams are not phased by misspelled words, shortened words, or words with different ending.  These require a great deal of preprocessing from other systems therefore may provide better results with less effort. Another astonishing fact is that character n-grams are not language dependent. Therefore implementing this system in another language would be trivial compared to other models. 

Secondly we looked into another method of automatically detecting duplicate reports. This method implemented quite a few state of the art techniques, that had only been used in non automated systems. One of these is the information retrieval formula for calculating similarity called BM25F.  Another novice aspect is the boolean feature that is true if two reports are based off of the same product.  It also compared the top similar report with the other k-top similar report. It provides a threshold to determine if a report is significantly more similar to a new report than the top k reports. It combines this factors together and may get results up to 200% more accurate than previous fully automated models. 

We need to make some decisions this week about the specifics of our model. We are currently considering a few different directions. A major decision will be which similarity measurement we are going to implement. We are also looking into what set of features we would like to consider as well as maybe an additional data set that has not been included in other research experiments.  

That's all for today, enjoy your Veteran's Day!