This week I applied what I learned last week about Gensim to our test data set (EclipseSmall). I was able to calculate the tf/idf (Term Frequency / inverse Document Frequency) for the 1001 bug reports in EclipseSmall. This involved doing the standard preprocessing (stemming, stop words, etc.) of each bug report, followed by converting each report into a vector, and finally doing the tf/idf calculations.
Next we want to create the similarity matrix I talked about last week and retrieve the top 5 most similar reports for each bug. We then want to see if any bug that is a duplicate has its master in the list of top 5 similar bug reports. To do this, we need to have a list of duplicate bugs. This is less trivial than it should be, since many bugs have multiple duplicates that are chained, duplicate to master, in various ways. I am working on a program to run through and find all of these duplicates.
We will also need to use some combinational logic to then calculate the set of unique pairs of duplicates. This number is essential to calculate recall rate, which is almost the current standard of duplicate detection performance.
I also read an interesting paper on Measuring the Semantic Similarity of Comments by Dit et al. In this paper, they look at the flow of the conversation in the comment section. If any comments get off topic they can be removed and therefore the conversation will be more readable. This is certainly an interesting application of similarity measurements that I did not think about before.
No comments:
Post a Comment