Benyamin Langgu Sinaga
Universiti Teknikal Malaysia Melaka

Published : 2 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 2 Documents
Search

A recommendation system of training data selection method for cross-project defect prediction Benyamin Langgu Sinaga; Sabrina Ahmad; Zuraida Abal Abas; Intan Ermahani A. Jalil
Indonesian Journal of Electrical Engineering and Computer Science Vol 27, No 2: August 2022
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/ijeecs.v27.i2.pp990-1006

Abstract

Cross-project defect prediction (CPDP) has been a popular approach to address the limited historical dataset when building a defect prediction model. Directly applying cross-project datasets to learn the prediction model produces an unsatisfactory predictive model. Therefore, the selection of training data is essential. Many studies have examined the effectiveness of training data selection methods, and the best-performing method varied across datasets. While no method consistently outperformed the others across all datasets, predicting the best method for a specific dataset is essential. This study proposed a recommendation system to select the most suitable training data selection method in the CPDP setting. We evaluated the proposed system using 44 datasets, 13 training data selection methods, and six classification algorithms. The findings concluded that the recommendation system effectively recommends the best method to select training data.
The impact of training data selection on the software defect prediction performance and data complexity Benyamin Langgu Sinaga; Sabrina Ahmad; Zuraida Abal Abas; Antasena Wahyu Anggarajati
Bulletin of Electrical Engineering and Informatics Vol 11, No 5: October 2022
Publisher : Institute of Advanced Engineering and Science

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.11591/eei.v11i5.3698

Abstract

Directly learning a defect prediction model from cross-project datasets results in a model with poor performance. Hence, training data selection becomes a feasible solution to this problem. Limited comparative studies investigating the effect of training data selection on the prediction performance have presented contradictory results. Those studies also did not analyze why a training data selection method underperforms. This study aims to investigate the impact of training data selection on the defect prediction model and data complexity measures. The method is based on an empirical comparison between prediction performance and data complexity measure before and after selection. This study compared 13 training data selection methods on 61 projects using six classification algorithms and measured the data complexity using six complexity measures focusing on overlap class, noise level, and class imbalanced ratio. Experimental results indicate that the best method for each dataset varies depending on the dataset and classifiers. The training data selection most affects noise rate and class imbalance. We concluded that critically selecting the training data method could improve the performance of the prediction model. We recommend dealing with noise and unbalanced classes when designing training data methods.