Claim Missing Document
Check
Articles

Found 1 Documents
Search
Journal : JURNAL%20MEDIA%20INFORMATIKA%20BUDIDARMA

Pengaruh Distribusi Panjang Data Teks pada Klasifikasi: Sebuah Studi Awal Said Al Faraby; Ade Romadhony
JURNAL MEDIA INFORMATIKA BUDIDARMA Vol 6, No 3 (2022): Juli 2022
Publisher : STMIK Budi Darma

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30865/mib.v6i3.4259

Abstract

In text classification, there is a problem with text domain differences (cross-domain) between the data used to train the model and the data used when the model is applied. In addition to the problem of domain differences, there are also language differences (cross-lingual). Many previous studies have looked for ways how classification models can be applied effectively and efficiently in these cross-domain and cross-lingual situations. However, there is one difference that is not given special attention because it is considered not very influential, namely the difference in text length (cross-length). In this study, we further investigated the cross-length condition by creating a special dataset and testing it with various commonly used classification models. The results showed that the difference in the distribution of text length between the training data and the test data could affect the performances. Cross-length transfers from long to short texts show an average decrease in F1-scores across all models of 14%, while transfers from short to long texts give an average decrease of 9%.