Siti Khomsah
Insitut Teknologi Telkom Purwokerto

Published : 2 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 2 Documents
Search

Cross-domain sentiment analysis model on Indonesian YouTube comment Agus Sasmito Aribowo; Halizah Basiron; Noor Fazilla Abd Yusof; Siti Khomsah
International Journal of Advances in Intelligent Informatics Vol 7, No 1 (2021): March 2021
Publisher : Universitas Ahmad Dahlan

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.26555/ijain.v7i1.554

Abstract

A cross-domain sentiment analysis (CDSA) study in the Indonesian language and tree-based ensemble machine learning is quite interesting. CDSA is useful to support the labeling process of cross-domain sentiment and reduce any dependence on the experts; however, the mechanism in the opinion unstructured by stop word, language expressions, and Indonesian slang words is unidentified yet. This study aimed to obtain the best model of CDSA for the opinion in Indonesia language that commonly is full of stop words and slang words in the Indonesian dialect. This study was purposely to observe the benefits of the stop words cleaning and slang words conversion in CDSA in the Indonesian language form. It was also to find out which machine learning method is suitable for this model. This study started by crawling five datasets of the comments on YouTube from 5 different domains. The dataset was copied into two groups: the dataset group without any process of stop word cleaning and slang word conversion and the dataset group to stop word cleaning and slang word conversion. CDSA model was built for each dataset group and then tested using two types of tree-based ensemble machine learning, i.e., Random Forest (RF) and Extra Tree (ET) classifier, and tested using three types of non-ensemble machine learning, including Naïve Bayes (NB), SVM, and Decision Tree (DT) as the comparison. Then, It can be suggested that the accuracy of CDSA in Indonesia Language increased if it still removed the stop words and converted the slang words. The best classifier model was built using tree-based ensemble machine learning, particularly ET, as in this study, the ET model could achieve the highest accuracy by 91.19%. This model is expected to be the CDSA technique alternative in the Indonesian language.
Model Text-Preprocessing Komentar Youtube Dalam Bahasa Indonesia Siti Khomsah; Agus Sasmito Aribowo
Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol 4 No 4 (2020): Agustus 2020
Publisher : Ikatan Ahli Informatika Indonesia (IAII)

Show Abstract | Download Original | Original Source | Check in Google Scholar | Full PDF (397.867 KB) | DOI: 10.29207/resti.v4i4.2035

Abstract

YouTube is the most widely used in Indonesia, and it’s reaching 88% of internet users in Indonesia. YouTube’s comments in Indonesian languages produced by users has increased massively, and we can use those datasets to elaborate on the polarization of public opinion on government policies. The main challenge in opinion analysis is preprocessing, especially normalize noise like stop words and slang words. This research aims to contrive several preprocessing model for processing the YouTube commentary dataset, then seeing the effect for the accuracy of the sentiment analysis. The types of preprocessing used include Indonesian text processing standards, deleting stop words and subjects or objects, and changing slang according to the Indonesian Dictionary (KBBI). Four preprocessing scenarios are designed to see the impact of each type of preprocessing toward the accuracy of the model. The investigation uses two features, unigram and combination of unigram-bigram. Count-Vectorizer and TF-IDF-Vectorizer are used to extract valuable features. The experimentation shows the use of unigram better than a combination of unigram and bigram features. The transformation of the slang word to standart word raises the accuracy of the model. Removing the stop words also contributes to increasing accuracy. In conclusion, the combination of preprocessing, which consists of standard preprocessing, stop-words removal, converting of Indonesian slang to common word based on Indonesian Dictionary (KBBI), raises accuracy to almost 3.5% on unigram feature.