Journal of Applied Data Sciences
Vol 5, No 1: JANUARY 2024

Active learning on Indonesian Twitter sentiment analysis using uncertainty sampling

Muhaza Liebenlito (Department of Mathematics, Faculty of Science and Technology, UIN Syarif Hidayatullah Jakarta)
Nur Inayah (Department of Mathematics, Faculty of Science and Technology, UIN Syarif Hidayatullah Jakarta)
Esti Choerunnisa (Department of Mathematics, Faculty of Science and Technology, UIN Syarif Hidayatullah Jakarta)
Taufik Edy Sutanto (Department of Mathematics, Faculty of Science and Technology, UIN Syarif Hidayatullah Jakarta)
Suma Inna (Department of Mathematics, Faculty of Science and Technology, UIN Syarif Hidayatullah Jakarta)



Article Info

Publish Date
29 Jan 2024

Abstract

Nowadays, sentiment analysis research in social media is rapidly developing. Sentiment analysis typically falls under supervised learning, which requires annotating data. However, the annotation process for sentiment analysis tasks is notoriously time-consuming. Fortunately, an effective strategy to overcome this challenge has emerged, known as active learning. Active learning involves labeling only a small subset of the dataset, leaving the rest for annotation through sampling strategies. This study focuses on comparing two active learning strategies: random sampling and boundary sampling. These strategies are applied to machine learning models such as logistic regression and random forests. In addition, we present an evaluation of the model performance and data savings achieved by implementing these strategies in the context of traditional machine learning for sentiment analysis on Twitter. The dataset considered consists of two labels: positive and negative sentiments. The results of our investigation show that active learning can significantly reduce the amount of training data required, saving up to 65% of the total training data required to achieve peak model accuracy. The most successful model identified uses a random forest with a margin sampling strategy, yielding an accuracy of 81.12% and an F1 score of 88.60%. This research highlights the effectiveness of active learning strategies in sentiment analysis, demonstrating their potential to improve model performance and resource efficiency. The results underscore the viability of employing active learning methods, particularly the combination of random forest models with margin sampling, for more efficient sentiment analysis in social media.

Copyrights © 2024






Journal Info

Abbrev

JADS

Publisher

Subject

Computer Science & IT Control & Systems Engineering Decision Sciences, Operations Research & Management

Description

One of the current hot topics in science is data: how can datasets be used in scientific and scholarly research in a more reliable, citable and accountable way? Data is of paramount importance to scientific progress, yet most research data remains private. Enhancing the transparency of the processes ...