Seminar Nasional Informatika (SEMNASIF)
Vol 1, No 1 (2010): Computatinal

DETEKSI BAHASA UNTUK DOKUMEN TEKS BERBAHASA INDONESIA

Amir Hamzah (Unknown)



Article Info

Publish Date
30 Jul 2015

Abstract

In the multi language environment corpus such as Internet, the information retrieval system has faced difficulties that caused by the mixture of language document response of single query request that do not match the user need. One approach to handle this problem is by designing cross-language search engine. On the other hand this solution is no need for the user that only hoped the document answer only in one language such as Bahasa Indonesia. In the second case the solution is by designing search engine in certain language. In the construction of special language search engine in multi language environment, a critical step is language detection of the document being analyzed. This research was aimed to study comparison of several methods of language detection based on N-gram, i.e. unigram, bigram and trigram. Several news text documents in Bahasa Indonesia from 100 documents until 3000 document, two academic document collections of 88 and 450 documents and two abstract collection and full paper collection in English, each of those is 40 documents, were used as test collection. The results showed that unigram, bigram and trigram were good parameter to detect the language of documents. Among those methods, bigram was the best in time complexity and accuracy

Copyrights © 2010