Generating meaning of textual data
Ujang Fahmi
Text mining, using manual techniques, was used first during the 1980s.
Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning.
Text mining is defined as ―the non-trivial extraction of hidden, previously unknown, and potentially useful information from (large amount of) textual data.
Text Cleanup means removing any unnecessary or unwanted information. Such as remove ads from web pages, normalize text converted from binary formats.
: bacground-color: red; Annual Meeting PERDAMI 2018 : http://youtu.be/xkd3v18mzPo?a via @YouTube>@BPJSKesehatanRI @anjarisme @BPJSTKinfo @DPR_RI @hincapandjaitan @rs_matacicendo @RoySparringa @PBIDI @RRIPrograma3 @dedeyusuf_1 @NilaMoeloek @taufik_hd2001 @AgusYudhoyono @KPK_RI
annual meeting perdami twothousandandeighteen
Tokenizing is simply achieved by splitting the text into white spaces.
| Text | Token |
|---|---|
| annual meeting perdami twothousandandeighteen | annual |
| annual meeting perdami twothousandandeighteen | meeting |
| annual meeting perdami twothousandandeighteen | perdami |
| annual meeting perdami twothousandandeighteen | twothousandandeighteen |
Part-of-Speech (POS) tagging means word class assignment to each token. Its input is given by the tokenized text. Taggers have to cope with unknown words (OOV problem) and ambiguous word-tag mappings.
A text document is represented by the words it contains and their occurrences. Two main approaches to document representation are:
Feature selection also is known as variable selection. It is the process of selecting a subset of important features for use in model creation. Redundant features are the one which provides no extra information. Irrelevant features provide no useful or relevant information in any context.
Evaluate the result, after evaluation, the result discard.
Back to pre-processing and feature selection phase