2017.02.14 15:22

Korean News Crawler, Sentimental Analysis and Text Mining

Korean texts from various Korean news sources have been crawled, sentimental analysis on the crawled texts have been conducted on a daily basis, and text mining on the texts have been carried out. Python and R have been used for this project.

Crawling of Korean News

Naver News forgather news from a wide range of mass media sources. The site constantly sorts, lists and updates the news on its main page. The following screenshot highlights Naver News page.

After examining html tags of the site, a crawler for listed news from the site was developed through the use of Python. The following screenshots highlight the development of crawler.

Identification of HTML Tags

Tags and class responsible for daily news from different sources have been identified after examining the site source code. The identified tags were utilized for development of news crawler.

Development of Naver News Crawler

As the Python codes above highlight, a crawler for crawling news content from Naver News site was developed. The crawler saves the news content in text file format and the following highlights crawled Korean news texts.

*Each line underscores one unique news topic. The crawled result reflects news from various media sources at specific time of a day.

Korean Natural Language Processing

To conduct Natural Language Processing of the Crawled Korean texts, KoNLPy package was used. Through the use of KoNLPy package, Korean words of the crawled result have been parsed, removing non-texts such as "..." but only extracting Korean texts. The following underscores parsing of Korean texts.

*Python codes parsing Korean words from crawled text. The codes above underline that parsed Korean words are saved in text file in the end.

*The above reflects parsed Korean texts.

Sentimental Analysis

Two different approaches have been taken to conduct sentimental analysis on the crawled texts. As shown above, the first method involves development of sentimental dictionaries using the crawled texts, application of the dictionaries on the texts and visualization of the overall sentiments through the use of pie chart. The second method involves application of already made Korean sentimental dictionaries to crawled texts and visualization of overall sentiments through the use of pie chart. After carrying out sentimental analysis using both methods, results from both methods have been compared and discussed.

Method 1

As above highlights, parsed Korean texts have been manually classified into negative, neutral and positive dictionary and the contents of dictionaries were incremented on a daily basis as news texts were crawled on a daily basis. As the dictionaries were developed from human manual classification, keywords within each dictionary well reflect each correspondent sentiment. Each dictionary was formatted in "key": "list of values" relationship.

As each sentimental dictionary was developed and incremented through human manual classification and edition, an error such as a keyword in a sentimental dictionary occurring multiple times can occur. Thus, in order to prevent this type of error and utilize a valid list of unique sentimental keywords, keywords list from each dictionary has been read, formed into unique keyword list and re-written to each dictionary after human manual classifying and editing of each dictionary from daily crawled texts has been completed. The codes above underline this process.

After acquiring sets of sentimental dictionary, the keywords from each dictionary were utilized to formulate different sentimental data lists. The data lists were then used to produce a pie chart to reflect sentiments of overall crawled news. The codes above underscore this and the below pie chart highlights visualization of overall crawled texts sentiments.

*The pie chart above reflects sentiments of Korean news at specific time of a day.

Method 2

Sentimental Dictionary for Korean words have been acquired from the beginning; therefore, there was no need to develop separate sentimental dictionaries. The following highlights the acquired sentimental dictionary.

*Korean Sentimental Dictionary has total of 16,363 records and each record contains attributes such as Part-Of-Speech Tag (P.O.S. Tag), n-gram(a group of n-consecutive words), Sentiments and Sentiment Scores.

Using the already crawled Korean news texts, a function was built to ensure sentimental dictionary was applied to daily crawled news texts. The function then visualized overall sentiments of Korean news at specific time of a day. The codes above highlight the aforementioned process and the following pie chart highlights an example of the visualization.

Sentiments Visualization

Visualization of Method 1 Results

Using Method 1, Korean News between 20170209 and 20170215 have been crawled. Frequency of sentimental words for each day have been counted through the use of Python and visualized through the use of R. The following line plot highlights this.

Ratio of sentimental words per day can be visualized using frequencies of sentimental words per day. The ratios of sentiments throughout one week was visualized through R. The visualization is as shown below.

Based on crawled texts between 20170209 and 20170215, Sentimental Analysis has been carried out and visualized into pie chart. The pie chart based on one week worth crawled news texts was drawn using Python. The pie chart below highlights this.

As the results from Method 1 approach highlight, between 20170209 and 20170215, majorities of words from Naver Main News, which is a collection of other Korean mass media news, were Neutral Words. Even line charts and stacked charts above highlight that majority of Korean texts embedded in Naver Main News throughout the study period were Neutral; however, the results also underline that the ratio of positive keywords were less evident throughout the study period. This implies that negative words were, perhaps, more frequently used to connect neutral words and convey messages, possibly leaving overall negative impressions more often to news readers about Korean general society or other global issues. In addition, this indicates a possibility that Naver News allocates more score, sorts or lists news from other Korean mass media that contain more negative connotations. If Naver News sorting algorithm is based on number of views, this can mean that online news readers are more attracted to negative news than positive news.

Visualization of Method 2 Results

Using already defined Korean Sentimental Dictionary with Sentimental Scores, ratios of sentimental words per day between 20170209 and 20170215 have been computed and visualized through the use of R. The following stacked bar chart highlights this.

Similar to method 1, one week worth of crawled Koran news texts have been combined and they were applied against method 2 approach to compute overall sentimental ratios. The overall ratios of sentimental words during the study period have been visualized with Python and pie chart. The following pie chart highlights this.

Discussion of Method 1 and Method 2 Results

As shown above, two method results differed from one another to some extent. Method 1 results stressed that major sentiment of crawled texts during the study period was "Neutral" and the overall sentiment was consistent throughout the study period. In contrast, Method 2 results showed that "Neutral" sentiment was clearly not the major sentiment in the crawled texts and throughout the study period. Instead, Method 2 result suggested that the major sentiment of the crawled words and throughout the study period was "Negative" followed by "Positive" sentiment. Furthermore, while result of Method 1 displayed that there was a relatively large difference between positive and negative sentiment in terms of sentimental words ratio overall ad throughout the study, Method 2 results highlighted that there were no such big difference between ratio of Negative and that of Positive during the study period. However, both results clearly marked that negative words were generally more present in the news than positive words. In other words, the results underlined that more negative Korean words were used in the news between 20170209 and 20170215. Reasons for this outcome are not clear; however, one can conjecture that "negative events" occurred frequently in Korean or global society during the study period.

Text Mining with Crawled Korean News

Based on Crawled News Texts between 20170209 and 20170215, some of the most frequent Korean texts have been identified. After pre-processing with Python, the frequent terms have been visualized through the use of R. The bar chart below highlights this.

As the bar chart above underscores, Korean word such as "특검" (which means "Independent Counsel" in English), "안" (which can mean "No" in English), "대통령" (which means "President"), and "조사" (which means "Investigation") were frequent in Naver News throughout the study period. Also, the bar chart underlines that "北" (which means "North") text frequently appeared between 20170209 and 20170215. These suggest that Korea Mass Media was frequently putting spotlights on Korean President Scandal as well as North Korea Threats during aforementioned 1-week period.

Through the use of R and crawled word counts, Word Cloud was drawn. The bigger the text is, the more frequent the text appeared in the crawled texts. The following highlights Word Cloud picture to visualize frequencies of Korean texts from crawled news.

Visualization of Associated Texts

Through the use of R, associated texts have been identified and visualized. The followings highlight the process.

Extracting Nouns and Pre-Processing

As shown above, text file containing crawled news have been read and nouns keywords from the read text file have been extracted through the use of R's KoNLP package. In order to eliminate repetition of words, filtering functions have been developed and applied.

Development of Transactions and Association Rules with Apriori Algorithm

As shown above, pre-processed words have been converted into transactions using "arules" R package. Using the transaction data, Apriori Algorithm was applied to form association rules.

Converting Association Rules to Matrix format

As shown above, Pre-Processing of converting Association Rules into Matrix format was conducted. The screenshot of console above highlights the converted Matrix format.

Visualization of Associated Texts and Closeness Centrality

As the codes above underline, "igraph" package was imported to visualize associated words. The associated words were randomly selected (from row 100 to 180) form the matrix data. Out of the selected associated texts, top 10 associated texts with closeness were selected and visualized to highlight which words in the top 10 close texts play role in being center. The following highlights the visualizations.