2016.11.20 12:58

Development of Crawlers using Python

Introduction

During this project, Crawlers that crawl Japanese texts from Japanese Bank website were developed. The functions of the Crawlers included Crawling of information according to url tag, pre-processing and saving the results of the crawling in text file format in working directory. The developments were conducted through the use of Python.

Target Site and Target Information

The following screenshot shows Bank of Tokyo-Mitsubishi UFJ Japanese website.

As customers navigate the website, the customers will find a page that highlights warnings on financial crimes in Japanese. The Bank updates financial crimes and specific details as shown below:

*Page on Financial Crime

*Lists of updates and warnings on Financial Crimes

In order to crawl only the updates on financial crimes from the site, the source code of the site had to be examined. The following shows the source code of the site.

As the above highlights, the actual texts on updates are linked to class 'p' and link name 'iLink02' while actual texts as well as the dates on the updates are linked to class 'ul' and link name 'listLink01 mt15 section01 date_info fsM'. Using these tags, the Crawlers were developed.

The following shows the Crawler that was successfully developed in Python and actually managed to crawl required information.

The Crawler, as shown above, uses packages such as requests to get connected to online. After importing necessary packages, the actual website address is returned to 'url' variable. Using "urlib", the website has been opened and returned to page. After this, 'BeautifulSoup' package was used to return all the tags of the page to the variable 'soup'.

The following shows the tags that were returned to soup.

As mentioned previously, the one of the classes that holds the target information was 'ul' and 'listLink01 mt15 section01 date_info fsM' was returned to the class according to the source code of the website. Thus, the link and the class was taken out from the tags and made sure only the information or texts from the tags were selected using for loop and 'list.append'.

The information that was crawled from the crawler is shown below.

As the Screenshot highlights, date and the information about financial crimes from the website have been crawled. In order to save the crawled result, the crawler that was developed during the project saves the crawled data into text file. The screenshot below highlights the output of the crawler.

Using the content in the text file, people may conduct further Natural Language Processing or integrate with other Open Source Software such as R to conduct sentimental analysis.

Instead of crawling dates but only the updates on the financial crimes from the bank website, another crawler was developed. The following highlights the crawler that crawled only the updates. Just like former, the crawler automatically saves the output of the result in text file format.

The logic of crawling is similar to the former crawler. The difference is that different class and class name from the tags were used to extract the updates. This will ignore the dates but just extract only the updates; however, once the crawler crawls the information from the site, unnecessary words such as '一覧へ', which means go to list in Japanese, are crawled as well. The following highlights the output from the crawler that contains irrelevant words.

In order to eliminate the irrelevant words, pre-processing was done. Using 'if' and 'for' loop statements, only the necessary information was saved into new variable as a list.

Moreover, since the result was still in 'list' format, 'chain' from 'itertools' was used to flatten the list and combine all together. After cleansing the data, the crawler saves the cleaned data in text file format just like the former crawler did. The following highlights output of the second crawler. The output is formatted in a way that each record represents one specific update about financial crime.