2016.11.20 12:57

Online Product Recommendation with Hadoop Spark

Cloud Computing, Hadoop YARN, Hadoop Spark, Association Rules, Python

Background

Product recommendation is one of the examples of cross-selling and it is frequently used by online retailers. One simple method to give product recommendations is to recommend products that are frequently browsed together by the customers.

The aim of this project was to recommend new products to the customer based on the products they have already browsed on the online website.

To conduct product recommendation, Hadoop Spark on Hadoop YARN was used and association rules learning, an unsupervised learning technique of data mining, was implemented. Furthermore, the project was conducted through the use of Cloud Computing.

The Cloud Service that the project utilized was Amazon Elastic Cloud Computing (EC2) which is one large complex web service. The EC2 provides an API for instantiating computing instances with any of the operating systems supported. Also, the EC2 can facilitate computations through Amazon Machine Images (AMIs) for various other models. The following image highlights how the EC2 operates.

A transaction data was acquired and each line of data represented a browsing session of a customer. On each line, each string of 8 characters represents the ID of an item browsed during that session. The items are separated by spaces. The size of the data is 3,408KB and the following screenshot shows the input data in text file format.

Cloud Computing

As mentioned, Amazon Elastic Cloud Computing service was used throughout the project. Once signed into AWS, three instances EC2 instances were launched. One for master node and two for slave nodes. The following screenshot highlights launching of the instances.

In order to connect to the instances that were just created, Putty client was used.

After inserting necessary IP Address and loading private key file to the PuTTY, a putty session was launched. Throughout the project, the AMI that was used had Ubuntu as the OS. After logging in with ubuntu, a putty session with Unix shell at the end was used initiated and account 'djyoo1234' was created.

Hadoop

During the project, Hadoop working environment on the Amazon instance was configured. This included installing of default JRE/JDK and downloading the latest release of Hadoop. Throughout the process, Hadoop YARN and Hadoop Distributed File System (HDFS) were configured. The following highlights the result after launching the Pseudo-distributed Hadoop Cluster.

Input Data

As shown below, the content of the text file was copied and saved to Unix local directory.

*The input data on Unix can be saved at HDFS by making a new directory on hadoop distributed file system. Through this, if needed, a person can load data from HDFS rather than form local directory.

Hadoop Spark

Hadoop Spark was downloaded through Unix; however, Python was used as a medium to utilize Hadoop Spark. Thus, PySpark was initiated and utilized throughout the project. The following highlights initialization of YARN, HDFS and pyspark.

Implementation with Python codes

Data was uploaded to Spark using the following python codes:

import os

data = open(os.getcwd() + '/items.txt', mode = 'r')

After uploading the data, following steps were followed to generate Online Product Recommendation

Acquiring Products that occur together at least 100 times

During the project, the threshold for frequent item sets to 100. Thus, all the items were counted if the items met the requirement of the threshold. Items were added into the items in ‘frequentItems’ variable which is in ‘list’ data structure. The following codes highlight the process.

Acquiring all possible pairs of frequent items

In order to acquire all combination pairs of frequent items, the following codes were utilized. In order to get possible pair combinations, ‘combinations()’ from ‘itertools’ package was utilized.

Acquiring all possible triples

Furthermore, in order to acquire frequent triples, the following codes were implemented. Again, in order to get 3 combinations, ‘combinations()’ from ‘itertools’ package was utilized.

Development of confidence function

Recommendations were planned to be displayed based on confidence score; therefore, in order to implement the confidence on the item sets, a function for the confidence score was developed. The function fundamentally follows the following formula for confidence:

Rule: A => B, Confidence = (Freqeuncy(A, B))/(Frequency(A))

To differentiate pairs and triples when the functions is used, ‘if’ statements were implemented.

Acquiring pairs and triples by confidence

In order to acquire ‘rules’ for pairs and triples with confidence scores, the function ‘confidence’ was implemented. Also, the result of the pairs and triples rules were added into ‘pairRules’ and ‘tripleRules’ variables, which are empty data in dictionary form at the beginning. The following codes highlight the process.

Sorting rules according to confidence score

In order to have rules sorted according to confidence scores, ‘sorted()’ was used on dictionary data containing rules. The dictionary data are ‘pairRules’ and ‘tripleRules’. During the project, only the 20 rules sorted according to confidence score were displayed. The following codes highlight this and the following screenshot highlights the output of the project.

*Output:

Conclusion and Limitation

Despite the data being small enough to be conducted in a single-node implementation, it was required to adopt the solution in parallel system. The above example was conducted on 3-node system with one master and 2 slave nodes for computation. This product recommendation process can greatly improve the relevancy of the recommended product and also utilize the benefits by implementing through parallel system. Should the data set be much larger or consequently changing, increasing node to match the demand is possible. Nonetheless, the process was challenging and demanding. Main problem was not the coding itself but using Python on UNIX interface. Though using python on console has been done before, experience in using python on Eclipse interface was much richer; therefore, codes had to be executed line by line instead of running the code lines all at once, which people can do it on Eclipse. This consumed a lot of time when finding sources of errors. In conclusion, online product recommendation based on confidence scores were generated and this would enable companies to recommend the products that are frequently browsed together by the customers.