Recommender Systems

An experiment with Amazon reviews

by Tomás Pica / @tomaspdc / github.com/tomaspdc

What is a Recommender System?

Filters information

Discovers preferences on a subject

Measures similarities between users

Real Examples


Understanding
Amazon's
ecosystem

Over 144 million active customer accounts.
( ~2.27 times the population of the UK )

Over 222 million products on sale.

426 items sold per second.
(Christmas 2013)


Customers can review products they bought on a scale of 1 to 5

Where's the Data?

~34.5 million reviews.

~6.5 million users.

~2.5 million products.

Spanning from Jun 1995 to Mar 2013

Source: http://snap.stanford.edu/data/web-Amazon.html
Permission granted by Julian McAuley (jmcauley@cs.stanford.edu)
J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.

A raw review


product/productId: B00005X3U4
product/title: The voice of Bugle Ann
product/price: unknown
review/userId: A169ZYI77GT1F3
review/profileName: Janet K. May
review/helpfulness: 0/0
review/score: 5.0
review/time: 1288051200
review/summary: Childhood Memories
review/text: My husband remembered this as a little boy. He tried to find one in the library but they had none. What a surprise he had on his birthday and thoroughly enjoyed it again. Brought a lot of old memories and stories to be told.
                    

Processed reviews


...
B000HEKTIW::A7EWCPD8COL3X::5::4.99::2/2::1292889600
B00005AQF1::A19CQRD6DIHMQL::5::unknown::0/3::1124409600
B000DZH89I::A2POGVCWFR6738::2::unknown::0/0::1358208000
B0007HEURA::A3C2A3D2KG1F1A::5::unknown::2/2::1266796800
B0002DJNNA::A1MFR5PGMZFQPX::1::5.93::0/1::1290297600
B003Y6ID2Y::ATGPAY0V61JO7::5::2.99::0/0::1178928000
B00029BM6A::A7M0T2XJM74DN::5::unknown::0/0::1333929600
B0000DD75Q::A1BKIHESLDFD95::4::9.89::3/3::1180656000
B743504704::A1IE6VWY0U0VNT::3::unknown::0/0::1204156800
B000E0C6SK::A16QQ78I8J29PA::4::unknown::3/3::1275264000
...
                    

Dataset slicing

True Blind subset: Random sample of ~9.8 million reviews

Second Blind subset: Random sample of ~6.5 million reviews

Training/Test sets: 80%/20% in random incremental samples with step size of 100k reviews, from 100,500 reviews to 11 million reviews.

That's a 100500 reviews subset, a 200,500 reviews subset, etc.

Measuring Similarity

Cosine Similarity

Predicting Behaviour

Singular Value Decomposition

One SVD model was computed for each Train/Test subset.

Training Errors

5-fold averages on Mean Absolute Error and Root Mean Squared Error

Second Blind Errors

MAE and RMSE against Second Blind subset.

Selecting a model

The model minimizing the Second Blind MAE error was chosen.

MAE and RMSE were measured against the True Blind subset.


MODEL/ID: 7500500
MAE: 0.766031
RMSE: 1.596316
                        

OK cool but... how good is that?

Baseline model: guess at random, weighted:


1 "star"   ~7.62%
2 "stars"  ~5.13%
3 "stars"  ~8.55%
4 "stars" ~19.38%
5 "stars" ~59.29%
                        


MODEL/ID: WEIGHTED-RANDOM
MAE: 3.93757
RMSE: 4.192413
                        


MODEL/ID: 7500500
MAE: 0.766031
RMSE: 1.596316
                        

SVD model is ~5 times more accurate than the Weighted-Random model

Conclusions

Big Data != Better Data

Model Storage Size

Predict Offline, Recommend Online

FIN

tomaspdc.github.io/amazon-recsys

by Tomás Pica / @tomaspdc / github.com/tomaspdc