Filters information
Discovers preferences on a subject
Measures similarities between users
Over 144 million active customer accounts.
( ~2.27 times the population of the UK )
Over 222 million products on sale.
426 items sold per second.
(Christmas 2013)
Customers can review products they bought on a scale of 1 to 5
~34.5 million reviews.
~6.5 million users.
~2.5 million products.
Spanning from Jun 1995 to Mar 2013
Source: http://snap.stanford.edu/data/web-Amazon.html
Permission granted by Julian McAuley (jmcauley@cs.stanford.edu)
J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.
product/productId: B00005X3U4
product/title: The voice of Bugle Ann
product/price: unknown
review/userId: A169ZYI77GT1F3
review/profileName: Janet K. May
review/helpfulness: 0/0
review/score: 5.0
review/time: 1288051200
review/summary: Childhood Memories
review/text: My husband remembered this as a little boy. He tried to find one in the library but they had none. What a surprise he had on his birthday and thoroughly enjoyed it again. Brought a lot of old memories and stories to be told.
...
B000HEKTIW::A7EWCPD8COL3X::5::4.99::2/2::1292889600
B00005AQF1::A19CQRD6DIHMQL::5::unknown::0/3::1124409600
B000DZH89I::A2POGVCWFR6738::2::unknown::0/0::1358208000
B0007HEURA::A3C2A3D2KG1F1A::5::unknown::2/2::1266796800
B0002DJNNA::A1MFR5PGMZFQPX::1::5.93::0/1::1290297600
B003Y6ID2Y::ATGPAY0V61JO7::5::2.99::0/0::1178928000
B00029BM6A::A7M0T2XJM74DN::5::unknown::0/0::1333929600
B0000DD75Q::A1BKIHESLDFD95::4::9.89::3/3::1180656000
B743504704::A1IE6VWY0U0VNT::3::unknown::0/0::1204156800
B000E0C6SK::A16QQ78I8J29PA::4::unknown::3/3::1275264000
...
True Blind subset: Random sample of ~9.8 million reviews
Second Blind subset: Random sample of ~6.5 million reviews
Training/Test sets: 80%/20% in random incremental samples with step size of 100k reviews, from 100,500 reviews to 11 million reviews.
That's a 100500 reviews subset, a 200,500 reviews subset, etc.
One SVD model was computed for each Train/Test subset.
5-fold averages on Mean Absolute Error and Root Mean Squared Error
MAE and RMSE against Second Blind subset.
The model minimizing the Second Blind MAE error was chosen.
MAE and RMSE were measured against the True Blind subset.
MODEL/ID: 7500500
MAE: 0.766031
RMSE: 1.596316
Baseline model: guess at random, weighted:
1 "star" ~7.62%
2 "stars" ~5.13%
3 "stars" ~8.55%
4 "stars" ~19.38%
5 "stars" ~59.29%
MODEL/ID: WEIGHTED-RANDOM
MAE: 3.93757
RMSE: 4.192413
MODEL/ID: 7500500
MAE: 0.766031
RMSE: 1.596316
SVD model is ~5 times more accurate than the Weighted-Random model
Big Data != Better Data
Model Storage Size
Predict Offline, Recommend Online