Erik Brynjolfsson and Michael Smith, the MIT economists who led the team that did the research we used as the basis of our original Amazon Long Tail sales estimate, have responded to my previous post on a new methodology for estimating Amazon's Long Tail sales. Bottom line, their latest thinking and work on this still suggests an estimate between 35% and 40%, which is at the higher end of Rosenthal's range. But if top-100 sales at Amazon are much higher than the model suggests, the number could be closer to the mid-20%s. The next step in this research is to gather more hard data, especially for bestsellers, to refine the models. Fortunately it looks like we may be able to get that in time for the book.
Professors Brynjolfsson and Smith write:
The blog is a great way to have a discussion on the growing importance of the Long Tail and it was nice to see Morris Rosenthal update the analysis. Here are some quick thoughts on the Amazon's long tail to add to the discussion:
1. In our original 2002 estimates in the paper with Jeffrey Hu, we used data from a large book publisher (who asked to remain anonymous) who provided detailed data correlating weekly sales quantities with average weekly sales rank for 321 book titles tracked over several weeks in the summer of 2001. We end up with a total of 861 points. They include ranks lower than 250 and up to about 1,000,000. We found that these points fit remarkably well to a Pareto (a.k.a. log-log) curve and used that to come up with our estimate that 39.2% of Amazon's book sales fall in titles with ranks above 100,000. (Incidentally, 100,000 is the number of unique titles at a typical Barnes and Noble Superstore; the number is closer to 40,000 for the average bookstore.)
2. This estimate relies on an assumption that the Pareto curve does a good job of approximating the both the very top (ranks <250) and very bottom (ranks >1,000,000) of the curve. If this isn't true it may over or underestimate true sales. It also relies on an assumption of the number of unique titles at Amazon, which we took to be 2,300,000 in our paper (which at the time was the number of books in print). Using 2,000,000 would give an estimate of 37.8%, using 2,500,000 would give an estimate of 39.9%.
3. Others have created similar estimates. Most notably, at about the same time as our work, Judy Chevalier and Austan Goolsbee developed an experiment to approximate the fit of a Pareto curve by purchasing titles from Amazon and observing the sales and ranks before and after the experiment. They also cite two other independent estimates available at the time using similar experiments. (You are too generous in giving us the credit for developing this technique in your original blog entry, although we did incorporate a version of it into subsequent drafts of our paper.) We think the larger sample of data from the publisher is more reliable, although both approaches yield broadly similar estimates, as we note in the paper. In mid-2004, Michael Smith, conducted his own calibration of Amazon's Pareto using the experiment-based technique. Applying the Pareto estimates from these four sources to our calculations above would give Long Tail estimates ranging from 27.7% to 44.5% of Amazon's sales.
5. Thus, we believe estimates in the range of 35-40% are probably the best place to start. This encompasses both our initial estimate of 39.2% and Mr.Rosenthal's estimate of 36% using Amazon's most recent ranking algorithm.
But you could convince us of estimates as low as 25% if the Pareto assumptions are violated in the highest or lowest ranked books. (However, we found our original data matched the Pareto assumptions well, and if anything had a slightly fatter than normal long tail.)
In any event, I agree with the spirit of your blog comments -- let's not lose the forest for the trees. Wherever the true number lies in this range, the value generated by Long Tail markets to consumers and smart producers is substantial. In our original work, we placed the value of access to "The Long Tail" in books at around $1 billion per year to consumers. It's undoubtedly much higher now. And this value -- combined with value generated from the other long tail industries that your work has identified -- is the real story.
We're continuing to do some work in this area with new data and methods. We don't have anything to report just yet, but early results suggest that we're all on the right track regarding the growing importance of this phenomenon.
Professors Brynjolfsson and Smith are satisfied with the Pareto distribution, but as I described in the comments on the previous Long Tail posting, I don't believe it could be used for the old sales rank system (up to October 2004), which all of their analysis and related studies were based on. In short, the Pareto equation is continious, the Amazon sales ranks function was not continious. They simply used points from the middle couple decades of the curve where the Pareto function happened to work OK.
In addition, their analysis put the U.S. Amazon sales rate for 2001 at of 99.4 million titles per year. Amazon's total US media sales for 2001, which includes Books, CD's, DVDs/Video, which was $1.688 billion, which works out to 1.17 billion for books and yields an average Amazon selling price of $11.77 per book. The same M.I.T. put the average book selling price of a book at Amazon between $29 and $41, which would mean the area under their curve was off by a factor of three, or 300%.
I have to admit this is my first experience trying to carry on a conversation via comments on a long blog post, and I suspect these particular trees kept getting lost in the forest.
Posted by: Morris Rosenthal | August 09, 2005 at 06:31 AM
I appreciate Morris Rosenthal’s interest our paper and welcome any suggestions for improving the research. He makes a number of points in his August 9 post, but unfortunately they require some clarification.
1. The Pareto distribution is widely used for discrete data like book sales ranks -- Vilfredo Pareto's first use was for wealth of individuals. Individuals are discrete (except, perhaps, those assimilated by the Borg). The Pareto distribution is also commonly fit to the size ranks of cities, sand particles, meteorites, and numerous other discrete data. More broadly, the essence of econometrics is fitting continuous lines and curves to discrete data, and in this case, the Pareto fit happened to be unusually good. If Mr. Rosenthal wants to propose an alternative functional form that fits the old or new book sales rank data even better, we'd love to see it.
2. I’m confused that he takes us to task for using “points from the middle couple decades of the curve where the Pareto function happened to work OK” yet our data of 861 points run from less than 250 to around 1,000,000, which is basically the same as his dataset which from this description http://www.fonerbooks.com/surfing.htm seems to run from 1,000 to 1,000,000.
3. Mr. Rosenthal also notes that our estimate implied a 2001 sales rate of 99.4 million books for Amazon. In comparison, his estimate is 101 million. In my experience, 99.4 ~ 101 in most of the social sciences, but I’ll grant that we each may be off by 2% or more, which would translate into a somewhat smaller potential error in the size of the long tail.
4. A careful reader of our paper would notice that we do NOT use the Dealtime data, with prices of $29 to $41, as the basis for our calculations as Mr. Rosenthal implies (see the text above table 5 at http://ssrn.com/abstract=400940 and equation 9). Instead, we use the average selling price for this purpose, exactly as he advocates. The area under the curve was not off by “a factor of three”, although we did offer error bands of about 30% in the paper.
Bottom line: The forest of the long tail remains visible to anyone who steps back and looks at the big picture, even if particular trees are occasionally lost. Our results in the paper (since peer-reviewed and published in the journal Management Science) were indeed correct with a reasonable margin of error, though they are at times misquoted or misunderstood.
However, I can only agree that carrying on conversations via a blog posts can be frustrating, and welcome Mr. Rosenthal to email or call us if he has additional questions or suggestions on our research (and especially if he has data he’d like to share!)
Posted by: Erik Brynjolfsson | August 09, 2005 at 03:39 PM
Ah, communications. I assume I'll be able to find your e-mail somewhere and write direct, but I may as well respond to your numbered list online.
1) I don't propose an alternative to the Pareto function for the old ranking system. The result of the overlapping Amazon ranking systems was not amenable to a single exponential function. I don't understand why you all assumed it was.
2) Data from 250 to 1,000,000 on a log graph is between 3 and 4 decades. I would define 3 or 4 as "a few." Depending on the number of data points you had in the area from 250 - 1,000 or close to 1,000,000, you may well find you're closer to 3 then 4. That said, the graph you are looking at on my site corresponds to the new ranking system. If you read through to the bottom of the page, you'll get a description of how the old system worked, plus my old graph which covered 7 decades. I eventually split the head of the curve into multiple lines to drive home the fact it was a moving target.
3 + 4) The only average price information I find in your paper is Table 5, where you give the average Amazon price you observed for ranks under 100,000 and over 100,000 as $29.26 and $41.60. I don't follow your Dealtime comment. My assertion that the area under your curve was off by a factor 3 by your own data takes the average price to be on the low side of your spread, $33. Using revenue data from Amazon's financial reports for that time period and the number of titles you estimated they sold yields an average selling price of about $11.00. If $11 doesn't go into $33 three times, I'll admit defeat. If it does, you have a factor of three to explain:-)
Morris
Posted by: Morris Rosenthal | August 09, 2005 at 05:23 PM
0. My email address is erikb (at) mit.edu. You can find it by clicking on my name next to my post and then going to my home page.
1. Glad to hear you don't seem to object to research examining whether the Pareto is a good fit. We didn't assume it necessarily would be a fit but we did examine this hypothesis. What we found, and reported, was that that this very simple equation has an R^2 of over 80% for these 861 data points. Perhaps you don't find that worth publishing. No problem.
2. Yes, I was comparing our results to your newest results. They seem to use a comparable span of data. And I heartily commend you on the span of data you chose to use!
3. A careful reading of our paper will reveal that we do not use the $29.26 and $41.60 figures (which, as we note next to table 5, are from Dealtime) to estimate the value of sales in the Long Tail. (For the record, we use these figures to support our conjecture that average prices in the "tail" are not lower than at the "head". This allows us to allocate a proportional amount of total revenues to the tail and compute the total consumer surplus using equation 9 and total revenues, and implicitly Amazon's -- NOT Dealtime's -- overall average selling prices). I'm not sure what you think we are using $33 for, but if you Read The Fine Manuscript you should get a clearer idea of our methods, which we tried our very best to describe carefully to the interested reader.
4. Yes, you are absolutely, positively correct that 33 is three times 11. Unfortunately for both of us, this has little or nothing to do with our analysis or results. To be as clear as possible: we could omit the Dealtime numbers entirely from the paper and the basic calculations for the Long Tail would be unchanged. Please see equation 9 and the rest of the detailed methodology if you are genuinely interested in learning what we did.
I know your goal is to help illuminate the blogosphere on this topic but I think it would be most useful if you carefully read the paper (and/or ask one of the authors to explain it) before posting your interpretations. For my part, I apologize for not writing the paper (and my postings) more clearly.
Please let me buy you a cup of coffee if you come to Cambridge and I'll go into as much detail as you like. Perhaps we can jointly analyze some of the fascinating new data you have.
Posted by: erikbrynjolfsson | August 09, 2005 at 06:52 PM
In case anybody is actually following this thread, Erik and I are now in a direct correspondence and hope to arrive at some mutually agreed conclusion:-)
Morris
Posted by: Morris Rosenthal | August 10, 2005 at 11:15 AM