Review of Statistics Done Wrong by Alex Reinhart
Since the advent of the big data era, organizations have been crying out for data scientists. Initially it was finding the true data scientist. But as these resources were considered scarce, it was a search for people who could hand-code analytical models in a Hadoop environment. As statistical tools such as R, Alteryx, and RapidMiner augmented Hadoop environments, we started to include traditional tools such as SAS and SPSS. These data scientists, or data scientists in training, were asked to take large amounts of data and divine the nuggets that would create a “cross-sell/up-sell” recommendation engine that would launch the next Netflix or link two disparate data sets that explain how markets interact and find the next groundbreaking investment opportunity.
What organizations did not necessarily do was provide these data scientists with the guides to using those tools effectively. They expected that level of experience to come with the “package” of a data scientist. However, when organizations attempted to home grow or develop them, there was an accelerated learning curve. Often, executive management said “go find me something no one has” and provided an analytical toolset and a BUNCH of disparate data. This approach can lead to amazing discoveries . . . and this approach can also lead to “analytical disaster” when correlation among attributes in a data set is confused for “cause and effect.” The use of advanced analytical model is a much art as it is a science.
What Statistics Done Wrong does is provide the novice, and even the expert, with some excellent examples of how to avoid issues associated with overpromising and underdelivering with statistical analysis. Alex Reinhart takes an approach where he is not going to teach the reader how to be a statistician or a data scientist, but rather to provide common sense ways to avoid the pitfalls associated with standard statistical analysis.
Reinhart walks the readers through multiple situations with excellent examples where making common assumptions about statistics can lead to issues with overconfidence in the analytical results. From p values to regression to omitting data, each of the scenarios provide data scientists and analysts with some of the lessons learned that are often only the result of years of experience. For those looking for resources to point out and document those lessons, Statistics Done Wrong is an excellent reference and a pleasure to read that you don’t often find in statistical books.