Design Paradigms in R
Lately I have been developing a deep curiosity of the origins of the R language. I have since read a more from the WayBack Machine than a Master’s student probably should. There are four documents that I believe to be extremely foundational and most clearly outline the original philosophies underpinning both R and its predecessor S. These are Evolution of the S Language (Chambers, 1996), A Brief History of S (Becker), Stages in the Evolution of S (Chambers, 200), and R: Past and Future History by Ross Ihaka (1998). The readings have elicited many lines of thought and potential inquiry. What interests me the most at present is the question “how have the design principles of S and R manifested themselves today?”
There are a number of design principles that I believe still exist in the R language today. Two of these find their origin in the development of S and the third was most clearly emphasized in the development of R. These are, in no particular order, what I believe to be the design principles that are most prominent in the modern iteration of R. The software should:
- be an interface language;
- be performant;
- and make users feel like they are performing analyses.
Identifying R’s evolution as an interface language is a rather objective task. Assessing these latter two principles is a subjective task, but one I will endeavor nonetheless.
The second principle is stated clearly by Ihaka (1998).
“My own conclusion has been that it is important to pursue efficiency issues, and in particular, speed.”
In my view of what is the prominent landscape of R today, few packages have paid as much attention to this principle as data.table. data.table’s performance is beyond reproach. The speed at which one can subset, manipulate, order, etc. using data.table is remarkable. Recent speed tests by package author Matt Dowle illustrate this. In performance testing, data.table regularly outperforms other existing tools such as Spark, dplyr, and pandas—spark being an exceptionally notable one.
This brings us to the third point which actually finds its origination in the development of S. John Chambers in Stages in the Evolution of S wrote that
“The ambiguity is real and goes to a key objective: we wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming.”
This is a helpful reminder that S was developed with the intentions of creating a tool for statistical analysis. In the design and implementation of S, the emphasis was doing analysis not programming. From this paradigm is where I think the tidyverse came—perhaps unintentionally, but consequently nonetheless.
The development of the tidyverse has, I believe, adhered to this principle. There is what seems to be a constant and conscious consideration to the useR experience. dplyr, for example, has developed a way to perform an analysis that is clear in intent and, to an extent, that can be read in a linguistically cogent manner.