Subject: Re: Eight Features of an Ideal Intro Stat Course (Response to comments by Herman Rubin) To: EdStat-L and sci.stat.edu From: Donald B. Macnaughton <donmac@matstat.com> Date: Sunday May 16, 1999 Cc: Herman Rubin <hrubin@b.stat.purdue.edu>
Quoting a 98/7/23 post of mine, Herman Rubin writes (on 98/8/3) > Donald Macnaughton ... wrote: > >> In a July 17 post I recommend that teachers emphasize the con- >> cept of a relationship between variables and I recommend >> >> a de-emphasis of less important topics such as univariate >> distributions ... > > As such, I agree about the point on univariate distributions. > One does not need a catalog of the standard ones, nor [does one > need to] be adept at calculating them. > > HOWEVER, on consideration of the actual problems, they are an > essential tool. I fully agree that univariate distributions are an essential tool in actual statistical problems -- most statistical analyses de- pend directly on concepts of univariate distributions. However, as Herman may agree, the ubiquity of univariate distri- butions in statistical analyses does NOT speak to whether a teacher should discuss univariate distributions near the begin- ning of an introductory statistics course when the course is aimed at students who are NOT majoring in statistics. I explain why I recommend that discussion of univariate distributions be omitted near the beginning of such courses in some Usenet posts (1998a, 1999a). > > The real problem is, what is needed to discuss relations? Herman is using the term "relations" to refer to what I call "re- lationships between variables". I compare the terms "relation- ship" and "relation" in appendix A. I agree with Herman that an important problem is to clarify the conceptual underpinnings of relation(ship)s between variables. > >> ( snip ) >> For example, students in high school physics courses learn >> about the relationship between acceleration (a) and force (f) >> with the model equation >> >> f = ma >> >> where m is the mass of the body being accelerated. > > I agree that there is not TOO much problem with understanding > this, but the way algebra is taught, I would not be that sure. Many students seem to understand the relationship between vari- ables implied by f = ma (Isaac Newton's second law of motion). Students also understand other similar model equations they study in science classes. But, unfortunately, most students do NOT seem to understand the broad importance of the *general* concept of a relationship between variables. (I discuss an approach to teaching the concept of a relationship between variables to students in three papers [1996, 1998b, 1999b].) > > But do RANDOM VARIABLES have relations like this? Taking the > classical Galton observations on heights of fathers and heights > of sons, there is no such relation. Herman's conclusion that there is no relation(ship) between the heights of fathers and the heights of sons in the population of families from which Galton (1886; 1889, chap. 6-7) drew his sam- ple depends on how we define the concept of a relationship be- tween variables. Herman recognizes this and proposes two types of definition: > > The user who understands statistical problems is likely to for- > mulate a relation as either a multivariate distribution or as a > conditional univariate distribution, most of the time with un- > known parameters. I agree that it is possible to define the concept of 'relation- ship between variables' in terms of multivariate or conditional univariate distributions. However, another simpler way of defin- ing the concept is also available -- in terms of conditional ex- pected value: DEFINITION: There is a *relationship* between the vari- ables x and y if for at least one value x' of x E(y|x') ~= E(y) (1) where E(*) is the expected value operator E(y|x') is the expected value of y given that x has the value x' and ~= stands for "is not equal to". Defining the concept of 'relationships between variables' in terms of conditional expected value leads to a simpler definition than the definitions Herman proposes above because the expected- value approach replaces the complicated concept of 'distribution' with the simpler concept of 'expected value'. Herman implies above that there is no relation(ship) between the heights of the fathers and the heights of the sons in Galton's population. However, under the definition I give above, it can be easily shown (in terms of a low p-value in a statistical test) that there IS a "relationship" in Galton's population between the heights of the fathers (x) and the heights of the sons (y). (Although it is not necessary to take account of the concept of a distribution in the definition of a relationship between vari- ables, if we wish to *perform the statistical test* I refer to above to check whether there is convincing evidence in Galton's data of a relationship between the heights of the fathers and the heights of the sons, we do need to take account of the distribu- tions of the values of the response variable [i.e., y = "height of the son"] for given values the predictor variable [x = "height of the father"]. Of course, much statistical machinery is avail- able to take account of these distributions in performing the statistical test.) (I discuss issues pertaining to the choice of an appropriate sta- tistical test for the Galton data in appendix B.) > ( snip ) > Considering the problems with interpreting multivariate data, > not starting with a distributional type of assumption, even if > the form of the distribution is largely unspecified, is likely > to lead to quite inappropriate analysis. Although the definition above of the concept of a relationship between variables makes no reference to distributions, it leads (as far as I can see) to fully appropriate analyses. Further- more, since the approach makes no use of multivariate distribu- tions, it bypasses all "the problems with interpreting multivari- ate data" Herman refers to. I further discuss defining the concept of 'relationship between variables' in terms of conditional expected value (and I propose a definition of "expected value") in a paper (1996, sec. 7.10). I thank Herman for his thought-provoking comments. ------------------------------------------------------- Donald B. Macnaughton MatStat Research Consulting Inc donmac@matstat.com Toronto, Canada ------------------------------------------------------- APPENDIX A: TERMINOLOGY: SHOULD IT BE "RELATIONSHIP" OR "RELATION" BETWEEN VARIABLES? In a paper I discuss whether we should use the preposition "be- tween" or the preposition "among" in the phrase "relationship ... variables" and I conclude that "between" is preferred in most situations (1999b, app. C). Similarly, following Herman's remarks above, we can ask whether the phrase should be: relationship between variables or relation between variables. To help resolve this issue of terminology, let me first present some dictionary definitions of the terms "relationship" and "re- lation" since these definitions show how the terms are commonly used by speakers of English. The second edition of the Oxford English Dictionary (OED) defines the relevant senses as: relationship The state of being related; a condition or character based upon this; kinship. relation 3. That feature or attribute of things which is involved in con- sidering them in comparison or contrast with each other; the particular way in which one thing is thought of in connexion with another; any connexion, correspondence, or association, which can be conceived as naturally existing between things. Note that the OED lexicographers define the relevant sense of a relationship first as a state, second as a condition, and last as a character (i.e., a property). On the other hand, they define the relevant sense of a relation first as a property (feature or attribute), second as a way of thinking, and last as a condition or state (connexion, correspondence, or association). The 1993 Random House Unabridged Electronic Dictionary defines the relevant senses of the two terms as: relationship 1. a connection, association, or involvement. relation 1. an existing connection; a significant association between or among things: "the relation between cause and effect". These definitions suggest that the Random House lexicographers do not see much difference between the two terms. Merriam-Webster's Collegiate Dictionary (tenth edition, 1993) de- fines the relevant senses of the two terms as: relationship 1. the state of being related or interrelated <studied the *relationship* between the variables> relation 2. an aspect or quality (as resemblance) that connects two or more things or parts as being or belonging or working together or as being of the same kind <the *relation* of time and space>; specifically : a property (as one expressed by "is equal to", "is less than", or "is the brother of") that holds between an ordered pair of objects Since the Merriam-Webster lexicographers actually cite the phrase "relationship between variables", it is clear which word they view as being more naturally used in the phrase. Note that the Merriam-Webster definitions and the OED definitions are essen- tially the same -- a relationship is mainly a state and a rela- tion is mainly a property (feature, attribute, aspect, or qual- ity). It seems more reasonable to me to view a relation(ship) between variables as a *state* or condition than to view it as a property of the situation, although the latter point of view is possible. Thus the dictionary definitions (as they reflect common usage) suggest to me that the word "relationship" is more appropriate than the word "relation" for use in the phrase "relation(ship) between variables". However, the word "relation" is shorter than "relationship", which I (as a writer) view as a significant advantage. Also, the use of the phrase "relation between variables" does not seem to lead to confusion or misunderstanding. Thus although I believe the term "relationship" is currently preferred, it seems possible (and reasonable) that idiom will migrate to the phrase "relation between variables". My informal sense of the frequency of use of the two terms in relevant statistical contexts is that the term "relationship" re- ceives substantially more use than the term "relation", but the latter term is used by several writers of note. For example, the term "relation" occurs at several places in an important book ed- ited by John Bailar and Frederick Mosteller (1992, pp. 27, 215, 294, 296, 306, 328), although they also allow their authors to use the term "relationship" (pp. 10-11). APPENDIX B: TESTING FOR A RELATIONSHIP BETWEEN VARIABLES IN GALTON'S DATA I suggest above that we can perform a statistical test on Gal- ton's data to determine whether there is evidence of a relation- ship in the population between the heights of the fathers (x) and the heights of the sons (y). Note that in actually performing such a test we need not perform a test that directly tests the inequality stated above in (1) since we can easily derive from (1) other equivalent conditions we can test. If any of these other conditions is satisfied, we can easily show that (1) is also satisfied. For example, we can test whether E(y|x1) ~= E(y|x2) (2) where we might choose x1 and x2 to be as far apart as possible since (if the relationship is strictly monotonic, as many rela- tionships are) this will give us (with other things being equal) a more powerful test of the existence of the relationship than if we use x1 and x2 closer together. If we can show that (2) is satisfied, it follows that so also is (1). Alternatively, if there is no compelling evidence that the best line for the relationship between the two variables is not a straight line, we can fit a straight line to the data and then test the hypothesis that the slope of the line in the population is zero. If the data allow us to reject this hypothesis, it is easy to show that (1) is satisfied, and thus we can conclude that a relationship exists between the two variables. On the other hand, if there is good evidence that the best line is NOT straight, this is also evidence of a relationship between the two variables in the sense that it also implies that (1) is satisfied. (NOTE: In his 1886 and 1889 works, Galton focuses on the rela- tionship between the height of the "mid-parent" and the height of the son, where the height of the mid-parent is a weighted average of the heights of the mother and father. Galton does not focus in these works on the relationship between the height of the *father* and the height of the son. However, the points Herman and I discuss above are independent of whether we view "father's height" or "mid-parent's height" as being the predictor variable in the example. REFERENCES Bailar, J. C., III, and Mosteller, F., eds. 1992. _Medical uses of statistics._ 2d ed. Boston: NEJM (New England Journal of Medicine) Books. Galton, F. 1886. "Regression towards mediocrity in hereditary stature." _Journal of the (Royal) Anthropological Institute,_ 15, 246-263. Galton, F. 1889. _Natural inheritance._ London: Macmillan. Macnaughton, D. B. 1996. "The entity-property-relationship ap- proach to statistics: An introduction for students." Avail- able at http://www.matstat.com/teach/ Macnaughton, D. B. 1998a. "Re: Eight features of an ideal intro stat course (response to comments by Dennis Roberts, Mark Myatt, Rolf Dalin, Gary Smith, and Rossi Hassad)." Posted to sci.stat.edu and EdStat-L beginning on July 23, 1998. Avail- able at http://www.matstat.com/teach/ Macnaughton, D. B. 1998b. "Eight features of an ideal introduc- tory statistics course." Available at http://www.matstat.com/teach/ Macnaughton, D. B. 1999a. "Re: Eight features of an ideal intro stat course (response to comments by Dennis Roberts and Karl L. Wuensch)." Posted to sci.stat.edu and EdStat-L on May 2 and May 9, 1999. Available at http://www.matstat.com/teach/ Macnaughton, D. B. 1999b. "The introductory statistics course: The entity-property-relationship approach." Available at http://www.matstat.com/teach/
Home page for Donald Macnaughton's papers about introductory statistics