Eight Features: Rubin Response

Subject: Re: Eight Features of an Ideal Intro Stat Course
         (Response to comments by Herman Rubin)

     To: EdStat-L and sci.stat.edu

   From: Donald B. Macnaughton <donmac@matstat.com>

   Date: Sunday May 16, 1999

     Cc: Herman Rubin <hrubin@b.stat.purdue.edu>

Quoting a 98/7/23 post of mine, Herman Rubin writes (on 98/8/3)

> Donald Macnaughton ... wrote:
> 
>> In a July 17 post I recommend that teachers emphasize the con-
>> cept of a relationship between variables and I recommend
>>
>>    a de-emphasis of less important topics such as univariate 
>>    distributions ...
>
> As such, I agree about the point on univariate distributions.
> One does not need a catalog of the standard ones, nor [does one
> need to] be adept at calculating them.
>
> HOWEVER, on consideration of the actual problems, they are an
> essential tool.  

I fully agree that univariate distributions are an essential tool 
in actual statistical problems -- most statistical analyses de-
pend directly on concepts of univariate distributions.

However, as Herman may agree, the ubiquity of univariate distri-
butions in statistical analyses does NOT speak to whether a 
teacher should discuss univariate distributions near the begin-
ning of an introductory statistics course when the course is 
aimed at students who are NOT majoring in statistics.  I explain 
why I recommend that discussion of univariate distributions be 
omitted near the beginning of such courses in some Usenet posts 
(1998a, 1999a).

>
> The real problem is, what is needed to discuss relations?

Herman is using the term "relations" to refer to what I call "re-
lationships between variables".  I compare the terms "relation-
ship" and "relation" in appendix A.  

I agree with Herman that an important problem is to clarify the 
conceptual underpinnings of relation(ship)s between variables.

>
>>   ( snip )
>> For example, students in high school physics courses learn
>> about the relationship between acceleration (a) and force (f)
>> with the model equation
>>
>>                            f = ma
>>
>> where m is the mass of the body being accelerated.
>
> I agree that there is not TOO much problem with understanding
> this, but the way algebra is taught, I would not be that sure.

Many students seem to understand the relationship between vari-
ables implied by f = ma (Isaac Newton's second law of motion).  
Students also understand other similar model equations they study 
in science classes.  But, unfortunately, most students do NOT 
seem to understand the broad importance of the *general* concept 
of a relationship between variables.  

(I discuss an approach to teaching the concept of a relationship 
between variables to students in three papers [1996, 1998b, 
1999b].)

>
> But do RANDOM VARIABLES have relations like this?  Taking the
> classical Galton observations on heights of fathers and heights
> of sons, there is no such relation.  

Herman's conclusion that there is no relation(ship) between the 
heights of fathers and the heights of sons in the population of 
families from which Galton (1886; 1889, chap. 6-7) drew his sam-
ple depends on how we define the concept of a relationship be-
tween variables.  Herman recognizes this and proposes two types 
of definition:

>
> The user who understands statistical problems is likely to for-
> mulate a relation as either a multivariate distribution or as a
> conditional univariate distribution, most of the time with un-
> known parameters.  

I agree that it is possible to define the concept of 'relation-
ship between variables' in terms of multivariate or conditional 
univariate distributions.  However, another simpler way of defin-
ing the concept is also available -- in terms of conditional ex-
pected value:

    DEFINITION:  There is a *relationship* between the vari-
    ables x and y if for at least one value x' of x

                       E(y|x') ~= E(y)                        (1)
    where 

       E(*) is the expected value operator

       E(y|x') is the expected value of y given that x has 
       the value x' and 

       ~= stands for "is not equal to".

Defining the concept of 'relationships between variables' in 
terms of conditional expected value leads to a simpler definition 
than the definitions Herman proposes above because the expected-
value approach replaces the complicated concept of 'distribution' 
with the simpler concept of 'expected value'.

Herman implies above that there is no relation(ship) between the 
heights of the fathers and the heights of the sons in Galton's 
population.  However, under the definition I give above, it can 
be easily shown (in terms of a low p-value in a statistical test) 
that there IS a "relationship" in Galton's population between the 
heights of the fathers (x) and the heights of the sons (y).

(Although it is not necessary to take account of the concept of a 
distribution in the definition of a relationship between vari-
ables, if we wish to *perform the statistical test* I refer to 
above to check whether there is convincing evidence in Galton's 
data of a relationship between the heights of the fathers and the 
heights of the sons, we do need to take account of the distribu-
tions of the values of the response variable [i.e., y = "height 
of the son"] for given values the predictor variable [x = "height 
of the father"].  Of course, much statistical machinery is avail-
able to take account of these distributions in performing the 
statistical test.)

(I discuss issues pertaining to the choice of an appropriate sta-
tistical test for the Galton data in appendix B.)

>    ( snip )
> Considering the problems with interpreting multivariate data,
> not starting with a distributional type of assumption, even if
> the form of the distribution is largely unspecified, is likely
> to lead to quite inappropriate analysis.

Although the definition above of the concept of a relationship 
between variables makes no reference to distributions, it leads 
(as far as I can see) to fully appropriate analyses.  Further-
more, since the approach makes no use of multivariate distribu-
tions, it bypasses all "the problems with interpreting multivari-
ate data" Herman refers to.  

I further discuss defining the concept of 'relationship between 
variables' in terms of conditional expected value (and I propose 
a definition of "expected value") in a paper (1996, sec. 7.10).

I thank Herman for his thought-provoking comments.

-------------------------------------------------------
Donald B. Macnaughton   MatStat Research Consulting Inc
donmac@matstat.com      Toronto, Canada
-------------------------------------------------------

APPENDIX A: TERMINOLOGY: SHOULD IT BE "RELATIONSHIP" OR
            "RELATION" BETWEEN VARIABLES?

In a paper I discuss whether we should use the preposition "be-
tween" or the preposition "among" in the phrase "relationship ... 
variables" and I conclude that "between" is preferred in most 
situations (1999b, app. C).

Similarly, following Herman's remarks above, we can ask whether 
the phrase should be:

                relationship between variables

                              or

                  relation between variables.

To help resolve this issue of terminology, let me first present 
some dictionary definitions of the terms "relationship" and "re-
lation" since these definitions show how the terms are commonly 
used by speakers of English.  

The second edition of the Oxford English Dictionary (OED) defines 
the relevant senses as:

relationship
   The state of being related; a condition or character based 
   upon this; kinship.

relation
3. That feature or attribute of things which is involved in con-
   sidering them in comparison or contrast with each other; the 
   particular way in which one thing is thought of in connexion 
   with another; any connexion, correspondence, or association, 
   which can be conceived as naturally existing between things.

Note that the OED lexicographers define the relevant sense of a 
relationship first as a state, second as a condition, and last as 
a character (i.e., a property).  On the other hand, they define 
the relevant sense of a relation first as a property (feature or 
attribute), second as a way of thinking, and last as a condition 
or state (connexion, correspondence, or association).

The 1993 Random House Unabridged Electronic Dictionary defines 
the relevant senses of the two terms as:

relationship
1. a connection, association, or involvement.

relation 
1. an existing connection; a significant association between or 
   among things: "the relation between cause and effect".

These definitions suggest that the Random House lexicographers do 
not see much difference between the two terms.

Merriam-Webster's Collegiate Dictionary (tenth edition, 1993) de-
fines the relevant senses of the two terms as:

relationship
1. the state of being related or interrelated <studied the 
   *relationship* between the variables>

relation
2. an aspect or quality (as resemblance) that connects two or 
   more things or parts as being or belonging or working together 
   or as being of the same kind <the *relation* of time and 
   space>; specifically : a property (as one expressed by "is 
   equal to", "is less than", or "is the brother of") that holds 
   between an ordered pair of objects

Since the Merriam-Webster lexicographers actually cite the phrase 
"relationship between variables", it is clear which word they 
view as being more naturally used in the phrase.  Note that the 
Merriam-Webster definitions and the OED definitions are essen-
tially the same -- a relationship is mainly a state and a rela-
tion is mainly a property (feature, attribute, aspect, or qual-
ity).

It seems more reasonable to me to view a relation(ship) between 
variables as a *state* or condition than to view it as a property 
of the situation, although the latter point of view is possible.  
Thus the dictionary definitions (as they reflect common usage) 
suggest to me that the word "relationship" is more appropriate 
than the word "relation" for use in the phrase "relation(ship) 
between variables". 

However, the word "relation" is shorter than "relationship", 
which I (as a writer) view as a significant advantage.  Also, the 
use of the phrase "relation between variables" does not seem to 
lead to confusion or misunderstanding.  Thus although I believe 
the term "relationship" is currently preferred, it seems possible 
(and reasonable) that idiom will migrate to the phrase "relation 
between variables".

My informal sense of the frequency of use of the two terms in 
relevant statistical contexts is that the term "relationship" re-
ceives substantially more use than the term "relation", but the 
latter term is used by several writers of note.  For example, the 
term "relation" occurs at several places in an important book ed-
ited by John Bailar and Frederick Mosteller (1992, pp. 27, 215, 
294, 296, 306, 328), although they also allow their authors to 
use the term "relationship" (pp. 10-11).

APPENDIX B: TESTING FOR A RELATIONSHIP BETWEEN VARIABLES IN
            GALTON'S DATA

I suggest above that we can perform a statistical test on Gal-
ton's data to determine whether there is evidence of a relation-
ship in the population between the heights of the fathers (x) and 
the heights of the sons (y).  Note that in actually performing 
such a test we need not perform a test that directly tests the 
inequality stated above in (1) since we can easily derive from 
(1) other equivalent conditions we can test.  If any of these 
other conditions is satisfied, we can easily show that (1) is 
also satisfied.  

For example, we can test whether

                        E(y|x1) ~= E(y|x2)                    (2)

where we might choose x1 and x2 to be as far apart as possible 
since (if the relationship is strictly monotonic, as many rela-
tionships are) this will give us (with other things being equal) 
a more powerful test of the existence of the relationship than if 
we use x1 and x2 closer together.  If we can show that (2) is 
satisfied, it follows that so also is (1).

Alternatively, if there is no compelling evidence that the best 
line for the relationship between the two variables is not a 
straight line, we can fit a straight line to the data and then 
test the hypothesis that the slope of the line in the population 
is zero.  If the data allow us to reject this hypothesis, it is 
easy to show that (1) is satisfied, and thus we can conclude that 
a relationship exists between the two variables.

On the other hand, if there is good evidence that the best line 
is NOT straight, this is also evidence of a relationship between 
the two variables in the sense that it also implies that (1) is 
satisfied.

(NOTE:  In his 1886 and 1889 works, Galton focuses on the rela-
tionship between the height of the "mid-parent" and the height of 
the son, where the height of the mid-parent is a weighted average 
of the heights of the mother and father.  Galton does not focus 
in these works on the relationship between the height of the 
*father* and the height of the son.  However, the points Herman 
and I discuss above are independent of whether we view "father's 
height" or "mid-parent's height" as being the predictor variable 
in the example.

REFERENCES

Bailar, J. C., III, and Mosteller, F., eds. 1992. _Medical uses 
   of statistics._ 2d ed. Boston: NEJM (New England Journal of 
   Medicine) Books.

Galton, F. 1886. "Regression towards mediocrity in hereditary 
   stature." _Journal of the (Royal) Anthropological Institute,_ 
   15, 246-263.

Galton, F. 1889. _Natural inheritance._ London: Macmillan.

Macnaughton, D. B. 1996.  "The entity-property-relationship ap-
   proach to statistics: An introduction for students."  Avail-
   able at http://www.matstat.com/teach/

Macnaughton, D. B. 1998a. "Re: Eight features of an ideal intro 
   stat course (response to comments by Dennis Roberts, Mark 
   Myatt, Rolf Dalin, Gary Smith, and Rossi Hassad)."  Posted to 
   sci.stat.edu and EdStat-L beginning on July 23, 1998.  Avail-
   able at http://www.matstat.com/teach/

Macnaughton, D. B. 1998b.  "Eight features of an ideal introduc-
   tory statistics course."  Available at 
   http://www.matstat.com/teach/

Macnaughton, D. B. 1999a. "Re: Eight features of an ideal intro 
   stat course (response to comments by Dennis Roberts and Karl 
   L. Wuensch)."  Posted to sci.stat.edu and EdStat-L on May 2 
   and May 9, 1999.  Available at http://www.matstat.com/teach/

Macnaughton, D. B. 1999b.  "The introductory statistics course: 
   The entity-property-relationship approach."  Available at 
   http://www.matstat.com/teach/

Return to top

Home page for Donald Macnaughton's papers about introductory statistics