Home

Subject: Definition of "Relationship Between Variables"
         (was Re: Eight Features of an Ideal Intro Stat Course)

     To: sci.stat.* Usenet Newsgroups
         EdStat E-Mail List

   From: Donald B. Macnaughton < donmac@matstat.com >

   Date: Monday January 28, 2002

     Cc: Jan de Leeuw 
         Herman Rubin 
         Robert Frick 

-----------------------------------------------------------------

This post is dedicated to the memory of Daniel DeLury (1907 -
1993) of the Department of Statistics at the University of
Toronto.  Dr. DeLury's influence on me is reflected throughout
the post, but most directly in the last appendix.

This post evaluates seven definitions of the concept of 'rela-
tionship between variables', including important definitions pro-
posed in earlier posts by Jan de Leeuw, Herman Rubin, and Robert
Frick.  It also discusses whether a teacher needs to discuss uni-
variate distributions or mathematics near the beginning of an in-
troductory statistics course for students who are not majoring in
statistics.

For simplicity, I assume throughout this post that all variables
are numeric -- that is, their values are numbers.  However, the
discussion and conclusions below easily generalize to situations
with non-numeric variables if the values of the variables are
(suitably) recoded to be numeric, and then one thinks in terms of
the recoded values.


A DEFINITION OF "RELATIONSHIP BETWEEN VARIABLES" BASED ON
EXPECTED VALUE

Responding to two informal definitions proposed by Herman Rubin
(in sci.stat.edu on 98/8/3), I proposed (on 99/5/16):

>>   DEFINITION: There is a *relationship* between the vari-
>>   ables x and y if for at least one value x' of x
>>
>>                    E(y|x') ~= E(y)                         [1]
>>   where
>>
>>      E(*) is the expected-value operator
>>
>>      E(y|x') is the expected value of y given that x has
>>      the value x' and
>>
>>      ~= stands for "is not equal to".
>>
>> Defining the concept of 'relationships between variables' in
>> terms of conditional expected value leads to a simpler defini-
>> tion than the definitions Herman proposes ... because the
>> expected-value approach replaces the complicated concept of
>> 'distribution' with the simpler concept of 'expected value'
>> [1999a].


JAN DE LEEUW'S REMARKS ABOUT [1]

Quoting [1], Jan de Leeuw writes (on 99/5/16)

> It seems to me this is too narrow.  Suppose, for example, that
> E(y|x) = E(y) for all x, but V(y|x) ~= V(y) for some x, where
> V is variance (for instance V(y|x) = \sigma^2 x^2).  Seems like
> a relationship to me.

Two points of view are available to handle Jan's example:

1. We can adopt Jan's point of view and say that the example is
   an example of a relationship between the variables x and y.

2. We can adopt the point of view suggested by [1] and say that
   the example is not an example of a relationship between the
   variables x and y.  (Of course, [1] clearly implies that the
   example is an example of a relationship between the variables
   x and V(y).)

We can adopt either of these points of view because they both ap-
pear to work satisfactorily.  I discuss which point of view is
preferred below, but first it is helpful to consider some pre-
liminary material.

(In an introductory statistics course for less advanced students
a teacher might reasonably decide not to present any formal defi-
nition of the concept of 'relationship between variables'.  In
this case the teacher would not present either of the above
points of view.  Instead, the teacher might choose to character-
ize the concept of 'relationship between variables' informally in
terms of one variable "depending" on the other, or in terms of
the values of one variable "varying somewhat in step" with the
values of the other.  Although these characterizations are not
mathematically explicit, I believe they are reasonable approaches
for less advanced students if the characterizations are developed
in terms of sufficient practical examples.)

>
> There is also a problem with symmetry.  Can we reverse the role
> of x and y in these definitions ? It seems so.

Appendix C discusses the symmetry of definitions of the concept
of 'relationship between variables'.


JAN DE LEEUW'S DEFINITION OF "RELATIONSHIP BETWEEN VARIABLES"

>
> This leads to a somewhat more straightforward definition: there
> is a relationship between random variables x and y if and only
> if they are not independent

Jan defines the concept of 'relationship between variables' in
terms of the concept of 'independence' of variables.  This leads
one to seek a definition of that concept, which Jan provides (in
terms of a definition of "dependence" or "relationship") as fol-
lows:

> (or, if you like, p(y|x) ~= p(y) for some x).

For clarity, let me make Jan's definition more explicit:

    DEFINITION: There is a *relationship* between the random
    variables x and y if and only if

                       p(y'|x') ~= p(y')                      [2]

    for some x' and some y' where

       p(y') = the unconditional probability that the vari-
               able y has the value y' (or equals the value
               of the probability density of y at y') and

    p(y'|x') = the probability that the variable y has the
               value y' given that the variable x has the
               value x'(or equals the value of the probabil-
               ity density of y at y' given that x is at x').

I hope that [2] properly characterizes the spirit of Jan's defi-
nition.  However, [2] differs from Jan's definition in two sig-
nificant ways:

- Primes appear on x and y in [2] to reinforce the idea that the
  definition is referring to (any) specific values of the vari-
  ables x and y.  That is, the variables are being used in the
  existential sense, as opposed to the universal sense.  This is
  also implied by the phrase "for some x' and some y'" in the
  definition.

- The phrase "and some y'" is added to [2].  This gives y the
  same existential freedom in the definition as x, which helps to
  show the broadness of the definition.

Note that [2] has the same structure as [1], with the expected-
value operator replaced by the probability (density) operator.

Definition [2] refers to the concept of a "random" variable, but
definition [1] does not refer to this concept.  Appendix A dis-
cusses the concept of 'random variable'.

Definition [2] is broader than [1] because [2] is satisfied by a
difference at any point across the two distributions (conditional
and unconditional) of the values of y, while [1] is satisfied
only if the means of the two distributions are different.

I further compare [1] and [2] below.


HERMAN RUBIN'S REMARKS ABOUT INTRODUCTORY STATISTICS

For brevity, I use the phrase "non-statistics-majors" in the fol-
lowing discussion to refer to students who are not majoring in
statistics or mathematics.

Herman Rubin begins his post by addressing the issue that started
the debate about the definition of "relationship between vari-
ables".  He writes (on 99/5/17)

> Donald Macnaughton ... wrote:
>
M> Quoting a 98/7/23 post of mine, Herman Rubin writes (on
M> 98/8/3)
>>
R>> Donald Macnaughton ... wrote:
>>>
M>>> In a July 17 post I recommend that teachers emphasize the
M>>> concept of a relationship between variables and I recommend
M>>> a de-emphasis of less important topics such as univariate
M>>> distributions ...
>>>
R>> As such, I agree about the point on univariate distributions.
R>> One does not need a catalog of the standard ones, nor [does
R>> one need to] be adept at calculating them.
R>>
R>> HOWEVER, on consideration of the actual problems, they are an
R>> essential tool.
>>
M> I fully agree that univariate distributions are an essential
M> tool in actual statistical problems -- most statistical analy-
M> ses depend directly on concepts of univariate distributions.
M>
M> However, as Herman may agree, the ubiquity of univariate dis-
M> tributions in statistical analyses does NOT speak to whether a
M> teacher should discuss univariate distributions near the be-
M> ginning of an introductory statistics course when the course
M> is aimed at students who are NOT majoring in statistics.
>
> On the contrary, it is these who need to understand, not the
> formulas for the standard univariate distributions, but what
> distributions, including univariate, are in general, and also
> some of their basic properties.

Herman and I disagree here:  He recommends that a teacher discuss
univariate distributions near the beginning of an introductory
course for non-statistics-majors.  In contrast, I recommend that
a teacher begin such a course by discussing relationships between
variables, with no discussion (near the beginning) of univariate
distributions.

>
> If the person in the other field cannot move the problem from
> "biological space" to "statistics space", the problem is not
> ready for the use of statistics

Again, I respectfully disagree.  By forcing our students (and
clients) to clamber from their own "space" into our "statistics
space" (that is, by forcing them to speak our complicated mathe-
matical language), I believe we confuse them and frighten many of
them away.

I believe that the mathematical language is unnecessary for non-
statistics-majors.  Instead of struggling to explain the mathe-
matics, we can focus on the practical use of statistics in em-
pirical research.  We can reasonably tell students that the main
practical use of statistics is to assist researchers to study re-
lationships between variables.

As noted, a relationship exists between two variables if when one
variable "goes up and down" in entities (or in the entities' en-
vironment), the other variable "goes up and down somewhat in
step".  We can illustrate this phenomenon on a scatterplot with
no discussion of any underlying mathematics.  We can then gener-
alize these ideas in various useful directions, again without the
(direct) need of mathematics.

In particular, we can show students that most empirical research
projects (or logical units of research projects) can be usefully
viewed as studying the relationship between a single response
variable and one or more predictor variables.  The response vari-
able is the variable that we wish to learn how to predict or con-
trol.  The predictor variable(s) is (are) the other variable(s)
that we observe or manipulate in a research project to help us
learn how to predict or control the values of the response vari-
able.

Many readers will agree that the statistical procedures that are
commonly used in empirical research include
- the t-test
- analysis of variance
- regression analysis
- response surface analysis
- categorical analysis
- time series analysis
- survey analysis
- survival analysis
- Bayesian analysis
- neural networks
- discriminant analysis
- nonparametric analysis
- logistic regression analysis
- probit analysis
- data mining methods
- univariate methods
- and others.
Examination of these procedures suggests that they can all be
reasonably and usefully viewed (for the most part) as optimal
methods for studying the relationship between a single response
variable and zero or more predictor variables under various cir-
cumstances.

The preceding four paragraphs suggest that the easy-to-understand
concept of 'relationship between variables' is a central unifying
concept of both the field of statistics and empirical research.
Thus it is reasonable to emphasize this concept in an introduc-
tory statistics course.

The main ideas are surprisingly simple:  In a typical research
project using statistical methods the researcher (e.g., a medical
researcher) would like the field of statistics to answer three
key questions, which are

1. How can we discover and demonstrate reliable evidence that a
   relationship exists (if one does) between the response vari-
   able and predictor variable(s) of interest?

2. If we find good evidence that a relationship exists, how can
   we best use our knowledge of the relationship to predict or
   possibly control the values of the response variable in new
   entities from the population on the basis of the values of the
   predictor variable(s)?

3. If we make such predictions or attempt such control, how accu-
   rate will the prediction or control be?

These questions make no reference (at least on the surface) to
mathematics.  We can show non-statistics-majors that much of the
field of statistics is about answering these questions in empiri-
cal research under various circumstances.  Discussing these ideas
(using sufficient practical examples) gives students a broad
overview of the vital role of statistics in empirical research.
This is more likely to impress non-statistics-majors than if we
discuss the mathematics.

Under this approach I do not suggest that we hide from students
the fact that statistical procedures are based on mathematical
principles.  Instead, I recommend that teachers make students
well aware of the existence of the important underlying mathemat-
ics.  But we can defer the details until a later course.

Similarly, it is important to inform students about the underly-
ing assumptions of statistical analysis -- we cannot have confi-
dence in the conclusions of a statistical analysis unless we know
that the underlying assumptions of the analysis are adequately
satisfied by the situation and data under study.  I recommend
that introductory statistics teachers impress students with this
important point.  But, as with the mathematics, we can defer the
details of the assumptions until later.

I further discuss the above points and the teaching approach I
recommend in two essays (1998a, 1999b) and in appendix H of this
post.  I discuss empirical research projects that do not study
relationships between variables in two essays (1997a; 1999b, app.
C).  Moore (1997a, sec. 4) and the American Statistical
Association (2002) also recommend de-emphasizing mathematics in
statistics education.


STATISTICAL PROCEDURES AS RELIGIOUS MANTRAS

Herman continues ...

> [If the person in the other field cannot move the problem from
> "biological space" to "statistics space", the problem is not
> ready for the use of statistics] except as "religious" mantras.

I like the metaphor of a religious mantra to characterize certain
traditional practices in statistics.  One area of statistics in
which I think statisticians and empirical researchers sometimes
use a mantra is in the important area of hypothesis testing.  I
have written about hypothesis testing earlier (1997b, sec. 9;
1998b, sec. 5) and the ideas appear from time to time below.  I
plan to present some further ideas in a later post.


HERMAN RUBIN'S FIRST DEFINITION OF "RELATIONSHIP BETWEEN
VARIABLES"

Herman next changes his focus to the main topic of the present
post -- the definition of the concept of 'relationship between
variables'.  Quoting [1] above, he writes

>
> I agree with de Leeuw that this definition is far too narrow.
>
> The appropriate version of this [is]
>
>     DEFINITION:  There is a *stochastic relationship*
>     between the random variables X and Y if for at
>     least one value x' the conditional distribution         [3]
>     of Y given X=x' is different from the uncondi-
>     tional distribution of Y.

Definition [3] is equivalent to Jan de Leeuw's definition [2] in
the sense that [3] will declare that a relationship exists be-
tween two "compatible" variables if and only if [2] also declares
that a relationship exists.  Appendix B discusses the equivalence
of [2] and [3].

(Two variables are "compatible" if they both reflect properties
of the same type of entity [or one may reflect a property of the
entities' environment], and if the available values of the vari-
ables are reasonably linked within entities and within time.
Clearly, we can reasonably study a relationship between variables
only if the variables are compatible.)

Despite the equivalence of [2] and [3], definition [3] differs
from definition [2] in the sense that [3] is effectively refer-
ring to the entire probability (density) function of the y-values
for a given x-value while [2] is effectively referring to a point
on the probability (density) function of the y-values for the
given x-value.  Definition [2] is thus more specific, and thus
perhaps slightly clearer, reducing the necessary and sufficient
condition for a relationship to a reasonable minimum condition.

>
> An alternative version is that X and Y are dependent random
> variables.  But the operational meaning of this is the above
> formulation; objects are independent if knowing one provides no
> information about the distribution of the other.  I would sug-
> gest that this be used as the definition of independence, and
> it goes over immediately to many objects.

For discussions about empirical research, I agree with Herman's
approach of defining the concept of 'independence' in terms of
the concept of 'relationship between objects', rather than the
other way around.  On the other hand, in theoretical discussions
it is often useful to begin with and focus on the concept of 'in-
dependence', as discussed in appendix I.

Herman speaks about "objects" because he wishes to apply the con-
cept of 'independence' to two different types of object, as he
indicates in his next sentence:

> A random variable here is an object, as is an event.

Consider Herman's concept of an event, and consider his notion
(implied in the second most recent quotation above) of the "dis-
tribution" of events -- how are events distributed?  One answer
is that they are distributed over time (or over some other appro-
priate dimension).  Thus consider the variable "time of an
event".  We can view the idea of independence of events simply in
terms of independence of (i.e., lack of relationship between) two
variables reflecting the (distribution over) time of the two
(types of) events.

Thus rather than needing two notions of independence (one for
variables and the other for events), we can subsume both types of
independence under the idea of a lack of a relationship between
variables.

>
> How can this be understood without knowing what it means for
> something to be the distribution of a random variable?

The referent of Herman's "this" is unclear although convention
suggests that the referent is the point he makes in his sentence
that precedes the above sentence.  However, I suspect that Herman
is not referring to the (somewhat peripheral) point in that sen-
tence.  Instead, I suspect that he is asking how his definition
[3] of the concept of a relationship between variables can be un-
derstood if one does not understand the concept of the distribu-
tion of the values of a (random) variable.

If that is Herman's point, I fully agree with it.  If we are to
successfully use [3] (or [2]) to define the concept of 'relation-
ship between variables' in a statistics course, students must
first understand the concept of the distribution of the values of
a variable.


HERMAN RUBIN'S REMARKS ABOUT [1]

Herman continues ...

>
> Expectation should not be taught using the formulas usually
> given,

By the "formulas usually given" I think Herman means the sum (or
integral) across all the possible values of the variable of the
product of the variable and its probability (density) function
[e.g., for the variable x, the sum (integral) across x of the
product of x and p(x)].

If we wish to teach the concept of 'expected value' to non-
statistics-majors, I agree with Herman that the formulas usually
given should not be used.  We can teach the concept to non-
statistics-majors in terms of the concept of 'arithmetic mean' or
'average'.  That is, the expected value is the value we will get
if we compute the average of the values of the variable for all
the entities in the population.

Non-statistics-majors readily understand that we can estimate
with reasonable precision the expected value of any variable by
computing the average of the values of the variable in a suitable
sample.  Here students need an informal awareness of the concept
of 'distribution'.  That is, they need to understand the idea
that the values of variables generally vary.  But they need no
mathematical awareness of distributions beyond adding together
the values and dividing by N.  (Most students already know from
statistical reports in the media that the average lies at the
"center" of the values.)

> but those formulas involve the concept of distribution as well.

I think Herman is here making the following argument:

- Definition [1] defines the concept of 'relationship between
  variables' in terms of the concept of 'expectation' or 'ex-
  pected value'.

- But the formulas usually given for the concept of expected
  value' involve the concept of 'distribution'.

- Therefore, [1] depends on the concept of 'distribution'.

I agree that [1] appeals to the concept of 'expected value' and
that the formulas usually given for expected value involve the
concept of 'distribution' [which is implicit in the function
p(x)].  However, if we bypass the formulas usually given and
characterize the concept of 'expected value' in terms of the con-
cept of 'arithmetic mean' or 'average', we bypass the need to re-
fer to the mathematical concept of 'distribution'.  This makes
the ideas substantially easier to understand.  I further discuss
this approach to expected value in a paper for students (1997b,
sec. 7.10).


HERMAN RUBIN'S SECOND DEFINITION OF "RELATIONSHIP BETWEEN
VARIABLES"

>
> The above definition could also be formulated as
>
>                     E(f(y)|x') ~= E(f(y))                   [4]
>
> for all functions f for which the expectations exist,

I suspect that Herman here means not all functions f but, in-
stead, some function f from among the set of all functions for
which the expectations exist.  That is, under [4] a relationship
exists between x and y if and only if [4] is satisfied for some
x' and some (any) specific value of y, and some (one, any) func-
tion f.

In [4] Herman has neatly changed from using the probability (den-
sity) operator as the main operator in the definition (as in [2]
and [3]) to using the expected-value operator (as in [1]).

Definition [4] is equivalent to [2] and [3] in the sense that [4]
will declare that a relationship exists between two compatible
variables if and only if [2] and [3] also declare that a rela-
tionship exists.  Appendix B discusses the equivalence of [2],
[3], and [4].

If the function f in [4] is the identify function (which it usu-
ally can be), [4] becomes [1].  In other words, [1] identifies a
subset of the cases that satisfy [4].

(For the broadest generality, the function f in [4] is allowed to
take multiple y-values [i.e., a vector of y-values] as its argu-
ment.  This enables us to include the variance function [as used
in Jan de Leeuw's example] and similar functions in the set of
permissible functions that may appear as f in the definition.)

I further discuss Herman's two definitions below, but first it is
helpful to consider three other definitions of the concept of
'relationship between variables'.


A STANDARD DEFINITION OF THE CONCEPT OF 'RELATIONSHIP BETWEEN
VARIABLES' FROM MATHEMATICAL STATISTICS

Jan de Leeuw begins his discussion above of the definition of
"relationship between variables" with the concept of 'independ-
ence', as opposed to beginning with the concept of 'dependence'
or 'relationship'.  Jan may have begun this way because it is a
standard way to begin.  For example, Freund and Walpole present
the following familiar definition of "independence" of two vari-
ables in their popular mathematical statistics textbook:

    If p(x,y) is the value of the joint probability distri-
    bution of the discrete random variables X and Y at 
    (x,y), and p1(x) and p2(y) are the values of the mar-
    ginal distributions of X and Y at x and y, X and Y are
    *independent* if and only if

                    p(x,y) = p1(x) p2(y)

    for all (x,y) within their range.
    
    To give a corresponding definition for continuous random
    variables, we simply substitute the word "density" for
    the word "distribution" [and the word "continuous" for
    the word "discrete"] (1987, p. 126).

(For comparability, I have reduced Freund and Walpole's original
definition from N variables to two variables and I have changed
the variable and function names to be consistent with those in
this post.)

Hogg and Craig, in their popular mathematical statistics text-
book, define "independence" the same way, although they use dif-
ferent wording and notation (1995, p. 101).  Other textbooks also
give conceptually the same definition, and thus Freund and
Walpole's definition reflects a widely-held view of the concept
of 'independence of two variables'.

Freund and Walpole emphasize the concept of 'independence' and
give much less attention to the concept of 'dependence' or 'rela-
tionship'.  Instead, "dependence" between two variables is simply
(and reasonably) implied as the negation of independence.

Thus we can define the concept of 'relationship between vari-
ables' in terms of the negation of Freund and Walpole's defini-
tion of "independence".  A reasonable version of this is

    DEFINITION: If p(x,y) is the value of the joint prob-
    ability (density) function of the random variables X and
    Y at (x,y), and p1(x) and p2(y) are the values of the
    marginal probability (density) functions of X and Y at x
    and y, there is a *relationship* between X and Y if and
    only if

                      p(x,y) ~= p1(x) p2(y)                   [5]

    for some (x,y) within their range.

Definition [5] is equivalent to [2] through [4] in the sense [5]
will declare that a relationship exists between two compatible
variables if and only [2] through [4] also declare that a rela-
tionship exists -- see appendix B.

Although [5] is equivalent to [2] through [4] in the sense above,
[5] is different from [2] through [4] (and [1]) in an important
sense -- [5] makes no significant distinction between the re-
sponse variable and the predictor variable, while the other defi-
nitions all make such a distinction.  Specifically, definitions
[1] through [4] use the vertical bar to mean "given that", and
the response variable y always appears to the left of the verti-
cal bar, and the predictor variable x always appears to the right
of the vertical bar.  (The vertical bar is implicit in [3].)

As discussed above, most empirical research projects (or logical
units of research projects) can be usefully viewed as studying
the relationship between a single response variable and one or
more predictor variables.  Thus the distinction between the re-
sponse variable and the predictor variable(s) is important in
most empirical research projects.  But [5] does not significantly
distinguish between these variables.  Thus [5] has less direct
applicability to the use of statistics in empirical research than
[1] through [4].

Consider the issue of quantification:  Note how Freund and
Walpole's definition of "independence" is a universally quanti-
fied statement, as indicated by the phrase "for all (x,y)" in the
definition.  On the other hand [5], which is the negation of
Freund and Walpole's definition, is an existentially quantified
statement, as indicated by the phrase "for some (x,y)" in the
definition.  Definitions [1] through [4] are also existentially
quantified statements, as is underscored by the primes on some of
the x's and y's in the definitions.

An advantage of using an existentially quantified definition is
that, in general, existentially quantified statements can (if
they are true) be verified in empirical research while univer-
sally quantified statements can almost never (even if they are
true) be directly verified.  (Universally quantified statements
can be falsified -- see appendix D.)  Universally quantified
statements cannot be verified because proper verification re-
quires an exhaustive search, which (due to resource limitations)
is almost always impossible.

(Because providing empirical support for a universally quantified
statement is almost always impossible, empirical researchers
rarely make universally quantified statements.  In particular,
empirical researchers rarely try to empirically support the claim
that no relationship exists between two given compatible vari-
ables -- that is, they rarely [if ever] try to empirically sup-
port the claim that the two variables are independent.  They do
not attempt to support this claim because generally it cannot be
reasonably empirically supported.  Instead, following the princi-
ple of parsimony, most experienced researchers simply formally
assume that no relationship exists between a response variable
and one or more compatible predictor variables until unequivocal
empirical evidence is brought forward that allows us to reject
the "null" assumption of no relationship.

(Appendix E discusses the "conservation" laws of physical sci-
ence, which are an interesting exception to the main point of the
preceding paragraph.)

As noted above, non-statistics-majors are more likely to be im-
pressed by the practical side of statistics.  Thus it makes sense
to introduce them to the existentially quantified definition of
"relationship between variables" (as defined by any of defini-
tions [1] through [7] in this post) instead of the universally
quantified definition of "independence".  This helps non-
statistics-majors to see the practical use of statistics in em-
pirical research, which is mostly about relationships between
variables (and not about "independences").

On the other hand, if we are teaching statistics to students who
are majoring in statistics or mathematics, it is important to in-
troduce the fundamental universally quantified definition of "in-
dependence of N random variables", as discussed in appendix I.


ROBERT FRICK'S DEFINITION OF "RELATIONSHIP BETWEEN VARIABLES"

Bob Frick wrote privately to me proposing another form of [1].
(I quote him here with his permission.)  Referring to [1], he
writes

> I propose
>
>     DEFINITION: There is a *relationship* between the vari-
>     ables x and y if for at least one pair of values x'
>     and x" of x
>
>                      E(y|x') ~= E(y|x").                    [6]
>
> I think this definition is mathematically equivalent to your
> definition and better fits my intuitive understanding and the
> typical understanding of causality.  I had to translate from
> your formulation to mine in order to understand and evaluate
> yours.

I agree with Bob that [1] and [6] are "mathematically equivalent"
in the sense that [1] will declare that a relationship exists be-
tween two compatible variables if and only if [6] also declares
that a relationship exists.  Appendix F discusses the equivalence
of [1] and [6].

(Discussion at several places below focuses on continuous [as op-
posed to discrete] response variables.  This is because continu-
ous response variables are available in most areas of empirical
research and continuous variables generally carry substantially
more information in their values.  Thus using a continuous re-
sponse variable usually enables a researcher to obtain better
knowledge of the relationship between the response variable and
the predictor variable[s] under study.)

Consider some properties of [1] and [6]:

1. Definition [1] is simpler than [6] in the sense that the right
   side of [1] is an unconditional expected value while the right
   side of [6] is a (more complicated) conditional expected
   value.

2. Definition [6] directly reflects how the detection of rela-
   tionships between variables is usually done in an important
   case in empirical research -- the case in which the response
   variable is continuous and the predictor variable is discrete,
   with two values.  This reflects the simplest standard exper-
   mental design, which is usually best analyzed with the sim-
   plest case of analysis of variance -- the one-way case with
   two levels (also called the t-test).  In this case we test
   whether a relationship exists between the variables by testing
   whether [6] (not [1]) is satisfied.  Statistical practice fa-
   vors [6] here over [1] because a research project properly 
   based on [6] generally provides (other things being equal) a 
   more powerful statistical test of whether the sought-after re-
   lationship exists.

3. The approach implied by [6] is generally not used to detect
   relationships between variables when we have a continuous re-
   sponse variable and a continuous predictor variable.  In this
   case the test for the existence of a relationship is generally 
   a test of whether a parameter in a model equation has a cer-
   tain "null" value.  We know or believe that the parameter will
   have the null value (typically zero) if no relationship is 
   present and will have a different value if a relationship is
   present.  If we can reasonably reject the hypothesis that the
   parameter has the null value, we can (tentatively) conclude
   that a relationship exists between the variables.  Statistical
   practice favors a test of a parameter here instead of the ap-
   proach implied by [6] because the parameter test provides
   (other things being equal) a more powerful statistical test of
   whether the sought-after relationship exists.

4. An approach (properly) based on [6] can generally give better
   prediction or control capability than a similar approach based
   on [1].

5. Although the approach implied by [6] is directly used some of
   the time to test for relationships between variables in em-
   pirical research, the approach implied by [1] is almost never
   directly used.  Instead, forms that can be derived from [1] 
   (such as [6] or a test of a parameter) are used in actual
   practice.

6. Consider the case in which humans informally study relation-
   ships between variables in everyday life.  In this case we are
   usually not conscious of the concept of 'relationship between
   variables'.  For example, after several visits to a new bank a
   person may observe, "The earlier in the morning I go to the 
   bank, the less time I have to wait to be served."  ("Duration
   of waiting time" is the response variable and "bank arrival
   time" is the predictor variable.)  In this case people seem
   more often to begin with an approach resembling [1] than one
   resembling [6].  This may be because [1] is simpler and lends
   itself at least as well as [6] to natural situations.  Here,
   we often begin without knowledge of the identity of the rele-
   vant predictor variable, and thus without direct knowledge of
   the values of the response variable when the predictor vari-
   able is at two different values (x' and x"), although this
   type of knowledge usually comes later.  Instead, we initially
   discover the relationship by noting that the response variable
   deviates from its expected value when the predictor variable
   is at a particular value (or in some value range).  Thus [1]
   seems more basic or more "natural" to me than [6].

The above discussion suggests that [1] and [6] both have advan-
tages.  Thus in statistics courses in which the teacher decides
to introduce [1] or [6] to define "relationship between vari-
ables", I recommend introducing both definitions to students.


A DEFINITION OF "RELATIONSHIP BETWEEN VARIABLES" IN TERMS OF A
MATHEMATICAL FUNCTION

The discussion above refers to the idea of a parameter in a model
equation.  The idea of a model equation suggests the following
definition of the concept of 'relationship between variables':

    DEFINITION: There is a *relationship* between the vari-
    ables x and y if the value of y can be expressed as a
    non-constant mathematical function of the value of x.     [7]
    An "error" term e is generally included with the func-
    tion, where e is usually viewed as being independent of
    x.  This is stated algebraically as

                        y = g(x) + e.

For maximum generality, the function g(x) is shown as a fully
general function.  However, in any real study of a relationship
between variables the general function g is replaced by a spe-
cific mathematical function that is chosen from among the many
types that are available.

Definition [7] is closely related to the concept of 'expected
value' because the function g is almost always chosen so as to
"best" estimate E(y|x).

(If the response variable is continuous and if the conditional
distribution of its values is noticeably non-symmetric, which I
estimate occurs in less than four percent of empirical research
projects with continuous response variables, the median may be
used instead of the mean [expected value].  The same basic prin-
ciples apply, but g(x) estimates the median of the conditional
distribution of the response variable instead of the mean.)

The mathematical form of g is chosen mostly through analysis of
data obtained in empirical research, although theoretical consid-
erations sometimes also play a central role, especially in the
physical sciences.

The function g is usually a mathematical function in the strict
sense of the term "function".  That is, g is a one-to-one or pos-
sibly many-to-one mapping between two sets, with no random ele-
ment involved.  (That is, the mapping is not one-to-many.)

The random element in [7] is handled by the error term e (which
is usually represented by the lowercase Greek letter epsilon).
This term takes account of the (empirical) fact that invariably
in real situations the best function g cannot perfectly predict
the associated value of y from a value of x -- the e is the error
in the prediction.  Researchers often determine the distribution
of e, but in any real situation the term itself has a different
unpredictable value every time an instance of the equation oc-
curs.

The error term in [7] provides two important conceptual benefits:

1. The error term enables the equation to satisfy the mathemati-
   cal requirements of the equals sign.

2. The error term collects all the unaccounted-for variation in
   the values of y in a single sensible place.  (Some complicated
   analyses use multiple error terms.)

Definition [7] is equivalent to [1] and [6] in the sense that [7]
will declare that a relationship exists between two compatible
variables if and only if [1] and [6] also declare that a rela-
tionship exists.  Appendix F discusses the equivalence of [1],
[6], and [7].

(Definition [7] is not equivalent to definitions [2] through [5],
but can be made so by broadening it, as discussed in appendix G.)

Definition [7] is important because mathematical functions are
often used to represent relationships between variables in most
branches of the physical and biological sciences, and also (at a
more abstract and implicit level) in much research in the social
sciences.  In the physical sciences the error term e in [7] is
usually omitted, but the same general principle of stating rela-
tionships between variables in the form of mathematical functions
is widely used.

Definition [7] refers to the concept of 'independence'.  A
teacher presenting [7] to students can use the standard approach
exemplified above in Freund and Walpole's definition of "inde-
pendence of two random variables" to characterize 'independence'.
However, that approach appeals to the concept of 'distribution'.
Thus students must understand the statistical concept of 'distri-
bution' to fully understand [7].

If a teacher chooses to present definitions [1] and [6] to stu-
dents, and if the students have sufficient mathematical ability,
I recommend that the teacher also present definition [7].  I rec-
ommend that the three definitions be presented in succession,
separated only by careful discussion of practical examples of ac-
tual relationships to reinforce each definition.  Presenting the
definitions in succession helps students to attain a unified
sense of the various ways that relationships between variables
appear in life and empirical research.

(My experience suggests that most students cannot understand any
definition of the concept of 'relationship between variables'
without sufficient discussion of practical examples, with "suffi-
cient" depending on the level of the students.)


COMPARISON OF THE DEFINITIONS

The preceding material discusses seven definitions of the concept
of 'relationship between two variables'.  Which definition is
preferred?

To simplify this question, the following discussion views defini-
tions [1], [6], and [7] as if they are the same definition.  This
is reasonable because the three definitions are theoretically
equivalent, as noted in appendix F.  The discussion refers to the
three definitions jointly as the "expected-value" definition of
the concept of 'relationship between variables'.

Similarly, the following discussion views definitions [2], [3],
[4], and [5] as if they are the same definition.  This is reason-
able because [2] through [5] are theoretically equivalent, as
noted in appendix B.  The discussion refers to [2] through [5]
jointly as the "distribution" definition of the concept of 'rela-
tionship between variables'.

The expected-value and distribution definitions are not equiva-
lent, as is illustrated by Jan de Leeuw's variance example:  If
we consider the example with y in the role of the response vari-
able, the expected-value definition does not directly declare
that a relationship exists between x and y, but the distribution
definition does directly declare that a relationship exists.

Since the two definitions are not equivalent, which of them is
preferred?

It is reasonable to split this question into two more specific
questions:

- Which definition is preferred in the introductory statistics
  course for non-statistics-majors?

- Which definition is preferred in general statistical discourse?

In determining the preferred definition, I assume we are not
Platonists.  Thus neither the expected-value definition nor the
distribution definition is more "correct".  This is because we do
not believe that some true Platonic concept of 'relationship be-
tween variables' exists somewhere, and we are trying to capture
the concept in the definition.  Instead, we are free to choose a
definition as being "correct".  Many readers will agree that a
reasonable approach to making this choice is to choose whichever
definition has more conceptual advantages.

Consider some features and advantages of the expected-value defi-
nition:

 1. The expected-value definition is easier to understand than
    the distribution definition because it does not require
    mathematical understanding of the statistical concept of
    'distribution (of the values of a variable)'.

 2. Empirical researchers are generally much more interested in
    directly predicting or controlling the values of the response
    variable in an empirical research project (i.e., in predict-
    ing or controlling expected value) than in predicting or con-
    trolling the values of higher moments (e.g., variance) of the
    response variable.  And although situations arise (especially
    in quality control) in which examples like Jan's are impor-
    tant, I estimate that more than ninety-six percent of all em-
    pirical research projects that study relationships with con-
    tinuous response variables (as reported in the empirical re-
    search literature) can be reasonably understood as viewing
    relationships in terms of the expected value (or occasionally
    in terms of the expected median) of the response variable.
    And usually, if a relationship is found between the variance
    of the response variable and a predictor variable, this is
    merely viewed as a nuisance.  (The variance relationship is a
    nuisance because heterogeneity of response variable variance
    adds complexity to the analysis.)

 3. Consistent with point 2, the expected-value definition is
    (implicitly) used much more frequently than the distribution
    definition to define the statistical tests that are performed
    in empirical research to detect relationships between vari-
    ables.  In the case of a continuous response variable the
    tests are almost always (effectively) tests of whether some
    measure of center or some parameter of a model (both of which
    are often linear functions of the [perhaps trimmed or
    subsetted] values of the response variable) has some value,
    or is different from some other fixed value, or is different
    from some other empirically derived value or values.  These
    tests are thus effectively tests of the first moment (possi-
    bly with appropriate subsetting) of the values of the re-
    sponse variable.  Only rarely are the key tests performed on
    other moments of the values of the response variable.  Also,
    tests that are in terms of the probability (density) function
    of the values of the response variable are performed only in-
    frequently.  (Instances occur when the response variable is
    discrete, as opposed to continuous but, as suggested above,
    discrete response variables are used less often.)

 4. Statistically knowledgeable empirical researchers often per-
    form statistical tests for variance relationships.  But when
    they perform such tests they are usually directly interested
    in studying a relationship between variables as defined by
    the expected-value definition.  And they are only performing
    the variance tests to assist in verifying that the underlying
    assumptions of the statistical procedure being used are ade-
    quately satisfied.  Furthermore, statistically knowledgeable
    empirical researchers almost never check whether the third or
    higher moments of the response variable change as a function
    of a predictor variable.  This suggests that empirical re-
    searchers generally view study of moments of the response
    variable higher than the second as being of little interest
    or value.

 5. A function (transformation) is sometimes applied to the val-
    ues of the response variable in the data analysis of an em-
    pirical research project.  However, if such a function is
    used, the purpose is usually merely to stabilize the variance
    of the response variable to satisfy assumptions of the sta-
    tistical procedure being used -- not to support direct study
    of higher moments or other similar study of the values of the
    response variable.

 6. Although the expected-value definition does not directly
    cover certain cases (such as Jan's V(y|x) case), it covers
    all these cases indirectly when the appropriate function is
    applied to the values of the response variable, as suggested
    by Herman's definition [4].

 7. The terminology of the expected-value definition is consis-
    tent with common language.  For example, it is natural and
    informative to report the results of an empirical research
    project that found the result in Jan's example as "There is
    no evidence of a relationship between x and y but there is
    good evidence of a relationship between x and V(y)."

 8. The expected-value definition makes a distinction between
    certain types of relationships between variables -- a dis-
    tinction that definitions [2], [3], and [5] do not make.
    (The distinction is also made by definition [4].)  The dis-
    tinction is in terms of the function f (which is usually
    merely the identity function) that is applied to the values
    of the response variable.  In cases in which this function is
    not the identity function, naming it helps one to understand
    the relationship.

 9. Empirical researchers are generally interested in minimizing
    the (error) variance in the values of the response variable
    in a research project.  This is because minimizing variance
    results in increased precision of prediction or control of
    the values of the response variable, which is a widely pur-
    sued general goal of empirical research.  However, minimiza-
    tion of variance is usually not pursued directly in empirical
    research.  Instead, minimization of variance comes as a sec-
    ondary benefit from studying prediction or control of ex-
    pected value through relationships between variables.  That
    is, usually a large part of the variability in the values of
    the response variable in an empirical research project is as-
    sumed to reflect the fact that this variable depends on nu-
    merous other variables (many of which may be unknown), and
    these other "influencing" variables may be varying (either
    systematically or at random) within or between entities,
    thereby causing some of the variation in the values of the
    response variable.  (Some of the variation in the response
    variable is also due to measurement error, and some of the
    variation may be "totally random".)  Identifying the influ-
    encing variables through studying relationships between vari-
    ables in terms of the expected-value definition "removes" the
    variation from the values of the response variable that can
    be associated with these variables, thereby reducing the "er-
    ror" variance in the values of the response variable, and
    thereby increasing precision in prediction or control.  That
    is, researchers generally increase precision through studying
    relationships between variables (and improving measurement
    methods) -- not through direct efforts to somehow reduce
    variance without studying relationships between variables.

10. The expected-value definition is consistent with the distri-
    bution definition.  This is because the expected-value defi-
    nition is not stated as "if only if".  The expected-value
    definition gives only a sufficient condition for a relation-
    ship between variables -- it does not give a necessary condi-
    tion.  (As noted above, this condition defines an empirically
    large subset of the cases defined by the distribution defini-
    tion.)  Thus the expected-value definition leaves open the
    possibility that other forms of "relationship" might also di-
    rectly qualify, although we need not discuss this esoteric
    point with non-statistics-majors.

Consider some features and advantages of the distribution defini-
tion:

 1. The distribution definition identifies a class of relation-
    ships between variables that the expected-value definition
    does not directly identify.  These are the relationships that
    resemble Jan's V(y|x) case.  (However, as noted, the ex-
    pected-value definition does identify these cases if an ap-
    propriate function is applied to the values of the response
    variable.)

 2. Unlike the expected-value definition, the distribution defi-
    nition (with the exception of [4]) does not force one to look
    for a function to deal with cases like Jan's V(y|x) case.  If
    the inequalities in [2], [3], or [5] are satisfied in any
    way, the distribution definition declares that a relationship
    exists between the two variables.  This idea is important
    from a theoretical point of view.  However, the idea is not
    often directly applied in empirical research.  This is be-
    cause in empirical research it is generally easier to find an
    appropriate function (if needed) and then to use the expected-
    value definition than it is to study the entire distribution
    of the values of the response variable.  That is, researchers
    generally focus on a key aspect of the distribution, which
    is usually the value it is "centered" around, which is
    usually (perhaps after a transformation) best represented by
    its expected value.  Experience has shown that studying the
    expected value (or occasionally some other measure of
    central tendency) of the response variable (while keeping an
    eye on the spread) is an efficient way of breaking down the
    concepts to simple yet generally sufficient principles.

 3. The distribution definition reflects the Bayesian approach to
    the study of relationships between variables.  This approach
    is reasonably viewed as focusing on the distribution of the
    values of the response variable (as opposed to focusing
    merely on the expected value of the response variable).  Re-
    searchers using the Bayesian approach study the relationship
    between a response variable (which may be a parameter) and
    zero or more predictor variables by inferring the "posterior"
    distribution of the values of the response variable.  They
    make this inference on the basis of Bayes' theorem and

    (a) the values of the response variable and predictor vari-
        able(s) (if any) obtained from the entities in the sample
        in the research project and
    
    (b) the "prior" distribution of the values of the response
        variable (possibly conditioned on the values of the pre-
        dictor variables).
   
    Thus the distribution definition directly mirrors the
    Bayesian approach.  This is a crucial advantage of the dis-
    tribution definition if one is using the Bayesian approach.

I am unable to think of other significant features or advantages
of the distribution definition.  If readers see other features or
advantages of either definition, I hope they will present them to
this debate.


WHICH DEFINITION IS PREFERRED?

Having considered some features and advantages of the two defini-
tions of the concept of 'relationship between variables', let us
now consider which definition is preferred.

First, which definition is preferred for an introductory statis-
tics course for non-statistics-majors?  For such a course, if the
teacher elects to present a formal definition of the concept of
'relationship between variables', I recommend emphasizing the
expected-value definition.  I base this on my beliefs that (a)
the expected-value approach is easier to understand, and (b) the
expected-value approach mirrors the methods statisticians and re-
searchers usually use to detect and study relationships between
variables in empirical research.

On the other hand, I recommend emphasizing the distribution defi-
nition if a teacher elects to teach the Bayesian approach in a
statistics course.  The distribution definition is preferred in
this case because, as noted, it directly mirrors the Bayesian ap-
proach.  (However, I recommend against teaching the Bayesian ap-
proach in an introductory statistics course for non-statistics
majors -- see appendix J.)

Finally, in a statistics course for students who are majoring in
statistics or mathematics, or in a statistics course for students
who have sufficient statistical experience, or in general statis-
tical discussion, I believe the preferred definition of "rela-
tionship between variables" should be at the discretion of the
instructor or participants.  Reasonable criteria for making the
choice are that the preferred definition for a particular discus-
sion should

1. maximize understanding and

2. provide optimal support for the intended analysis.

In addition to being preferred in Bayesian cases, the distribu-
tion definition can better satisfy the two criteria in some non-
Bayesian cases, especially in some theoretical and mathematical
cases.  For example, the technical discussion in appendix D ap-
peals to the distribution definition.

On the other hand, in many other non-Bayesian cases the expected-
value definition seems superior.  For example, "standard" analy-
sis of variance seems better viewed in terms of the expected-
value definition.  This is because in standard analysis of vari-
ance the resulting p-values are almost always reasonably viewed
as testing for relationships between variables in terms of de-
tecting differences between means of the values of the response
variable -- standard analysis of variance does not (directly)
test anything about the higher moments of the values of the re-
sponse variable.  Appendix K further discuses this point.


GENERALIZATION OF THE DEFINITIONS

Definitions [1] through [7] are all definitions of a relationship
between a single response variable and a single predictor vari-
able.  Appendix H discusses the important issue of generalizing
the definitions to situations with multiple response variables
and multiple predictor variables.


MAIN POINTS

The concept of 'relationship between variables' can be reasonably
defined in terms of the concept of 'expected value' and in terms
of the concept of 'univariate distribution'.  The two definitions
are not equivalent.  The expected-value definition identifies a
large subset of the cases identified by the distribution defini-
tion.  The expected-value definition indirectly identifies the
remaining cases.

The expected-value definition is easier to understand and has
several other significant advantages over the distribution defi-
nition.  And for the introductory statistics course for non-
statistics-majors the expected-value definition appears to have
no serious disadvantages.  Thus I recommend that a teacher empha-
size the expected-value definition in an introductory statistics
course for non-statistics majors if the teacher elects to present
a formal definition of the concept of 'relationship between vari-
ables'.

The distribution definition is preferred when the Bayesian ap-
proach is used and in some theoretical and mathematical discus-
sions.

The easy-to-understand concept of 'relationship between vari-
ables' is a central unifying concept of both the field of statis-
tics and empirical research.  A key use of the concept is to as-
sist researchers in accurate prediction and control.  Thus I rec-
ommend that the introductory statistics course for non-statistics-
majors focus on the study of relationships between variables in
empirical research as a means to accurate prediction and control.
This focus is important whether the concept of 'relationship' is
formally defined or is instead informally characterized in terms
of practical examples.

Don Macnaughton

-------------------------------------------------------
Donald B. Macnaughton   MatStat Research Consulting Inc
donmac@matstat.com      Toronto, Canada
-------------------------------------------------------


APPENDIX A:  IS THE CONCEPT OF A "RANDOM" VARIABLE NECESSARY IN
             THE DEFINITION OF "RELATIONSHIP BETWEEN VARIABLES"?

(The order of the appendices in this essay respects logical de-
pendencies among them, which makes the order different from the
order in which they are first referenced above.)

The definitions I discuss above by de Leeuw, Rubin, Freund and
Walpole, and Hogg and Craig all refer to the concept of a "ran-
dom" variable.  On the other hand, definition [1], Frick's [6],
and [7] omit the concept of 'random' and simply use the concept
of 'variable'.  Should the definition of "relationship between
variables" appeal to the concept of 'random variable'?

First, what is a random variable?  I suggest that the following
definition mirrors current statistical thinking:

    DEFINITION: A variable is a *random variable* if it has a
    probability distribution (density) implicitly or ex-
    plicitly associated with it.

The definition is sometimes stated in other ways, such as "random
variable: a measurable function on a probability space; ..."
(Parker 1994).  But in almost every case the same ideas are pre-
sent, especially the key idea of 'probability distribution'.

Lay people often view a random variable as varying "at random",
with the concept of 'at random' connoting a vague sense of unpre-
dictability.  This view is consistent with the statistical defi-
nition of "random variable".  However, the statistical definition
sharpens the focus by using the idea of the probability distribu-
tion of the values.  This enables us to go beyond vague "unpre-
dictability" to being able to efficiently characterize various
properties of the randomness or unpredictability.  We character-
ize these properties in terms of properties of the distribution --
specifically, in terms of the type of the distribution and in
terms of the values of its parameters.

A key use of the concept of 'random variable' is to concisely
distinguish between the variables that are used in statistics
(which generally have probability distributions) and the vari-
ables that are used in general mathematics (which generally lack
probability distributions).  Thus in distinguishing statistics
(especially the mathematical aspects of statistics) from general
mathematics, the distinction between "random" variables and other
variables is clearly important.

On the other hand, the distinction between random variables and
other variables is less important in empirical research.  This is
because virtually all variables studied in empirical research can
be reasonably viewed as being random variables.  This is because
virtually all variables in empirical research are repeatedly
measured and, in general, the values vary from one measurement to
the next, which implies a probability distribution.  (An arguable
exception is that each manipulated [predictor] variable in a for-
mal experiment is "fixed" at two or more different values by the
researcher, and thus [in a sense] lacks a distribution.  But the
fixed values differ from each other and thereby [speaking perhaps
loosely] have a distribution with probabilities, which satisfies
the condition of the definition of "random variable".)  Since
virtually all variables in empirical research can be reasonably
viewed as being "random" variables, the idea of 'random variable'
does not make a useful distinction in empirical research.  Per-
haps for this reason, the idea of a "random" variable often fades
into the background in empirical research, and is not mentioned.
Instead, discussion usually proceeds in terms of the concept of
'variable' without the modifier "random".

For the introductory statistics course for non-statistics majors,
I recommend above that teachers emphasize the use of statistics
in empirical research where the idea of a random variable often
fades into the background and thus is less important.  I also
recommend above that teachers omit (or minimize) discussing the
mathematical aspects of statistics where the idea of a random
variable is important.  These points lead me to recommend that
teachers omit discussing the somewhat abstruse concept of 'random
variable' in introductory statistics courses for non-statistics-
majors.

On the other hand, in courses for students who are majoring in
statistics or mathematics, it is clear that the concept of 'ran-
dom variable' is important, identifying a key distinction between
mathematical statistics and general mathematics.

(The above ideas raise the question of how we should define the
statistical concept of 'variable'.  I propose a definition of the
concept in the 1999b paper [sec 3.2].)


APPENDIX B:  EQUIVALENCE OF DEFINITIONS [2], [3], [4], AND [5]

In the body of this essay I say that definitions [2], [3], [4],
and [5] of the concept of 'relationship between variables' are
equivalent to each other.  This appendix illustrates the
equivalences.

In saying that the four definitions are "equivalent" I mean that
they will always theoretically agree as to whether a relationship
exists between two compatible variables.

To see the equivalence of [2], [3], and [4], imagine (in three-
space) the surface defined by the bivariate probability (density)
function of two compatible variables x and y and consider slices
through this surface made parallel to the plane defined by the p-
axis (probability [-density] axis) and the y-axis.  The slices
can be made at any permissible value for x in the range of the x-
axis.  Consider two exhaustive situations:

A. the situation in which all such slices are similar to each
   other and

B. the situation in which at least one slice is not similar to
   another slice.

(Two slices [made parallel to the p-y plane] of the surface de-
fined by the bivariate probability [density] function are "simi-
lar" to each other if for every value y' in the y-dimension the
probability [density] function values at y' for the two slices
are in a constant proportion to each other, or both values of the
function at y' may be zero.)

If situation A is satisfied for a given pair of compatible vari-
ables, one can easily see that [2], [3], and [4] all declare that
no relationship is present.  On the other hand, if situation B is
satisfied, one can easily see that [2] and [3] declare that a re-
lationship is present.

To show that situation B implies that [4] declares that a rela-
tionship is present, we must show that whenever situation B is
satisfied a function f exists such that the inequality in [4] is
satisfied.  Such a function always (theoretically) exists because
the probability (density) function of y can always serve as the
function.  That is, if situation B is satisfied, and if we use
the probability (density) function as the function in [4], we can
always find a point (x',y') -- a point of dissimilarity between
the marginal and conditional probability (density) functions of y
-- such that [4] is satisfied.

                            *   *   *

Simple algebra and the standard definition of conditional prob-
ability imply that [2] and [5] are equivalent.

Therefore, definitions [2], [3], [4], and [5] of the concept of
'relationship between variables' are theoretically equivalent to
each other in terms of their pronouncements of whether a rela-
tionship exists between two compatible variables.


APPENDIX C: THE SYMMETRY OF DEFINITIONS OF "RELATIONSHIP BETWEEN
            VARIABLES"

In his 99/5/16 post Jan de Leeuw speaks of "a problem with symme-
try" and asks whether we can "reverse the role of x and y in
these definitions".

By "these definitions" I think Jan means the following:

- the definition I propose in my 99/5/16 post, which appears
  above as [1]

- some derived forms of [1] in the 99/5/16 post

- the two informal definitions Herman Rubin proposes in his
  98/8/3 post and

- the definition Jan proposes in his 99/5/16 post, which I quote
  above, and which is reflected in [2].

In referring to the "symmetry" of the definitions, I think Jan
means the following:

    DEFINITION: A definition of the concept of 'relationship
    between two variables' is *symmetric* if and only if for
    all pairs of compatible variables x and y the definition
    declares that a relationship exists between the vari-
    ables with y in the role of the response variable if and
    only it also declares that a relationship exists between
    the variables with x in the role of the response vari-
    able.

In referring to a "problem" with symmetry, Jan may be suggesting
that a reasonable definition of the concept of 'relationship be-
tween variables' ought to be symmetric.  Jan may be noting that
[2] is symmetric, but [1] is not, and he may be suggesting that
therefore [1] is less reasonable.

In the following paragraphs I first discuss how the distribution
definition is symmetric.  Then I discuss how the expected-value
definition is not symmetric.  Then I discuss whether the lack of
symmetry of the expected-value definition is a problem.

To see the symmetry of the distribution definition (i.e., defini-
tions [2] through [5]), consider [5].  Clearly, [5] is symmetric.
But, as discussed in appendix B, [2] through [5] are all equiva-
lent.  Therefore, the distribution definition is symmetric.

A graphical way to see the symmetry of [2] and [3] is to consider
the joint probability (density) function of two compatible vari-
ables x and y when there is and is not a relationship (as defined
by any of [2] through [5]), and to note the concomitant absence
or presence of similarities of slices made parallel to the p-y
plane, as discussed in appendix B.  If no relationship is pre-
sent, [2] and [3] imply that all slices parallel to the p-y plane
will be similar to each other.  But this geometrically implies
that all slices parallel to the p-x plane will also be similar to
each other.

On the other hand, if a relationship is present, [2] and [3] im-
ply that at least one slice parallel the p-y plane will be dis-
similar to another slice parallel to that plane.  But this geo-
metrically implies that at least one slice parallel to the p-x
plane will also be dissimilar to another slice parallel to that
plane.  Thus the distribution definition is symmetric.

On the other hand, the expected-value definition is not symmet-
ric, as illustrated by Jan's example.  That is, suppose for some
pair of compatible variables x and y the conditions of Jan's ex-
ample are satisfied and therefore E(y|x) = E(y) for all x, but
V(y|x') ~= V(y) for some x'.  Furthermore, for ease of visualiza-
tion, suppose that V(y) is an increasing function of x.  If the
expected-value definition is to be symmetric, it would have to be
true that E(x|y) = E(x) for all y.  But this is clearly not the
case.  For example, if a value y' of y is "far away" (in either
direction) from the mean of the y's, E(x|y') is higher than if y'
is closer to the mean.  This disconfirming example shows that the
expected-value definition is not generally symmetric.

However, although the expected-value definition is not generally
(analytically) symmetric, this does not appear to be a problem
when we are using statistics in empirical research.  This is be-
cause in empirical research we are generally only interested in a
one-way relationship.  That is, we identify a response variable y
whose values we would like to predict or control.  Then we iden-
tify predictor variables x1, x2, ... that we will measure (along
with y) in an empirical research project to see if we can find
evidence of a relationship.  If we find such (reasonable) evi-
dence, we can use the information derived from the relationship
to make predictions (or perhaps exercise control) of the value of
y in new entities from the population on the basis of the values
of the x's.  But we are usually not interested in turning things
around and predicting or controlling the values of the x's from
the values of y.  Thus in empirical research the question of
whether the expected-value definition (or any other definition of
"relationship between variables") is symmetric is usually irrele-
vant.

The lack of symmetry in the expected-value definition actually
arises only in unusual cases.  In most real cases in empirical
research if the expected-value definition is satisfied by a pair
of compatible variables x and y, we will find that the definition
is also satisfied if the roles of x and y are reversed, although
this fact may not be of much interest or importance.

Finally, the lack of symmetry in the expected-value definition,
rather than being a shortcoming, can be viewed as an advantage.
This is because if we determine that a certain "relationship" be-
tween variables is not "bi-directional", which we can do if we
think in terms of the expected-value definition or [4], this is
an additional piece of possibly interesting information about the
relationship -- information that we cannot directly obtain if we
use [2], [3], or [5].

(A relationship between variables x and y is bi-directional if
the expected-value definition declares that the relationship ex-
ists regardless of whether x or y is used in the role of response
variable.)

(We can use [4] to determine whether a given relationship between
variables is bi-directional by noting the function f used in
specifying the relationship -- the nature of this function may
help us to determine whether the relationship is bi-directional.)


APPENDIX D: VERIFICATION AND FALSIFICATION IN THE STUDY OF
            RELATIONSHIPS BETWEEN VARIABLES

Consider the statement that a relationship exists between a re-
sponse variable and a predictor variable.  If we use the distri-
bution definition of "relationship", this statement is logically
equivalent to the statement that the variables are not independ-
ent.  This leads to the question of which of the following two
equivalent approaches we should emphasize in technical discus-
sions about empirical research:

- verifying the existence of a relationship

- falsifying the non-existence of the relationship (i.e., falsi-
  fying the independence of the variables).

Popper (1959, 1983, 1989) emphasizes falsification although he
emphasizes falsification of theories, as opposed to falsification
of independence of variables.  (I propose a list of components of
a scientific theory in a Usenet post [1997c].)

Work that falsifies a theory is clearly important -- an effective
approach to advancing knowledge is to engage in an ongoing exer-
cise to try to find (reasonable) falsifications of scientific
theories.  Most scientists continually follow this approach, es-
pecially with new theories.

We can falsify a theory in two ways:

- We can falsify a theory by finding a reasonable alternative ex-
  planation of a leading research result that supports the the-
  ory.  The alternative explanation implies that the result is
  equivocal, and therefore the support for the theory is dimin-
  ished (Mosteller 1990, Lipsey 1990).  For example, Fleischmann,
  Pons, and Hawkins proposed the "cold fusion" theory (1989).
  This theory describes a hypothetical process the authors be-
  lieved could provide safe inexpensive energy.  They reported
  research results that suggest (somewhat) that the theory is
  correct.  A few other researchers also reported research re-
  sults that support the theory.  But many other researchers were
  unable to replicate these results and were unable to obtain
  other results that should be obtainable if the theory is cor-
  rect.  Thus reasonable alternative explanations of the original
  results were proposed, as detailed by Huizenga (1993) and
  Hoffman (1995).  (The alternative explanations generally in-
  volve shortcomings in the design and conduct of the original
  research.)  A few physical scientists continue to believe that
  the cold fusion theory is correct, but most now believe that
  some of the alternative explanations are correct.  Thus most
  physical scientists believe that the cold fusion theory lacks
  unequivocal empirical support and thus lacks credence.  (Of
  course, the theory could immediately be resurrected if new un-
  equivocal and replicable empirical results supporting the the-
  ory are brought forward.)

- We can falsify a theory by finding a new (reliable) empirical
  research result that is inconsistent with the theory.  For ex-
  ample, Michelson and Morley carefully measured the relative
  speed of light when it was traveling in different directions
  and found (within an estimable amount of measurement error)
  that the speed was (apparently) the same in all the (measured)
  directions (1887).  That is, they found no evidence of a rela-
  tionship between "direction of light travel" and "speed of
  light".  This tended to falsify the then-popular theory of the
  luminiferous ether, which had predicted that the speed would
  vary in different directions to a degree substantially greater
  than the measurement error in the experiment.

Consider the totality of empirical research projects, including
those that serve to falsify theories.  As noted above, examina-
tion of such projects suggests that most of them can be reasona-
bly viewed as studying the existence (and other aspects) of a re-
lationship between a single response variable and one or more
predictor variables.

As also noted, we can turn things around and say that these re-
search projects are attempting to falsify the statement that the
variables are independent.  However, this is effectively a double
negative (i.e., falsifying the statement that no relationship ex-
ists) and seems harder to understand than the simple logically
equivalent idea of verifying the statement that the relationship
exists.

Thus despite Popper's emphasis on falsification, in technical
discussions about empirical research it is reasonable to empha-
size the concept of verifying the existence of a relationship be-
tween variables (and further study of the relationship) instead
of emphasizing the logically equivalent but harder-to-understand
concept of falsifying the independence of the variables.


APPENDIX E:  A CASE WHEN RESEARCHERS DO DISCUSS INDEPENDENCE OF
             VARIABLES

The discussion of [5] contrasts the definitions of "relationship
between variables" with the definition of "independence of vari-
ables" and says that empirical researchers rarely attempt to sup-
port the claim that two variables are independent.  That is, re-
searchers rarely attempt to support the claim no relationship ex-
ists between two given compatible variables.  An informative ex-
ception to this rule occurs with the so-called "conservation"
laws in the physical sciences.

A conservation laws states that within a "closed system" a cer-
tain variable is constant.  That is, no relationship exists be-
tween the variable and any other variable.  For example, the law
of conservation of energy states that the total amount of energy
in a closed system is constant.  In other words, there is no re-
lationship between the total amount of energy in such a system
and any other variable (including the variable "time").

Other conservation laws (all defined by Parker 1994, p. 441) are
about conservation of
- angular momentum (in mechanics)
- areas (in orbital mechanics)
- charge (in the study of electricity)
- condensation (in fluid mechanics)
- mass (which is similar to "weight", and is a property of physi-
  cal objects)
- matter (which is the "material" that physical objects are made
  of)
- momentum (in mechanics)
- orbital symmetry (in organic chemistry)
- parity (in quantum mechanics)
- probability (in quantum mechanics)
- vorticity (in fluid mechanics).

The conservation laws reflect the efforts of empirical research-
ers to find relationships between the variables named in the laws
and other variables.  These efforts have repeatedly failed to
find any such relationships in closed systems.  The various fail-
ures to find relationships (that is, the "negative results") are
enshrined in the conservation laws.

Although the conservation laws have substantial empirical sup-
port, the principle of parsimony (as discussed above) recommends
that we assume that no relationship exists between any pair of
compatible variables until unequivocal evidence of such a rela-
tionship is brought forward.  Thus the conservation laws may seem
less important because they only state what the principle of par-
simony recommends we assume.  Thus perhaps the conservation laws
need no empirical support, and thus perhaps the time and effort
spent discovering them was wasted.

However, the researchers who discovered the conservation laws
usually did not set out to discover them.  Instead, they set out
to discover a relationship between the variable named in a con-
servation law and some other variable.  But they failed because
(apparently) no such relationships exist.

Although the conservation laws merely state what the principle of
parsimony recommends we assume, they serve three important pur-
poses:

- The conservation laws state that certain (apparent) constancies
  have been repeatedly found in empirical research.  Knowledge of
  these constancies is generally a crucial element in building
  models of how the particular area of interest works.

- The conservation laws warn researchers that any search (in a
  closed system) for a relationship between a variable named in a
  conservation law and another variable will likely be unsuccess-
  ful.

- The conservation laws are monuments to the nameless researchers
  who diligently tried to refute them, but failed.  (One rarely
  finds fame in a negative result, although Michelson and Morley
  did [1887].)

Because the conservation laws reflect negative results, they are
not necessarily the last word on the situations they describe.
In particular, it is possible that researchers will someday dis-
cover that some of the conservation laws are untrue, and that a
relationship does exist in a closed system between a variable
named in a conservation law and one or more other variables.  (In
general, any such relationships will be weak or hard-to-observe --
otherwise they would have already been discovered.)  For exam-
ple, the law of conservation of matter has already been shown to
be untrue in a certain hard-to-observe situation -- the situation
in which matter is converted to energy through a nuclear reac-
tion.


APPENDIX F: EQUIVALENCE OF DEFINITIONS [1], [6], AND [7]

In the body of this essay I say that definitions [1], [6], and
[7] of the concept of 'relationship between variables' are
equivalent to each other.  This appendix illustrates the
equivalences.

As before, in saying that the three definitions are "equivalent"
I mean that they will always theoretically agree as to whether a
relationship exists between two compatible variables.

Here is an argument to show the equivalence of [1] and [6]:

First, suppose that definition [1] declares that a relationship
exists between two compatible variables x and y.  Then [1] im-
plies that there exists a value of x, say x', such that E(y|x')
~= E(y).  Then one of two cases must occur:

1. There is a value x" of x different from x' such that E(y|x") =
   E(y).  Thus E(y|x") ~= E(y|x').  Thus [6] declares that a re-
   lationship exists between x and y.

2. There is no value x" such that E(y|x") = E(y).  This case may
   occur because the mean of the values of y is not a permissible
   value.  But in this case there must exist a value x" that is a
   permissible value of x such that E(y|x") ~= E(y|x').  This
   value is necessary to make E(y) different from E(y|x') as
   given by [1].  But in this case [6] declares that a relation-
   ship exists between x and y.

Q.E.D.

For the converse, suppose that definition [6] declares that a re-
lationship exists between two compatible variables x and y.  Then
[6] implies that values exist of x, say x' and x", such that
E(y|x') ~= E(y|x").  Then E(y|x) ~= E(y) for some x (i.e., either
x' or x").  But in this case definition [1] declares that a rela-
tionship exists between x and y.

Q.E.D.


Similarly, here is an argument to show the equivalence of [6] and
[7]:

First, suppose that definition [6] declares that a relationship
exists between two compatible variables x and y.  Then [6] im-
plies that values x' and x" of x exist such that

                      E(y|x') ~= E(y|x").                     (a)

Define a function g(x) such that g estimates E(y|x).  In view of
(a), if g is sufficiently accurately defined, g(x') ~= g(x").
Thus g (if sufficiently accurately defined) is a non-constant
function that estimates the value of y from a given value of x.

(Because this argument is attempting to show the theoretical
equivalence of [6] and [7], it can require definition of the
function g and require that g be sufficiently accurately defined.
In a specific empirical situation the operations associated with
properly defining g are carried out by collecting sufficient ap-
propriate data and performing a reasonable analysis of the data
to derive g.)

Consider a specific evaluation of g(x) in an empirical situation.
In this situation the associated error term e has a specific nu-
meric value (called the "residual") that is equal to the differ-
ence between the measured value of y and g(x).  Assuming we have
sufficient access to data, we can use appropriate residual analy-
sis to refine the function g to the point that the residuals are
statistically independent of the values of x.  Thus we can sat-
isfy the condition of [7] regarding the error term.  But in this
case definition [7] declares that a relationship exists between x
and y.

(Actually, the condition in the preceding paragraph need not be
satisfied for us to believe that a relationship exists under
definition [7].  And if a non-constant function g can do better
than chance at predicting the value of y from a value of x, most
statisticians will agree that a relationship exists regardless of
whether the residuals are independent of x.  However, it is use-
ful to emphasize the requirement that the residuals be independ-
ent of x because [as discussed by Draper and Smith 1998, chap. 2]
if we refine g to make the error term properly independent of the
predictor variable(s), this leads us to the form of g that gives
the most accurate predictions.)

Q.E.D.

For the converse, suppose that definition [7] declares that a re-
lationship exists between two compatible variables x and y.  Then
[7] implies that a non-constant function g(x) is defined such
that g estimates E(y|x).  Since g is non-constant, there exist
two distinct values of x, say x' and x", such that g(x') ~=
g(x").  Then, assuming g is sufficiently accurately defined,
E(y|x') ~= E(y|x").  But in this case definition [6] declares
that a relationship exists between x and y.

Q.E.D.

Therefore, definitions [1], [6], and [7] of the concept of 'rela-
tionship between variables' are theoretically equivalent to each
other in terms of their pronouncements of whether a relationship
exists between two compatible variables.


APPENDIX G:  REWORDING [7] TO BE EQUIVALENT TO THE DISTRIBUTION
             DEFINITION

As discussed in appendix F, definition [7] (which defines "rela-
tionship between variables" in terms of the concept of 'mathe-
matical function') is equivalent to [1] and [6].  What is the re-
lationship between [7] and the distribution definition (i.e., [2]
through [5])?

Definition [7] is not equivalent to the distribution definition,
as can be seen by studying Jan de Leeuw's variance example:  It
is not possible to express Jan's variance relationship in the
form y = g(x) + e if g is restricted to being a standard mathe-
matical function (i.e., a one-to-one or many-to-one mapping) and
if e is restricted to being independent of x.  (Using Jan's spe-
cific example, we can express the key relationship as V(y|x) =
g(x) + e, but the relationship under consideration is between x
and y, not between x and V(y|x).)

Although [7] is not equivalent to the distribution definition, we
can reword it to be equivalent as follows:

    DEFINITION: There is a *relationship* between the vari-
    ables x and y if and only if the value of y can be ex-
    pressed as a "mapping" of x, symbolized by m(x).  At
    least one of the moments of m(x) must be a non-constant
    mathematical function of x.  An error term e may be in-   [8]
    cluded with the mapping, where e is usually viewed as
    being independent of x.  This is stated algebraically as

                          y = m(x) + e.

Note that m(x) represents a mapping (possibly one-to-many).  This
broadens the conditions of the definition to allow relationships
in which E(y|x) (first moment of y given x) is constant for all
values of x, but one or more of the higher moments of y depend on
x.

(Since [8] refers to the moments of the distribution of the val-
ues of y, it is valid only if the relevant moments of y exist.
Relevant moments of the values of the response variable can usu-
ally be found to exist in empirical research although focus is
usually on the first moment [reflecting the expected value or av-
erage], occasionally on the second [reflecting the spread], and
only rarely on the third [reflecting asymmetry], or fourth [re-
flecting peakedness].  Empirical researchers virtually never
study moments greater than the fourth.  This may be because the
"normal" nature of most data makes phenomena in these moments
[and in the third and fourth moments] unlikely to exist.  Fur-
thermore, even in cases in which such phenomena may exist, they
are not readily interpretable, nor obviously useful, although ex-
ceptions are conceivable.)

Consider the case in which the distribution definition correctly
declares that a relationship is present between two compatible
variables x and y, but the expected-value definition can directly
detect no evidence of the relationship.  For example, consider
again Jan's variance example.  In this case the mapping m(x) re-
quired by the definition will be of an unusual form because it
will include a random element (or some other form of "memory").
The random element is necessary to provide the one-to-many map-
ping that makes the variance of y dependent on x even though E(y)
is independent of x.  Such unusual forms are not often found in
empirical research.

(Another approach to making [7] equivalent to the distribution
definition is to allow the values [or moments] of the error term
e to be functions of the predictor variable x.  However, this ap-
proach has the undesirable feature that it violates the [not sac-
rosanct] standard principle that the error term is independent of
the values of the predictor variable[s].)


APPENDIX H: GENERALIZATION OF [1] THROUGH [8]

Definitions [1] through [8] are all definitions of a relationship
between a single response variable and a single predictor vari-
able.  How should we generalize these definitions to the case
with multiple response variables or multiple predictor variables?
Consider first the case of multiple response variables.

Most research projects with multiple response variables can be
assigned to one of three categories:

(a) research projects that use "repeated measurements" (also
    called "repeated measures") in which we have a single re-
    sponse variable that is measured repeatedly (i.e., more than
    once) in each research entity or

(b) research projects in which each response variable can be rea-
    sonably viewed as defining a logically separate research pro-
    ject having only a single response variable (but sharing the
    research entities and predictor variables with the other re-
    sponse variables) or

(c) research projects in which the multiple response variables
    can be viewed as a single (vector-valued) variable, even
    though they are not the same variable.

Some medical, social, and biological research falls into category
(a) in which the procedure of repeated measurements enables the
researcher to "reuse" the research entities (typically organisms,
perhaps humans), which can provide the dual benefits of decreased
research costs and increased power of the statistical tests (but
at a cost of increased complexity of concepts and assumptions).

Surveys often fall into category (b) -- a survey may ask respon-
dents several questions that each define a separate response
variable.  For example, in surveying customers leaving a store we
may ask about the quality of the service, whether the location of
the store is convenient, whether the price ranges for merchandise
are appropriate, and so on.  Each of these questions defines a
different response variable.  Each response variable is typically
analyzed separately to see if it depends on any of the shared
predictor variables, such as the customer's age, gender, purchas-
ing behavior, occupation, clerk used, and so on.

In category (c) note that the response variable in a practical
empirical research project is invariably viewed as a unity -- as
a single response variable (which may on rare occasion be a vec-
tor of different variables).  This is reasonable because if the
response "variable" is just a random conglomeration of (compati-
ble) variables, there would appear to be no obvious empirical
sense in using this conglomeration as the response "variable" in
an analysis.

Thus cases with multiple response variables can usually be rea-
sonably viewed (in one of the above three senses) as having only
a single response variable.

In categories (a) and (b) no definition of "relationship between
variables" with multiple response variables is needed because the
cases degenerate into the case with only a single response vari-
able.  However, in category (c) we do need such a definition for
completeness.  However, in view of the rareness of this case,
that definition is beyond the scope of this essay.

                            *   *   *

Let us now consider the important case with multiple predictor
variables.  How should we define the concept of 'relationship be-
tween variables' in this case?

First, it is reasonable to ask whether we need a definition of
the relationship between a response variable and multiple predic-
tor variables or whether we might get by more parsimoniously by
applying the simple definition multiple times.  That is, why not
say a relationship exists between a response variable and multi-
ple predictor variables if a separate simple relationship exists
between the response variable and each of the predictor vari-
ables?  Here, a simple relationship is defined by the version of
the definition in which only a single predictor variable is pre-
sent (e.g., one of definitions [1] through [8] above).

However, defining the concept of 'relationship between variables'
with multiple predictor variables in terms of multiple simple re-
lationships does not allow us to detect or study all possible
forms that a relationship might have.  In particular, using sim-
ple relationships does not allow us to fully study "interactions"
between the predictor variables with respect to their joint rela-
tionship to the response variable.  Interactions were discovered
(invented, devised) by Fisher (1935, chap. VI) as a key part of
his invention of analysis of variance.  Interactions provide a
method of detecting all the possible (strong enough) relation-
ships between the response variable and the predictor variables
in a research project -- certain interactive relationships cannot
generally be detected or observed if we define relationships be-
tween variables solely in terms of one of the simple definitions.
I give a definition of the concept of 'interaction' in a paper
(1998c, sec. 6.10).

                            *   *   *

Let us consider generalizations of [1] through [8] that enable us
to take account of interactions.  For maximum efficiency, we
would like each generalization to state a minimal set of condi-
tions for a relationship to exist.  The following generalizations
provide reasonable minimal conditions:

I give a generalization of [1] to the case with an arbitrary num-
ber of predictor variables in the paper for students (1997b, sec.
7.10).  That generalization can be used as a template for the
generalizations I propose for [2] through [4].

For example, here is the resulting generalization of [2] to the
case with two predictor variables:

    DEFINITION: There is a *relationship* between the pre-
    dictor variables x1 and x2 and the response variable y if
    and only if

                p(y'|x1',x2') ~= p(y'|x1') and

                p(y"|x1",x2") ~= p(y"|x2")

    for some y' and y" and for some permissible pair
    (x1',x2') and some permissible pair (x1",x2") of values
    of x1 and x2 where

    p(y'|*) = the probability (density) that the variable y
              has the value y' given that the variable(s)
              appearing at * is (are) at its (their) indi-
              cated value(s).

(The values of y' and y" in the definition need not be distinct,
nor the values of x1' and x1", nor x2' and x2".)

Generalization of [3] and [4] is the same as for [1] and [2].
The general pattern is to state each inequality with the function
on the left having all the predictor variables as arguments.  The
function on the right is a reduced function with one of the pre-
dictor variables omitted.  For k predictor variables k inequali-
ties are stated, each with a different one of the predictor vari-
ables omitted from the function on the right.  The "function" is
either the expected-value function (for [1] and [4]) or the prob-
ability (density) function (for [2] and [3]).

Generalization of [5] can be performed by converting the gener-
alization of [2], which involves conditional probabilities, into
expressions in unconditional probabilities using the definition
of conditional probability.  The converted generalization of [2]
yields a generalization of [5].  This approach to generalizing
[5] yields expressions that are non-intuitive and is thus (like
[5]) of less direct interest in standard empirical research.

Generalization of [6] is slightly different from the generaliza-
tion of [1], but straightforward.

Generalization of [7] to the case with k predictor variables can
be performed by adding the requisite arguments for the additional
predictor variables to the argument list of g in the equation,
and by requiring that for each predictor variable xi the first-
order partial derivative of g(x1, ..., xk) with respect to xi be
non-zero for at least one set of values of the predictor vari-
ables.  (A different set of values of the predictor variables may
be used for each partial derivative.)

The approach in the preceding paragraph requires that g be dif-
ferentiable as indicated.  The definition can be generalized to
cases when g is not differentiable by speaking in terms of par-
ticular differences between the g(x1, ..., xk) for different
given values of the x's -- the discrete analogue of partial dif-
ferentiation.

(Fisher invented analysis of variance to systematize the study of
these differences in the general case when the response variable
is continuous and the predictor variables are discrete [1935].  I
believe that analysis of variance is the most important of
Fisher's many important contributions because it theoretically
underpins almost all modern scientific experimentation.)

Generalization of [8] to the case with k predictor variables can
be performed by adding the requisite arguments for the additional
predictor variables to the argument list of m in the equation,
and by requiring that for each predictor variable xi the first-
order partial derivative of at least one of the moments of m(x1,
..., xk) with respect to xi be non-zero for at least one set of
values of the predictor variables.  (A different moment and a
different set of values of the predictor variables may be used
for each partial derivative.)

The generalization of [8] can be generalized to cases when m is
not differentiable using the approach described above for the
generalization of [7].

Appendix F discusses how definitions [1], [6], and [7] are
equivalent, and appendix B discusses how definitions [2] through
[5] are equivalent.  These equivalences raise the question of
whether the generalizations of these definitions are also respec-
tively equivalent.  Although I suspect that the generalizations
discussed above are respectively equivalent, I have not studied
this question in detail.

Another approach to a general definition of the concept of 'rela-
tionship between variables' is to attempt a negation of the gen-
eral definition of "independence of variables".  This approach
was used above in the two-variable case to define [5].  However,
as noted in the discussion of [5], this approach does not distin-
guish between the response and predictor variables.  This makes
it difficult to distinguish between the dependence or independ-
ence of the predictor variables with each other from the depend-
ence or independence of the predictor variables with the response
variable.  But the dependence or independence of the predictor
variables with each other is irrelevant to defining a relation-
ship between a response variable and a set of predictor vari-
ables.  Thus it seems more difficult to obtain a minimal defini-
tion of the standard general concept of 'relationship between
variables' if we work from the negation of the general definition
of "independence of variables".

Other general definitions of "relationship between variables" may
be possible.  If readers see other useful general definitions, I
hope they will present them to this debate.


APPENDIX I: THE IMPORTANCE OF THE CONCEPT OF 'INDEPENDENCE OF
            VARIABLES'

In the body of this essay I say that in empirical research it is
reasonable to think in terms of the concept of 'relationship be-
tween variables' instead of thinking in terms of the concept of
'independence' of variables.  On the other hand, in mathematical
statistics, it is often easier to think in terms of the concept
of 'independence'.  One can see this by studying the generaliza-
tions of the definition of "relationship between variables" to
the case in which more than one predictor variable is present.
These generalizations can be compared to the generalization of
the definition of "independence".

Note how the generalizations of [1] through [8] (as discussed in
appendix H) are complicated.  This is because if p predictor
variables are present, the generalizations require that the re-
searcher or student verify that p different conditions are satis-
fied (instead of only the single condition that must be satisfied
in [1] though [8] if only one predictor variable is present).

On the other hand, the generalization of Freund and Walpole's
definition of "independence" (which was negated to yield [5]) is
(as given by Freund and Walpole) simple:  The generalization is
performed by merely adding more arguments to the left side and
more concomitant marginal factors to the right side of the single
equation.  We also add concomitant terms to the statement of the
list of universally quantified variables.

The ease of generalization of the definition of "independence" is
a strong reason why the concept of 'independence' is often pre-
ferred to its negation (i.e., the concept of 'relationship be-
tween variables') in general discussion in mathematical statis-
tics.


APPENDIX J:  SHOULD THE INTRODUCTORY STATISTICS COURSE TEACH THE
             BAYESIAN APPROACH?

If valid and reliable prior information is available in an em-
pirical research project, the Bayesian approach is generally the
best approach to use to analyze the data from the project.  This
is because (a) the Bayesian approach is generally equal (through
symmetry) to other approaches in its ability to take account of
the general needs of research projects, and (b) the Bayesian ap-
proach takes much better account of the prior information.  Un-
fortunately, however, valid and reliable prior information is un-
available in many empirical research projects because the pro-
jects are exploring new ground.

The Bayesian approach is generally more complicated than the non-
Bayesian approach.  This can be seen by noting that a Bayesian
procedure can usually (always?) be reasonably viewed as an en-
hancement of some non-Bayesian procedure.  The enhancement in-
volves extending the non-Bayesian procedure to take proper ac-
count of available prior information.

Because valid and reliable prior information is unavailable in
many empirical research projects, and in view of my recommenda-
tion above that the introductory statistics course for non-
statistics-majors focus on the use of statistics in empirical re-
search, and because the Bayesian approach is more complicated, I
recommend against emphasizing the Bayesian approach in an intro-
ductory statistics course for non-statistics-majors.  However, it
is helpful to briefly show introductory students the high-level
concepts of the approach.  This is because if valid and reliable
prior information is available, the Bayesian approach has the po-
tential to significantly enhance empirical research by refining
conclusions and reducing costs.

Albert (1997), Berry (1997), and Moore (1997b) debate teaching
the Bayesian approach in an introductory statistics course.


APPENDIX K:  DO ANALYSIS OF VARIANCE F-TESTS TEST VARIANCES?

In the body of this essay I say that statistical tests for de-
tecting relationships between variables are usually performed in
terms of linear functions of the values of the response variable
and not in terms of higher moments of the values, such as vari-
ances.  Consider a possible counterexample:

The standard F-tests in analysis of variance are generally used
to detect relationships between the response variable and the
predictor variable(s) in a designed experiment.  A standard
analysis of variance F-test is performed by computing the ratio
of two variance estimates, called the "F-ratio" (in Fisher's
honor).  The variance estimate in the denominator of the F-ratio
is "unadulterated", but the variance estimate in the numerator is
(if the associated relationship between variables exists in the
population) "adulterated" and "inflated" by the effects of the
predictor variable(s) involved in the relationship.  Thus these
tests are testing variances.  Thus perhaps this fact is a coun-
terexample to my claim that most tests for the existence of a re-
lationship between variables are done in terms of linear func-
tions of the values of the response variable.

In considering this issue, it is useful to distinguish between
the mathematics of the analysis of variance F-test and what is
functionally done.  Mathematically, for each hypothesis test in
standard analysis of variance it is clear that a ratio of two
variance estimates is tested with the F-test to determine whether
the numerator variance is significantly greater than the denomi-
nator variance.

(Confusingly, although only one tail of the central F-
distribution is used in computing the p-value for this test, the
test is still a "two-tail" test in the sense that in the one-way
case with two levels the p-value is identical to the two-tail p-
value from the associated t-test, and different from (i.e.,
twice) the associated one-tail p-value from the t-test.  [Only
one tail of the F-distribution is used because F is t-squared,
and thus the upper and lower tails of the t-distribution are
folded into the upper tail of the F-distribution.])

Mathematically analysis of variance performs tests of variances.
But functionally analysis of variance does something quite dif-
ferent, as one can see by studying expressions for the statisti-
cal hypotheses being tested in analysis of variance.  I discuss
these expressions in a paper in terms of the cell-means and over-
parameterized model equations (1998c, sec. 14 & 15).

In standard analysis of variance the hypotheses being tested can
always be stated in terms of one or more linear functions (more
specifically, contrasts) of the values of the response variable.
I illustrate this fact in a computer program that shows that the
numerator sums of squares in standard balanced and standard un-
balanced analysis of variance (with no empty cells) can always be
viewed as being computed in terms of a set of contrasts (held in
the "projection matrix") that are applied to the vector of values
of the response variable (1998d, line 1047).

Thus although the mathematics of the standard F-test in analysis
of variance is conveniently understood in terms of tests of vari-
ances, it is reasonable to view the detailed hypotheses being
tested in terms of linear functions (contrasts) of the values of
the response variable.

I learned the contrast approach to analysis of variance from
Daniel DeLury.  Dr. DeLury's simple but moving introduction to
the beauty of analysis of variance was the highlight of my formal
education.


REFERENCES

Albert, J. 1997. "Teaching Bayes' rule: A data-oriented ap-
   proach," The American Statistician, 51, 247-253.  (Part of a
   larger discussion pp. 241-274.)

American Statistical Association. 2002. "Curriculum guidelines
   for undergraduate programs in statistical science." Available
   at http://www.amstat.org/education/Curriculum_Guidelines.html

Berry, D. A. 1997. "Teaching elementary Bayesian statistics with
   real applications in science," The American Statistician,
   51, 241-246.  (Part of a larger discussion pp. 241-274.)

Draper, N. R. and Smith, H. 1998. Applied regression analysis.
   3rd ed. New York: John Wiley.

Fisher, R. A. 1935. The design of experiments. Edinburgh:
   Oliver and Boyd.  The 8th edition of this seminal work appears
   in Fisher (1990).

---- 1990. Statistical methods, experimental design, and scien-
   tific inference ed. by J. H. Bennett.  Oxford: Oxford
   University Press.

Fleischmann, M., Pons, S., and Hawkins, M. 1989. "Electrochemi-
   cally induced nuclear fusion of deuterium," Journal of
   Electroanalytical Chemistry, 261, 301-308.  Errata in 263,
   187-188 (1989).

Freund, J. E. and Walpole, R. E. 1987. Mathematical statistics.
   4th ed. Englewood Cliffs, NJ: Prentice-Hall.

Hoffman, N. 1995. A dialogue on chemically induced nuclear ef-
   fects: A guide for the perplexed about cold fusion. La Grange
   Park, Illinois: American Nuclear Society.

Hogg, R. V. and Craig, A. T. 1995. Introduction to mathematical
   statistics. 5th ed. Englewood Cliffs, NJ: Prentice Hall.

Huizenga, J. R. 1993. Cold fusion: The scientific fiasco of the
   century (revised and updated). Oxford: Oxford University
   Press.

Lipsey, M. W. 1990. "Theory as method:  Small theories of treat-
   ments," in Research Methodology:  Strengthening Causal Inter-
   pretations of Nonexperimental Data, ed. L. Sechrest, E.
   Perrin, and J. Bunker, Rockville MD:  U.S. Department of
   Health and Human Services, pp. 33-51.

Macnaughton, D. B. 1997a. "Re: How should we motivate students in
   intro stat? (Response to comments by John R. Vokey)." Posted
   to EdStat and sci.stat.edu on April 6, 1977.  Available at
   http://www.matstat.com/teach/p0024.htm

---- 1997b. "The entity-property-relationship approach to statis-
   tics:  An introduction for students."  Available at
   http://www.matstat.com/teach/

---- 1997c. 'EPR approach and scientific "explanation." (Response
   to comments by Robert Frick).' Posted to EdStat and
   sci.stat.edu on July 23, 1997. Available at
   http://www.matstat.com/teach/p0026.htm

---- 1998a. "Re: Eight features of an ideal introductory statis-
   tics course. (Response to comments by Gary Smith)." Posted to
   EdStat and sci.stat.edu on November 23, 1998.  Available at
   http://www.matstat.com/teach/p0036.htm

---- 1998b. "Eight features of an ideal introductory statistics
   course. Available at http://www.matstat.com/teach/

---- 1998c. "Which sums of squares are best in unbalanced analy-
   sis of variance?" Available at http://www.matstat.com/ss/

---- 1998d. "Computing numerator sums of squares in unbalanced
   analysis of variance: Two-way case (computer output)." Avail-
   able at http://www.matstat.com/ss/pr0139.htm

---- 1999a. "Response to comments by Herman Rubin." Posted to
   EdStat and sci.stat.edu on May 16, 1999.  Available at
   http://www.matstat.com/teach/p0041.htm

---- 1999b. "The introductory statistics course: The entity-
   property-relationship approach." Available at
   http://www.matstat.com/teach/

Michelson, A. A. and Morley, E. W. 1887. "On the relative motion
   of the earth and the luminiferous ether," American Journal of
   Science, 3rd Series, 34, 333-345.

Moore, D. S. 1997a. "New pedagogy and new content: The case of
   statistics" (with discussion), International Statistical
   Review, 65, 123-165.

---- 1997b. "Bayes for beginners? Some reasons to hesitate," The
   American Statistician, 51, 254-261.  (Part of a larger dis-
   cussion pp. 241-274.)

Mosteller, F. 1990. "Improving research methodology:  An over-
   view," in Research Methodology:  Strengthening Causal Inter-
   pretations of Nonexperimental Data, ed. L. Sechrest, E.
   Perrin, and J. Bunker, Rockville MD:  U. S. Department of
   Health and Human Services, pp. 221-230.

Parker, S. B. ed. 1994. McGraw-Hill dictionary of scientific and
   technical terms. 5th ed. New York: McGraw-Hill.

Popper, K. R. 1959. The logic of scientific discovery. Reprint
   1992, London: Routledge.

---- 1983. Realism and the aim of science. Reprint 1992,
   London: Routledge.

---- 1989. Conjectures and refutations. (5th ed.) Reprint 1992,
   London: Routledge.

Return to top

Home page for the Entity-Property-Relationship Approach to Introductory Statistics