Subject: Re: Eight Features of an Ideal Intro Stat Course
         (Response to comments by Jan de Leeuw)

     To: EdStat-L and sci.stat.edu

   From: Donald B. Macnaughton <donmac@matstat.com>

   Date: Sunday October 31, 1999

     Cc: Jan de Leeuw <deleeuw@stat.ucla.edu>

Quoting a 99/5/2 post of mine, Jan de Leeuw writes (on 99/5/6)

> This is a very interesting discussion and Don's position is, as
> always, well-argued and convincing.  But
>
> -- "the external world" is ill-defined.  How about introspec-
> tion ?  How about flying saucers ?  Or, to put it differently,
> what ARE data about a non-external world ?

These are excellent questions.  I begin with a definition:

    The EXTERNAL WORLD is what is "out there" -- what we see 
    when we look out the windows in our heads and what we 
    sense through our other senses.  

I use the concept of 'external world' as a way of indicating the 
broadness of the applications of statistics -- we can use the 
field of statistics almost anywhere "external" to ourselves.  (To 
use statistics in some area we need only find reasonable ways to 
measure the relevant variables in the area.)

However, until I received Jan's post I believed that we could not 
obtain reliable measurements of variables in the INTERNAL world 
because that world is often (as a fact of human nature) ruled by 
prejudices -- it is subjective.  Thus I concluded that we could 
NOT use the field of statistics in the world of our inner experi-
ence.  

However, after considering Jan's question about introspection, I 
now think my conclusion may be incorrect.  It is indeed an 
EMPIRICAL question whether the internal world of one's own per-
sonal experience can also be subjected to statistical analysis, 
just like the external world.  I will be pleased if researchers 
show that the field of statistics is also applicable in the in-
ternal world of one's private experience.

                            *   *   *

In asking about flying saucers, I think Jan is wondering about 
things that do not "really" exist.  That is, I think he probably 
believes (as I believe) that flying saucers (vehicles used by ex-
traterrestrials visiting the earth) do not really exist (except 
in people's imaginations).  If Jan is asking how such things re-
late to the external world I refer to, my answer is that this is 
a matter of choosing a point of view.  I take the point of view 
that things (entities) that do not exist (like flying saucers) 
are still a part of the external world, because people talk about 
such things.  The fact that these things are (or may be) imagi-
nary need not somehow remove them (in their imaginary existence) 
from being a part of external reality.

                            *   *   *

In my 99/5/2 post, my main reference to the concept of 'external 
world' was in the definition I proposed of "empirical research", 
which I gave as follows:

    EMPIRICAL RESEARCH is any research in which data are 
    gathered from the external world and then conclusions are 
    drawn from the data about the external world.

Jan's questions have shown me that the concept of 'external 
world' is an unnecessary restriction in the definition.  Thus I 
now propose the following new definition:

    EMPIRICAL RESEARCH is any research in which data are 
    gathered from some area of experience and then conclu-
    sions are drawn from the data about the area of experi-
    ence.

The new definition encompasses both the external and internal 
worlds and is thus broader than the original definition.  I thank 
Jan for motivating the new definition.

                            *   *   *

Note that the new definition does not explicitly refer to the 
concept of 'data analysis'.  However, this concept is certainly 
implicit in the definition because the conclusions we draw from 
data in empirical research can only (sensibly) be drawn on the 
basis of some form of examination or "analysis" of the data.  
Diagrammatically

  empirical research = data gathering, which is followed by
                       data analysis, which is followed by 
                       conclusion drawing.

I omit a direct reference to the concept of 'data analysis' in 
the definition because I wish to emphasize the goal of empirical 
research, which is not data analysis, but is instead (correct) 
conclusion drawing. 

(I discuss three types of conclusion drawing in the application 
of the field of statistics in appendix A.


> -- I suggest "Statistics develops and evaluates techniques for
> data analysis". 

Although Jan does not say so explicitly, I believe he is propos-
ing a definition of the role of the field of statistics.  Jan's 
definition is worth setting apart:

    Statistics develops and evaluates techniques for data 
    analysis.

Contrast Jan's definition of the role of statistics with mine, 
which is

    Statistics provides efficient general techniques to help 
    empirical researchers study variables and relationships 
    between variables, mainly as a means to predict and con-
    trol the values of variables.

Note that the concept of 'data analysis', which is central 
in Jan's definition, is implicit in my definition because (as I 
discuss above) the concept is implicit in my prior definition of 
empirical research.

I believe that Jan's definition and my definition of the role of 
statistics are mutually fully consistent.  However, my definition 
is both more explicit and more limiting.  That is, Jan's defini-
tion says that statistics supports "data analysis".  My defini-
tion also says (through my definition of "empirical research") 
that statistics supports "data analysis", but it also says where, 
how, and why statistics supports data analysis.  

The definition says WHERE statistics supports data analysis -- in 
empirical research.  I believe we should emphasize where statis-
tics supports data analysis because people sometimes forget that 
the field of statistics is tightly tied (both generally and spe-
cifically) to empirical research projects -- for most statisti-
cians the field of statistics is not an end in itself.

The definition says HOW statistics supports data analysis -- by 
providing efficient general techniques for studying variables and 
relationships between variables.

The definition says WHY statistics supports data analysis -- 
(mainly) to enable us (i.e., society) to accurately predict and 
control the values of variables.  

The "where", "how", and "why" ideas in my definition impose extra 
limitations on the role of statistics that are not present in 
Jan's definition.  However, many readers will agree that these 
limitations are not unduly limiting because statistics does not 
(in a practical sense) do much beyond providing techniques for 
studying variables and relationships between or among variables 
in empirical research, and such study is mainly useful as a means 
to accurately predict and control the values of variables.  

(One can also reasonably argue that, in addition to providing the 
ability to predict and control, the field of statistics also en-
ables us to "explain" and "understand" the area of experience we 
are studying.  I view the relationship between the first pair of 
concepts ['prediction' and 'control'] and the second ['explana-
tion' and 'understanding'] as follows:  In almost every area of 
serious human endeavor the most widely accepted TEST of a good 
explanation or of good understanding is whether the explanation 
or understanding leads to correct prediction or correct control.  
Therefore, although explanation and understanding are important, 
they are subordinate to prediction and control.  I develop this 
view in two earlier posts [1996a, 1996b].  I discuss the action-
prescriptive role of the field of statistics in appendix A.)

I include the "where", "how", and "why" ideas in the definition 
because I believe these ideas deserve strong emphasis.  I believe 
that simply saying the role of statistics is to support "data 
analysis" is too vague -- outsiders cannot easily see value in 
doing "data analysis".  Instead, to an outsider, doing data 
analysis sounds like something medieval monks might do with great 
rigor in an isolated monastery in the mountains, but with no 
known practical value.

Most importantly, if we emphasize WHY statistics supports data 
analysis (i.e., to enable accurate prediction and control), we 
pique the curiosity of outsiders because almost everyone would 
like an easy path to accurate prediction and control.  Most stat-
isticians and many empirical researchers agree that proper use of 
statistics is by far the easiest and best path to accurate pre-
diction and control in any area of empirical research.


> This [i.e., Jan's proposed definition of the role of statis-
> tics] avoids "empirical", "external", and "science", because
> defining these is metaphysics. 

I believe Jan is here using the term "metaphysics" in the pejora-
tive sense of "building conceptual structures on slippery elusive 
concepts at the bottom".  I fully agree with the importance of 
eliminating this type of metaphysics.  

Jan identifies three possible slippery elusive concepts -- 'em-
pirical', 'external', and 'science'.  Can we omit these three 
concepts from the definition of the role of statistics?

Regarding the concept of 'empirical', recall that my definition 
of the role of statistics refers to the concept of 'empirical re-
search'.  Since the concept of 'empirical research' is central in 
the present discussion, let us focus not on the concept of 'em-
pirical' but on the concept of 'empirical research'.  

My definition above of "empirical research" states that empirical 
research is data gathering, (data analysis,) and conclusion draw-
ing.  The main concepts in this definition are the concept of 
'data' and the concept of 'conclusion drawing'.  To determine 
whether the concept of 'empirical research' is slippery or elu-
sive let us consider its roots and ask whether the concepts of 
'data' and 'conclusion drawing' are slippery or elusive.

Consider the concept of 'data' (which Jan uses directly and which 
I use indirectly in defining the role of statistics).  I define 
"data" as "a set of measured values of properties of entities or, 
equivalently, a set of values of variables".  This concept ap-
pears to be a straightforward (not slippery or elusive) concept.

Consider the concept of 'conclusion-drawing'.  The act of drawing 
a conclusion is equivalent to making a (preferably correct) 
statement about some area of experience.  Thus this concept 
(which I discuss in appendix A) also appears to be a straightfor-
ward concept.

In view of the preceding points, I suggest that the concept of 
'empirical research' in my definition of the role of statistics 
is not particularly slippery or elusive.  And the term "empiri-
cal" is merely a useful modifier that conveys the important ideas 
of 'data-based' and 'observation-based'.

Finally, as I discuss above, I believe it is helpful to include 
the concept of 'empirical research' in the definition of the role 
of statistics because I believe students understand our field 
better if we emphasize where it plays its role.

                            *   *   *

Consideration of Jan's second concept ('external') has led me to 
realize (as I discuss above) that it is unnecessary to refer to 
the concept of the external world in the definition of "empirical 
research".  Thus I fully agree with Jan that this concept need 
not be (explicitly or implicitly) present in the definition of 
the role of statistics.

I also agree that we should not use Jan's third concept ('sci-
ence') in the definition of the role of statistics.  The field of 
statistics has many applications outside of what is traditionally 
viewed as "science", such as in business, industry, law, and gov-
ernment.


> It [i.e., Jan's proposed definition of the role of statistics]
> also avoids "inference" and "modeling" and "quantitative", be-
> cause these are irrelevant for the definition. 

I believe we should include the concept of 'inference' in the 
definition of the role of statistics.  (I include the concept in 
my definition of the role by including the phrase "conclusions 
are drawn from the data" in my conceptually prior definition of 
"empirical research".)  We should include the concept of 'infer-
ence' in the definition because the data analysis we do in sta-
tistics is not an end in itself, but is instead a MEANS to draw-
ing valid conclusions (inferences) about the particular area of 
experience we are using statistics to study.


> Inference is the responsibility of the scientist, statisticians
> just develop tools which may help the scientist doing the job.

I agree that the final responsibility for correct statistical in-
ference rests with the scientist or empirical researcher who uses 
statistics.  However, statisticians must be experts in the 
GENERAL ideas of statistical inference so that we can teach users 
of statistics to draw valid inferences from data.  Thus inference 
(in the GENERAL sense) is very much the responsibility of statis-
ticians.

I suspect Jan wrote his sentence above to be provocative.  He may 
agree that saying "statisticians just develop tools which may 
help" minimizes the fundamental role our field plays (or can 
play) across almost all empirical research.  Such minimization 
may suggest to outsiders that our field is obscure and uninter-
esting.  

Instead of minimization, if statisticians politely trumpet the 
vital role our field plays (in providing optimal methods for ac-
curate prediction and control), I believe we can lead the field 
to receive the very broad recognition it deserves.

                            *   *   *

The second concept Jan says is irrelevant for the definition of 
the role of statistics is the concept of 'modeling'.  By "model-
ing" Jan may mean "building statistical model equations of rela-
tionships between variables".  I agree that it is unnecessary to 
refer to this sense of the word "modeling" in the definition of 
the role of statistics.

On the other hand, by "modeling" Jan may mean "studying relation-
ships between variables" WITHOUT any reference to statistical 
model equations.  If Jan is using the word "modeling" in this 
sense, he is suggesting that the idea of studying relationships 
between variables is irrelevant for the definition of (the role 
of) statistics.

My definition of the role of statistics explicitly refers to the 
idea of studying relationships between variables.  I believe this 
idea is relevant for the definition because my experience sug-
gests that 

- almost all empirical research projects can be usefully viewed 
  as studying relationships between or among variables and 

- almost all the commonly used statistical procedures can be use-
  fully unified as methods for studying relationships between or 
  among variables.  

I discuss these points further in a paper (1999a).

                            *   *   *

The third concept Jan says is irrelevant for the definition of 
the role of statistics is the concept of 'quantitative', and I 
agree.


> Although, in the case of hypothesis testing for instance, they
> [statisticians] may have hindered more than they have helped. 

I agree that there is much confusion about hypothesis testing -- 
so much confusion that some empirical researchers eschew hypothe-
sis testing and instead use confidence intervals to perform what 
is usually the same function.

I suggest that many instances of hypothesis testing can be best 
viewed as testing for evidence of the existence of a relationship 
between variables (or testing for evidence of the existence of an 
extension to an already known relationship between variables).  
This point of view helps empirical researchers to better under-
stand the usefulness of hypothesis testing.  

Testing hypotheses about the existence of relationships between 
variables is important because (to avoid embarrassing and costly 
errors) we must first verify that there is clear evidence of a 
relationship between the relevant variables before attempting to 
use information about a putative relationship between variables 
for prediction or control.  Computing p-values and verifying that 
they are low enough is an objective and standardized aid in 
checking for proper evidence of the existence of a relationship 
between variables.  I discuss these ideas further in two papers 
(1997, sec. 9; 1998, sec. 5).


> Modeling (in statistics) is one way (and a controversial one at
> that) to develop data analysis techniques.

I agree.   (I believe that most [all?] statistical models are 
models of relationships between variables.)


> -- the distinction between univariate frequency distributions
> and "relations between variables" is better formulated, I
> think, as the distinction between univariate and multivariate
> frequency distributions or between unconditional and condi-
> tional distributions.  This makes most of the distinction go
> away. 

The two formulations Jan proposes are more elegant and more com-
pact than the formulation I propose.  Jan's formulations may be 
very useful in advanced statistics courses.  

However, I am interested in finding the best formulation of the 
concepts for the introductory statistics course for students who 
are not majoring in statistics.  Here I think Jan's formulations 
are more complicated than necessary because they require that we 
use the concept of 'distribution' (either 'multivariate distribu-
tion' or 'conditional distribution'), which is a difficult con-
cept for beginners to grasp.  

Contrast Jan's formulations with the formulation I propose, which 
makes no reference to the concept of 'distribution'.  We can ini-
tially characterize relationships between variables in terms of 
(the values of) one variable "depending" on (the values of) an-
other variable, or the values of one variable varying "in step" 
with the values of another variable.  (We can reinforce the char-
acterizations by carefully discussing illustrative practical ex-
amples of relationships.)  Later, we can formally characterize 
the concept in terms of conditional expected value (still without 
having to appeal to the concept of 'distribution') as I discuss 
in a paper and in an earlier post (1997, sec. 7.10, 1999b).

I suggest that if two or more consistent formulations are avail-
able upon which to build an approach to the introductory statis-
tics course, we should use the formulation that is easiest for 
beginners to understand.  I believe the formulation that distin-
guishes between univariate distributions and relationships be-
tween variables allows the teacher to introduce the concept of 
'relationship between variables' very early in the introductory 
course.  I further believe that a careful discussion of this 
easy-to-understand concept gives students a fundamental grasp of 
the important practical value of the field of statistics (in fa-
cilitating accurate prediction and control).  Thus I recommend 
that teachers use the formulation that distinguishes between re-
lationships between variables and univariate distributions in-
stead of Jan's (correct, more elegant, but harder to understand) 
formulations that make the distinction go away.


> The constructions that make studying a univariate distribution
> a limiting case of studying a relationship are not very con-
> vincing to me.  Another construction (instead of zero predic-
> tors) is one predictor which only takes a single value.  Dis-
> tribution of IQ in Arkansas is distribution of IQ in the 50
> states, conditional on Arkansas.

Here Jan is referring to the idea that univariate distributions 
are simply a special (degenerate) case of relationships between 
variables -- the case in which, as usual, there is one response 
variable, but the number of predictor variables, instead of being 
one or more, is reduced to zero.  Jan's construction (with a sin-
gle constant-valued predictor variable) is another valid way of 
showing that univariate distributions are (rigorously) a special 
case of relationships between variables.  


> -- It simply will not do to minimize the importance of univari-
> ate summaries.  My guess is (unlike Don's) that 96% of the sta-
> tistics published (in newspapers, the gray literature, govern-
> ment reports) are univariate frequency distributions or their
> measures of central tendency and dispersion.  Saying that this
> is not science will not make it go away.  

To follow up on Jan's point I studied each article and letter in 
the first section the May 3, 1999 issue of The Toronto Star news-
paper, and I identified all the statements or graphics that in 
any way refer to the values of variables.  I classified each of 
these statements and graphics into one of the following four 
categories:  

1. Statements about the value of a variable in a single entity.    
   (For example, a statement might say a particular person is 36 
   years old, or a particular concert was attended by approxi-
   mately 25,000 people, or a particular town is "small".)

2. Statements about a univariate distribution of the values of a 
   variable (including statements about the central tendency or 
   the dispersion of a univariate distribution).  (For example, a 
   statement might say that the average age of members of a par-
   ticular social group is 34 years.)

3. Statements about a relationship between variables.  (For exam-
   ple, a statement might say that the average temperature of the 
   earth has been increasing over the last 100 years.)

4. Statements that do not fit into any of the preceding three 
   categories.  

Of the 1237 statements I found that refer to the values of vari-
ables 85% are statements of the value of a variable in a single 
entity, 2% are statements of univariate distributions, and 13% 
are statements of relationships between variables.  (No state-
ments fall into the fourth category.)  I further discuss my find-
ings in appendix B.

Jan estimates that 96% of the statistics published in newspapers 
are reports of univariate distributions or their measures of cen-
tral tendency and dispersion.  My informal survey of articles in 
The Toronto Star suggests that Jan's estimate may be high.  

However, my estimate in my 99/5/2 post that 96% of empirical re-
search projects study relationships between variables, and Jan's 
estimate that 96% of statistics in newspapers, the gray litera-
ture, and government reports are related to univariate distribu-
tions, and my informal survey of The Toronto Star are NOT par-
ticularly reliable sources of information.  It would be much bet-
ter if someone (an empirical researcher) could present properly-
sampled survey information about the relative frequency with 
which various types of statistical statements appear in various 
areas of interest.  As I suggest in the 99/5/2 post, I hope an 
interested reader will perform a proper survey.

Finally, to the extent that the general public PREFERS reports of 
univariate distributions over reports of relationships between 
variables, I believe we have failed as statistics teachers to 
educate the general public.  We have failed because a report of a 
relationship between a response variable Y and one or more rele-
vant predictor variables is invariably more informative and more 
useful than a report of the univariate distribution of Y in iso-
lation.


> Also remember that density estimation, robust estimation, etc.
> are important areas of statistical activity.

With respect to empirical density estimation, assuming we are 
studying a UNIVARIATE density, this is clearly NOT (in my every-
day language) a study of a relationship between variables.  I be-
lieve this activity is subsumed under the roughly 2% of empirical 
research projects that study univariate distributions that I re-
fer to in the 99/5/2 post.  Again though, this is an estimate on 
my part, which should be corroborated or corrected through a 
proper survey.  

With respect to empirical MULTIVARIATE density estimation, in 
this case two or more variables are always involved.  Thus it is 
easy to view this activity as simply another way of studying the 
relationship between (or, better, among) these variables. 

With respect to robust estimation, if a research project uses ro-
bust estimation to estimate the values of the parameters of a 
univariate distribution, the research project is clearly studying 
a univariate distribution and not studying a relationship between 
variables.  I believe this use of robust estimation is subsumed 
under the 2% of empirical research projects that study univariate 
distributions.  On the other hand, if a research project uses ro-
bust estimation to estimate the parameters of a model, it is al-
most certainly studying a relationship between variables because 
(as I suggest above) most (all?) statistical models are models of 
relationships between variables.


> -- Emphasizing that statisticians develop and evaluate data
> analysis techniques is consistent with the Box idea that sta-
> tistics is a service activity, serving the sciences, but also
> policy, government, sports, the law, and the public.

I fully agree with Jan's statement of George Box's important idea 
that statistics is a service activity serving the areas mentioned 
(and other areas, such as business and industry).  

Jan recommends emphasizing "data analysis techniques".  However, 
as I discuss above, I believe the concept of 'data analysis' is 
cryptic and diffuse -- outsiders cannot easily see any practical 
service or practical value in "data analysis".

In contrast, I recommend that teachers emphasize the easy-to-
understand unifying concept of (accurate) prediction and control 
on the basis of relationships between variables.  I believe that 
emphasizing prediction and control

- impresses students with the significant practical value of our 
  field and 

- makes our field substantially easier for students to under-
  stand.

-------------------------------------------------------
Donald B. Macnaughton   MatStat Research Consulting Inc
donmac@matstat.com      Toronto, Canada
-------------------------------------------------------


APPENDIX A:  THREE TYPES OF CONCLUSIONS IN STATISTICAL ANALYSIS
             OF EMPIRICAL RESEARCH RESULTS

In the new definition I propose above of "empirical research" I 
say that empirical research involves "drawing conclusions" from 
data.  The conclusions we draw from data in empirical research 
can be broken into three types.  We can draw

- conclusions about the values of variables or about the values 
  of parameters in entities of some type (and more generally con-
  clusions about univariate distributions of the values of vari-
  ables) or

- conclusions about relationships between or among variables in 
  entities of some type (and how the relationships can be used 
  for accurate prediction or control) or

- conclusions about the best course of action to take in some en-
  deavor.

The first type of conclusion -- about the values of variables or 
the values of parameters -- is the simplest.  For example, if a 
researcher has made several empirical estimates of the speed of 
light, he or she can use statistical methods to analyze these es-
timates and determine a "best" estimate of the speed, and also 
determine an estimate of the precision of the estimate.  

The second type of conclusion -- about relationships between 
variables -- is, I believe, the most important type.  The field 
of statistics can help researchers to determine whether a rela-
tionship exists between variables and, when a relationship is 
found, can help to accurately predict and possibly control the 
values of the response variable in new entities from the popula-
tion on the basis of the relationship.

The third type of conclusion -- about the best course of action 
-- is clearly important because almost every individual and al-
most every organization is regularly faced with making important, 
but difficult, conclusions or decisions about courses of action.  
Thus the statistical methods for making conclusions or decisions 
about the best course of action are an important subset of the 
methods of statistics.

                            *   *   *

I place the three types of conclusions in two main groups.  I 
call the methods to make the first and second types of conclu-
sions the "descriptive" aspect of statistics because these con-
clusions DESCRIBE a state of the world.  That is, they DESCRIBE 
values of variables or parameters, or they DESCRIBE relationships 
between variables, or they DESCRIBE methods for predicting and 
controlling on the basis of such relationships.  

I call the methods to make the third type of conclusion the 
"action-prescriptive" aspect of statistics because these conclu-
sions prescribe a course of action.  

Conclusions in the descriptive aspect of statistics often involve 
statistical tests of hypotheses.  For example, a researcher may 
perform a research project and conclude from the data that a re-
lationship exists between one or more of the predictor variables 
and the response variable.  The researcher will typically support 
this conclusion by reporting a test of the hypothesis that the 
relationship exists and will note that the p-value yielded by the 
test is low enough that we cannot reasonably explain away the 
evidence of the relationship in the data as being merely due to 
chance.

Conclusions in the action-prescriptive aspect of statistics are 
usually based on conclusions provided by the descriptive aspect.  
That is, most (all?) conclusions made using the action-prescrip-
tive methods of statistics are based on knowledge about the val-
ues of relevant variables and parameters or (more likely) are 
based on knowledge about relevant relationships between vari-
ables.  However, in addition, the action-prescriptive aspect of 
statistics takes account of a set of VALUES and GOALS.  (Here the 
word "values" denotes principles or qualities that are socially 
or commercially valuable as opposed to denoting the values of a 
variable or parameter.)  The values and goals are used together 
with knowledge of variables and relationships between variables 
to identify (possibly with the help of "objective functions" or 
"utility functions") the conclusion (i.e., the decision about a 
course of action) that optimizes the achievement of the values 
and goals.  

Clearly, descriptive conclusions and action-prescriptive conclu-
sions are made using different sets of underlying methods.  
Therefore, in discussing the use of statistics for drawing con-
clusions or making decisions it is important to identify the type 
of conclusions or decisions one is discussing.  

                            *   *   *

As a practical matter, decision-makers often find it unnecessary 
to use action-prescriptive concepts to make a decision about a 
course of action on the basis of the results of an empirical re-
search project because the appropriate course of action is obvi-
ous.  For example, if a medical research project shows that a 
particular drug lessens the severity of a disease in people, the 
action-prescriptive aspect of statistics cannot generally improve 
on the recommendation that doctors should prescribe the lowest 
(safe) dose of the drug that yields a "full" response in most 
people (possibly subject to corrections related to ancillary 
variables such as people's ages or body weights).

As a second practical matter, the concepts of the action-pre-
scriptive aspect of statistics are often unusable in real-life 
situations in which we must decide a course of action.  This is 
because the necessary objective functions (or other pieces of in-
formation needed to use precise action-prescriptive concepts) are 
often unavailable.  (This problem can sometimes be bypassed by 
working with a range of possible objective functions.)

As a third practical matter, descriptive conclusions and action-
prescriptive conclusions are often widely separated.  That is, 
empirical researchers generally provide descriptive conclusions 
about the values of variables or about the values of parameters 
or descriptive conclusions about relationships between variables.  
But (apart from possibly SUGGESTING a course of action) the em-
pirical researchers who provide the descriptive conclusions play 
little or no direct role in MAKING DECISIONS about a course of 
action on the basis of the conclusions.  This is because empiri-
cal researchers typically place the information they discover at 
other people's disposal.  (Researchers often place the informa-
tion in the public domain in the form of a journal article.)  
Then the researchers go back to doing what they do best -- em-
pirical research -- and other people (people who make policy) 
make the relevant decisions about courses of action.

(An [infrequently occurring] exception to the point in the pre-
ceding paragraph occurs when we directly use methods from the ac-
tion-prescriptive aspect of statistics in conducting empirical 
research.  This occurs in so-called "sequential" statistical 
analysis in which at each main step in the analysis we make a 
formal decision as to what action to take next.  [Usually one 
possible action is to gather more data.]  Each decision is made 
on the basis of the data accumulated to that point.)

Since 

- the action-prescriptive aspect of statistics is not often used 
  to interpret the results of empirical research projects and 

- when the action-prescriptive aspect IS used it is usually 
  widely separated from the descriptive aspect

therefore, it is possible and reasonable to separately teach stu-
dents about the descriptive and action-prescriptive aspects of 
statistics.

Since the action-prescriptive aspect of statistics relies on con-
clusions provided by the descriptive aspect, but not vice versa, 
it is reasonable to teach the descriptive aspect of statistics to 
students before teaching them the action-prescriptive aspect.

Finally, the descriptive aspect of statistics is much broader 
than the action-prescriptive aspect, perhaps at least in the ra-
tio of ten to one in terms of, say, the amount of associated lit-
erature published.

The above considerations lead me to recommend that introductory 
teachers focus on the descriptive aspect of statistics.  That is, 
I recommend that teachers concentrate on how to draw conclusions 
about variables and parameters and (especially) how to draw con-
clusions about relationships between variables.  I recommend that 
discussion of how to use statistical techniques to draw conclu-
sions (or make decisions) about correct courses of action be 
postponed until a later course.

My sense of the distinction between the descriptive and action-
prescriptive aspects of statistics grew from an insightful arti-
cle about p-values by Ware, Mosteller, Delgado, Donnelly, and 
Ingelfinger (1992). 


APPENDIX B:  EXAMINATION OF THE MAY 3 ISSUE OF THE TORONTO STAR
             FOR STATEMENTS ABOUT VARIABLES 

The Toronto Star is a broadsheet newspaper with a weekday circu-
lation of about 450,000.  I chose to work with the May 3, 1999 
Metro Edition of the Star "at random" in the sense that this was 
the issue at hand when I began work.

The first section of the May 3 Star contains 20 pages, with 
roughly 45% of the material being advertising.  It contains 44 
news stories (some only a paragraph or two long) in the catego-
ries of world news, Canada news, and Ontario news.  (Most Toronto 
news appears in another section.)  The first section also con-
tains 10 letters to the editor, 6 opinion articles, and 4 edito-
rials.  

I carefully scanned all the above-mentioned non-advertising mate-
rial in the section, circling each statement of the simple value 
of a variable, writing "D" in the margin for each statement of 
some aspect of a univariate distribution, writing "R" in the mar-
gin for each statement of a relationship between variables, and 
writing "N" in the margin for each statement that pertains to 
variables but that falls into none of the preceding three groups.  
If I detected that the same statement was made more than once 
(e.g., in the text and in a graphic), I counted the statement 
only once.  The following table shows my counts of the four types 
of statements:

                     Frequency of Occurrence 
                of Different Types of Statements 
                     Pertaining to Variables 
                 in the Non-Advertising Material 
                   in the First Section of the
                      Metro Edition of the 
                    May 3, 1999 Toronto Star
                ---------------------------------
                   Statement
                      Type       Count   Percent
                ---------------------------------
                 1. Value of      1051      85
                    a variable         
  
                 2. Univariate      30       2
                    distribution
 
                 3. Relationship   156      13
                    between
                    variables

                 4. Other            0       0
                                   ___     ___

                     TOTALS       1237     100
                ---------------------------------

Let me describe some of my classifications.  The first news story 
on the front page of the May 3 Star is titled "Release of PoWs 
fails to sway NATO".  The first sentence in this story says

    NATO will continue bombing Yugoslavia despite the release 
    of three U.S. prisoners of war.

I interpreted this (unhappy) sentence as stating a relationship 
between variables or, actually, a NON-relationship between vari-
ables.  The response variable can be viewed as "bombing intensity 
of Yugoslavia" and the predictor variable can be viewed as "num-
ber of NATO prisoners released".  The sentence says there will be 
no relationship between the release of the three prisoners and 
the bombing intensity.

The second sentence fills in details of the first sentence and 
introduces a new relationship between variables as follows:

    "We are not only not going to stop the bombing, we're go-
    ing to intensify the bombing," U.S. Defence Secretary 
    William Cohen said yesterday.

This sentence indicates that there will be a relationship between 
the intensity of the bombing and time -- the intensity will in-
crease in the immediately following time.  

The second sentence also states the value of the variable that 
indicates the time of Cohen's statement: "yesterday".

The third sentence says

    Still, diplomatic efforts to end the conflict in the
    Balkans appeared to gain some strength yesterday.  

This sentence is another statement of a relationship between 
variables.  The response variable is "strength of diplomatic ef-
forts to end the conflict in the Balkans" and the predictor vari-
able is time.  The sentence is saying that the strength of the 
efforts appears to have increased in recent time (i.e., over the 
course of "yesterday").

The fourth sentence indicates how the diplomatic efforts to end 
the conflict appeared to gain strength as follows:

    The White House announced that U.S. President Bill 
    Clinton would meet today with Russian envoy Viktor 
    Chernomyrdin, who will brief him on Russian efforts to 
    resolve the crisis.

Part of the fourth sentence states the value of a variable -- the 
(anticipated) value of the "when" property of the meeting, which 
is "today".

The second story on the front page is titled "Rental housing cri-
sis predicted" and begins with the following sentence:

    Toronto and other Ontario cities face a housing crisis as 
    demand for affordable apartments grows at a much faster 
    rate than construction, a comprehensive new report says.

This sentence is a somewhat complicated statement of a (extrapo-
lated) relationship between variables.  One way of viewing the 
relationship is to say that the predictor variable is "time" and 
the two response variables are "demand for affordable apartments" 
and "rate of construction of affordable apartments".

A third story on the front page is titled "Immigrants feel wel-
come" and begins with the following sentence:

    Toronto is a place that welcomes newcomers, more so than 
    many cities around the world, say participants in a poll 
    commissioned for The Star's Beyond 2000 project.

This is a statement of a relationship between variables.  The re-
sponse variable could be labeled "amount of welcome extended to 
newcomers" and the predictor variable is the nominal-level vari-
able "city" (e.g., Toronto, New York, Tokyo).  

                            *   *   *

Although relationships between variables dominate the important 
lead sentences of the three stories I discuss above, relation-
ships between variables are only peripheral in the other two sto-
ries on the front page.  One of these stories is about the 
Toronto Maple Leafs hockey team winning a series of playoff 
games.  This story contains many statements of the values of 
variables, but I found no statements of relationships between 
variables.  

To illustrate some statements of values of variables, here are 
the third and fourth sentences from the hockey story, with the 
values of variables between asterisks:

    *Thirty-two years to the day* after winning their *last* 
    Stanley Cup, the Leafs moved on to the *second* round of 
    the *'99* playoffs *last night* by finishing off the 
    Philadelphia Flyers.  The *1-0* victory came on *winger* 
    *Sergei Berezin's* goal with *one* minute to play in 
    regulation time.

(The final score of the game can be viewed as a vector-valued 
variable with two components.)

Note the broad view I take of what are statements of the values 
of variables.  That is, as well as values that denote numbers, I 
view all words or phrases that function as adjectives or adverbs 
and all words or phrases that state a time or date as stating 
values of variables.  I take this point of view because these 
differing linguistic operations all seem to be performing the 
same conceptual function -- that of stating the value of some 
property of an entity or, equivalently, that of stating the value 
of some variable.  (In the case of an adjective, the entity is 
denoted by the NOUN the adjective modifies.  In the case of an 
adverb, the entity is the PROCESSES the adverb helps to describe.  
In the case of a point in time or a duration, the entity is the 
EVENT the time refers to.)

The final story on the front page is about a shooting in Alberta.  
This story contains many statements of the values of variables, 
but only one statement of a relationships between variables.

                            *   *   *

As I note above, I found only 30 statements of univariate distri-
butions, including statements of means, standard deviations, or 
other measures of location or dispersion.  Two of these occur in 
one sentence in the hockey story on the front page.  The sentence 
quotes the player who scored the only goal in the game as fol-
lows:

    "It's the biggest goal I've ever scored in the NHL," said 
    Berezin, who led the team in goals this season.

This sentence identifies a property that Berezin associates with 
each of the goals he has scored in the NHL -- the goal's bigness 
(i.e., its importance) -- and the sentence says that his latest 
goal has the maximum value in the distribution of the values of 
that property.  

The same sentence (implicitly) refers to the distribution of the 
number of goals scored by players on Berezin's team and says that 
Berezin is the player who is associated with the maximum value of 
that distribution.

Here are some other examples of statements of univariate distri-
butions:

    Testicular cancer tends to hit men in their 20s (page 
    A3).

The statement says that the age distribution of the occurrence of 
testicular cancer peaks strongly in the 20s.

    [Curt] Newport used computer models and state-of-the-art 
    underwater equipment to search an area near Grand Bahama 
    in water that averaged almost 4,700 meters deep (page 
    A3).

The statement gives the average of the univariate distribution of 
water depths in the area searched.

    Mireya Moscoso ... won election yesterday to become the 
    president [of Panama] ....  She beast Martin Torrijos 
    ....  Torrijos conceded defeat with 59.4 per cent of bal-
    lots counted, showing Moscoso with 44 per cent of the 
    vote to his 38 per cent.  ... A third candidate ... 
    Alberto Vallarino ... held 17 per cent of the vote (page 
    A10).

This passage reports the univariate distribution of the votes 
among the three candidates.  (One could also reasonably view the 
passage as stating a relationship between two variables:  "candi-
date" [i.e., Moscoso, Torrijos, or Vallarino] and "percent of the 
vote won".)

                            *   *   *

I found that classifying the text in the Star was difficult be-
cause I often found that rereading a passage suggested another 
(usually deeper) interpretation, or suggested an additional 
statement of one of the types.  Thus it seems likely that another 
reader performing the classification would produce a somewhat 
different classification from mine.  

However, I believe one could develop a set of straightforward 
rules for classifying statements that refer to variables.  These 
rules would lead to strong agreement among trained classifiers 
about statements in passages of text.  I suspect that these rules 
would lead to finding proportions of the three types of state-
ments about variables that are close to the proportions I give in 
the table above.

                            *   *   *

I also studied the sports section in the May 3 Star.  The sports 
section is a special case because it is full of statistics.  
Since many newspaper sports statistics are presented in tables, 
and since the row and column (and other) dimensions of a table 
each suggest a different variable, it is clear that tables of 
newspaper sports statistics reflect relationships between vari-
ables.  

Interestingly however, tables of newspaper sports statistics do 
not generally reflect relationships between variables in 
POPULATIONS.  This is because the tables are generally not rea-
sonably viewed as representing SAMPLES that reflect some wider 
population of entities.  Instead, the tables are generally de-
tailed cross-classified reports of the values of properties of 
individual entities (typically teams, players, or horses in the 
Star), and interest is focused on these specific entities, rather 
than on some population they represent.  

Thus (unless the tables themselves are sampled over different is-
sues of a newspaper) the relationships between variables re-
flected in tables of newspaper sports statistics cannot generally 
be used for predicting or controlling the values of variables in 
other entities in a population, as is generally the case in stan-
dard statistical analysis.  

Although tables of newspaper sports statistics are generally not 
(directly) useful for predicting and controlling the values of 
variables in other entities in a population, such tables are of 
definite interest in predicting and controlling the values of the 
variables FOR THE SAME INDIVIDUAL ENTITIES that are reported in a 
table AT A LATER TIME.  For example, if a table of sports statis-
tics shows teams as rows and if the statistics in the columns of 
the table indicate that one team is presently doing particularly 
well relative to the others, we can reasonably use the table 
(possibly with other relevant consistent information) to predict 
that this particular team will do well in its next game.  If one 
is a betting person, this prediction might then form the basis of 
a bet.

Thus the use of statistics in sports is often different from the 
standard use of statistics in empirical research


REFERENCES

Macnaughton, D. B. 1996a. "EPR approach to intro stat: Relation-
   ships between variables (response to comments by Brian 
   Griffin)." Published in sci.stat.edu and EdStat-L on July 18, 
   1996.  Available at http://www.matstat.com/teach/p0012.htm

Macnaughton, D. B. 1996b. "EPR approach to intro stat: Relation-
   ships between variables (response to comments by George 
   Zeliger)." Published in sci.stat.edu and EdStat-L on July 30, 
   1996.  Available at http://www.matstat.com/teach/p0016.htm

Macnaughton, D. B. 1997. "The entity-property-relationship ap-
   proach to statistics: An introduction for students."  Avail-
   able at http://www.matstat.com/teach/

Macnaughton, D. B. 1998. "Eight features of an ideal introductory 
   statistics course."  Available at 
   http://www.matstat.com/teach/

Macnaughton, D. B. 1999a. "The introductory statistics course: 
   The entity-property-relationship approach."  Available at 
   http://www.matstat.com/teach/

Macnaughton, D. B. 1999b. "Re: Eight features of an ideal intro 
   stat course (response to comments by Herman Rubin)." Published 
   in sci.stat.edu and EdStat-L on May 16, 1999.  Available at 
   http://www.matstat.com/teach/p0041.htm

Ware, J. H., Mosteller, F., Delgado, F., Donnelly, M. S., and 
   Ingelfinger, J. A. 1992. "P values." In _Medical uses of sta-
   tistics_ 2nd ed. edited by J. C. Bailar III and F. Mosteller, 
   181-200. Boston: NEJM (New England Journal of Medicine) Books. 

Return to top

Home page for Donald Macnaughton's papers about introductory statistics