Eight Features: 2nd Response to Roberts

Subject: Re: Eight Features of an Ideal Intro Stat Course
         (Second response to comments by Dennis Roberts)

     To: EdStat-L and sci.stat.edu

   From: Donald B. Macnaughton <donmac@matstat.com>

   Date: Sunday May 2, 1999

     Cc: Dennis Roberts <dmr@psu.edu>

Referring to a 98/8/3 post of mine, Dennis Roberts writes (on 
98/8/4)

>   ( snip )
> Donald Macnaughton wrote (in part):
>
>> - the usefulness of the field of statistics lies solely in its
>>   applications in empirical research
>
> sorry ... have to disagree.  while not exclusive to statistics
> ... there is a general logic in thinking that derives from such
> study ... that applies to empirical research and OTHER arenas
> of thought ...  

To resolve this disagreement it is helpful to have a clear sense 
of the concept of 'empirical research'.  I propose the following 
definition:

    EMPIRICAL RESEARCH is any research in which data are 
    gathered from the external world and then conclusions 
    are drawn from the data about the external world.

Under this definition, I cannot see any (practical) uses of the 
field of statistics beyond its uses in empirical research.  If 
Dennis still disagrees, I hope he will identify the "other arenas 
of thought" (beyond empirical research) he refers to, and I hope 
he will describe the usefulness of the field of statistics in 
those arenas.

Two distinguished statisticians support my claim about the use-
fulness of statistics.  Harry Roberts insightfully discusses us-
ing student projects to teach statistics (1992).  He reinforces 
his points by quoting George Box as saying:

    In my view statistics has no reason for existence except 
    as the catalyst for investigation and discovery (p. 109).

Study of Box's writing suggests that he uses the phrase "investi-
gation and discovery" to denote what I refer to as "empirical re-
search".  (That is, my experience suggests that ALL discussions 
in Box's writing about the use of statistics are directly aimed 
[either specifically or generally] at drawing conclusions about 
the external world from data.)  Thus Box's statement corroborates 
my statement that Dennis quotes above.  


>> - almost all empirical research projects can be usefully
>>   characterized as studying relationships between variables
>
> if you consider "frequency" to be a variable ... in a simple
> frequency distribution which resides on the "y" axis ... then
> perhaps so.  but ... this is kind of a stretch ... in that we
> normally don't consider 'frequency' to be a variable like most
> others that fall along a relationship graph (scatterdiagram)
> with X and y axes ...

Dennis is considering the situation in which an empirical re-
search project is studying the univariate distribution of the 
variable X.  He proposes (in graphical terms) a way we could view 
the study of the univariate distribution of X as *actually* being 
a study of a relationship between two variables.  But he then re-
jects his proposal as a "stretch", and I fully agree.

In a more complete form, Dennis' argument above would seem to run 
as follows:

1. It is not reasonable to view an empirical research project 
   that studies the univariate distribution of a variable X as 
   actually studying a *relationship* between X and another vari-
   able Y, where Y is the variable "frequency of occurrence of 
   different values of X".

2. No other point of view exists under which we can reasonably 
   view an empirical research project that studies a univariate 
   distribution as studying a relationship between variables.

3. Therefore, Macnaughton's statement that almost all empirical 
   research projects can be usefully characterized as studying 
   relationships between variables is incorrect.

This argument is unsound because the second premise is incorrect 
-- we can reasonably view each empirical research project that 
studies a univariate distribution as studying a special type of 
relationship between variables as follows:  As usual, one re-
sponse variable is present in the example, namely the variable 
whose univariate distribution is under study.  But the number of 
predictor variables, instead of being one or more, is reduced to 
zero.  The preceding three sentences appear to be rigorously true 
as a limiting (degenerate) case in two senses
- in an empirical sense and
- in a strict mathematical sense.  

That is, every empirical or mathematical procedure we use to 
study univariate distributions can be easily viewed as the limit-
ing case (when the number of predictor variables is reduced to 
zero) of a similar (but more complicated) procedure we use (or 
could use) to study relationships between variables.

                            *   *   *

However, one need not view univariate distributions as degenerate 
cases of relationships between variables to see the correctness 
of the point that almost all empirical research projects can be 
usefully viewed as studying relationships.  For even if we say 
that studies of univariate distributions are NOT studies of rela-
tionships, it is still true that almost all empirical research 
projects can be usefully viewed as studying relationships between 
variables.  This is because very few *real* empirical research 
projects study univariate distributions.  Instead, almost all 
real empirical research projects (or logical components of re-
search projects) can be best viewed as studying the relationship 
between a single response variable and one or more predictor 
variables.

Let me make some predictions:

1. More than ninety-six percent of real empirical research proj-
   ects are best viewed as studying relationships between vari-
   ables.

2. Less than two percent of real empirical research projects are 
   best viewed as studying univariate distributions.

3. Less than two percent of real empirical research projects are 
   best viewed as not belonging to either of the above two 
   groups.  For example, a research project might be best viewed 
   as studying entities or relationships between entities as op-
   posed to studying properties of entities (variables) or rela-
   tionships between properties.

(Some readers will suspect that my first [i.e., 96%] prediction 
above is much too high.  Such a suspicion is reasonable because 
many modern empirical research projects are NOT generally viewed 
as studying relationships between variables.  I discuss how it is 
possible to usefully view many such research projects as studying 
relationships in a paper [1999, app. B].)

The main point I wish to make here is the following:  If almost 
all real empirical research projects are easily and usefully 
viewed as studying relationships between variables, and since 
most of the modern statistical procedures can be easily charac-
terized as studying relationships between variables (Macnaughton 
1999, sec. 4.3), and since the concept of a relationship between 
variables is not hard to understand, it is reasonable to empha-
size the concept of a relationship between variables early in the 
introductory statistics course.

I discuss some further issues about my three predictions above in 
the appendix.  I discuss some research projects that do NOT study 
relationships between (or among) variables in an earlier post 
(1997a, app. A).  The "laws" of science are a small but important 
group of statements derived from empirical *scientific* research.  
In another post I report on a classification of 213 laws of sci-
ence into eight categories (with most of the laws being classi-
fied as statements of relationships [or non-relationships] be-
tween variables) (1997b, app. A).


>> - almost all the statistical methods can be usefully charac-
>>   terized as methods for studying relationships between vari-
>>   ables.
>
> are you admitting that some cannot be?  what are these?  are
> they not important too?

A few statistical methods cannot easily be characterized as 
studying relationships between variables in the standard "re-
sponse variable - predictor variable(s)" sense.  These methods 
include 
- cluster analysis
- factor analysis
- principal components analysis
- multidimensional scaling and
- a few other infrequently used statistical methods.

These methods still study variables (i.e., properties of enti-
ties), and in a loose sense they also study relationships *be-
tween* or *among* the variables.  But since none of these methods 
focuses on a specific response variable, I view them as a sepa-
rate group.

These methods are exceptions to the "response variable - predic-
tor variable(s)" rule and appear only rarely in real empirical 
research.  (I estimate that these methods appear in total in less 
than one percent of reported empirical research projects that use 
statistical methods.)  Thus although these methods are important 
in a small percentage of research projects, I believe they are 
not important topics for discussion in an introductory statistics 
course.

(I discuss these methods further in a paper [1999, app. C].)


>>   ( snip )
>> If we spend initial time discussing univariate distributions
>> before we discuss relationships between variables, I believe
>> we *alienate* students because students find univariate dis-
>> tributions to be boring and of little obvious use.
>
> what evidence do you have to support this? 

In what follows in this post, when I refer to the "introductory 
(statistics) course" I mean the introductory course for students 
who are NOT majoring in statistics, whom I call "non-statistics-
majors".  I am not discussing the introductory course for statis-
tics majors.

My main evidence for my claim that univariate distributions are 
boring and of little obvious use is that *I*, a practicing stat-
istician, find univariate distributions to be boring and of lit-
tle obvious use for beginning students, despite the fact that I 
have looked carefully for practical uses.  

(I am not saying that univariate distributions have NO uses --
they are indispensable tools in support of the mathematics in 
most statistical analyses.  I am only saying that univariate dis-
tributions have little obvious use that can be appreciated by 
students at the beginning of an introductory statistics course.)

I cannot prove that univariate distributions are of little or no 
use for beginning students because such a proof appears to be 
logically impossible.  On the other hand, if univariate distribu-
tions DO have important uses for beginning students, it should be 
easy for the proponents of discussing univariate distributions to 
prove THEIR case by merely describing these uses.  I invite pro-
ponents of teaching univariate distributions at the beginning of 
the introductory course to propose examples of univariate distri-
butions that both (a) have practical uses and (b) students find 
of interest.  

(I discuss some putative examples of interesting univariate dis-
tributions in a paper [1998a, app. G] and in a Usenet post 
[1998b].  I shall discuss in a forthcoming post two examples of 
interesting univariate distributions discussed on November 26 by 
Karl Wuensch.)  

                            *   *   *

My recommendation against discussing univariate distributions at 
the *beginning* of the introductory statistics course raises an 
important question

    If univariate distributions are not to be discussed at or 
    near the beginning of the introductory course, where 
    should they be discussed?  

I recommend that univariate distributions be discussed near the 
*end* of the introductory statistics course or at the beginning 
of a second course.  I explain this recommendation and propose a 
syllabus for the introductory course in a paper (1999, sec. 6.4 
and 6.9).


> personally ... i think most students find statistics boring ...
> whether it be studying relationships or not ... 

I agree that many students find statistics boring in SOME intro-
ductory courses.  However, the fact that statistics is boring in 
SOME courses does NOT allow us to conclude that statistics will 
be boring if we emphasize interesting relationships between vari-
ables right from the start.  This is because almost no statistics 
courses presently take a relationship-between-variables approach, 
so we have almost no relevant data on which to base a conclusion.

(Some leaders in statistical education have already independently 
adopted the approach of emphasizing relationships between vari-
ables, although perhaps not to the extent I recommend.  For exam-
ple, using an idea developed by Gudmund Iversen, George Cobb 
teaches two introductory courses, both of which start with rela-
tionships -- one devoted to experimental design and applied 
analysis of variance and the other devoted to applied regression 
[G. Cobb, personal communication, August 21, 1996].  Similarly, 
Robin Lock teaches an introductory course devoted to time series 
analysis -- i.e., methods for studying relationships between 
variables when an important predictor variable is "time" [Cobb 
1993, sec. 3.1].)


> and this [the fact that students find statistics boring] is
> primarily because it [statistics] is foisted on them and re-
> quired ... and not naturally selected. 

Dennis makes a good point -- the fact that statistics courses are 
often mandatory does not endear students to statistics.  Stu-
dents' lack of respect is heightened when they (as I have sug-
gested earlier) have trouble seeing any practical value in sta-
tistics.  Thus to avoid making things worse in an already bad 
situation, it is helpful to quickly show students the *practical 
value* of statistics.  

I believe we can best show students practical value by showing 
them how relationships between variables enable accurate predic-
tion and control (of the values of variables).  


> what i think does make a difference is to have data of interest
> to them ... 

I fully agree.  If we use interesting data that students can see 
practical value in studying, we are much more likely to give stu-
dents a lasting appreciation of statistics.  I further discuss 
the choice of data in two papers (1998a, sec. 6; 1999 sec. 6.5).


> whether this [data] be studied in the context of some relation-
> ship problem or not ... 

This clause gives insight into Dennis' view of relationships be-
tween variables:  He seems to be suggesting that relationships 
are somewhat incidental in the field of statistics.  


>> On the other hand, students find relationships between vari-
>> ables to be fascinating.
>
> donald ... i think you do stretch a bit ... and are assigning a
> characteristic to intro students that they just don't possess
> ... 

Dennis speaks from his experience with introductory students.  
However, although he may have taught many introductory statistics 
courses, he has probably not seriously *manipulated* his approach 
to teaching the fundamental statistical concepts.  In particular, 
since Dennis' statement in the earlier quote above suggests he 
believes relationships between variables are somewhat incidental 
in statistics, he has presumably not emphasized relationships be-
tween variables in any of his introductory courses.  Thus he can-
not speak from experience about the effectiveness of carefully 
discussing relationships between variables early in the introduc-
tory course.

My own experience with students is that they find relationships 
between variables to be fascinating.

>> Relationships are fascinating
>
> maybe some are ... but NOT because of the following ..
>
>> because study of relationships is the only known objective
>> method for accurate prediction and control
>
> how do students know that?

Most entering students have no knowledge of the broad usefulness 
of relationships between variables for accurate prediction and 
control.  We can, however, easily enrich our students with this 
powerful knowledge.

-------------------------------------------------------
Donald B. Macnaughton   MatStat Research Consulting Inc
donmac@matstat.com      Toronto, Canada
-------------------------------------------------------


APPENDIX:  A SURVEY OF EMPIRICAL RESEARCH PROJECTS

In the body of this post I predict that more than ninety-six per-
cent of real empirical research projects are best viewed as 
studying relationships between variables.  My prediction is not 
based on a proper statistical survey of empirical research proj-
ects and is instead based simply on my experience as a statisti-
cian.  I make the prediction as a straw man basis for discussion.  

A proper statistical survey might draw a sample of reports of re-
search projects from a sample of journals reporting empirical re-
search results and then use experts in empirical research to (1) 
identify the entities and variables of interest in each research 
project and (2) classify the main focus (or foci) of each re-
search project into one of the three categories I describe in the 
body of this post.  I hope that an interested reader will perform 
a proper empirical version of the survey.

Ironically, the survey is a survey (across empirical research 
projects) of the univariate distribution of the nominal-level 
variable "main focus of the research project".  Thus here, de-
spite my discussion above and elsewhere of how univariate distri-
butions are generally uninteresting, we have an interesting uni-
variate distribution.  However, I suggest that we can make this 
distribution *more* interesting if we turn it into a study of a 
relationship between two variables.  That is, we could study 
(across empirical research projects) the distribution of the 
(nominal-level) response variable "main focus of the research 
project" as a function of the (nominal-level) predictor variable 
"branch of empirical research to which the research project be-
longs" (e.g., medicine, physics, sociology, etc.).  Perhaps the 
distribution is the same in each branch of empirical research or 
perhaps it is different.


REFERENCES

Cobb, G. W. 1993. "Reconsidering statistics education:  A Na-
   tional Science Foundation conference."  _Journal of Statistics 
   Education 1(1)._  Available at 
   http://www.amstat.org/publications/jse/v1n1/cobb.html

Macnaughton, D. B. 1997a. "Re: How should we *motivate* students 
   in intro stat? (response to comments by John R. Vokey)."  
   Posted to sci.stat.edu and EdStat-L on April 6, 1997 and re-
   vised on June 1, 1997.  Available at 
   http://www.matstat.com/teach/p0024.htm

Macnaughton, D. B. 1997b. "EPR approach and scientific explana-
   tion (response to comments by Robert Frick)."  Posted to 
   sci.stat.edu and EdStat-L on July 23, 1997.  Available at 
   http://www.matstat.com/teach/p0026.htm

Macnaughton, D. B. 1998a. "Eight features of an ideal introduc-
   tory statistics course."  Available at 
   http://www.matstat.com/teach/

Macnaughton, D. B. 1998b. "Re: Eight features of an ideal intro-
   ductory statistics course (response to comments by Gary 
   Smith)."  Posted to sci.stat.edu and EdStat-L on November 23, 
   1998.  Available at http://www.matstat.com/teach/p0036.htm

Macnaughton, D. B. 1999. "The introductory statistics course:  
   The entity-property-relationship approach." Available at 
   http://www.matstat.com/teach/

Roberts, H. V. 1992. "Student-Conducted Projects in Introductory 
   Statistics Courses." In Gordon, F. S. and Gordon, S. P. (eds.) 
   _Statistics for the Twenty-First Century, MAA Notes, No. 26,_ 
   Washington, DC: Mathematical Association of America. pp. 109 - 
   121.

Home page for Donald Macnaughton's papers about introductory statistics