Home Subject: Definition of "Relationship Between Variables" (was Re: Eight Features of an Ideal Intro Stat Course) To: sci.stat.* Usenet Newsgroups EdStat E-Mail List From: Donald B. Macnaughton < donmac@matstat.com > Date: Monday January 28, 2002 Cc: Jan de Leeuw Herman Rubin Robert Frick ----------------------------------------------------------------- This post is dedicated to the memory of Daniel DeLury (1907 - 1993) of the Department of Statistics at the University of Toronto. Dr. DeLury's influence on me is reflected throughout the post, but most directly in the last appendix. This post evaluates seven definitions of the concept of 'rela- tionship between variables', including important definitions pro- posed in earlier posts by Jan de Leeuw, Herman Rubin, and Robert Frick. It also discusses whether a teacher needs to discuss uni- variate distributions or mathematics near the beginning of an in- troductory statistics course for students who are not majoring in statistics. For simplicity, I assume throughout this post that all variables are numeric -- that is, their values are numbers. However, the discussion and conclusions below easily generalize to situations with non-numeric variables if the values of the variables are (suitably) recoded to be numeric, and then one thinks in terms of the recoded values. A DEFINITION OF "RELATIONSHIP BETWEEN VARIABLES" BASED ON EXPECTED VALUE Responding to two informal definitions proposed by Herman Rubin (in sci.stat.edu on 98/8/3), I proposed (on 99/5/16): >> DEFINITION: There is a *relationship* between the vari- >> ables x and y if for at least one value x' of x >> >> E(y|x') ~= E(y) [1] >> where >> >> E(*) is the expected-value operator >> >> E(y|x') is the expected value of y given that x has >> the value x' and >> >> ~= stands for "is not equal to". >> >> Defining the concept of 'relationships between variables' in >> terms of conditional expected value leads to a simpler defini- >> tion than the definitions Herman proposes ... because the >> expected-value approach replaces the complicated concept of >> 'distribution' with the simpler concept of 'expected value' >> [1999a]. JAN DE LEEUW'S REMARKS ABOUT [1] Quoting [1], Jan de Leeuw writes (on 99/5/16) > It seems to me this is too narrow. Suppose, for example, that > E(y|x) = E(y) for all x, but V(y|x) ~= V(y) for some x, where > V is variance (for instance V(y|x) = \sigma^2 x^2). Seems like > a relationship to me. Two points of view are available to handle Jan's example: 1. We can adopt Jan's point of view and say that the example is an example of a relationship between the variables x and y. 2. We can adopt the point of view suggested by [1] and say that the example is not an example of a relationship between the variables x and y. (Of course, [1] clearly implies that the example is an example of a relationship between the variables x and V(y).) We can adopt either of these points of view because they both ap- pear to work satisfactorily. I discuss which point of view is preferred below, but first it is helpful to consider some pre- liminary material. (In an introductory statistics course for less advanced students a teacher might reasonably decide not to present any formal defi- nition of the concept of 'relationship between variables'. In this case the teacher would not present either of the above points of view. Instead, the teacher might choose to character- ize the concept of 'relationship between variables' informally in terms of one variable "depending" on the other, or in terms of the values of one variable "varying somewhat in step" with the values of the other. Although these characterizations are not mathematically explicit, I believe they are reasonable approaches for less advanced students if the characterizations are developed in terms of sufficient practical examples.) > > There is also a problem with symmetry. Can we reverse the role > of x and y in these definitions ? It seems so. Appendix C discusses the symmetry of definitions of the concept of 'relationship between variables'. JAN DE LEEUW'S DEFINITION OF "RELATIONSHIP BETWEEN VARIABLES" > > This leads to a somewhat more straightforward definition: there > is a relationship between random variables x and y if and only > if they are not independent Jan defines the concept of 'relationship between variables' in terms of the concept of 'independence' of variables. This leads one to seek a definition of that concept, which Jan provides (in terms of a definition of "dependence" or "relationship") as fol- lows: > (or, if you like, p(y|x) ~= p(y) for some x). For clarity, let me make Jan's definition more explicit: DEFINITION: There is a *relationship* between the random variables x and y if and only if p(y'|x') ~= p(y') [2] for some x' and some y' where p(y') = the unconditional probability that the vari- able y has the value y' (or equals the value of the probability density of y at y') and p(y'|x') = the probability that the variable y has the value y' given that the variable x has the value x'(or equals the value of the probabil- ity density of y at y' given that x is at x'). I hope that [2] properly characterizes the spirit of Jan's defi- nition. However, [2] differs from Jan's definition in two sig- nificant ways: - Primes appear on x and y in [2] to reinforce the idea that the definition is referring to (any) specific values of the vari- ables x and y. That is, the variables are being used in the existential sense, as opposed to the universal sense. This is also implied by the phrase "for some x' and some y'" in the definition. - The phrase "and some y'" is added to [2]. This gives y the same existential freedom in the definition as x, which helps to show the broadness of the definition. Note that [2] has the same structure as [1], with the expected- value operator replaced by the probability (density) operator. Definition [2] refers to the concept of a "random" variable, but definition [1] does not refer to this concept. Appendix A dis- cusses the concept of 'random variable'. Definition [2] is broader than [1] because [2] is satisfied by a difference at any point across the two distributions (conditional and unconditional) of the values of y, while [1] is satisfied only if the means of the two distributions are different. I further compare [1] and [2] below. HERMAN RUBIN'S REMARKS ABOUT INTRODUCTORY STATISTICS For brevity, I use the phrase "non-statistics-majors" in the fol- lowing discussion to refer to students who are not majoring in statistics or mathematics. Herman Rubin begins his post by addressing the issue that started the debate about the definition of "relationship between vari- ables". He writes (on 99/5/17) > Donald Macnaughton ... wrote: > M> Quoting a 98/7/23 post of mine, Herman Rubin writes (on M> 98/8/3) >> R>> Donald Macnaughton ... wrote: >>> M>>> In a July 17 post I recommend that teachers emphasize the M>>> concept of a relationship between variables and I recommend M>>> a de-emphasis of less important topics such as univariate M>>> distributions ... >>> R>> As such, I agree about the point on univariate distributions. R>> One does not need a catalog of the standard ones, nor [does R>> one need to] be adept at calculating them. R>> R>> HOWEVER, on consideration of the actual problems, they are an R>> essential tool. >> M> I fully agree that univariate distributions are an essential M> tool in actual statistical problems -- most statistical analy- M> ses depend directly on concepts of univariate distributions. M> M> However, as Herman may agree, the ubiquity of univariate dis- M> tributions in statistical analyses does NOT speak to whether a M> teacher should discuss univariate distributions near the be- M> ginning of an introductory statistics course when the course M> is aimed at students who are NOT majoring in statistics. > > On the contrary, it is these who need to understand, not the > formulas for the standard univariate distributions, but what > distributions, including univariate, are in general, and also > some of their basic properties. Herman and I disagree here: He recommends that a teacher discuss univariate distributions near the beginning of an introductory course for non-statistics-majors. In contrast, I recommend that a teacher begin such a course by discussing relationships between variables, with no discussion (near the beginning) of univariate distributions. > > If the person in the other field cannot move the problem from > "biological space" to "statistics space", the problem is not > ready for the use of statistics Again, I respectfully disagree. By forcing our students (and clients) to clamber from their own "space" into our "statistics space" (that is, by forcing them to speak our complicated mathe- matical language), I believe we confuse them and frighten many of them away. I believe that the mathematical language is unnecessary for non- statistics-majors. Instead of struggling to explain the mathe- matics, we can focus on the practical use of statistics in em- pirical research. We can reasonably tell students that the main practical use of statistics is to assist researchers to study re- lationships between variables. As noted, a relationship exists between two variables if when one variable "goes up and down" in entities (or in the entities' en- vironment), the other variable "goes up and down somewhat in step". We can illustrate this phenomenon on a scatterplot with no discussion of any underlying mathematics. We can then gener- alize these ideas in various useful directions, again without the (direct) need of mathematics. In particular, we can show students that most empirical research projects (or logical units of research projects) can be usefully viewed as studying the relationship between a single response variable and one or more predictor variables. The response vari- able is the variable that we wish to learn how to predict or con- trol. The predictor variable(s) is (are) the other variable(s) that we observe or manipulate in a research project to help us learn how to predict or control the values of the response vari- able. Many readers will agree that the statistical procedures that are commonly used in empirical research include - the t-test - analysis of variance - regression analysis - response surface analysis - categorical analysis - time series analysis - survey analysis - survival analysis - Bayesian analysis - neural networks - discriminant analysis - nonparametric analysis - logistic regression analysis - probit analysis - data mining methods - univariate methods - and others. Examination of these procedures suggests that they can all be reasonably and usefully viewed (for the most part) as optimal methods for studying the relationship between a single response variable and zero or more predictor variables under various cir- cumstances. The preceding four paragraphs suggest that the easy-to-understand concept of 'relationship between variables' is a central unifying concept of both the field of statistics and empirical research. Thus it is reasonable to emphasize this concept in an introduc- tory statistics course. The main ideas are surprisingly simple: In a typical research project using statistical methods the researcher (e.g., a medical researcher) would like the field of statistics to answer three key questions, which are 1. How can we discover and demonstrate reliable evidence that a relationship exists (if one does) between the response vari- able and predictor variable(s) of interest? 2. If we find good evidence that a relationship exists, how can we best use our knowledge of the relationship to predict or possibly control the values of the response variable in new entities from the population on the basis of the values of the predictor variable(s)? 3. If we make such predictions or attempt such control, how accu- rate will the prediction or control be? These questions make no reference (at least on the surface) to mathematics. We can show non-statistics-majors that much of the field of statistics is about answering these questions in empiri- cal research under various circumstances. Discussing these ideas (using sufficient practical examples) gives students a broad overview of the vital role of statistics in empirical research. This is more likely to impress non-statistics-majors than if we discuss the mathematics. Under this approach I do not suggest that we hide from students the fact that statistical procedures are based on mathematical principles. Instead, I recommend that teachers make students well aware of the existence of the important underlying mathemat- ics. But we can defer the details until a later course. Similarly, it is important to inform students about the underly- ing assumptions of statistical analysis -- we cannot have confi- dence in the conclusions of a statistical analysis unless we know that the underlying assumptions of the analysis are adequately satisfied by the situation and data under study. I recommend that introductory statistics teachers impress students with this important point. But, as with the mathematics, we can defer the details of the assumptions until later. I further discuss the above points and the teaching approach I recommend in two essays (1998a, 1999b) and in appendix H of this post. I discuss empirical research projects that do not study relationships between variables in two essays (1997a; 1999b, app. C). Moore (1997a, sec. 4) and the American Statistical Association (2002) also recommend de-emphasizing mathematics in statistics education. STATISTICAL PROCEDURES AS RELIGIOUS MANTRAS Herman continues ... > [If the person in the other field cannot move the problem from > "biological space" to "statistics space", the problem is not > ready for the use of statistics] except as "religious" mantras. I like the metaphor of a religious mantra to characterize certain traditional practices in statistics. One area of statistics in which I think statisticians and empirical researchers sometimes use a mantra is in the important area of hypothesis testing. I have written about hypothesis testing earlier (1997b, sec. 9; 1998b, sec. 5) and the ideas appear from time to time below. I plan to present some further ideas in a later post. HERMAN RUBIN'S FIRST DEFINITION OF "RELATIONSHIP BETWEEN VARIABLES" Herman next changes his focus to the main topic of the present post -- the definition of the concept of 'relationship between variables'. Quoting [1] above, he writes > > I agree with de Leeuw that this definition is far too narrow. > > The appropriate version of this [is] > > DEFINITION: There is a *stochastic relationship* > between the random variables X and Y if for at > least one value x' the conditional distribution [3] > of Y given X=x' is different from the uncondi- > tional distribution of Y. Definition [3] is equivalent to Jan de Leeuw's definition [2] in the sense that [3] will declare that a relationship exists be- tween two "compatible" variables if and only if [2] also declares that a relationship exists. Appendix B discusses the equivalence of [2] and [3]. (Two variables are "compatible" if they both reflect properties of the same type of entity [or one may reflect a property of the entities' environment], and if the available values of the vari- ables are reasonably linked within entities and within time. Clearly, we can reasonably study a relationship between variables only if the variables are compatible.) Despite the equivalence of [2] and [3], definition [3] differs from definition [2] in the sense that [3] is effectively refer- ring to the entire probability (density) function of the y-values for a given x-value while [2] is effectively referring to a point on the probability (density) function of the y-values for the given x-value. Definition [2] is thus more specific, and thus perhaps slightly clearer, reducing the necessary and sufficient condition for a relationship to a reasonable minimum condition. > > An alternative version is that X and Y are dependent random > variables. But the operational meaning of this is the above > formulation; objects are independent if knowing one provides no > information about the distribution of the other. I would sug- > gest that this be used as the definition of independence, and > it goes over immediately to many objects. For discussions about empirical research, I agree with Herman's approach of defining the concept of 'independence' in terms of the concept of 'relationship between objects', rather than the other way around. On the other hand, in theoretical discussions it is often useful to begin with and focus on the concept of 'in- dependence', as discussed in appendix I. Herman speaks about "objects" because he wishes to apply the con- cept of 'independence' to two different types of object, as he indicates in his next sentence: > A random variable here is an object, as is an event. Consider Herman's concept of an event, and consider his notion (implied in the second most recent quotation above) of the "dis- tribution" of events -- how are events distributed? One answer is that they are distributed over time (or over some other appro- priate dimension). Thus consider the variable "time of an event". We can view the idea of independence of events simply in terms of independence of (i.e., lack of relationship between) two variables reflecting the (distribution over) time of the two (types of) events. Thus rather than needing two notions of independence (one for variables and the other for events), we can subsume both types of independence under the idea of a lack of a relationship between variables. > > How can this be understood without knowing what it means for > something to be the distribution of a random variable? The referent of Herman's "this" is unclear although convention suggests that the referent is the point he makes in his sentence that precedes the above sentence. However, I suspect that Herman is not referring to the (somewhat peripheral) point in that sen- tence. Instead, I suspect that he is asking how his definition [3] of the concept of a relationship between variables can be un- derstood if one does not understand the concept of the distribu- tion of the values of a (random) variable. If that is Herman's point, I fully agree with it. If we are to successfully use [3] (or [2]) to define the concept of 'relation- ship between variables' in a statistics course, students must first understand the concept of the distribution of the values of a variable. HERMAN RUBIN'S REMARKS ABOUT [1] Herman continues ... > > Expectation should not be taught using the formulas usually > given, By the "formulas usually given" I think Herman means the sum (or integral) across all the possible values of the variable of the product of the variable and its probability (density) function [e.g., for the variable x, the sum (integral) across x of the product of x and p(x)]. If we wish to teach the concept of 'expected value' to non- statistics-majors, I agree with Herman that the formulas usually given should not be used. We can teach the concept to non- statistics-majors in terms of the concept of 'arithmetic mean' or 'average'. That is, the expected value is the value we will get if we compute the average of the values of the variable for all the entities in the population. Non-statistics-majors readily understand that we can estimate with reasonable precision the expected value of any variable by computing the average of the values of the variable in a suitable sample. Here students need an informal awareness of the concept of 'distribution'. That is, they need to understand the idea that the values of variables generally vary. But they need no mathematical awareness of distributions beyond adding together the values and dividing by N. (Most students already know from statistical reports in the media that the average lies at the "center" of the values.) > but those formulas involve the concept of distribution as well. I think Herman is here making the following argument: - Definition [1] defines the concept of 'relationship between variables' in terms of the concept of 'expectation' or 'ex- pected value'. - But the formulas usually given for the concept of expected value' involve the concept of 'distribution'. - Therefore, [1] depends on the concept of 'distribution'. I agree that [1] appeals to the concept of 'expected value' and that the formulas usually given for expected value involve the concept of 'distribution' [which is implicit in the function p(x)]. However, if we bypass the formulas usually given and characterize the concept of 'expected value' in terms of the con- cept of 'arithmetic mean' or 'average', we bypass the need to re- fer to the mathematical concept of 'distribution'. This makes the ideas substantially easier to understand. I further discuss this approach to expected value in a paper for students (1997b, sec. 7.10). HERMAN RUBIN'S SECOND DEFINITION OF "RELATIONSHIP BETWEEN VARIABLES" > > The above definition could also be formulated as > > E(f(y)|x') ~= E(f(y)) [4] > > for all functions f for which the expectations exist, I suspect that Herman here means not all functions f but, in- stead, some function f from among the set of all functions for which the expectations exist. That is, under [4] a relationship exists between x and y if and only if [4] is satisfied for some x' and some (any) specific value of y, and some (one, any) func- tion f. In [4] Herman has neatly changed from using the probability (den- sity) operator as the main operator in the definition (as in [2] and [3]) to using the expected-value operator (as in [1]). Definition [4] is equivalent to [2] and [3] in the sense that [4] will declare that a relationship exists between two compatible variables if and only if [2] and [3] also declare that a rela- tionship exists. Appendix B discusses the equivalence of [2], [3], and [4]. If the function f in [4] is the identify function (which it usu- ally can be), [4] becomes [1]. In other words, [1] identifies a subset of the cases that satisfy [4]. (For the broadest generality, the function f in [4] is allowed to take multiple y-values [i.e., a vector of y-values] as its argu- ment. This enables us to include the variance function [as used in Jan de Leeuw's example] and similar functions in the set of permissible functions that may appear as f in the definition.) I further discuss Herman's two definitions below, but first it is helpful to consider three other definitions of the concept of 'relationship between variables'. A STANDARD DEFINITION OF THE CONCEPT OF 'RELATIONSHIP BETWEEN VARIABLES' FROM MATHEMATICAL STATISTICS Jan de Leeuw begins his discussion above of the definition of "relationship between variables" with the concept of 'independ- ence', as opposed to beginning with the concept of 'dependence' or 'relationship'. Jan may have begun this way because it is a standard way to begin. For example, Freund and Walpole present the following familiar definition of "independence" of two vari- ables in their popular mathematical statistics textbook: If p(x,y) is the value of the joint probability distri- bution of the discrete random variables X and Y at (x,y), and p1(x) and p2(y) are the values of the mar- ginal distributions of X and Y at x and y, X and Y are *independent* if and only if p(x,y) = p1(x) p2(y) for all (x,y) within their range. To give a corresponding definition for continuous random variables, we simply substitute the word "density" for the word "distribution" [and the word "continuous" for the word "discrete"] (1987, p. 126). (For comparability, I have reduced Freund and Walpole's original definition from N variables to two variables and I have changed the variable and function names to be consistent with those in this post.) Hogg and Craig, in their popular mathematical statistics text- book, define "independence" the same way, although they use dif- ferent wording and notation (1995, p. 101). Other textbooks also give conceptually the same definition, and thus Freund and Walpole's definition reflects a widely-held view of the concept of 'independence of two variables'. Freund and Walpole emphasize the concept of 'independence' and give much less attention to the concept of 'dependence' or 'rela- tionship'. Instead, "dependence" between two variables is simply (and reasonably) implied as the negation of independence. Thus we can define the concept of 'relationship between vari- ables' in terms of the negation of Freund and Walpole's defini- tion of "independence". A reasonable version of this is DEFINITION: If p(x,y) is the value of the joint prob- ability (density) function of the random variables X and Y at (x,y), and p1(x) and p2(y) are the values of the marginal probability (density) functions of X and Y at x and y, there is a *relationship* between X and Y if and only if p(x,y) ~= p1(x) p2(y) [5] for some (x,y) within their range. Definition [5] is equivalent to [2] through [4] in the sense [5] will declare that a relationship exists between two compatible variables if and only [2] through [4] also declare that a rela- tionship exists -- see appendix B. Although [5] is equivalent to [2] through [4] in the sense above, [5] is different from [2] through [4] (and [1]) in an important sense -- [5] makes no significant distinction between the re- sponse variable and the predictor variable, while the other defi- nitions all make such a distinction. Specifically, definitions [1] through [4] use the vertical bar to mean "given that", and the response variable y always appears to the left of the verti- cal bar, and the predictor variable x always appears to the right of the vertical bar. (The vertical bar is implicit in [3].) As discussed above, most empirical research projects (or logical units of research projects) can be usefully viewed as studying the relationship between a single response variable and one or more predictor variables. Thus the distinction between the re- sponse variable and the predictor variable(s) is important in most empirical research projects. But [5] does not significantly distinguish between these variables. Thus [5] has less direct applicability to the use of statistics in empirical research than [1] through [4]. Consider the issue of quantification: Note how Freund and Walpole's definition of "independence" is a universally quanti- fied statement, as indicated by the phrase "for all (x,y)" in the definition. On the other hand [5], which is the negation of Freund and Walpole's definition, is an existentially quantified statement, as indicated by the phrase "for some (x,y)" in the definition. Definitions [1] through [4] are also existentially quantified statements, as is underscored by the primes on some of the x's and y's in the definitions. An advantage of using an existentially quantified definition is that, in general, existentially quantified statements can (if they are true) be verified in empirical research while univer- sally quantified statements can almost never (even if they are true) be directly verified. (Universally quantified statements can be falsified -- see appendix D.) Universally quantified statements cannot be verified because proper verification re- quires an exhaustive search, which (due to resource limitations) is almost always impossible. (Because providing empirical support for a universally quantified statement is almost always impossible, empirical researchers rarely make universally quantified statements. In particular, empirical researchers rarely try to empirically support the claim that no relationship exists between two given compatible vari- ables -- that is, they rarely [if ever] try to empirically sup- port the claim that the two variables are independent. They do not attempt to support this claim because generally it cannot be reasonably empirically supported. Instead, following the princi- ple of parsimony, most experienced researchers simply formally assume that no relationship exists between a response variable and one or more compatible predictor variables until unequivocal empirical evidence is brought forward that allows us to reject the "null" assumption of no relationship. (Appendix E discusses the "conservation" laws of physical sci- ence, which are an interesting exception to the main point of the preceding paragraph.) As noted above, non-statistics-majors are more likely to be im- pressed by the practical side of statistics. Thus it makes sense to introduce them to the existentially quantified definition of "relationship between variables" (as defined by any of defini- tions [1] through [7] in this post) instead of the universally quantified definition of "independence". This helps non- statistics-majors to see the practical use of statistics in em- pirical research, which is mostly about relationships between variables (and not about "independences"). On the other hand, if we are teaching statistics to students who are majoring in statistics or mathematics, it is important to in- troduce the fundamental universally quantified definition of "in- dependence of N random variables", as discussed in appendix I. ROBERT FRICK'S DEFINITION OF "RELATIONSHIP BETWEEN VARIABLES" Bob Frick wrote privately to me proposing another form of [1]. (I quote him here with his permission.) Referring to [1], he writes > I propose > > DEFINITION: There is a *relationship* between the vari- > ables x and y if for at least one pair of values x' > and x" of x > > E(y|x') ~= E(y|x"). [6] > > I think this definition is mathematically equivalent to your > definition and better fits my intuitive understanding and the > typical understanding of causality. I had to translate from > your formulation to mine in order to understand and evaluate > yours. I agree with Bob that [1] and [6] are "mathematically equivalent" in the sense that [1] will declare that a relationship exists be- tween two compatible variables if and only if [6] also declares that a relationship exists. Appendix F discusses the equivalence of [1] and [6]. (Discussion at several places below focuses on continuous [as op- posed to discrete] response variables. This is because continu- ous response variables are available in most areas of empirical research and continuous variables generally carry substantially more information in their values. Thus using a continuous re- sponse variable usually enables a researcher to obtain better knowledge of the relationship between the response variable and the predictor variable[s] under study.) Consider some properties of [1] and [6]: 1. Definition [1] is simpler than [6] in the sense that the right side of [1] is an unconditional expected value while the right side of [6] is a (more complicated) conditional expected value. 2. Definition [6] directly reflects how the detection of rela- tionships between variables is usually done in an important case in empirical research -- the case in which the response variable is continuous and the predictor variable is discrete, with two values. This reflects the simplest standard exper- mental design, which is usually best analyzed with the sim- plest case of analysis of variance -- the one-way case with two levels (also called the t-test). In this case we test whether a relationship exists between the variables by testing whether [6] (not [1]) is satisfied. Statistical practice fa- vors [6] here over [1] because a research project properly based on [6] generally provides (other things being equal) a more powerful statistical test of whether the sought-after re- lationship exists. 3. The approach implied by [6] is generally not used to detect relationships between variables when we have a continuous re- sponse variable and a continuous predictor variable. In this case the test for the existence of a relationship is generally a test of whether a parameter in a model equation has a cer- tain "null" value. We know or believe that the parameter will have the null value (typically zero) if no relationship is present and will have a different value if a relationship is present. If we can reasonably reject the hypothesis that the parameter has the null value, we can (tentatively) conclude that a relationship exists between the variables. Statistical practice favors a test of a parameter here instead of the ap- proach implied by [6] because the parameter test provides (other things being equal) a more powerful statistical test of whether the sought-after relationship exists. 4. An approach (properly) based on [6] can generally give better prediction or control capability than a similar approach based on [1]. 5. Although the approach implied by [6] is directly used some of the time to test for relationships between variables in em- pirical research, the approach implied by [1] is almost never directly used. Instead, forms that can be derived from [1] (such as [6] or a test of a parameter) are used in actual practice. 6. Consider the case in which humans informally study relation- ships between variables in everyday life. In this case we are usually not conscious of the concept of 'relationship between variables'. For example, after several visits to a new bank a person may observe, "The earlier in the morning I go to the bank, the less time I have to wait to be served." ("Duration of waiting time" is the response variable and "bank arrival time" is the predictor variable.) In this case people seem more often to begin with an approach resembling [1] than one resembling [6]. This may be because [1] is simpler and lends itself at least as well as [6] to natural situations. Here, we often begin without knowledge of the identity of the rele- vant predictor variable, and thus without direct knowledge of the values of the response variable when the predictor vari- able is at two different values (x' and x"), although this type of knowledge usually comes later. Instead, we initially discover the relationship by noting that the response variable deviates from its expected value when the predictor variable is at a particular value (or in some value range). Thus [1] seems more basic or more "natural" to me than [6]. The above discussion suggests that [1] and [6] both have advan- tages. Thus in statistics courses in which the teacher decides to introduce [1] or [6] to define "relationship between vari- ables", I recommend introducing both definitions to students. A DEFINITION OF "RELATIONSHIP BETWEEN VARIABLES" IN TERMS OF A MATHEMATICAL FUNCTION The discussion above refers to the idea of a parameter in a model equation. The idea of a model equation suggests the following definition of the concept of 'relationship between variables': DEFINITION: There is a *relationship* between the vari- ables x and y if the value of y can be expressed as a non-constant mathematical function of the value of x. [7] An "error" term e is generally included with the func- tion, where e is usually viewed as being independent of x. This is stated algebraically as y = g(x) + e. For maximum generality, the function g(x) is shown as a fully general function. However, in any real study of a relationship between variables the general function g is replaced by a spe- cific mathematical function that is chosen from among the many types that are available. Definition [7] is closely related to the concept of 'expected value' because the function g is almost always chosen so as to "best" estimate E(y|x). (If the response variable is continuous and if the conditional distribution of its values is noticeably non-symmetric, which I estimate occurs in less than four percent of empirical research projects with continuous response variables, the median may be used instead of the mean [expected value]. The same basic prin- ciples apply, but g(x) estimates the median of the conditional distribution of the response variable instead of the mean.) The mathematical form of g is chosen mostly through analysis of data obtained in empirical research, although theoretical consid- erations sometimes also play a central role, especially in the physical sciences. The function g is usually a mathematical function in the strict sense of the term "function". That is, g is a one-to-one or pos- sibly many-to-one mapping between two sets, with no random ele- ment involved. (That is, the mapping is not one-to-many.) The random element in [7] is handled by the error term e (which is usually represented by the lowercase Greek letter epsilon). This term takes account of the (empirical) fact that invariably in real situations the best function g cannot perfectly predict the associated value of y from a value of x -- the e is the error in the prediction. Researchers often determine the distribution of e, but in any real situation the term itself has a different unpredictable value every time an instance of the equation oc- curs. The error term in [7] provides two important conceptual benefits: 1. The error term enables the equation to satisfy the mathemati- cal requirements of the equals sign. 2. The error term collects all the unaccounted-for variation in the values of y in a single sensible place. (Some complicated analyses use multiple error terms.) Definition [7] is equivalent to [1] and [6] in the sense that [7] will declare that a relationship exists between two compatible variables if and only if [1] and [6] also declare that a rela- tionship exists. Appendix F discusses the equivalence of [1], [6], and [7]. (Definition [7] is not equivalent to definitions [2] through [5], but can be made so by broadening it, as discussed in appendix G.) Definition [7] is important because mathematical functions are often used to represent relationships between variables in most branches of the physical and biological sciences, and also (at a more abstract and implicit level) in much research in the social sciences. In the physical sciences the error term e in [7] is usually omitted, but the same general principle of stating rela- tionships between variables in the form of mathematical functions is widely used. Definition [7] refers to the concept of 'independence'. A teacher presenting [7] to students can use the standard approach exemplified above in Freund and Walpole's definition of "inde- pendence of two random variables" to characterize 'independence'. However, that approach appeals to the concept of 'distribution'. Thus students must understand the statistical concept of 'distri- bution' to fully understand [7]. If a teacher chooses to present definitions [1] and [6] to stu- dents, and if the students have sufficient mathematical ability, I recommend that the teacher also present definition [7]. I rec- ommend that the three definitions be presented in succession, separated only by careful discussion of practical examples of ac- tual relationships to reinforce each definition. Presenting the definitions in succession helps students to attain a unified sense of the various ways that relationships between variables appear in life and empirical research. (My experience suggests that most students cannot understand any definition of the concept of 'relationship between variables' without sufficient discussion of practical examples, with "suffi- cient" depending on the level of the students.) COMPARISON OF THE DEFINITIONS The preceding material discusses seven definitions of the concept of 'relationship between two variables'. Which definition is preferred? To simplify this question, the following discussion views defini- tions [1], [6], and [7] as if they are the same definition. This is reasonable because the three definitions are theoretically equivalent, as noted in appendix F. The discussion refers to the three definitions jointly as the "expected-value" definition of the concept of 'relationship between variables'. Similarly, the following discussion views definitions [2], [3], [4], and [5] as if they are the same definition. This is reason- able because [2] through [5] are theoretically equivalent, as noted in appendix B. The discussion refers to [2] through [5] jointly as the "distribution" definition of the concept of 'rela- tionship between variables'. The expected-value and distribution definitions are not equiva- lent, as is illustrated by Jan de Leeuw's variance example: If we consider the example with y in the role of the response vari- able, the expected-value definition does not directly declare that a relationship exists between x and y, but the distribution definition does directly declare that a relationship exists. Since the two definitions are not equivalent, which of them is preferred? It is reasonable to split this question into two more specific questions: - Which definition is preferred in the introductory statistics course for non-statistics-majors? - Which definition is preferred in general statistical discourse? In determining the preferred definition, I assume we are not Platonists. Thus neither the expected-value definition nor the distribution definition is more "correct". This is because we do not believe that some true Platonic concept of 'relationship be- tween variables' exists somewhere, and we are trying to capture the concept in the definition. Instead, we are free to choose a definition as being "correct". Many readers will agree that a reasonable approach to making this choice is to choose whichever definition has more conceptual advantages. Consider some features and advantages of the expected-value defi- nition: 1. The expected-value definition is easier to understand than the distribution definition because it does not require mathematical understanding of the statistical concept of 'distribution (of the values of a variable)'. 2. Empirical researchers are generally much more interested in directly predicting or controlling the values of the response variable in an empirical research project (i.e., in predict- ing or controlling expected value) than in predicting or con- trolling the values of higher moments (e.g., variance) of the response variable. And although situations arise (especially in quality control) in which examples like Jan's are impor- tant, I estimate that more than ninety-six percent of all em- pirical research projects that study relationships with con- tinuous response variables (as reported in the empirical re- search literature) can be reasonably understood as viewing relationships in terms of the expected value (or occasionally in terms of the expected median) of the response variable. And usually, if a relationship is found between the variance of the response variable and a predictor variable, this is merely viewed as a nuisance. (The variance relationship is a nuisance because heterogeneity of response variable variance adds complexity to the analysis.) 3. Consistent with point 2, the expected-value definition is (implicitly) used much more frequently than the distribution definition to define the statistical tests that are performed in empirical research to detect relationships between vari- ables. In the case of a continuous response variable the tests are almost always (effectively) tests of whether some measure of center or some parameter of a model (both of which are often linear functions of the [perhaps trimmed or subsetted] values of the response variable) has some value, or is different from some other fixed value, or is different from some other empirically derived value or values. These tests are thus effectively tests of the first moment (possi- bly with appropriate subsetting) of the values of the re- sponse variable. Only rarely are the key tests performed on other moments of the values of the response variable. Also, tests that are in terms of the probability (density) function of the values of the response variable are performed only in- frequently. (Instances occur when the response variable is discrete, as opposed to continuous but, as suggested above, discrete response variables are used less often.) 4. Statistically knowledgeable empirical researchers often per- form statistical tests for variance relationships. But when they perform such tests they are usually directly interested in studying a relationship between variables as defined by the expected-value definition. And they are only performing the variance tests to assist in verifying that the underlying assumptions of the statistical procedure being used are ade- quately satisfied. Furthermore, statistically knowledgeable empirical researchers almost never check whether the third or higher moments of the response variable change as a function of a predictor variable. This suggests that empirical re- searchers generally view study of moments of the response variable higher than the second as being of little interest or value. 5. A function (transformation) is sometimes applied to the val- ues of the response variable in the data analysis of an em- pirical research project. However, if such a function is used, the purpose is usually merely to stabilize the variance of the response variable to satisfy assumptions of the sta- tistical procedure being used -- not to support direct study of higher moments or other similar study of the values of the response variable. 6. Although the expected-value definition does not directly cover certain cases (such as Jan's V(y|x) case), it covers all these cases indirectly when the appropriate function is applied to the values of the response variable, as suggested by Herman's definition [4]. 7. The terminology of the expected-value definition is consis- tent with common language. For example, it is natural and informative to report the results of an empirical research project that found the result in Jan's example as "There is no evidence of a relationship between x and y but there is good evidence of a relationship between x and V(y)." 8. The expected-value definition makes a distinction between certain types of relationships between variables -- a dis- tinction that definitions [2], [3], and [5] do not make. (The distinction is also made by definition [4].) The dis- tinction is in terms of the function f (which is usually merely the identity function) that is applied to the values of the response variable. In cases in which this function is not the identity function, naming it helps one to understand the relationship. 9. Empirical researchers are generally interested in minimizing the (error) variance in the values of the response variable in a research project. This is because minimizing variance results in increased precision of prediction or control of the values of the response variable, which is a widely pur- sued general goal of empirical research. However, minimiza- tion of variance is usually not pursued directly in empirical research. Instead, minimization of variance comes as a sec- ondary benefit from studying prediction or control of ex- pected value through relationships between variables. That is, usually a large part of the variability in the values of the response variable in an empirical research project is as- sumed to reflect the fact that this variable depends on nu- merous other variables (many of which may be unknown), and these other "influencing" variables may be varying (either systematically or at random) within or between entities, thereby causing some of the variation in the values of the response variable. (Some of the variation in the response variable is also due to measurement error, and some of the variation may be "totally random".) Identifying the influ- encing variables through studying relationships between vari- ables in terms of the expected-value definition "removes" the variation from the values of the response variable that can be associated with these variables, thereby reducing the "er- ror" variance in the values of the response variable, and thereby increasing precision in prediction or control. That is, researchers generally increase precision through studying relationships between variables (and improving measurement methods) -- not through direct efforts to somehow reduce variance without studying relationships between variables. 10. The expected-value definition is consistent with the distri- bution definition. This is because the expected-value defi- nition is not stated as "if only if". The expected-value definition gives only a sufficient condition for a relation- ship between variables -- it does not give a necessary condi- tion. (As noted above, this condition defines an empirically large subset of the cases defined by the distribution defini- tion.) Thus the expected-value definition leaves open the possibility that other forms of "relationship" might also di- rectly qualify, although we need not discuss this esoteric point with non-statistics-majors. Consider some features and advantages of the distribution defini- tion: 1. The distribution definition identifies a class of relation- ships between variables that the expected-value definition does not directly identify. These are the relationships that resemble Jan's V(y|x) case. (However, as noted, the ex- pected-value definition does identify these cases if an ap- propriate function is applied to the values of the response variable.) 2. Unlike the expected-value definition, the distribution defi- nition (with the exception of [4]) does not force one to look for a function to deal with cases like Jan's V(y|x) case. If the inequalities in [2], [3], or [5] are satisfied in any way, the distribution definition declares that a relationship exists between the two variables. This idea is important from a theoretical point of view. However, the idea is not often directly applied in empirical research. This is be- cause in empirical research it is generally easier to find an appropriate function (if needed) and then to use the expected- value definition than it is to study the entire distribution of the values of the response variable. That is, researchers generally focus on a key aspect of the distribution, which is usually the value it is "centered" around, which is usually (perhaps after a transformation) best represented by its expected value. Experience has shown that studying the expected value (or occasionally some other measure of central tendency) of the response variable (while keeping an eye on the spread) is an efficient way of breaking down the concepts to simple yet generally sufficient principles. 3. The distribution definition reflects the Bayesian approach to the study of relationships between variables. This approach is reasonably viewed as focusing on the distribution of the values of the response variable (as opposed to focusing merely on the expected value of the response variable). Re- searchers using the Bayesian approach study the relationship between a response variable (which may be a parameter) and zero or more predictor variables by inferring the "posterior" distribution of the values of the response variable. They make this inference on the basis of Bayes' theorem and (a) the values of the response variable and predictor vari- able(s) (if any) obtained from the entities in the sample in the research project and (b) the "prior" distribution of the values of the response variable (possibly conditioned on the values of the pre- dictor variables). Thus the distribution definition directly mirrors the Bayesian approach. This is a crucial advantage of the dis- tribution definition if one is using the Bayesian approach. I am unable to think of other significant features or advantages of the distribution definition. If readers see other features or advantages of either definition, I hope they will present them to this debate. WHICH DEFINITION IS PREFERRED? Having considered some features and advantages of the two defini- tions of the concept of 'relationship between variables', let us now consider which definition is preferred. First, which definition is preferred for an introductory statis- tics course for non-statistics-majors? For such a course, if the teacher elects to present a formal definition of the concept of 'relationship between variables', I recommend emphasizing the expected-value definition. I base this on my beliefs that (a) the expected-value approach is easier to understand, and (b) the expected-value approach mirrors the methods statisticians and re- searchers usually use to detect and study relationships between variables in empirical research. On the other hand, I recommend emphasizing the distribution defi- nition if a teacher elects to teach the Bayesian approach in a statistics course. The distribution definition is preferred in this case because, as noted, it directly mirrors the Bayesian ap- proach. (However, I recommend against teaching the Bayesian ap- proach in an introductory statistics course for non-statistics majors -- see appendix J.) Finally, in a statistics course for students who are majoring in statistics or mathematics, or in a statistics course for students who have sufficient statistical experience, or in general statis- tical discussion, I believe the preferred definition of "rela- tionship between variables" should be at the discretion of the instructor or participants. Reasonable criteria for making the choice are that the preferred definition for a particular discus- sion should 1. maximize understanding and 2. provide optimal support for the intended analysis. In addition to being preferred in Bayesian cases, the distribu- tion definition can better satisfy the two criteria in some non- Bayesian cases, especially in some theoretical and mathematical cases. For example, the technical discussion in appendix D ap- peals to the distribution definition. On the other hand, in many other non-Bayesian cases the expected- value definition seems superior. For example, "standard" analy- sis of variance seems better viewed in terms of the expected- value definition. This is because in standard analysis of vari- ance the resulting p-values are almost always reasonably viewed as testing for relationships between variables in terms of de- tecting differences between means of the values of the response variable -- standard analysis of variance does not (directly) test anything about the higher moments of the values of the re- sponse variable. Appendix K further discuses this point. GENERALIZATION OF THE DEFINITIONS Definitions [1] through [7] are all definitions of a relationship between a single response variable and a single predictor vari- able. Appendix H discusses the important issue of generalizing the definitions to situations with multiple response variables and multiple predictor variables. MAIN POINTS The concept of 'relationship between variables' can be reasonably defined in terms of the concept of 'expected value' and in terms of the concept of 'univariate distribution'. The two definitions are not equivalent. The expected-value definition identifies a large subset of the cases identified by the distribution defini- tion. The expected-value definition indirectly identifies the remaining cases. The expected-value definition is easier to understand and has several other significant advantages over the distribution defi- nition. And for the introductory statistics course for non- statistics-majors the expected-value definition appears to have no serious disadvantages. Thus I recommend that a teacher empha- size the expected-value definition in an introductory statistics course for non-statistics majors if the teacher elects to present a formal definition of the concept of 'relationship between vari- ables'. The distribution definition is preferred when the Bayesian ap- proach is used and in some theoretical and mathematical discus- sions. The easy-to-understand concept of 'relationship between vari- ables' is a central unifying concept of both the field of statis- tics and empirical research. A key use of the concept is to as- sist researchers in accurate prediction and control. Thus I rec- ommend that the introductory statistics course for non-statistics- majors focus on the study of relationships between variables in empirical research as a means to accurate prediction and control. This focus is important whether the concept of 'relationship' is formally defined or is instead informally characterized in terms of practical examples. Don Macnaughton ------------------------------------------------------- Donald B. Macnaughton MatStat Research Consulting Inc donmac@matstat.com Toronto, Canada ------------------------------------------------------- APPENDIX A: IS THE CONCEPT OF A "RANDOM" VARIABLE NECESSARY IN THE DEFINITION OF "RELATIONSHIP BETWEEN VARIABLES"? (The order of the appendices in this essay respects logical de- pendencies among them, which makes the order different from the order in which they are first referenced above.) The definitions I discuss above by de Leeuw, Rubin, Freund and Walpole, and Hogg and Craig all refer to the concept of a "ran- dom" variable. On the other hand, definition [1], Frick's [6], and [7] omit the concept of 'random' and simply use the concept of 'variable'. Should the definition of "relationship between variables" appeal to the concept of 'random variable'? First, what is a random variable? I suggest that the following definition mirrors current statistical thinking: DEFINITION: A variable is a *random variable* if it has a probability distribution (density) implicitly or ex- plicitly associated with it. The definition is sometimes stated in other ways, such as "random variable: a measurable function on a probability space; ..." (Parker 1994). But in almost every case the same ideas are pre- sent, especially the key idea of 'probability distribution'. Lay people often view a random variable as varying "at random", with the concept of 'at random' connoting a vague sense of unpre- dictability. This view is consistent with the statistical defi- nition of "random variable". However, the statistical definition sharpens the focus by using the idea of the probability distribu- tion of the values. This enables us to go beyond vague "unpre- dictability" to being able to efficiently characterize various properties of the randomness or unpredictability. We character- ize these properties in terms of properties of the distribution -- specifically, in terms of the type of the distribution and in terms of the values of its parameters. A key use of the concept of 'random variable' is to concisely distinguish between the variables that are used in statistics (which generally have probability distributions) and the vari- ables that are used in general mathematics (which generally lack probability distributions). Thus in distinguishing statistics (especially the mathematical aspects of statistics) from general mathematics, the distinction between "random" variables and other variables is clearly important. On the other hand, the distinction between random variables and other variables is less important in empirical research. This is because virtually all variables studied in empirical research can be reasonably viewed as being random variables. This is because virtually all variables in empirical research are repeatedly measured and, in general, the values vary from one measurement to the next, which implies a probability distribution. (An arguable exception is that each manipulated [predictor] variable in a for- mal experiment is "fixed" at two or more different values by the researcher, and thus [in a sense] lacks a distribution. But the fixed values differ from each other and thereby [speaking perhaps loosely] have a distribution with probabilities, which satisfies the condition of the definition of "random variable".) Since virtually all variables in empirical research can be reasonably viewed as being "random" variables, the idea of 'random variable' does not make a useful distinction in empirical research. Per- haps for this reason, the idea of a "random" variable often fades into the background in empirical research, and is not mentioned. Instead, discussion usually proceeds in terms of the concept of 'variable' without the modifier "random". For the introductory statistics course for non-statistics majors, I recommend above that teachers emphasize the use of statistics in empirical research where the idea of a random variable often fades into the background and thus is less important. I also recommend above that teachers omit (or minimize) discussing the mathematical aspects of statistics where the idea of a random variable is important. These points lead me to recommend that teachers omit discussing the somewhat abstruse concept of 'random variable' in introductory statistics courses for non-statistics- majors. On the other hand, in courses for students who are majoring in statistics or mathematics, it is clear that the concept of 'ran- dom variable' is important, identifying a key distinction between mathematical statistics and general mathematics. (The above ideas raise the question of how we should define the statistical concept of 'variable'. I propose a definition of the concept in the 1999b paper [sec 3.2].) APPENDIX B: EQUIVALENCE OF DEFINITIONS [2], [3], [4], AND [5] In the body of this essay I say that definitions [2], [3], [4], and [5] of the concept of 'relationship between variables' are equivalent to each other. This appendix illustrates the equivalences. In saying that the four definitions are "equivalent" I mean that they will always theoretically agree as to whether a relationship exists between two compatible variables. To see the equivalence of [2], [3], and [4], imagine (in three- space) the surface defined by the bivariate probability (density) function of two compatible variables x and y and consider slices through this surface made parallel to the plane defined by the p- axis (probability [-density] axis) and the y-axis. The slices can be made at any permissible value for x in the range of the x- axis. Consider two exhaustive situations: A. the situation in which all such slices are similar to each other and B. the situation in which at least one slice is not similar to another slice. (Two slices [made parallel to the p-y plane] of the surface de- fined by the bivariate probability [density] function are "simi- lar" to each other if for every value y' in the y-dimension the probability [density] function values at y' for the two slices are in a constant proportion to each other, or both values of the function at y' may be zero.) If situation A is satisfied for a given pair of compatible vari- ables, one can easily see that [2], [3], and [4] all declare that no relationship is present. On the other hand, if situation B is satisfied, one can easily see that [2] and [3] declare that a re- lationship is present. To show that situation B implies that [4] declares that a rela- tionship is present, we must show that whenever situation B is satisfied a function f exists such that the inequality in [4] is satisfied. Such a function always (theoretically) exists because the probability (density) function of y can always serve as the function. That is, if situation B is satisfied, and if we use the probability (density) function as the function in [4], we can always find a point (x',y') -- a point of dissimilarity between the marginal and conditional probability (density) functions of y -- such that [4] is satisfied. * * * Simple algebra and the standard definition of conditional prob- ability imply that [2] and [5] are equivalent. Therefore, definitions [2], [3], [4], and [5] of the concept of 'relationship between variables' are theoretically equivalent to each other in terms of their pronouncements of whether a rela- tionship exists between two compatible variables. APPENDIX C: THE SYMMETRY OF DEFINITIONS OF "RELATIONSHIP BETWEEN VARIABLES" In his 99/5/16 post Jan de Leeuw speaks of "a problem with symme- try" and asks whether we can "reverse the role of x and y in these definitions". By "these definitions" I think Jan means the following: - the definition I propose in my 99/5/16 post, which appears above as [1] - some derived forms of [1] in the 99/5/16 post - the two informal definitions Herman Rubin proposes in his 98/8/3 post and - the definition Jan proposes in his 99/5/16 post, which I quote above, and which is reflected in [2]. In referring to the "symmetry" of the definitions, I think Jan means the following: DEFINITION: A definition of the concept of 'relationship between two variables' is *symmetric* if and only if for all pairs of compatible variables x and y the definition declares that a relationship exists between the vari- ables with y in the role of the response variable if and only it also declares that a relationship exists between the variables with x in the role of the response vari- able. In referring to a "problem" with symmetry, Jan may be suggesting that a reasonable definition of the concept of 'relationship be- tween variables' ought to be symmetric. Jan may be noting that [2] is symmetric, but [1] is not, and he may be suggesting that therefore [1] is less reasonable. In the following paragraphs I first discuss how the distribution definition is symmetric. Then I discuss how the expected-value definition is not symmetric. Then I discuss whether the lack of symmetry of the expected-value definition is a problem. To see the symmetry of the distribution definition (i.e., defini- tions [2] through [5]), consider [5]. Clearly, [5] is symmetric. But, as discussed in appendix B, [2] through [5] are all equiva- lent. Therefore, the distribution definition is symmetric. A graphical way to see the symmetry of [2] and [3] is to consider the joint probability (density) function of two compatible vari- ables x and y when there is and is not a relationship (as defined by any of [2] through [5]), and to note the concomitant absence or presence of similarities of slices made parallel to the p-y plane, as discussed in appendix B. If no relationship is pre- sent, [2] and [3] imply that all slices parallel to the p-y plane will be similar to each other. But this geometrically implies that all slices parallel to the p-x plane will also be similar to each other. On the other hand, if a relationship is present, [2] and [3] im- ply that at least one slice parallel the p-y plane will be dis- similar to another slice parallel to that plane. But this geo- metrically implies that at least one slice parallel to the p-x plane will also be dissimilar to another slice parallel to that plane. Thus the distribution definition is symmetric. On the other hand, the expected-value definition is not symmet- ric, as illustrated by Jan's example. That is, suppose for some pair of compatible variables x and y the conditions of Jan's ex- ample are satisfied and therefore E(y|x) = E(y) for all x, but V(y|x') ~= V(y) for some x'. Furthermore, for ease of visualiza- tion, suppose that V(y) is an increasing function of x. If the expected-value definition is to be symmetric, it would have to be true that E(x|y) = E(x) for all y. But this is clearly not the case. For example, if a value y' of y is "far away" (in either direction) from the mean of the y's, E(x|y') is higher than if y' is closer to the mean. This disconfirming example shows that the expected-value definition is not generally symmetric. However, although the expected-value definition is not generally (analytically) symmetric, this does not appear to be a problem when we are using statistics in empirical research. This is be- cause in empirical research we are generally only interested in a one-way relationship. That is, we identify a response variable y whose values we would like to predict or control. Then we iden- tify predictor variables x1, x2, ... that we will measure (along with y) in an empirical research project to see if we can find evidence of a relationship. If we find such (reasonable) evi- dence, we can use the information derived from the relationship to make predictions (or perhaps exercise control) of the value of y in new entities from the population on the basis of the values of the x's. But we are usually not interested in turning things around and predicting or controlling the values of the x's from the values of y. Thus in empirical research the question of whether the expected-value definition (or any other definition of "relationship between variables") is symmetric is usually irrele- vant. The lack of symmetry in the expected-value definition actually arises only in unusual cases. In most real cases in empirical research if the expected-value definition is satisfied by a pair of compatible variables x and y, we will find that the definition is also satisfied if the roles of x and y are reversed, although this fact may not be of much interest or importance. Finally, the lack of symmetry in the expected-value definition, rather than being a shortcoming, can be viewed as an advantage. This is because if we determine that a certain "relationship" be- tween variables is not "bi-directional", which we can do if we think in terms of the expected-value definition or [4], this is an additional piece of possibly interesting information about the relationship -- information that we cannot directly obtain if we use [2], [3], or [5]. (A relationship between variables x and y is bi-directional if the expected-value definition declares that the relationship ex- ists regardless of whether x or y is used in the role of response variable.) (We can use [4] to determine whether a given relationship between variables is bi-directional by noting the function f used in specifying the relationship -- the nature of this function may help us to determine whether the relationship is bi-directional.) APPENDIX D: VERIFICATION AND FALSIFICATION IN THE STUDY OF RELATIONSHIPS BETWEEN VARIABLES Consider the statement that a relationship exists between a re- sponse variable and a predictor variable. If we use the distri- bution definition of "relationship", this statement is logically equivalent to the statement that the variables are not independ- ent. This leads to the question of which of the following two equivalent approaches we should emphasize in technical discus- sions about empirical research: - verifying the existence of a relationship - falsifying the non-existence of the relationship (i.e., falsi- fying the independence of the variables). Popper (1959, 1983, 1989) emphasizes falsification although he emphasizes falsification of theories, as opposed to falsification of independence of variables. (I propose a list of components of a scientific theory in a Usenet post [1997c].) Work that falsifies a theory is clearly important -- an effective approach to advancing knowledge is to engage in an ongoing exer- cise to try to find (reasonable) falsifications of scientific theories. Most scientists continually follow this approach, es- pecially with new theories. We can falsify a theory in two ways: - We can falsify a theory by finding a reasonable alternative ex- planation of a leading research result that supports the the- ory. The alternative explanation implies that the result is equivocal, and therefore the support for the theory is dimin- ished (Mosteller 1990, Lipsey 1990). For example, Fleischmann, Pons, and Hawkins proposed the "cold fusion" theory (1989). This theory describes a hypothetical process the authors be- lieved could provide safe inexpensive energy. They reported research results that suggest (somewhat) that the theory is correct. A few other researchers also reported research re- sults that support the theory. But many other researchers were unable to replicate these results and were unable to obtain other results that should be obtainable if the theory is cor- rect. Thus reasonable alternative explanations of the original results were proposed, as detailed by Huizenga (1993) and Hoffman (1995). (The alternative explanations generally in- volve shortcomings in the design and conduct of the original research.) A few physical scientists continue to believe that the cold fusion theory is correct, but most now believe that some of the alternative explanations are correct. Thus most physical scientists believe that the cold fusion theory lacks unequivocal empirical support and thus lacks credence. (Of course, the theory could immediately be resurrected if new un- equivocal and replicable empirical results supporting the the- ory are brought forward.) - We can falsify a theory by finding a new (reliable) empirical research result that is inconsistent with the theory. For ex- ample, Michelson and Morley carefully measured the relative speed of light when it was traveling in different directions and found (within an estimable amount of measurement error) that the speed was (apparently) the same in all the (measured) directions (1887). That is, they found no evidence of a rela- tionship between "direction of light travel" and "speed of light". This tended to falsify the then-popular theory of the luminiferous ether, which had predicted that the speed would vary in different directions to a degree substantially greater than the measurement error in the experiment. Consider the totality of empirical research projects, including those that serve to falsify theories. As noted above, examina- tion of such projects suggests that most of them can be reasona- bly viewed as studying the existence (and other aspects) of a re- lationship between a single response variable and one or more predictor variables. As also noted, we can turn things around and say that these re- search projects are attempting to falsify the statement that the variables are independent. However, this is effectively a double negative (i.e., falsifying the statement that no relationship ex- ists) and seems harder to understand than the simple logically equivalent idea of verifying the statement that the relationship exists. Thus despite Popper's emphasis on falsification, in technical discussions about empirical research it is reasonable to empha- size the concept of verifying the existence of a relationship be- tween variables (and further study of the relationship) instead of emphasizing the logically equivalent but harder-to-understand concept of falsifying the independence of the variables. APPENDIX E: A CASE WHEN RESEARCHERS DO DISCUSS INDEPENDENCE OF VARIABLES The discussion of [5] contrasts the definitions of "relationship between variables" with the definition of "independence of vari- ables" and says that empirical researchers rarely attempt to sup- port the claim that two variables are independent. That is, re- searchers rarely attempt to support the claim no relationship ex- ists between two given compatible variables. An informative ex- ception to this rule occurs with the so-called "conservation" laws in the physical sciences. A conservation laws states that within a "closed system" a cer- tain variable is constant. That is, no relationship exists be- tween the variable and any other variable. For example, the law of conservation of energy states that the total amount of energy in a closed system is constant. In other words, there is no re- lationship between the total amount of energy in such a system and any other variable (including the variable "time"). Other conservation laws (all defined by Parker 1994, p. 441) are about conservation of - angular momentum (in mechanics) - areas (in orbital mechanics) - charge (in the study of electricity) - condensation (in fluid mechanics) - mass (which is similar to "weight", and is a property of physi- cal objects) - matter (which is the "material" that physical objects are made of) - momentum (in mechanics) - orbital symmetry (in organic chemistry) - parity (in quantum mechanics) - probability (in quantum mechanics) - vorticity (in fluid mechanics). The conservation laws reflect the efforts of empirical research- ers to find relationships between the variables named in the laws and other variables. These efforts have repeatedly failed to find any such relationships in closed systems. The various fail- ures to find relationships (that is, the "negative results") are enshrined in the conservation laws. Although the conservation laws have substantial empirical sup- port, the principle of parsimony (as discussed above) recommends that we assume that no relationship exists between any pair of compatible variables until unequivocal evidence of such a rela- tionship is brought forward. Thus the conservation laws may seem less important because they only state what the principle of par- simony recommends we assume. Thus perhaps the conservation laws need no empirical support, and thus perhaps the time and effort spent discovering them was wasted. However, the researchers who discovered the conservation laws usually did not set out to discover them. Instead, they set out to discover a relationship between the variable named in a con- servation law and some other variable. But they failed because (apparently) no such relationships exist. Although the conservation laws merely state what the principle of parsimony recommends we assume, they serve three important pur- poses: - The conservation laws state that certain (apparent) constancies have been repeatedly found in empirical research. Knowledge of these constancies is generally a crucial element in building models of how the particular area of interest works. - The conservation laws warn researchers that any search (in a closed system) for a relationship between a variable named in a conservation law and another variable will likely be unsuccess- ful. - The conservation laws are monuments to the nameless researchers who diligently tried to refute them, but failed. (One rarely finds fame in a negative result, although Michelson and Morley did [1887].) Because the conservation laws reflect negative results, they are not necessarily the last word on the situations they describe. In particular, it is possible that researchers will someday dis- cover that some of the conservation laws are untrue, and that a relationship does exist in a closed system between a variable named in a conservation law and one or more other variables. (In general, any such relationships will be weak or hard-to-observe -- otherwise they would have already been discovered.) For exam- ple, the law of conservation of matter has already been shown to be untrue in a certain hard-to-observe situation -- the situation in which matter is converted to energy through a nuclear reac- tion. APPENDIX F: EQUIVALENCE OF DEFINITIONS [1], [6], AND [7] In the body of this essay I say that definitions [1], [6], and [7] of the concept of 'relationship between variables' are equivalent to each other. This appendix illustrates the equivalences. As before, in saying that the three definitions are "equivalent" I mean that they will always theoretically agree as to whether a relationship exists between two compatible variables. Here is an argument to show the equivalence of [1] and [6]: First, suppose that definition [1] declares that a relationship exists between two compatible variables x and y. Then [1] im- plies that there exists a value of x, say x', such that E(y|x') ~= E(y). Then one of two cases must occur: 1. There is a value x" of x different from x' such that E(y|x") = E(y). Thus E(y|x") ~= E(y|x'). Thus [6] declares that a re- lationship exists between x and y. 2. There is no value x" such that E(y|x") = E(y). This case may occur because the mean of the values of y is not a permissible value. But in this case there must exist a value x" that is a permissible value of x such that E(y|x") ~= E(y|x'). This value is necessary to make E(y) different from E(y|x') as given by [1]. But in this case [6] declares that a relation- ship exists between x and y. Q.E.D. For the converse, suppose that definition [6] declares that a re- lationship exists between two compatible variables x and y. Then [6] implies that values exist of x, say x' and x", such that E(y|x') ~= E(y|x"). Then E(y|x) ~= E(y) for some x (i.e., either x' or x"). But in this case definition [1] declares that a rela- tionship exists between x and y. Q.E.D. Similarly, here is an argument to show the equivalence of [6] and [7]: First, suppose that definition [6] declares that a relationship exists between two compatible variables x and y. Then [6] im- plies that values x' and x" of x exist such that E(y|x') ~= E(y|x"). (a) Define a function g(x) such that g estimates E(y|x). In view of (a), if g is sufficiently accurately defined, g(x') ~= g(x"). Thus g (if sufficiently accurately defined) is a non-constant function that estimates the value of y from a given value of x. (Because this argument is attempting to show the theoretical equivalence of [6] and [7], it can require definition of the function g and require that g be sufficiently accurately defined. In a specific empirical situation the operations associated with properly defining g are carried out by collecting sufficient ap- propriate data and performing a reasonable analysis of the data to derive g.) Consider a specific evaluation of g(x) in an empirical situation. In this situation the associated error term e has a specific nu- meric value (called the "residual") that is equal to the differ- ence between the measured value of y and g(x). Assuming we have sufficient access to data, we can use appropriate residual analy- sis to refine the function g to the point that the residuals are statistically independent of the values of x. Thus we can sat- isfy the condition of [7] regarding the error term. But in this case definition [7] declares that a relationship exists between x and y. (Actually, the condition in the preceding paragraph need not be satisfied for us to believe that a relationship exists under definition [7]. And if a non-constant function g can do better than chance at predicting the value of y from a value of x, most statisticians will agree that a relationship exists regardless of whether the residuals are independent of x. However, it is use- ful to emphasize the requirement that the residuals be independ- ent of x because [as discussed by Draper and Smith 1998, chap. 2] if we refine g to make the error term properly independent of the predictor variable(s), this leads us to the form of g that gives the most accurate predictions.) Q.E.D. For the converse, suppose that definition [7] declares that a re- lationship exists between two compatible variables x and y. Then [7] implies that a non-constant function g(x) is defined such that g estimates E(y|x). Since g is non-constant, there exist two distinct values of x, say x' and x", such that g(x') ~= g(x"). Then, assuming g is sufficiently accurately defined, E(y|x') ~= E(y|x"). But in this case definition [6] declares that a relationship exists between x and y. Q.E.D. Therefore, definitions [1], [6], and [7] of the concept of 'rela- tionship between variables' are theoretically equivalent to each other in terms of their pronouncements of whether a relationship exists between two compatible variables. APPENDIX G: REWORDING [7] TO BE EQUIVALENT TO THE DISTRIBUTION DEFINITION As discussed in appendix F, definition [7] (which defines "rela- tionship between variables" in terms of the concept of 'mathe- matical function') is equivalent to [1] and [6]. What is the re- lationship between [7] and the distribution definition (i.e., [2] through [5])? Definition [7] is not equivalent to the distribution definition, as can be seen by studying Jan de Leeuw's variance example: It is not possible to express Jan's variance relationship in the form y = g(x) + e if g is restricted to being a standard mathe- matical function (i.e., a one-to-one or many-to-one mapping) and if e is restricted to being independent of x. (Using Jan's spe- cific example, we can express the key relationship as V(y|x) = g(x) + e, but the relationship under consideration is between x and y, not between x and V(y|x).) Although [7] is not equivalent to the distribution definition, we can reword it to be equivalent as follows: DEFINITION: There is a *relationship* between the vari- ables x and y if and only if the value of y can be ex- pressed as a "mapping" of x, symbolized by m(x). At least one of the moments of m(x) must be a non-constant mathematical function of x. An error term e may be in- [8] cluded with the mapping, where e is usually viewed as being independent of x. This is stated algebraically as y = m(x) + e. Note that m(x) represents a mapping (possibly one-to-many). This broadens the conditions of the definition to allow relationships in which E(y|x) (first moment of y given x) is constant for all values of x, but one or more of the higher moments of y depend on x. (Since [8] refers to the moments of the distribution of the val- ues of y, it is valid only if the relevant moments of y exist. Relevant moments of the values of the response variable can usu- ally be found to exist in empirical research although focus is usually on the first moment [reflecting the expected value or av- erage], occasionally on the second [reflecting the spread], and only rarely on the third [reflecting asymmetry], or fourth [re- flecting peakedness]. Empirical researchers virtually never study moments greater than the fourth. This may be because the "normal" nature of most data makes phenomena in these moments [and in the third and fourth moments] unlikely to exist. Fur- thermore, even in cases in which such phenomena may exist, they are not readily interpretable, nor obviously useful, although ex- ceptions are conceivable.) Consider the case in which the distribution definition correctly declares that a relationship is present between two compatible variables x and y, but the expected-value definition can directly detect no evidence of the relationship. For example, consider again Jan's variance example. In this case the mapping m(x) re- quired by the definition will be of an unusual form because it will include a random element (or some other form of "memory"). The random element is necessary to provide the one-to-many map- ping that makes the variance of y dependent on x even though E(y) is independent of x. Such unusual forms are not often found in empirical research. (Another approach to making [7] equivalent to the distribution definition is to allow the values [or moments] of the error term e to be functions of the predictor variable x. However, this ap- proach has the undesirable feature that it violates the [not sac- rosanct] standard principle that the error term is independent of the values of the predictor variable[s].) APPENDIX H: GENERALIZATION OF [1] THROUGH [8] Definitions [1] through [8] are all definitions of a relationship between a single response variable and a single predictor vari- able. How should we generalize these definitions to the case with multiple response variables or multiple predictor variables? Consider first the case of multiple response variables. Most research projects with multiple response variables can be assigned to one of three categories: (a) research projects that use "repeated measurements" (also called "repeated measures") in which we have a single re- sponse variable that is measured repeatedly (i.e., more than once) in each research entity or (b) research projects in which each response variable can be rea- sonably viewed as defining a logically separate research pro- ject having only a single response variable (but sharing the research entities and predictor variables with the other re- sponse variables) or (c) research projects in which the multiple response variables can be viewed as a single (vector-valued) variable, even though they are not the same variable. Some medical, social, and biological research falls into category (a) in which the procedure of repeated measurements enables the researcher to "reuse" the research entities (typically organisms, perhaps humans), which can provide the dual benefits of decreased research costs and increased power of the statistical tests (but at a cost of increased complexity of concepts and assumptions). Surveys often fall into category (b) -- a survey may ask respon- dents several questions that each define a separate response variable. For example, in surveying customers leaving a store we may ask about the quality of the service, whether the location of the store is convenient, whether the price ranges for merchandise are appropriate, and so on. Each of these questions defines a different response variable. Each response variable is typically analyzed separately to see if it depends on any of the shared predictor variables, such as the customer's age, gender, purchas- ing behavior, occupation, clerk used, and so on. In category (c) note that the response variable in a practical empirical research project is invariably viewed as a unity -- as a single response variable (which may on rare occasion be a vec- tor of different variables). This is reasonable because if the response "variable" is just a random conglomeration of (compati- ble) variables, there would appear to be no obvious empirical sense in using this conglomeration as the response "variable" in an analysis. Thus cases with multiple response variables can usually be rea- sonably viewed (in one of the above three senses) as having only a single response variable. In categories (a) and (b) no definition of "relationship between variables" with multiple response variables is needed because the cases degenerate into the case with only a single response vari- able. However, in category (c) we do need such a definition for completeness. However, in view of the rareness of this case, that definition is beyond the scope of this essay. * * * Let us now consider the important case with multiple predictor variables. How should we define the concept of 'relationship be- tween variables' in this case? First, it is reasonable to ask whether we need a definition of the relationship between a response variable and multiple predic- tor variables or whether we might get by more parsimoniously by applying the simple definition multiple times. That is, why not say a relationship exists between a response variable and multi- ple predictor variables if a separate simple relationship exists between the response variable and each of the predictor vari- ables? Here, a simple relationship is defined by the version of the definition in which only a single predictor variable is pre- sent (e.g., one of definitions [1] through [8] above). However, defining the concept of 'relationship between variables' with multiple predictor variables in terms of multiple simple re- lationships does not allow us to detect or study all possible forms that a relationship might have. In particular, using sim- ple relationships does not allow us to fully study "interactions" between the predictor variables with respect to their joint rela- tionship to the response variable. Interactions were discovered (invented, devised) by Fisher (1935, chap. VI) as a key part of his invention of analysis of variance. Interactions provide a method of detecting all the possible (strong enough) relation- ships between the response variable and the predictor variables in a research project -- certain interactive relationships cannot generally be detected or observed if we define relationships be- tween variables solely in terms of one of the simple definitions. I give a definition of the concept of 'interaction' in a paper (1998c, sec. 6.10). * * * Let us consider generalizations of [1] through [8] that enable us to take account of interactions. For maximum efficiency, we would like each generalization to state a minimal set of condi- tions for a relationship to exist. The following generalizations provide reasonable minimal conditions: I give a generalization of [1] to the case with an arbitrary num- ber of predictor variables in the paper for students (1997b, sec. 7.10). That generalization can be used as a template for the generalizations I propose for [2] through [4]. For example, here is the resulting generalization of [2] to the case with two predictor variables: DEFINITION: There is a *relationship* between the pre- dictor variables x1 and x2 and the response variable y if and only if p(y'|x1',x2') ~= p(y'|x1') and p(y"|x1",x2") ~= p(y"|x2") for some y' and y" and for some permissible pair (x1',x2') and some permissible pair (x1",x2") of values of x1 and x2 where p(y'|*) = the probability (density) that the variable y has the value y' given that the variable(s) appearing at * is (are) at its (their) indi- cated value(s). (The values of y' and y" in the definition need not be distinct, nor the values of x1' and x1", nor x2' and x2".) Generalization of [3] and [4] is the same as for [1] and [2]. The general pattern is to state each inequality with the function on the left having all the predictor variables as arguments. The function on the right is a reduced function with one of the pre- dictor variables omitted. For k predictor variables k inequali- ties are stated, each with a different one of the predictor vari- ables omitted from the function on the right. The "function" is either the expected-value function (for [1] and [4]) or the prob- ability (density) function (for [2] and [3]). Generalization of [5] can be performed by converting the gener- alization of [2], which involves conditional probabilities, into expressions in unconditional probabilities using the definition of conditional probability. The converted generalization of [2] yields a generalization of [5]. This approach to generalizing [5] yields expressions that are non-intuitive and is thus (like [5]) of less direct interest in standard empirical research. Generalization of [6] is slightly different from the generaliza- tion of [1], but straightforward. Generalization of [7] to the case with k predictor variables can be performed by adding the requisite arguments for the additional predictor variables to the argument list of g in the equation, and by requiring that for each predictor variable xi the first- order partial derivative of g(x1, ..., xk) with respect to xi be non-zero for at least one set of values of the predictor vari- ables. (A different set of values of the predictor variables may be used for each partial derivative.) The approach in the preceding paragraph requires that g be dif- ferentiable as indicated. The definition can be generalized to cases when g is not differentiable by speaking in terms of par- ticular differences between the g(x1, ..., xk) for different given values of the x's -- the discrete analogue of partial dif- ferentiation. (Fisher invented analysis of variance to systematize the study of these differences in the general case when the response variable is continuous and the predictor variables are discrete [1935]. I believe that analysis of variance is the most important of Fisher's many important contributions because it theoretically underpins almost all modern scientific experimentation.) Generalization of [8] to the case with k predictor variables can be performed by adding the requisite arguments for the additional predictor variables to the argument list of m in the equation, and by requiring that for each predictor variable xi the first- order partial derivative of at least one of the moments of m(x1, ..., xk) with respect to xi be non-zero for at least one set of values of the predictor variables. (A different moment and a different set of values of the predictor variables may be used for each partial derivative.) The generalization of [8] can be generalized to cases when m is not differentiable using the approach described above for the generalization of [7]. Appendix F discusses how definitions [1], [6], and [7] are equivalent, and appendix B discusses how definitions [2] through [5] are equivalent. These equivalences raise the question of whether the generalizations of these definitions are also respec- tively equivalent. Although I suspect that the generalizations discussed above are respectively equivalent, I have not studied this question in detail. Another approach to a general definition of the concept of 'rela- tionship between variables' is to attempt a negation of the gen- eral definition of "independence of variables". This approach was used above in the two-variable case to define [5]. However, as noted in the discussion of [5], this approach does not distin- guish between the response and predictor variables. This makes it difficult to distinguish between the dependence or independ- ence of the predictor variables with each other from the depend- ence or independence of the predictor variables with the response variable. But the dependence or independence of the predictor variables with each other is irrelevant to defining a relation- ship between a response variable and a set of predictor vari- ables. Thus it seems more difficult to obtain a minimal defini- tion of the standard general concept of 'relationship between variables' if we work from the negation of the general definition of "independence of variables". Other general definitions of "relationship between variables" may be possible. If readers see other useful general definitions, I hope they will present them to this debate. APPENDIX I: THE IMPORTANCE OF THE CONCEPT OF 'INDEPENDENCE OF VARIABLES' In the body of this essay I say that in empirical research it is reasonable to think in terms of the concept of 'relationship be- tween variables' instead of thinking in terms of the concept of 'independence' of variables. On the other hand, in mathematical statistics, it is often easier to think in terms of the concept of 'independence'. One can see this by studying the generaliza- tions of the definition of "relationship between variables" to the case in which more than one predictor variable is present. These generalizations can be compared to the generalization of the definition of "independence". Note how the generalizations of [1] through [8] (as discussed in appendix H) are complicated. This is because if p predictor variables are present, the generalizations require that the re- searcher or student verify that p different conditions are satis- fied (instead of only the single condition that must be satisfied in [1] though [8] if only one predictor variable is present). On the other hand, the generalization of Freund and Walpole's definition of "independence" (which was negated to yield [5]) is (as given by Freund and Walpole) simple: The generalization is performed by merely adding more arguments to the left side and more concomitant marginal factors to the right side of the single equation. We also add concomitant terms to the statement of the list of universally quantified variables. The ease of generalization of the definition of "independence" is a strong reason why the concept of 'independence' is often pre- ferred to its negation (i.e., the concept of 'relationship be- tween variables') in general discussion in mathematical statis- tics. APPENDIX J: SHOULD THE INTRODUCTORY STATISTICS COURSE TEACH THE BAYESIAN APPROACH? If valid and reliable prior information is available in an em- pirical research project, the Bayesian approach is generally the best approach to use to analyze the data from the project. This is because (a) the Bayesian approach is generally equal (through symmetry) to other approaches in its ability to take account of the general needs of research projects, and (b) the Bayesian ap- proach takes much better account of the prior information. Un- fortunately, however, valid and reliable prior information is un- available in many empirical research projects because the pro- jects are exploring new ground. The Bayesian approach is generally more complicated than the non- Bayesian approach. This can be seen by noting that a Bayesian procedure can usually (always?) be reasonably viewed as an en- hancement of some non-Bayesian procedure. The enhancement in- volves extending the non-Bayesian procedure to take proper ac- count of available prior information. Because valid and reliable prior information is unavailable in many empirical research projects, and in view of my recommenda- tion above that the introductory statistics course for non- statistics-majors focus on the use of statistics in empirical re- search, and because the Bayesian approach is more complicated, I recommend against emphasizing the Bayesian approach in an intro- ductory statistics course for non-statistics-majors. However, it is helpful to briefly show introductory students the high-level concepts of the approach. This is because if valid and reliable prior information is available, the Bayesian approach has the po- tential to significantly enhance empirical research by refining conclusions and reducing costs. Albert (1997), Berry (1997), and Moore (1997b) debate teaching the Bayesian approach in an introductory statistics course. APPENDIX K: DO ANALYSIS OF VARIANCE F-TESTS TEST VARIANCES? In the body of this essay I say that statistical tests for de- tecting relationships between variables are usually performed in terms of linear functions of the values of the response variable and not in terms of higher moments of the values, such as vari- ances. Consider a possible counterexample: The standard F-tests in analysis of variance are generally used to detect relationships between the response variable and the predictor variable(s) in a designed experiment. A standard analysis of variance F-test is performed by computing the ratio of two variance estimates, called the "F-ratio" (in Fisher's honor). The variance estimate in the denominator of the F-ratio is "unadulterated", but the variance estimate in the numerator is (if the associated relationship between variables exists in the population) "adulterated" and "inflated" by the effects of the predictor variable(s) involved in the relationship. Thus these tests are testing variances. Thus perhaps this fact is a coun- terexample to my claim that most tests for the existence of a re- lationship between variables are done in terms of linear func- tions of the values of the response variable. In considering this issue, it is useful to distinguish between the mathematics of the analysis of variance F-test and what is functionally done. Mathematically, for each hypothesis test in standard analysis of variance it is clear that a ratio of two variance estimates is tested with the F-test to determine whether the numerator variance is significantly greater than the denomi- nator variance. (Confusingly, although only one tail of the central F- distribution is used in computing the p-value for this test, the test is still a "two-tail" test in the sense that in the one-way case with two levels the p-value is identical to the two-tail p- value from the associated t-test, and different from (i.e., twice) the associated one-tail p-value from the t-test. [Only one tail of the F-distribution is used because F is t-squared, and thus the upper and lower tails of the t-distribution are folded into the upper tail of the F-distribution.]) Mathematically analysis of variance performs tests of variances. But functionally analysis of variance does something quite dif- ferent, as one can see by studying expressions for the statisti- cal hypotheses being tested in analysis of variance. I discuss these expressions in a paper in terms of the cell-means and over- parameterized model equations (1998c, sec. 14 & 15). In standard analysis of variance the hypotheses being tested can always be stated in terms of one or more linear functions (more specifically, contrasts) of the values of the response variable. I illustrate this fact in a computer program that shows that the numerator sums of squares in standard balanced and standard un- balanced analysis of variance (with no empty cells) can always be viewed as being computed in terms of a set of contrasts (held in the "projection matrix") that are applied to the vector of values of the response variable (1998d, line 1047). Thus although the mathematics of the standard F-test in analysis of variance is conveniently understood in terms of tests of vari- ances, it is reasonable to view the detailed hypotheses being tested in terms of linear functions (contrasts) of the values of the response variable. I learned the contrast approach to analysis of variance from Daniel DeLury. Dr. DeLury's simple but moving introduction to the beauty of analysis of variance was the highlight of my formal education. REFERENCES Albert, J. 1997. "Teaching Bayes' rule: A data-oriented ap- proach," The American Statistician, 51, 247-253. (Part of a larger discussion pp. 241-274.) American Statistical Association. 2002. "Curriculum guidelines for undergraduate programs in statistical science." Available at http://www.amstat.org/education/Curriculum_Guidelines.html Berry, D. A. 1997. "Teaching elementary Bayesian statistics with real applications in science," The American Statistician, 51, 241-246. (Part of a larger discussion pp. 241-274.) Draper, N. R. and Smith, H. 1998. Applied regression analysis. 3rd ed. New York: John Wiley. Fisher, R. A. 1935. The design of experiments. Edinburgh: Oliver and Boyd. The 8th edition of this seminal work appears in Fisher (1990). ---- 1990. Statistical methods, experimental design, and scien- tific inference ed. by J. H. Bennett. Oxford: Oxford University Press. Fleischmann, M., Pons, S., and Hawkins, M. 1989. "Electrochemi- cally induced nuclear fusion of deuterium," Journal of Electroanalytical Chemistry, 261, 301-308. Errata in 263, 187-188 (1989). Freund, J. E. and Walpole, R. E. 1987. Mathematical statistics. 4th ed. Englewood Cliffs, NJ: Prentice-Hall. Hoffman, N. 1995. A dialogue on chemically induced nuclear ef- fects: A guide for the perplexed about cold fusion. La Grange Park, Illinois: American Nuclear Society. Hogg, R. V. and Craig, A. T. 1995. Introduction to mathematical statistics. 5th ed. Englewood Cliffs, NJ: Prentice Hall. Huizenga, J. R. 1993. Cold fusion: The scientific fiasco of the century (revised and updated). Oxford: Oxford University Press. Lipsey, M. W. 1990. "Theory as method: Small theories of treat- ments," in Research Methodology: Strengthening Causal Inter- pretations of Nonexperimental Data, ed. L. Sechrest, E. Perrin, and J. Bunker, Rockville MD: U.S. Department of Health and Human Services, pp. 33-51. Macnaughton, D. B. 1997a. "Re: How should we motivate students in intro stat? (Response to comments by John R. Vokey)." Posted to EdStat and sci.stat.edu on April 6, 1977. Available at http://www.matstat.com/teach/p0024.htm ---- 1997b. "The entity-property-relationship approach to statis- tics: An introduction for students." Available at http://www.matstat.com/teach/ ---- 1997c. 'EPR approach and scientific "explanation." (Response to comments by Robert Frick).' Posted to EdStat and sci.stat.edu on July 23, 1997. Available at http://www.matstat.com/teach/p0026.htm ---- 1998a. "Re: Eight features of an ideal introductory statis- tics course. (Response to comments by Gary Smith)." Posted to EdStat and sci.stat.edu on November 23, 1998. Available at http://www.matstat.com/teach/p0036.htm ---- 1998b. "Eight features of an ideal introductory statistics course. Available at http://www.matstat.com/teach/ ---- 1998c. "Which sums of squares are best in unbalanced analy- sis of variance?" Available at http://www.matstat.com/ss/ ---- 1998d. "Computing numerator sums of squares in unbalanced analysis of variance: Two-way case (computer output)." Avail- able at http://www.matstat.com/ss/pr0139.htm ---- 1999a. "Response to comments by Herman Rubin." Posted to EdStat and sci.stat.edu on May 16, 1999. Available at http://www.matstat.com/teach/p0041.htm ---- 1999b. "The introductory statistics course: The entity- property-relationship approach." Available at http://www.matstat.com/teach/ Michelson, A. A. and Morley, E. W. 1887. "On the relative motion of the earth and the luminiferous ether," American Journal of Science, 3rd Series, 34, 333-345. Moore, D. S. 1997a. "New pedagogy and new content: The case of statistics" (with discussion), International Statistical Review, 65, 123-165. ---- 1997b. "Bayes for beginners? Some reasons to hesitate," The American Statistician, 51, 254-261. (Part of a larger dis- cussion pp. 241-274.) Mosteller, F. 1990. "Improving research methodology: An over- view," in Research Methodology: Strengthening Causal Inter- pretations of Nonexperimental Data, ed. L. Sechrest, E. Perrin, and J. Bunker, Rockville MD: U. S. Department of Health and Human Services, pp. 221-230. Parker, S. B. ed. 1994. McGraw-Hill dictionary of scientific and technical terms. 5th ed. New York: McGraw-Hill. Popper, K. R. 1959. The logic of scientific discovery. Reprint 1992, London: Routledge. ---- 1983. Realism and the aim of science. Reprint 1992, London: Routledge. ---- 1989. Conjectures and refutations. (5th ed.) Reprint 1992, London: Routledge.
Home page for the Entity-Property-Relationship Approach to Introductory Statistics