Subject: Re: Eight Features of an Ideal Intro Stat Course (Response to comments by Jan de Leeuw) To: EdStat-L and sci.stat.edu From: Donald B. Macnaughton <donmac@matstat.com> Date: Sunday October 31, 1999 Cc: Jan de Leeuw <deleeuw@stat.ucla.edu>
Quoting a 99/5/2 post of mine, Jan de Leeuw writes (on 99/5/6) > This is a very interesting discussion and Don's position is, as > always, well-argued and convincing. But > > -- "the external world" is ill-defined. How about introspec- > tion ? How about flying saucers ? Or, to put it differently, > what ARE data about a non-external world ? These are excellent questions. I begin with a definition: The EXTERNAL WORLD is what is "out there" -- what we see when we look out the windows in our heads and what we sense through our other senses. I use the concept of 'external world' as a way of indicating the broadness of the applications of statistics -- we can use the field of statistics almost anywhere "external" to ourselves. (To use statistics in some area we need only find reasonable ways to measure the relevant variables in the area.) However, until I received Jan's post I believed that we could not obtain reliable measurements of variables in the INTERNAL world because that world is often (as a fact of human nature) ruled by prejudices -- it is subjective. Thus I concluded that we could NOT use the field of statistics in the world of our inner experi- ence. However, after considering Jan's question about introspection, I now think my conclusion may be incorrect. It is indeed an EMPIRICAL question whether the internal world of one's own per- sonal experience can also be subjected to statistical analysis, just like the external world. I will be pleased if researchers show that the field of statistics is also applicable in the in- ternal world of one's private experience. * * * In asking about flying saucers, I think Jan is wondering about things that do not "really" exist. That is, I think he probably believes (as I believe) that flying saucers (vehicles used by ex- traterrestrials visiting the earth) do not really exist (except in people's imaginations). If Jan is asking how such things re- late to the external world I refer to, my answer is that this is a matter of choosing a point of view. I take the point of view that things (entities) that do not exist (like flying saucers) are still a part of the external world, because people talk about such things. The fact that these things are (or may be) imagi- nary need not somehow remove them (in their imaginary existence) from being a part of external reality. * * * In my 99/5/2 post, my main reference to the concept of 'external world' was in the definition I proposed of "empirical research", which I gave as follows: EMPIRICAL RESEARCH is any research in which data are gathered from the external world and then conclusions are drawn from the data about the external world. Jan's questions have shown me that the concept of 'external world' is an unnecessary restriction in the definition. Thus I now propose the following new definition: EMPIRICAL RESEARCH is any research in which data are gathered from some area of experience and then conclu- sions are drawn from the data about the area of experi- ence. The new definition encompasses both the external and internal worlds and is thus broader than the original definition. I thank Jan for motivating the new definition. * * * Note that the new definition does not explicitly refer to the concept of 'data analysis'. However, this concept is certainly implicit in the definition because the conclusions we draw from data in empirical research can only (sensibly) be drawn on the basis of some form of examination or "analysis" of the data. Diagrammatically empirical research = data gathering, which is followed by data analysis, which is followed by conclusion drawing. I omit a direct reference to the concept of 'data analysis' in the definition because I wish to emphasize the goal of empirical research, which is not data analysis, but is instead (correct) conclusion drawing. (I discuss three types of conclusion drawing in the application of the field of statistics in appendix A. > -- I suggest "Statistics develops and evaluates techniques for > data analysis". Although Jan does not say so explicitly, I believe he is propos- ing a definition of the role of the field of statistics. Jan's definition is worth setting apart: Statistics develops and evaluates techniques for data analysis. Contrast Jan's definition of the role of statistics with mine, which is Statistics provides efficient general techniques to help empirical researchers study variables and relationships between variables, mainly as a means to predict and con- trol the values of variables. Note that the concept of 'data analysis', which is central in Jan's definition, is implicit in my definition because (as I discuss above) the concept is implicit in my prior definition of empirical research. I believe that Jan's definition and my definition of the role of statistics are mutually fully consistent. However, my definition is both more explicit and more limiting. That is, Jan's defini- tion says that statistics supports "data analysis". My defini- tion also says (through my definition of "empirical research") that statistics supports "data analysis", but it also says where, how, and why statistics supports data analysis. The definition says WHERE statistics supports data analysis -- in empirical research. I believe we should emphasize where statis- tics supports data analysis because people sometimes forget that the field of statistics is tightly tied (both generally and spe- cifically) to empirical research projects -- for most statisti- cians the field of statistics is not an end in itself. The definition says HOW statistics supports data analysis -- by providing efficient general techniques for studying variables and relationships between variables. The definition says WHY statistics supports data analysis -- (mainly) to enable us (i.e., society) to accurately predict and control the values of variables. The "where", "how", and "why" ideas in my definition impose extra limitations on the role of statistics that are not present in Jan's definition. However, many readers will agree that these limitations are not unduly limiting because statistics does not (in a practical sense) do much beyond providing techniques for studying variables and relationships between or among variables in empirical research, and such study is mainly useful as a means to accurately predict and control the values of variables. (One can also reasonably argue that, in addition to providing the ability to predict and control, the field of statistics also en- ables us to "explain" and "understand" the area of experience we are studying. I view the relationship between the first pair of concepts ['prediction' and 'control'] and the second ['explana- tion' and 'understanding'] as follows: In almost every area of serious human endeavor the most widely accepted TEST of a good explanation or of good understanding is whether the explanation or understanding leads to correct prediction or correct control. Therefore, although explanation and understanding are important, they are subordinate to prediction and control. I develop this view in two earlier posts [1996a, 1996b]. I discuss the action- prescriptive role of the field of statistics in appendix A.) I include the "where", "how", and "why" ideas in the definition because I believe these ideas deserve strong emphasis. I believe that simply saying the role of statistics is to support "data analysis" is too vague -- outsiders cannot easily see value in doing "data analysis". Instead, to an outsider, doing data analysis sounds like something medieval monks might do with great rigor in an isolated monastery in the mountains, but with no known practical value. Most importantly, if we emphasize WHY statistics supports data analysis (i.e., to enable accurate prediction and control), we pique the curiosity of outsiders because almost everyone would like an easy path to accurate prediction and control. Most stat- isticians and many empirical researchers agree that proper use of statistics is by far the easiest and best path to accurate pre- diction and control in any area of empirical research. > This [i.e., Jan's proposed definition of the role of statis- > tics] avoids "empirical", "external", and "science", because > defining these is metaphysics. I believe Jan is here using the term "metaphysics" in the pejora- tive sense of "building conceptual structures on slippery elusive concepts at the bottom". I fully agree with the importance of eliminating this type of metaphysics. Jan identifies three possible slippery elusive concepts -- 'em- pirical', 'external', and 'science'. Can we omit these three concepts from the definition of the role of statistics? Regarding the concept of 'empirical', recall that my definition of the role of statistics refers to the concept of 'empirical re- search'. Since the concept of 'empirical research' is central in the present discussion, let us focus not on the concept of 'em- pirical' but on the concept of 'empirical research'. My definition above of "empirical research" states that empirical research is data gathering, (data analysis,) and conclusion draw- ing. The main concepts in this definition are the concept of 'data' and the concept of 'conclusion drawing'. To determine whether the concept of 'empirical research' is slippery or elu- sive let us consider its roots and ask whether the concepts of 'data' and 'conclusion drawing' are slippery or elusive. Consider the concept of 'data' (which Jan uses directly and which I use indirectly in defining the role of statistics). I define "data" as "a set of measured values of properties of entities or, equivalently, a set of values of variables". This concept ap- pears to be a straightforward (not slippery or elusive) concept. Consider the concept of 'conclusion-drawing'. The act of drawing a conclusion is equivalent to making a (preferably correct) statement about some area of experience. Thus this concept (which I discuss in appendix A) also appears to be a straightfor- ward concept. In view of the preceding points, I suggest that the concept of 'empirical research' in my definition of the role of statistics is not particularly slippery or elusive. And the term "empiri- cal" is merely a useful modifier that conveys the important ideas of 'data-based' and 'observation-based'. Finally, as I discuss above, I believe it is helpful to include the concept of 'empirical research' in the definition of the role of statistics because I believe students understand our field better if we emphasize where it plays its role. * * * Consideration of Jan's second concept ('external') has led me to realize (as I discuss above) that it is unnecessary to refer to the concept of the external world in the definition of "empirical research". Thus I fully agree with Jan that this concept need not be (explicitly or implicitly) present in the definition of the role of statistics. I also agree that we should not use Jan's third concept ('sci- ence') in the definition of the role of statistics. The field of statistics has many applications outside of what is traditionally viewed as "science", such as in business, industry, law, and gov- ernment. > It [i.e., Jan's proposed definition of the role of statistics] > also avoids "inference" and "modeling" and "quantitative", be- > cause these are irrelevant for the definition. I believe we should include the concept of 'inference' in the definition of the role of statistics. (I include the concept in my definition of the role by including the phrase "conclusions are drawn from the data" in my conceptually prior definition of "empirical research".) We should include the concept of 'infer- ence' in the definition because the data analysis we do in sta- tistics is not an end in itself, but is instead a MEANS to draw- ing valid conclusions (inferences) about the particular area of experience we are using statistics to study. > Inference is the responsibility of the scientist, statisticians > just develop tools which may help the scientist doing the job. I agree that the final responsibility for correct statistical in- ference rests with the scientist or empirical researcher who uses statistics. However, statisticians must be experts in the GENERAL ideas of statistical inference so that we can teach users of statistics to draw valid inferences from data. Thus inference (in the GENERAL sense) is very much the responsibility of statis- ticians. I suspect Jan wrote his sentence above to be provocative. He may agree that saying "statisticians just develop tools which may help" minimizes the fundamental role our field plays (or can play) across almost all empirical research. Such minimization may suggest to outsiders that our field is obscure and uninter- esting. Instead of minimization, if statisticians politely trumpet the vital role our field plays (in providing optimal methods for ac- curate prediction and control), I believe we can lead the field to receive the very broad recognition it deserves. * * * The second concept Jan says is irrelevant for the definition of the role of statistics is the concept of 'modeling'. By "model- ing" Jan may mean "building statistical model equations of rela- tionships between variables". I agree that it is unnecessary to refer to this sense of the word "modeling" in the definition of the role of statistics. On the other hand, by "modeling" Jan may mean "studying relation- ships between variables" WITHOUT any reference to statistical model equations. If Jan is using the word "modeling" in this sense, he is suggesting that the idea of studying relationships between variables is irrelevant for the definition of (the role of) statistics. My definition of the role of statistics explicitly refers to the idea of studying relationships between variables. I believe this idea is relevant for the definition because my experience sug- gests that - almost all empirical research projects can be usefully viewed as studying relationships between or among variables and - almost all the commonly used statistical procedures can be use- fully unified as methods for studying relationships between or among variables. I discuss these points further in a paper (1999a). * * * The third concept Jan says is irrelevant for the definition of the role of statistics is the concept of 'quantitative', and I agree. > Although, in the case of hypothesis testing for instance, they > [statisticians] may have hindered more than they have helped. I agree that there is much confusion about hypothesis testing -- so much confusion that some empirical researchers eschew hypothe- sis testing and instead use confidence intervals to perform what is usually the same function. I suggest that many instances of hypothesis testing can be best viewed as testing for evidence of the existence of a relationship between variables (or testing for evidence of the existence of an extension to an already known relationship between variables). This point of view helps empirical researchers to better under- stand the usefulness of hypothesis testing. Testing hypotheses about the existence of relationships between variables is important because (to avoid embarrassing and costly errors) we must first verify that there is clear evidence of a relationship between the relevant variables before attempting to use information about a putative relationship between variables for prediction or control. Computing p-values and verifying that they are low enough is an objective and standardized aid in checking for proper evidence of the existence of a relationship between variables. I discuss these ideas further in two papers (1997, sec. 9; 1998, sec. 5). > Modeling (in statistics) is one way (and a controversial one at > that) to develop data analysis techniques. I agree. (I believe that most [all?] statistical models are models of relationships between variables.) > -- the distinction between univariate frequency distributions > and "relations between variables" is better formulated, I > think, as the distinction between univariate and multivariate > frequency distributions or between unconditional and condi- > tional distributions. This makes most of the distinction go > away. The two formulations Jan proposes are more elegant and more com- pact than the formulation I propose. Jan's formulations may be very useful in advanced statistics courses. However, I am interested in finding the best formulation of the concepts for the introductory statistics course for students who are not majoring in statistics. Here I think Jan's formulations are more complicated than necessary because they require that we use the concept of 'distribution' (either 'multivariate distribu- tion' or 'conditional distribution'), which is a difficult con- cept for beginners to grasp. Contrast Jan's formulations with the formulation I propose, which makes no reference to the concept of 'distribution'. We can ini- tially characterize relationships between variables in terms of (the values of) one variable "depending" on (the values of) an- other variable, or the values of one variable varying "in step" with the values of another variable. (We can reinforce the char- acterizations by carefully discussing illustrative practical ex- amples of relationships.) Later, we can formally characterize the concept in terms of conditional expected value (still without having to appeal to the concept of 'distribution') as I discuss in a paper and in an earlier post (1997, sec. 7.10, 1999b). I suggest that if two or more consistent formulations are avail- able upon which to build an approach to the introductory statis- tics course, we should use the formulation that is easiest for beginners to understand. I believe the formulation that distin- guishes between univariate distributions and relationships be- tween variables allows the teacher to introduce the concept of 'relationship between variables' very early in the introductory course. I further believe that a careful discussion of this easy-to-understand concept gives students a fundamental grasp of the important practical value of the field of statistics (in fa- cilitating accurate prediction and control). Thus I recommend that teachers use the formulation that distinguishes between re- lationships between variables and univariate distributions in- stead of Jan's (correct, more elegant, but harder to understand) formulations that make the distinction go away. > The constructions that make studying a univariate distribution > a limiting case of studying a relationship are not very con- > vincing to me. Another construction (instead of zero predic- > tors) is one predictor which only takes a single value. Dis- > tribution of IQ in Arkansas is distribution of IQ in the 50 > states, conditional on Arkansas. Here Jan is referring to the idea that univariate distributions are simply a special (degenerate) case of relationships between variables -- the case in which, as usual, there is one response variable, but the number of predictor variables, instead of being one or more, is reduced to zero. Jan's construction (with a sin- gle constant-valued predictor variable) is another valid way of showing that univariate distributions are (rigorously) a special case of relationships between variables. > -- It simply will not do to minimize the importance of univari- > ate summaries. My guess is (unlike Don's) that 96% of the sta- > tistics published (in newspapers, the gray literature, govern- > ment reports) are univariate frequency distributions or their > measures of central tendency and dispersion. Saying that this > is not science will not make it go away. To follow up on Jan's point I studied each article and letter in the first section the May 3, 1999 issue of The Toronto Star news- paper, and I identified all the statements or graphics that in any way refer to the values of variables. I classified each of these statements and graphics into one of the following four categories: 1. Statements about the value of a variable in a single entity. (For example, a statement might say a particular person is 36 years old, or a particular concert was attended by approxi- mately 25,000 people, or a particular town is "small".) 2. Statements about a univariate distribution of the values of a variable (including statements about the central tendency or the dispersion of a univariate distribution). (For example, a statement might say that the average age of members of a par- ticular social group is 34 years.) 3. Statements about a relationship between variables. (For exam- ple, a statement might say that the average temperature of the earth has been increasing over the last 100 years.) 4. Statements that do not fit into any of the preceding three categories. Of the 1237 statements I found that refer to the values of vari- ables 85% are statements of the value of a variable in a single entity, 2% are statements of univariate distributions, and 13% are statements of relationships between variables. (No state- ments fall into the fourth category.) I further discuss my find- ings in appendix B. Jan estimates that 96% of the statistics published in newspapers are reports of univariate distributions or their measures of cen- tral tendency and dispersion. My informal survey of articles in The Toronto Star suggests that Jan's estimate may be high. However, my estimate in my 99/5/2 post that 96% of empirical re- search projects study relationships between variables, and Jan's estimate that 96% of statistics in newspapers, the gray litera- ture, and government reports are related to univariate distribu- tions, and my informal survey of The Toronto Star are NOT par- ticularly reliable sources of information. It would be much bet- ter if someone (an empirical researcher) could present properly- sampled survey information about the relative frequency with which various types of statistical statements appear in various areas of interest. As I suggest in the 99/5/2 post, I hope an interested reader will perform a proper survey. Finally, to the extent that the general public PREFERS reports of univariate distributions over reports of relationships between variables, I believe we have failed as statistics teachers to educate the general public. We have failed because a report of a relationship between a response variable Y and one or more rele- vant predictor variables is invariably more informative and more useful than a report of the univariate distribution of Y in iso- lation. > Also remember that density estimation, robust estimation, etc. > are important areas of statistical activity. With respect to empirical density estimation, assuming we are studying a UNIVARIATE density, this is clearly NOT (in my every- day language) a study of a relationship between variables. I be- lieve this activity is subsumed under the roughly 2% of empirical research projects that study univariate distributions that I re- fer to in the 99/5/2 post. Again though, this is an estimate on my part, which should be corroborated or corrected through a proper survey. With respect to empirical MULTIVARIATE density estimation, in this case two or more variables are always involved. Thus it is easy to view this activity as simply another way of studying the relationship between (or, better, among) these variables. With respect to robust estimation, if a research project uses ro- bust estimation to estimate the values of the parameters of a univariate distribution, the research project is clearly studying a univariate distribution and not studying a relationship between variables. I believe this use of robust estimation is subsumed under the 2% of empirical research projects that study univariate distributions. On the other hand, if a research project uses ro- bust estimation to estimate the parameters of a model, it is al- most certainly studying a relationship between variables because (as I suggest above) most (all?) statistical models are models of relationships between variables. > -- Emphasizing that statisticians develop and evaluate data > analysis techniques is consistent with the Box idea that sta- > tistics is a service activity, serving the sciences, but also > policy, government, sports, the law, and the public. I fully agree with Jan's statement of George Box's important idea that statistics is a service activity serving the areas mentioned (and other areas, such as business and industry). Jan recommends emphasizing "data analysis techniques". However, as I discuss above, I believe the concept of 'data analysis' is cryptic and diffuse -- outsiders cannot easily see any practical service or practical value in "data analysis". In contrast, I recommend that teachers emphasize the easy-to- understand unifying concept of (accurate) prediction and control on the basis of relationships between variables. I believe that emphasizing prediction and control - impresses students with the significant practical value of our field and - makes our field substantially easier for students to under- stand. ------------------------------------------------------- Donald B. Macnaughton MatStat Research Consulting Inc donmac@matstat.com Toronto, Canada ------------------------------------------------------- APPENDIX A: THREE TYPES OF CONCLUSIONS IN STATISTICAL ANALYSIS OF EMPIRICAL RESEARCH RESULTS In the new definition I propose above of "empirical research" I say that empirical research involves "drawing conclusions" from data. The conclusions we draw from data in empirical research can be broken into three types. We can draw - conclusions about the values of variables or about the values of parameters in entities of some type (and more generally con- clusions about univariate distributions of the values of vari- ables) or - conclusions about relationships between or among variables in entities of some type (and how the relationships can be used for accurate prediction or control) or - conclusions about the best course of action to take in some en- deavor. The first type of conclusion -- about the values of variables or the values of parameters -- is the simplest. For example, if a researcher has made several empirical estimates of the speed of light, he or she can use statistical methods to analyze these es- timates and determine a "best" estimate of the speed, and also determine an estimate of the precision of the estimate. The second type of conclusion -- about relationships between variables -- is, I believe, the most important type. The field of statistics can help researchers to determine whether a rela- tionship exists between variables and, when a relationship is found, can help to accurately predict and possibly control the values of the response variable in new entities from the popula- tion on the basis of the relationship. The third type of conclusion -- about the best course of action -- is clearly important because almost every individual and al- most every organization is regularly faced with making important, but difficult, conclusions or decisions about courses of action. Thus the statistical methods for making conclusions or decisions about the best course of action are an important subset of the methods of statistics. * * * I place the three types of conclusions in two main groups. I call the methods to make the first and second types of conclu- sions the "descriptive" aspect of statistics because these con- clusions DESCRIBE a state of the world. That is, they DESCRIBE values of variables or parameters, or they DESCRIBE relationships between variables, or they DESCRIBE methods for predicting and controlling on the basis of such relationships. I call the methods to make the third type of conclusion the "action-prescriptive" aspect of statistics because these conclu- sions prescribe a course of action. Conclusions in the descriptive aspect of statistics often involve statistical tests of hypotheses. For example, a researcher may perform a research project and conclude from the data that a re- lationship exists between one or more of the predictor variables and the response variable. The researcher will typically support this conclusion by reporting a test of the hypothesis that the relationship exists and will note that the p-value yielded by the test is low enough that we cannot reasonably explain away the evidence of the relationship in the data as being merely due to chance. Conclusions in the action-prescriptive aspect of statistics are usually based on conclusions provided by the descriptive aspect. That is, most (all?) conclusions made using the action-prescrip- tive methods of statistics are based on knowledge about the val- ues of relevant variables and parameters or (more likely) are based on knowledge about relevant relationships between vari- ables. However, in addition, the action-prescriptive aspect of statistics takes account of a set of VALUES and GOALS. (Here the word "values" denotes principles or qualities that are socially or commercially valuable as opposed to denoting the values of a variable or parameter.) The values and goals are used together with knowledge of variables and relationships between variables to identify (possibly with the help of "objective functions" or "utility functions") the conclusion (i.e., the decision about a course of action) that optimizes the achievement of the values and goals. Clearly, descriptive conclusions and action-prescriptive conclu- sions are made using different sets of underlying methods. Therefore, in discussing the use of statistics for drawing con- clusions or making decisions it is important to identify the type of conclusions or decisions one is discussing. * * * As a practical matter, decision-makers often find it unnecessary to use action-prescriptive concepts to make a decision about a course of action on the basis of the results of an empirical re- search project because the appropriate course of action is obvi- ous. For example, if a medical research project shows that a particular drug lessens the severity of a disease in people, the action-prescriptive aspect of statistics cannot generally improve on the recommendation that doctors should prescribe the lowest (safe) dose of the drug that yields a "full" response in most people (possibly subject to corrections related to ancillary variables such as people's ages or body weights). As a second practical matter, the concepts of the action-pre- scriptive aspect of statistics are often unusable in real-life situations in which we must decide a course of action. This is because the necessary objective functions (or other pieces of in- formation needed to use precise action-prescriptive concepts) are often unavailable. (This problem can sometimes be bypassed by working with a range of possible objective functions.) As a third practical matter, descriptive conclusions and action- prescriptive conclusions are often widely separated. That is, empirical researchers generally provide descriptive conclusions about the values of variables or about the values of parameters or descriptive conclusions about relationships between variables. But (apart from possibly SUGGESTING a course of action) the em- pirical researchers who provide the descriptive conclusions play little or no direct role in MAKING DECISIONS about a course of action on the basis of the conclusions. This is because empiri- cal researchers typically place the information they discover at other people's disposal. (Researchers often place the informa- tion in the public domain in the form of a journal article.) Then the researchers go back to doing what they do best -- em- pirical research -- and other people (people who make policy) make the relevant decisions about courses of action. (An [infrequently occurring] exception to the point in the pre- ceding paragraph occurs when we directly use methods from the ac- tion-prescriptive aspect of statistics in conducting empirical research. This occurs in so-called "sequential" statistical analysis in which at each main step in the analysis we make a formal decision as to what action to take next. [Usually one possible action is to gather more data.] Each decision is made on the basis of the data accumulated to that point.) Since - the action-prescriptive aspect of statistics is not often used to interpret the results of empirical research projects and - when the action-prescriptive aspect IS used it is usually widely separated from the descriptive aspect therefore, it is possible and reasonable to separately teach stu- dents about the descriptive and action-prescriptive aspects of statistics. Since the action-prescriptive aspect of statistics relies on con- clusions provided by the descriptive aspect, but not vice versa, it is reasonable to teach the descriptive aspect of statistics to students before teaching them the action-prescriptive aspect. Finally, the descriptive aspect of statistics is much broader than the action-prescriptive aspect, perhaps at least in the ra- tio of ten to one in terms of, say, the amount of associated lit- erature published. The above considerations lead me to recommend that introductory teachers focus on the descriptive aspect of statistics. That is, I recommend that teachers concentrate on how to draw conclusions about variables and parameters and (especially) how to draw con- clusions about relationships between variables. I recommend that discussion of how to use statistical techniques to draw conclu- sions (or make decisions) about correct courses of action be postponed until a later course. My sense of the distinction between the descriptive and action- prescriptive aspects of statistics grew from an insightful arti- cle about p-values by Ware, Mosteller, Delgado, Donnelly, and Ingelfinger (1992). APPENDIX B: EXAMINATION OF THE MAY 3 ISSUE OF THE TORONTO STAR FOR STATEMENTS ABOUT VARIABLES The Toronto Star is a broadsheet newspaper with a weekday circu- lation of about 450,000. I chose to work with the May 3, 1999 Metro Edition of the Star "at random" in the sense that this was the issue at hand when I began work. The first section of the May 3 Star contains 20 pages, with roughly 45% of the material being advertising. It contains 44 news stories (some only a paragraph or two long) in the catego- ries of world news, Canada news, and Ontario news. (Most Toronto news appears in another section.) The first section also con- tains 10 letters to the editor, 6 opinion articles, and 4 edito- rials. I carefully scanned all the above-mentioned non-advertising mate- rial in the section, circling each statement of the simple value of a variable, writing "D" in the margin for each statement of some aspect of a univariate distribution, writing "R" in the mar- gin for each statement of a relationship between variables, and writing "N" in the margin for each statement that pertains to variables but that falls into none of the preceding three groups. If I detected that the same statement was made more than once (e.g., in the text and in a graphic), I counted the statement only once. The following table shows my counts of the four types of statements: Frequency of Occurrence of Different Types of Statements Pertaining to Variables in the Non-Advertising Material in the First Section of the Metro Edition of the May 3, 1999 Toronto Star --------------------------------- Statement Type Count Percent --------------------------------- 1. Value of 1051 85 a variable 2. Univariate 30 2 distribution 3. Relationship 156 13 between variables 4. Other 0 0 ___ ___ TOTALS 1237 100 --------------------------------- Let me describe some of my classifications. The first news story on the front page of the May 3 Star is titled "Release of PoWs fails to sway NATO". The first sentence in this story says NATO will continue bombing Yugoslavia despite the release of three U.S. prisoners of war. I interpreted this (unhappy) sentence as stating a relationship between variables or, actually, a NON-relationship between vari- ables. The response variable can be viewed as "bombing intensity of Yugoslavia" and the predictor variable can be viewed as "num- ber of NATO prisoners released". The sentence says there will be no relationship between the release of the three prisoners and the bombing intensity. The second sentence fills in details of the first sentence and introduces a new relationship between variables as follows: "We are not only not going to stop the bombing, we're go- ing to intensify the bombing," U.S. Defence Secretary William Cohen said yesterday. This sentence indicates that there will be a relationship between the intensity of the bombing and time -- the intensity will in- crease in the immediately following time. The second sentence also states the value of the variable that indicates the time of Cohen's statement: "yesterday". The third sentence says Still, diplomatic efforts to end the conflict in the Balkans appeared to gain some strength yesterday. This sentence is another statement of a relationship between variables. The response variable is "strength of diplomatic ef- forts to end the conflict in the Balkans" and the predictor vari- able is time. The sentence is saying that the strength of the efforts appears to have increased in recent time (i.e., over the course of "yesterday"). The fourth sentence indicates how the diplomatic efforts to end the conflict appeared to gain strength as follows: The White House announced that U.S. President Bill Clinton would meet today with Russian envoy Viktor Chernomyrdin, who will brief him on Russian efforts to resolve the crisis. Part of the fourth sentence states the value of a variable -- the (anticipated) value of the "when" property of the meeting, which is "today". The second story on the front page is titled "Rental housing cri- sis predicted" and begins with the following sentence: Toronto and other Ontario cities face a housing crisis as demand for affordable apartments grows at a much faster rate than construction, a comprehensive new report says. This sentence is a somewhat complicated statement of a (extrapo- lated) relationship between variables. One way of viewing the relationship is to say that the predictor variable is "time" and the two response variables are "demand for affordable apartments" and "rate of construction of affordable apartments". A third story on the front page is titled "Immigrants feel wel- come" and begins with the following sentence: Toronto is a place that welcomes newcomers, more so than many cities around the world, say participants in a poll commissioned for The Star's Beyond 2000 project. This is a statement of a relationship between variables. The re- sponse variable could be labeled "amount of welcome extended to newcomers" and the predictor variable is the nominal-level vari- able "city" (e.g., Toronto, New York, Tokyo). * * * Although relationships between variables dominate the important lead sentences of the three stories I discuss above, relation- ships between variables are only peripheral in the other two sto- ries on the front page. One of these stories is about the Toronto Maple Leafs hockey team winning a series of playoff games. This story contains many statements of the values of variables, but I found no statements of relationships between variables. To illustrate some statements of values of variables, here are the third and fourth sentences from the hockey story, with the values of variables between asterisks: *Thirty-two years to the day* after winning their *last* Stanley Cup, the Leafs moved on to the *second* round of the *'99* playoffs *last night* by finishing off the Philadelphia Flyers. The *1-0* victory came on *winger* *Sergei Berezin's* goal with *one* minute to play in regulation time. (The final score of the game can be viewed as a vector-valued variable with two components.) Note the broad view I take of what are statements of the values of variables. That is, as well as values that denote numbers, I view all words or phrases that function as adjectives or adverbs and all words or phrases that state a time or date as stating values of variables. I take this point of view because these differing linguistic operations all seem to be performing the same conceptual function -- that of stating the value of some property of an entity or, equivalently, that of stating the value of some variable. (In the case of an adjective, the entity is denoted by the NOUN the adjective modifies. In the case of an adverb, the entity is the PROCESSES the adverb helps to describe. In the case of a point in time or a duration, the entity is the EVENT the time refers to.) The final story on the front page is about a shooting in Alberta. This story contains many statements of the values of variables, but only one statement of a relationships between variables. * * * As I note above, I found only 30 statements of univariate distri- butions, including statements of means, standard deviations, or other measures of location or dispersion. Two of these occur in one sentence in the hockey story on the front page. The sentence quotes the player who scored the only goal in the game as fol- lows: "It's the biggest goal I've ever scored in the NHL," said Berezin, who led the team in goals this season. This sentence identifies a property that Berezin associates with each of the goals he has scored in the NHL -- the goal's bigness (i.e., its importance) -- and the sentence says that his latest goal has the maximum value in the distribution of the values of that property. The same sentence (implicitly) refers to the distribution of the number of goals scored by players on Berezin's team and says that Berezin is the player who is associated with the maximum value of that distribution. Here are some other examples of statements of univariate distri- butions: Testicular cancer tends to hit men in their 20s (page A3). The statement says that the age distribution of the occurrence of testicular cancer peaks strongly in the 20s. [Curt] Newport used computer models and state-of-the-art underwater equipment to search an area near Grand Bahama in water that averaged almost 4,700 meters deep (page A3). The statement gives the average of the univariate distribution of water depths in the area searched. Mireya Moscoso ... won election yesterday to become the president [of Panama] .... She beast Martin Torrijos .... Torrijos conceded defeat with 59.4 per cent of bal- lots counted, showing Moscoso with 44 per cent of the vote to his 38 per cent. ... A third candidate ... Alberto Vallarino ... held 17 per cent of the vote (page A10). This passage reports the univariate distribution of the votes among the three candidates. (One could also reasonably view the passage as stating a relationship between two variables: "candi- date" [i.e., Moscoso, Torrijos, or Vallarino] and "percent of the vote won".) * * * I found that classifying the text in the Star was difficult be- cause I often found that rereading a passage suggested another (usually deeper) interpretation, or suggested an additional statement of one of the types. Thus it seems likely that another reader performing the classification would produce a somewhat different classification from mine. However, I believe one could develop a set of straightforward rules for classifying statements that refer to variables. These rules would lead to strong agreement among trained classifiers about statements in passages of text. I suspect that these rules would lead to finding proportions of the three types of state- ments about variables that are close to the proportions I give in the table above. * * * I also studied the sports section in the May 3 Star. The sports section is a special case because it is full of statistics. Since many newspaper sports statistics are presented in tables, and since the row and column (and other) dimensions of a table each suggest a different variable, it is clear that tables of newspaper sports statistics reflect relationships between vari- ables. Interestingly however, tables of newspaper sports statistics do not generally reflect relationships between variables in POPULATIONS. This is because the tables are generally not rea- sonably viewed as representing SAMPLES that reflect some wider population of entities. Instead, the tables are generally de- tailed cross-classified reports of the values of properties of individual entities (typically teams, players, or horses in the Star), and interest is focused on these specific entities, rather than on some population they represent. Thus (unless the tables themselves are sampled over different is- sues of a newspaper) the relationships between variables re- flected in tables of newspaper sports statistics cannot generally be used for predicting or controlling the values of variables in other entities in a population, as is generally the case in stan- dard statistical analysis. Although tables of newspaper sports statistics are generally not (directly) useful for predicting and controlling the values of variables in other entities in a population, such tables are of definite interest in predicting and controlling the values of the variables FOR THE SAME INDIVIDUAL ENTITIES that are reported in a table AT A LATER TIME. For example, if a table of sports statis- tics shows teams as rows and if the statistics in the columns of the table indicate that one team is presently doing particularly well relative to the others, we can reasonably use the table (possibly with other relevant consistent information) to predict that this particular team will do well in its next game. If one is a betting person, this prediction might then form the basis of a bet. Thus the use of statistics in sports is often different from the standard use of statistics in empirical research REFERENCES Macnaughton, D. B. 1996a. "EPR approach to intro stat: Relation- ships between variables (response to comments by Brian Griffin)." Published in sci.stat.edu and EdStat-L on July 18, 1996. Available at http://www.matstat.com/teach/p0012.htm Macnaughton, D. B. 1996b. "EPR approach to intro stat: Relation- ships between variables (response to comments by George Zeliger)." Published in sci.stat.edu and EdStat-L on July 30, 1996. Available at http://www.matstat.com/teach/p0016.htm Macnaughton, D. B. 1997. "The entity-property-relationship ap- proach to statistics: An introduction for students." Avail- able at http://www.matstat.com/teach/ Macnaughton, D. B. 1998. "Eight features of an ideal introductory statistics course." Available at http://www.matstat.com/teach/ Macnaughton, D. B. 1999a. "The introductory statistics course: The entity-property-relationship approach." Available at http://www.matstat.com/teach/ Macnaughton, D. B. 1999b. "Re: Eight features of an ideal intro stat course (response to comments by Herman Rubin)." Published in sci.stat.edu and EdStat-L on May 16, 1999. Available at http://www.matstat.com/teach/p0041.htm Ware, J. H., Mosteller, F., Delgado, F., Donnelly, M. S., and Ingelfinger, J. A. 1992. "P values." In _Medical uses of sta- tistics_ 2nd ed. edited by J. C. Bailar III and F. Mosteller, 181-200. Boston: NEJM (New England Journal of Medicine) Books.
Home page for Donald Macnaughton's papers about introductory statistics