The following is self-documenting output from a SAS IML computer program:
NOTE: SAS (r) Proprietary Software Release 6.12 TS020 Licensed to DONALD B. MACNAUGHTON, Site 0025250001. 1 /* 2 PR0139.SAS 3 4 COMPUTING NUMERATOR SUMS OF SQUARES 5 IN UNBALANCED ANALYSIS OF VARIANCE: 6 Two-Way Case 7 8 Donald B. Macnaughton 9 donmac@matstat.com 10 11 12 TABLE OF CONTENTS 13 14 Abstract > 15 16 Introduction > 17 - preliminary notes > 18 - introduction to the Searle example > 19 - a goal of experiments and analysis of variance > 20 - a controversy about sums of squares > 21 - model equations > 22 - the residual sum of squares of a model equation > 23 - an interpretation of analysis of variance sums of squares 24 in terms of model equations > 25 - summing up > 26 - program overview > 27 28 Preliminary Steps > 29 - start PROC IML > 30 - define the Searle data > 31 - generate the main effect submatrices of the design matrix > 32 - generate the interaction submatrix of the design matrix > 33 - obtain the SS subroutine and list it > 34 - abstract > 35 - uses > 36 - main arguments: y, XE, and XR > 37 - method > 38 - details of the method > 39 - the hypothesis matrix H as a function of XE and XR > 40 - the projection matrix PM as a function of H > 41 - yp, the projection of y by PM > 42 - the sum of squares as the squared length of yp > 43 - other methods of computing the sum of squares > 44 - general comments > 45 - secondary arguments > 46 - notes > 47 - executable statements > 48 - references > 49 - set the values of the three secondary arguments of the SS 50 subroutine > 51 52 Compute the Seven Sums of Squares Using the SS Subroutine > 53 - HTO A sum of squares > 54 - specify the matrix XE for the effect being tested > 55 - specify the matrix XR for the other effects in the 56 two models > 57 - call SS to compute the sum of squares > 58 - HTO B sum of squares > 59 - HTI A sum of squares > 60 - HTI B sum of squares > 61 - sequential A sum of squares when A is entered first > 62 - sequential B sum of squares when B is entered first > 63 - interaction sum of squares > 64 65 Save the Data in a SAS Dataset and Quit from IML > 66 67 Compute the Sums of Squares using PROC GLM (for comparison with 68 the values generated above) > 69 70 Summary > 71 72 Notes > 73 74 Appendix: Steps to Run the Program > 75 76 References > 77 78 Output from PROC GLM > 79 80 81 ABSTRACT 82 83 This SAS program illustrates a conceptual point of view and the 84 matrix arithmetic for computing the following types of analysis 85 of variance numerator sums of squares: 86 87 - HTO (Higher-level Terms are Omitted) 88 = SPSS ANOVA Experimental 89 = SAS Type II in the two-way case 90 91 - HTI (Higher-level Terms are Included) 92 = SAS Type III 93 = SPSS ANOVA Unique 94 = the default approach in many analysis of variance pro- 95 grams 96 97 - sequential 98 = SAS Type I 99 = SPSS ANOVA Hierarchical in the two-way case. 100 101 The conceptual point of view is one of computing an analysis of 102 variance sum of squares by computing the difference between the 103 residual sums of squares of two model equations (Yates 1934). 104 The matrix arithmetic is simple and is specified directly in 105 terms of the conceptual point of view. 106 107 The program is heavily annotated. Computations are illustrated 108 using data from a 2 x 3 unbalanced experiment discussed by 109 Shayle Searle (1987, 79). 110 111 112 PRELIMINARY NOTES 113 114 If you are not familiar with SAS, you can tell the difference 115 between my comments and the SAS program statements as follows: 116 My comments begin with the two symbols /* and end with the two 117 symbols */ /* Anything outside these symbols (except 118 blanks) is a program statement, which SAS will try to execute. 119 120 To lay some groundwork, I begin the program not with SAS pro- 121 gram statements but instead with about 500 lines of my own com- 122 ments. These set the stage for the later program statements. 123 124 125 INTRODUCTION TO THE SEARLE EXAMPLE 126 127 Analysis of variance is a broadly used method for analyzing the 128 results of scientific experiments. The method was invented by 129 Sir Ronald Aylmer Fisher (1925, 1935) and is generally viewed 130 as the most powerful and versatile available method for scien- 131 tifically inferring causation. 132 133 A current controversy in analysis of variance pertains to ana- 134 lyzing the data from "unbalanced" experiments. The controversy 135 is important because (for various reasons) a large proportion 136 of real-world experiments end up being unbalanced. Shayle 137 Searle addresses the controversy in his book LINEAR MODELS FOR 138 UNBALANCED DATA (1987) in which he discusses both mathematical 139 and philosophical issues. Although I disagree with some of 140 Professor Searle's philosophical conclusions, I am in awe of 141 his mathematical work. It is with deep respect that I offer 142 the following analysis of an important example in his book. 143 144 Searle's example is of an experiment to test whether "type of 145 potting soil" influences "time to germination" in three varie- 146 ties of carrot seed (1987, 78-79). 147 148 (Searle's example is from the field of agriculture. However, 149 the discussion is not limited to experiments in the field of 150 agriculture. Instead, both Searle's discussion and my discus- 151 sion apply to experiments in all fields of empirical research. 152 In particular, the discussion applies to a majority of the ex- 153 periments in the physical, biological, and social sciences.) 154 155 Clearly, the response variable in Searle's experiment is "time 156 to germination", and the two predictor variables are "soil 157 type" and "seed variety". Searle presents the following data 158 as possible results of the experiment: 159 160 TABLE 1 161 Time in Days to First Germination 162 of Three Varieties of Carrot Seed 163 Grown in Two Different Potting Soils 164 ------------------- 165 Seed 166 Soil Variety (B) 167 Type ------------ 168 (A) 1 2 3 169 ------------------- 170 1 6 13 14 171 10 15 22 172 11 173 174 2 12 31 18 175 15 9 176 19 12 177 18 178 ------------------- 179 180 The first number in the body of the table (i.e., 6) indicates 181 that in one of the fifteen trials in the experiment it took six 182 days for seeds of variety 1 to germinate when they were planted 183 in soil of type 1. 184 185 Searle's experiment is unbalanced because, as the table shows, 186 the number of values of the response variable available in the 187 various "cells" in the experiment differs from cell to cell. 188 For example, three values of the response variable are avail- 189 able in the (1,1) cell, but only two values are available in 190 the (1,2) cell. 191 192 193 A GOAL OF EXPERIMENTS AND ANALYSIS OF VARIANCE 194 195 When discussing analysis of variance, it is important to be 196 aware of both the goal of scientific experiments and the role 197 of analysis of variance in achieving the goal. Otherwise, the 198 discussion may become an arbitrary mathematical exercise. One 199 useful way of characterizing the goal is 200 201 The goal of experiments and analysis of variance is to 202 obtain knowledge about relationships between variables. 203 204 In any scientific experiment, an important step in achieving 205 this goal is to determine, in as unequivocal a way as possible, 206 whether a relationship actually *exists* between the variables 207 under study. In particular, we wish to determine whether the 208 response variable in an experiment "depends" on one or more of 209 the predictor variables. If a relationship is found between 210 the variables, a second goal is to determine the nature of the 211 relationship. 212 213 Thus in the Searle data we wish to determine whether "time to 214 germination" depends on "soil type", or whether "time to germi- 215 nation" depends on "seed variety". 216 217 It is invariably the case in experimental research that we wish 218 to determine whether the dependence exists in the *population* 219 of entities under study, not just in the sample of entities 220 that participated in the experiment. (In Searle's example the 221 entities are trials, or "carrot seed plantings".) 222 223 In an experiment with two predictor variables, the nature of 224 the relationship between the response variable and the predic- 225 tor variables can be either 226 - no (detectable) relationship or 227 - one or two "simple" relationships ("main effects") or 228 - an "interaction". 229 230 Interactions were invented by Fisher (1935, ch. VI). Interac- 231 tions provide a comprehensive means for detecting any (detect- 232 able) form of relationship that might exist between the re- 233 sponse variable and the predictor variables in an experiment. 234 In particular, interactions help us to detect complicated rela- 235 tionships between the response variable and a predictor vari- 236 able in which the specific form of the relationship depends on 237 the level of one or more *other* predictor variables. I give 238 formal definitions of the concepts of 'relationship between 239 variables', 'interaction', and 'simple relationship' in a paper 240 (1997, sec. 6). 241 242 The use of analysis of variance to detect relationships between 243 variables is (at a high level) straightforward: We submit the 244 data summarizing the results of an experiment (e.g., the data 245 in table 1) to an analysis of variance program, and the program 246 computes a set of "p-values". Assuming the experiment was 247 properly designed, the program provides a p-value for each sim- 248 ple relationship (main effect) and a p-value for each interac- 249 tion. If the p-value for a main effect or interaction is low 250 enough (and if there is no reasonable alternative explanation), 251 we conclude that the particular relationship between variables 252 associated with the p-value is extant in the population of en- 253 tities under study. 254 255 I discuss the general scientific study of relationships between 256 variables (including the notion of a p-value) further in two 257 papers (1996a, 1996b). 258 259 260 A CONTROVERSY ABOUT SUMS OF SQUARES 261 262 To compute an analysis of variance p-value from the results of 263 an experiment, it is mathematically necessary to first compute 264 certain "sums of squares". All analysis of variance programs 265 compute these sums of squares as an intermediate step in com- 266 puting p-values. Currently controversy exists about how sums 267 of squares should be computed. Controversy exists about both 268 the numerator sum of squares and the denominator sum of squares 269 used in the "F-ratio" to compute a p-value. The present dis- 270 cussion focuses exclusively on computing numerator sums of 271 squares for unbalanced experiments. 272 273 (Since balanced experiments are an "internal" limiting case of 274 unbalanced experiments, the discussion below also applies to 275 balanced experiments.) 276 277 This program illustrates the computations of the three best- 278 known conceptual methods for computing numerator sums of 279 squares. The program also illustrates the mathematical aspects 280 of computing numerator sums of squares by illustrating a simple 281 mathematical algorithm that can carry out all of the three con- 282 ceptual methods. 283 284 The purpose of this program is not to judge the various methods 285 of computing sums of squares, but rather to illustrate them. 286 Therefore, I make no judgments here about the merits of the 287 various methods. However, I do make judgments in the paper 288 (1997) and I shall extend these judgments in material I shall 289 publish later. 290 291 292 MODEL EQUATIONS 293 294 To understand sums of squares, it is useful to understand the 295 notion of a "model equation" of the relationship between the 296 response variable and the predictor variables in an experiment. 297 Two types of model equations (often called simply "models") are 298 in use today: "overparameterized" models and "cell-means" mod- 299 els. I discuss both types in the paper (1997, sec. 9). The 300 following discussion is in terms of overparameterized models. 301 302 Consider a two-way experiment (e.g., the Searle experiment) 303 that has predictor variables A and B. We can model the rela- 304 tionship between the response variable, called y, and the pre- 305 dictor variables with the following model equation: 306 307 y(i,j,k) = mu + alpha(i) + beta(j) + phi(i,j) + e(i,j,k). (1) 308 309 (Normally the i's, j's and k's in the parentheses would be sub- 310 scripts, and the five terms on the right side of the equation 311 would be Greek letters, but these features are not yet avail- 312 able for comments in computer programs.) 313 314 The terms in the equation have the following interpretations: 315 316 y(i,j,k) = value of the response variable for the kth entity in 317 the (i,j) treatment group in the experiment 318 319 mu = grand mean of the values of the response variable 320 for all the entities in the population 321 322 alpha(i) = simple ("main") effect of predictor variable A on y 323 when A is at level i 324 325 beta(j) = simple ("main") effect of predictor variable B on y 326 when B is at level j 327 328 phi(i,j) = the joint (interaction) effect of predictor vari- 329 ables A and B on y when they are at levels i and j 330 respectively 331 332 e(i,j,k) = the "error" term, which takes account of the vari- 333 ation in y that cannot be accounted for by the other 334 four terms on the right side of the equation. 335 336 Model (1) gives us a succinct picture of how the value of the 337 response variable "depends" on the values of the two predictor 338 variables, and how it also depends on other unknown factors, 339 which are taken account of by the error term. Model (1) is 340 called the "saturated" model for a two-way experiment because 341 it contains all the possible terms for such an experiment. 342 343 As we shall see, it often makes sense to use a reduced version 344 of (1) in which certain terms are omitted from the right side. 345 For example, if we omit the interaction term, phi, we get 346 347 y(i,j,k) = mu + alpha(i) + beta(j) + e(i,j,k). (2) 348 349 350 THE RESIDUAL SUM OF SQUARES OF A MODEL EQUATION 351 352 Once we have the data from an experiment (e.g., the data in ta- 353 ble 1), we can "fit" various models to the data. That is, we 354 can use a mathematical algorithm to compute values for the pa- 355 rameters associated with the terms in the model [i.e., values 356 for mu, the alpha(i)s, the beta(j)s, and the phi(i,j)s]. The 357 algorithm operates by choosing values for the parameters asso- 358 ciated with the four terms so that the resulting model gives 359 the "best" predictions of the values of the response variable y 360 for all the values of y obtained in the experiment. 361 362 The fitting of the terms is usually done by the method of least 363 squares or by the closely related method of maximum likelihood. 364 In the standard case, both methods yield "identical" estimates 365 for the values of the parameters associated with the terms 366 (excluding the error term e) on the right side of the equation 367 in the sense that 368 the sum of the squared differences between the values 369 of the response variable predicted by the equation and 370 the *actual* values of the response variable in the ex- 371 periment 372 is the lowest possible value. 373 374 After we have fitted a model to the results of an experiment 375 and obtained estimates for the values of the parameters, we can 376 then use the model to compute the predicted value for each of 377 the values of the response variable obtained in the experiment. 378 Then we can subtract each predicted value from the correspond- 379 ing actual value to get a number called the "residual". If we 380 square each of these residuals and add the squared residuals 381 together, we get the "residual sum of squares" for the model 382 for the experimental data at hand. 383 384 (Of course, the residual sum of squares is the number that was 385 itself minimized two paragraphs above in order to determine the 386 estimates of the values of the parameters -- a piece of mathe- 387 matical bootstrapping that still amazes me.) 388 389 390 AN INTERPRETATION OF ANALYSIS OF VARIANCE SUMS OF SQUARES 391 IN TERMS OF MODEL EQUATIONS 392 393 We can view all standard analysis of variance numerator sums of 394 squares as being the value we obtain if we subtract the resid- 395 ual sum of squares for one model from the residual sum of 396 squares for another model (Yates 1934, 63). For example, con- 397 sider the following two models for the results of a two-way ex- 398 periment: 399 400 y(i,j,k) = mu + beta(j) + e(i,j,k) (3) 401 402 y(i,j,k) = mu + alpha(i) + beta(j) + e(i,j,k). (2) 403 404 I shall use the term Rn to denote the residual sum of squares 405 for model (n). Thus R3 denotes the residual sum of squares for 406 (3). 407 408 Suppose we perform a two-way experiment (such as the Searle ex- 409 periment), and suppose we (separately) fit models (2) and (3) 410 to the results of the experiment, yielding R2 and R3. If we 411 subtract R2 from R3, this difference is identical to the HTO (= 412 SAS Type II = SPSS ANOVA Experimental) sum of squares for the A 413 main effect. 414 415 It stands to reason that the numerical difference R3 - R2 416 should equal a sum of squares for the A main effect since the 417 only conceptual difference between the two models is that (2) 418 contains alpha, the term for predictor variable A, and (3) does 419 not. 420 421 Consider the following two models: 422 423 y(i,j,k) = mu + beta(j) + phi(i,j) + e(i,j,k) (4) 424 425 y(i,j,k) = mu + alpha(i) + beta(j) + phi(i,j) + e(i,j,k). (1) 426 427 Note that (4) and (1) are the same as (3) and (2) respectively, 428 except that both (4) and (1) have an extra term, namely 429 phi(i,j). R4 - R1 is identical to the HTI (= SAS Type III = 430 SPSS ANOVA Unique) sum of squares for the A main effect. 431 432 We can see the source of the names HTI and HTO by studying the 433 two pairs of models above. For the HTI sum of squares the 434 Higher-level interaction Term [phi(i,j)] is Included (HTI) in 435 both models. For the HTO sum of squares the Higher-level in- 436 teraction Term is Omitted (HTO) from both models. 437 438 Finally, consider the following two models: 439 440 y(i,j,k) = mu + e(i,j,k) (5) 441 442 y(i,j,k) = mu + alpha(i) + e(i,j,k). (6) 443 444 Note that (5) and (6) are the same as (3) and (2) respectively, 445 except that both (5) and (6) lack the term beta(j). R5 - R6 is 446 identical to the sequential (= SAS Type I = SPSS ANOVA Hierar- 447 chical) sum of squares for the A main effect when A is entered 448 first (after the mean) into the model. 449 450 If we compute the difference between the residual sums of 451 squares of two models, the difference is called the "sum of 452 squares for the effect being tested". The "effect" is the term 453 that is present in one of the two models, but absent from the 454 other. 455 456 Each sum of squares has associated with it a number of "degrees 457 of freedom". 458 459 (For any main effect, the number of degrees of freedom is one 460 less than the number of values assumed by the associated pre- 461 dictor variable in the experiment. For example, in the Searle 462 data, predictor variable B assumes three different values in 463 the experiment so the number of degrees of freedom for B is 464 two. For an interaction, the number of degrees of freedom is 465 the product of the numbers of degrees of freedom for the main 466 effects for each of the predictor variables associated with the 467 interaction.) 468 469 If we divide a sum of squares for a particular effect by its 470 degrees of freedom, we get the "mean square" for the effect. 471 Similarly, if we divide the residual sum of squares for the 472 saturated model by *its* degrees of freedom, we get the resid- 473 ual mean square. The reason we might *want* to compute any of 474 these mean squares rests on three facts 475 476 - If there is *no* relationship in the population between the 477 response variable and predictor variable A, clearly the cor- 478 rect value for alpha(i) is zero for all values of i in all 479 the equations above. In this case, it can be shown that the 480 three mean squares for the A effect can be expected (under 481 certain often-satisfiable assumptions) to equal the "residual 482 variance" in the experiment. (The residual variance is esti- 483 mated by the residual mean square.) 484 485 - If there *is* a relationship between A and the response vari- 486 able, the three mean squares for the A effect can usually be 487 expected to be *greater* than the residual variance. 488 489 - Thus to determine whether there is evidence of a relationship 490 between the response variable and predictor variable A, we 491 need only determine whether the appropriate effect mean 492 square is significantly greater than the residual mean 493 square. The p-value is simply an easy-to-interpret result of 494 this determination. 495 496 These facts, buttressed by mathematical arguments (Searle 1987, 497 sec. 8.6), imply that the three approaches to computing sums of 498 squares provide (with certain limitations) valid candidates for 499 the numerator sums of squares in F-ratios used to compute p- 500 values used to test for the existence of a relationship between 501 the response variable and predictor variable A. 502 503 (The technique I discuss above for detecting relationships be- 504 tween variables is undeniably complex. A questioning reader 505 might wonder whether a simpler technique [with reasonable power 506 and objectivity] might be found. So far, no such technique has 507 been found.) 508 509 The above discussion states that three popular types of numera- 510 tor sums of squares (HTO, HTI, and sequential) can be computed 511 in a two-way experiment by computing the difference between the 512 residual sums of squares of two model equations. This general- 513 izes: 514 515 All standard analysis of variance numerator sums of 516 squares for two-way, three-way, and higher experiments 517 can be viewed as reflecting the difference between the 518 residual sums of squares of two overparameterized model 519 equations. 520 521 522 SUMMING UP 523 524 - an important use of analysis of variance is to help research- 525 ers detect relationships between variables in populations of 526 entities 527 528 - we can detect relationships between variables by studying the 529 p-values obtained by applying analysis of variance to the re- 530 sults of an experiment 531 532 - analysis of variance computer programs compute numerator sums 533 of squares as an intermediate step in computing p-values 534 535 - controversy exists about which method of computing numerator 536 sums of squares is preferred in unbalanced experiments 537 538 - one easy-to-understand way of viewing all standard approaches 539 to computing numerator sums of squares is to view them as re- 540 flecting the difference between the residual sums of squares 541 of two different overparameterized model equations for the 542 relationship between the response variable and predictor 543 variables in the experiment. 544 545 546 PROGRAM OVERVIEW 547 548 The program statements below demonstrate a simple algorithm for 549 computing analysis of variance numerator sums of squares. The 550 algorithm takes three pieces of information as input 551 552 1. a specification of the two model equations whose residual 553 sums of squares we wish to "difference" in order to compute 554 an analysis of variance numerator sums of squares [e.g., (3) 555 and (2) above] 556 557 2. a specification of the layout of the experiment 558 559 3. the response variable data vector containing the values of 560 the response variable obtained in the experiment. 561 562 The algorithm uses the input to compute the specified sum of 563 squares. 564 565 To show that the algorithm works properly, the program computes 566 seven analysis of variance sums of squares for the Searle data 567 given above. 568 569 The program below is organized into five parts: 570 571 1. a set of six preliminary steps to get things ready for com- 572 puting the sums of squares 573 574 2. seven repetitions of three simple steps to compute the seven 575 sums of squares 576 577 3. a recomputation of the seven sums of squares with SAS PROC 578 GLM for comparison 579 580 4. a summary 581 582 5. notes, an appendix, and references. 583 584 585 PRELIMINARY STEP 1: START PROC IML AND RESET SOME OPTIONS 586 587 PROC IML (Interactive Matrix Language) is an easy-to-use com- 588 puter language for general matrix arithmetic. It is an add-on 589 component of the SAS system and has many built-in functions to 590 facilitate matrix operations particular to statistical analy- 591 sis. 592 593 The PROC IML statement below initiates the IML environment. 594 After that statement is executed, the statements that follow 595 are statements in the IML language until we reach the QUIT 596 statement, which takes us back into a standard SAS program. 597 */ 598 599 proc iml; IML Ready 600 601 /* 602 The options in the following RESET statement control the desti- 603 nation and appearance of the IML output. 604 */ 605 606 reset log fw=3 spaces=3; 607 608 609 /* 610 PRELIMINARY STEP 2: DEFINE THE SEARLE DATA 611 612 To get the Searle data into IML, we define two column vectors 613 (a and b) of values for the two predictor variables, and we de- 614 fine a column vector y, containing the values of the response 615 variable. Following are the three IML statements that define 616 the vectors a, b, and y: 617 */ 618 619 a = { 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,2, 2 }; 620 621 b = { 1, 1, 1, 2, 2, 3, 3, 1, 1, 1, 1, 2, 3,3, 3 }; 622 623 y = { 6,10,11, 13,15, 14,22, 12,15,19,18, 31, 18,9,12 }; 624 /* 625 Note how the values in the three vectors "line up" correctly 626 with each other to reflect exactly the same information as is 627 given near the beginning of this program in table 1. For exam- 628 ple, the last three values in y (i.e., 18, 9, and 12) each line 629 up with a value of 2 in vector a and a value of 3 in vector b, 630 reflecting the fact that the last three values in y are associ- 631 ated with the (2,3) cell in table 1. 632 633 The set of fifteen numbers between the braces in each of the 634 three IML statements above is called a "matrix literal". Al- 635 though the above three matrix literals are laid out horizon- 636 tally, they do not specify row vectors but instead specify 637 *column* vectors. They specify column vectors because each of 638 the numbers (except the last) in each matrix literal is fol- 639 lowed by a comma, which in an IML matrix literal indicates the 640 end of a row. (Numbers *within* a row of an IML matrix literal 641 are not separated by commas, but by blanks.) 642 643 644 PRELIMINARY STEP 3: GENERATE THE MAIN EFFECT SUBMATRICES 645 646 Before we can begin computing sums of squares, we must first 647 generate the "full-column-rank" submatrices of the overall 648 "design matrix" for the experiment. These submatrices (which 649 are surprisingly simple) are used in the computation of the 650 sums of squares. 651 652 Any analysis of variance has a separate submatrix of the design 653 matrix for each of its main effects (i.e., for each predictor 654 variable) and a separate submatrix for each interaction in the 655 experiment. Each submatrix has N rows, where N equals the num- 656 ber of elements (entities, rows, cases, observations) in the 657 data. (In the Searle data N is fifteen since there were fif- 658 teen trials in the experiment; these trials are reflected in 659 the fifteen elements in each of vectors a, b, and y defined 660 above.) Each submatrix has DF columns, where DF is the number 661 of degrees of freedom associated with the main effect or inter- 662 action associated with the submatrix. 663 664 The full-column-rank submatrix for any main effect can be gen- 665 erated with a single statement in IML. For example, to gener- 666 ate the submatrix for the main effect of predictor variable B, 667 and to store that submatrix in the matrix called Bdesign, we 668 use the following statement: 669 670 Bdesign = designf(b) 671 672 where b is the column vector containing the raw values of pre- 673 dictor variable B used in the experiment. 674 675 DESIGNF is a built-in function in IML. The function returns a 676 matrix of zeros and (positive and negative) ones, with N rows 677 and DF columns, where N and DF are as described above. 678 679 Consider the main effect for predictor variable B (seed vari- 680 ety) in the Searle data. Here is a copy of the definition 681 (given above) of the (column) vector showing the different val- 682 ues of predictor variable B for the fifteen trials in the ex- 683 periment: 684 685 b = { 686 1, 1, 1, 2, 2, 3, 3, 1, 1, 1, 1, 2, 3, 3, 3 }. 687 688 Following is the IML statement to compute the submatrix of the 689 design matrix for predictor variable B. This is followed by a 690 statement to print the newly computed matrix. (The matrix is 691 printed transposed for ease of study.) 692 */ 693 694 Bdesign = designf(b); 695 696 print (Bdesign`); #TEM1001 1 1 1 0 0 -1 -1 1 1 1 1 0 -1 -1 -1 0 0 0 1 1 -1 -1 0 0 0 0 1 -1 -1 -1 697 /* 698 Consider the first row of 1's, zeros, and -1's in the output 699 above. Compare the values in this row with the values in vec- 700 tor B given further above. Note how the 1's identify the cases 701 in which B has the value 1. Similarly, the zeros identify the 702 cases in which B has the value 2. Finally, the -1's identify 703 the cases in which B has the value 3. 704 705 Similarly, in the second row note how the 1's identify the 706 cases in which B has the value 2. And the -1's identify 707 (again) the cases in which B has the value 3. The zeros iden- 708 tify the cases in which B has the value 1. 709 710 This generalizes: Suppose some (discrete-valued) predictor 711 variable P has assumed m different values in an experiment. 712 (The value of m must be greater than or equal to 2 for P to be 713 a valid predictor variable.) Then the (transposed) submatrix 714 for P will have m-1 rows, each row being associated with one of 715 the m values, and with one of the m values being "left out" 716 with no row of its own. Within each row the elements are +1 if 717 this case received the associated value of predictor variable 718 P; -1 if the case received the "left out" value of predictor 719 variable P; and zero if m is greater than 2 and the case re- 720 ceived one of the other m-2 values of predictor variable P. 721 722 Design matrices are mysterious because, in the case of comput- 723 ing the standard analysis of variance sums of squares, they are 724 not unique. That is, we could get exactly the same results in 725 the output below (apart from slight differences due to roundoff 726 errors) if we specified a new submatrix of the design matrix 727 for predictor variable B provided that the two columns in the 728 new matrix were themselves any two independent "linear combina- 729 tions" of the two columns of the submatrix for B given above. 730 731 How is it possible to get the same results if a submatrix of 732 the design matrix is replaced with a transformed version of it- 733 self? This puzzling aspect of design matrices is explained by 734 the fact that the relevant information (for computing a sum of 735 squares) in a submatrix of a design matrix is not stored in the 736 *values* of the elements in the matrix -- it is stored in the 737 *relationships among* the values. Independent linear combina- 738 tions of the columns in a submatrix preserve this relevant in- 739 formation. 740 741 (Technically, the linear combinations preserve the information 742 in the sense that all transformed sets of columns of a sub- 743 matrix of the design matrix so-defined are generating vectors 744 of the same relevant subspace of the N-dimensional vector space 745 under study.) 746 747 Now let us compute the submatrix of the design matrix for pre- 748 dictor variable A (soil type). Here is a copy of the (column) 749 vector of values from the definition given above for this vari- 750 able: 751 752 a = { 753 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2 } 754 755 Following are the statements to compute and print the submatrix 756 of the design matrix for predictor variable A. (The matrix is 757 again printed transposed for ease of study.) 758 */ 759 760 Adesign = designf(a); 761 762 print (Adesign`); #TEM1001 1 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 763 /* 764 Since predictor variable A assumed only two values in the ex- 765 periment, the (transposed) submatrix of the design matrix has 766 only one row. 767 768 As was the case with predictor variable B, note how the 1's and 769 -1's above in Adesign line up respectively with the 1's and 2's 770 in the column vector of the values of predictor variable A. 771 772 773 PRELIMINARY STEP 4: GENERATE THE INTERACTION SUBMATRIX 774 775 Each interaction in an experiment has its own submatrix of the 776 design matrix. We can generate an interaction submatrix by 777 computing the "horizontal direct product" of the design matri- 778 ces for all the main effects for the predictor variables asso- 779 ciated with the interaction. 780 781 Consider two matrices P and Q with n rows and np and nq col- 782 umns, respectively. The horizontal direct product of P and Q 783 has n rows and np x nq columns. Each column in the horizontal 784 direct product is associated with a unique pair of columns, one 785 from P and one from Q. The elements in a given column in the 786 horizontal direct product are the scalar products of corre- 787 sponding elements in the two associated columns of P and Q. 788 789 For example, consider the following two matrices: 790 791 P = {1 2, and Q = {1 6, 792 3 4, 2 5, 793 5 6 } 3 4 }. 794 795 The horizontal direct product of P and Q is 796 797 { 1 6 2 12, 798 6 15 8 20, 799 15 20 18 24 }. 800 801 The IML function HDIR(P,Q) computes the horizontal direct prod- 802 uct of two matrices P and Q. 803 804 Following is a statement to generate the submatrix of the de- 805 sign matrix for the A x B interaction in the Searle data. The 806 submatrix is generated by computing the horizontal direct prod- 807 uct of the two design submatrices (Adesign and Bdesign) gener- 808 ated above. This is followed by a statement to print the in- 809 teraction submatrix (again transposed for ease of study). 810 */ 811 812 ABdesign = hdir(Adesign, Bdesign); 813 814 print (ABdesign`); #TEM1001 1 1 1 0 0 -1 -1 -1 -1 -1 -1 0 1 1 1 0 0 0 1 1 -1 -1 0 0 0 0 -1 1 1 1 815 816 817 /* 818 COMPUTATION STEP 1: FOR A GIVEN SUM OF SQUARES SPECIFY XE 819 (WE SHALL FIRST COMPUTE THE HTO SUM OF SQUARES FOR A) 820 821 Having computed the three submatrices of the design matrix, we 822 can begin computing sums of squares. Let us first compute the 823 HTO sum of squares for predictor variable A (soil type). 824 825 The actual computing of sums of squares in this program is done 826 by a subroutine called SS. Using a subroutine is efficient be- 827 cause it consolidates the code to compute sums of squares into 828 a reusable and portable package. I discuss the SS subroutine 829 in detail below. 830 831 In order to compute a sum of squares, the SS subroutine re- 832 quires three pieces of input: 833 834 1. the values of the response variable obtained in the experi- 835 ment 836 837 2. information about which sum of squares we wish to compute, 838 as specified by two overparameterized model equations, as 839 discussed above 840 841 3. information about the layout of the experiment. 842 843 The first piece of input, the values of the response variable, 844 was defined above in the definition of the vector y. Thus we 845 need only pass the vector y to the subroutine for the subrou- 846 tine to use the values of the response variable. 847 848 The second and third pieces of input are passed to the subrou- 849 tine through two matrices that are obtained from the submatri- 850 ces of the design matrix discussed above. The two matrices are 851 called XE and XR. 852 853 The first matrix, XE, is simply the submatrix of the design ma- 854 trix for the particular effect we wish to test. In the present 855 case we are testing the A main effect. Thus we specify XE as 856 follows: 857 */ 858 859 XE = Adesign; 860 861 862 /* 863 COMPUTATION STEP 2: FOR A GIVEN SUM OF SQUARES SPECIFY XR 864 865 In the present example we wish to compute the HTO numerator sum 866 of squares for the A main effect. Thus (as discussed above) we 867 wish to compute the difference between the residual sums of 868 squares of the following two model equations: 869 870 y(i,j,k) = mu + beta(j) + e(i,j,k) (3) 871 872 y(i,j,k) = mu + alpha(i) + beta(j) + e(i,j,k). (2) 873 874 We have already specified the conceptual difference between (3) 875 and (2) in the specification of XE above. Thus all that re- 876 mains in order to fully specify the two equations is to list 877 the "other" terms on the right side of (3) and (2). 878 879 The "other" terms on the right side of (3) and (2) (excluding 880 the error term) are mu and beta. We specify the other terms by 881 "horizontally concatenating" the submatrices for these terms to 882 form the matrix XR. 883 884 As suggested by the name, the horizontal concatenation of two 885 matrices P and Q is the matrix formed by appending the rows of 886 Q to the corresponding rows of P. (Horizontal concatenation of 887 two matrices is possible only if the matrices have the same 888 number of rows.) In SAS IML the expression P || Q is used to 889 specify the horizontal concatenation of two matrices P and Q. 890 891 Since mu is a constant in the two equations, the submatrix of 892 the design matrix for mu is a column vector of 1's. We can 893 specify a column vector of fifteen 1's with the IML function 894 895 j(15,1). 896 897 Thus we specify the XR matrix for the HTO A (soil type) sum of 898 squares in the experiment with the following statement: 899 */ 900 901 XR = j(15,1) || Bdesign; 902 903 /* 904 Here is the (transposed) XR matrix: 905 */ 906 907 print (XR`); #TEM1001 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 -1 -1 1 1 1 1 0 -1 -1 -1 0 0 0 1 1 -1 -1 0 0 0 0 1 -1 -1 -1 908 909 /* 910 Note that the XE and XR contain two types of information: 911 912 - information about which analysis of variance sum of squares 913 we wish to compute (as captured by the *choice* of the sub- 914 matrices we have used to define XE and XR) 915 916 - information about the layout of the experiment (as captured 917 in the *layout* of the submatrices of the design matrix used 918 to define XE and XR). 919 920 The E in XE stands for the Effect being tested. The R in XR 921 stands for otheR terms on the right side of the two model equa- 922 tions. 923 924 925 PRELIMINARY STEP 5: 926 MAKE THE SS SUBROUTINE AVAILABLE TO THE PROGRAM 927 928 (I have delayed this preliminary step until after defining XE 929 and XR, since this will help you to understand the subroutine.) 930 931 Since we have now defined y, XE, and XR for the HTO A (soil 932 type) effect, we are ready to call subroutine SS to compute the 933 sum of squares. In fact, we shall call the subroutine momen- 934 tarily. But before we can actually call the subroutine, we 935 must first make it known to IML. We do this with the %INCLUDE 936 statement that follows. 937 938 Because the SOURCE2 option is specified in the %INCLUDE state- 939 ment, SAS lists all of the statements in the subroutine in the 940 output below immediately after the %INCLUDE statement. 941 */ 942 943 %include 'D:\PROGS\SS.SAS' / source2; NOTE: %INCLUDE (level 1) file D:\PROGS\SS.SAS is file D:\PROGS\SS.SAS. 944 + /*********************************************************/ 945 + 946 + start SS (result, y, XE, XR, level,printss,printh); 947 + 948 + /* 949 + SS.SAS 950 + 951 + SUBROUTINE TO COMPUTE ANALYSIS OF VARIANCE 952 + NUMERATOR SUMS OF SQUARES 953 + 954 + Donald B. Macnaughton 955 + donmac@matstat.com 956 + 957 + ABSTRACT 958 + 959 + This SAS IML subroutine computes analysis of variance nu- 960 + merator sums of squares for statistical tests in unbalanced 961 + analysis of variance. The subroutine can compute the fol- 962 + lowing types of sums of squares: 963 + 964 + - HTO (Higher-level Terms are Omitted) 965 + = SPSS ANOVA Experimental 966 + = SAS Type II in the two-way case 967 + 968 + - SAS Type II 969 + 970 + - HTOS (Higher-level Terms are Omitted unless Significant) 971 + = superset of SAS Type II and HTO 972 + 973 + - HTI (Higher-level Terms are Included) 974 + = SAS Type III 975 + = SPSS ANOVA Unique 976 + = the default method in many analysis of variance pro- 977 + grams 978 + 979 + - sequential 980 + = SAS Type I 981 + = SPSS ANOVA Hierarchical in the two-way case 982 + 983 + - SPSS ANOVA Hierarchical 984 + 985 + - other types of sums of squares (including those in analy- 986 + sis of covariance) that can be specified as being equal to 987 + the difference between the residual sums of squares of two 988 + overparameterized model equations. 989 + 990 + Each time the subroutine is called it computes the single 991 + sum of squares specified by the calling arguments and re- 992 + turns the result to the calling program through the returned 993 + argument RESULT. 994 + 995 + 996 + SUBROUTINE USES 997 + 998 + The subroutine 999 + 1000 + - provides an easy way of computing numerator sums of 1001 + squares that cannot be computed in some statistical pack- 1002 + ages 1003 + 1004 + - provides a way of checking how a particular sum of squares 1005 + is computed in a statistical package 1006 + 1007 + - illustrates the mathematical aspects of computing sums of 1008 + squares 1009 + 1010 + - assists in performing simulation tests of conjectures 1011 + about analysis of variance sums of squares. 1012 + 1013 + 1014 + SUBROUTINE MAIN ARGUMENTS 1015 + 1016 + When calling this subroutine you must specify three main ar- 1017 + guments: y, XE, and XR. 1018 + 1019 + The first main argument, y, must be a column vector contain- 1020 + ing the values of the response variable obtained in the ex- 1021 + periment. 1022 + 1023 + The second and third main arguments, XE and XR, must be sub- 1024 + matrices of the (full-column-rank) design matrix for the ex- 1025 + periment. These submatrices jointly specify the particular 1026 + sum of squares to be computed. XE must contain the subma- 1027 + trix associated with the effect being tested. XR must con- 1028 + tain the horizontal concatenation of the submatrices associ- 1029 + ated with all the other terms (except the error term) on the 1030 + right side of the two model equations whose residual sums of 1031 + squares you wish to difference. 1032 + 1033 + I give examples of specification of y, XE, and XR in two 1034 + sample calling programs (1998a, 1998b). 1035 + 1036 + 1037 + SUBROUTINE METHOD 1038 + 1039 + As noted above, the subroutine is supplied with the vector y 1040 + and the matrices XE and XR. Here is the code (all four 1041 + lines) for computing the requested sum of squares and stor- 1042 + ing it in the variable RESULT: 1043 + 1044 + H = XE - XR * ginv(XR) * XE <-- hypothesis matrix 1045 + (dimensions are N x DF) 1046 + PM = H * ginv(H) <-- projection matrix (N x N) 1047 + yp = PM * y <-- projection of y (N x 1) 1048 + result = ssq(yp) <-- squared length of the 1049 + projection (1 x 1). 1050 + 1051 + If you are familiar with matrix algebra, the above code will 1052 + probably be understandable, even if you are not familiar 1053 + with SAS IML. But here are definitions for functions that 1054 + might be puzzling: 1055 + 1056 + ginv(X) = Moore-Penrose generalized inverse of X 1057 + 1058 + ssq(x) = sum of the squares of the elements in x. 1059 + 1060 + The subroutine method is derived from discussion and proofs 1061 + by Hocking (1985) and Searle (1987). Aldrich (1998) dis- 1062 + cusses the history of the methods. I discuss the method in 1063 + more detail below. 1064 + 1065 + 1066 + SUMMARY OF THE REST OF THE SUBROUTINE 1067 + 1068 + The remaining lines in this subroutine (717 lines) contain 1069 + - a discussion of details of the method discussed above 1070 + - a discussion of details of the subroutine operation and 1071 + - the executable version of the four statements given above. 1072 + I recommend that you omit reading these lines unless you are 1073 + interested in the details. (To find the end of the subrou- 1074 + tine to return to reading the calling program, search for 1075 + the row of asterisks.) 1076 + 1077 + 1078 + THE HYPOTHESIS MATRIX H 1079 + 1080 + As noted above, the first step of this subroutine is to gen- 1081 + erate the hypothesis matrix H, using the following state- 1082 + ment: 1083 + 1084 + H = XE - XR * ginv(XR) * XE. 1085 + 1086 + Actually, the above statement is one of two different meth- 1087 + ods that the subroutine can use to generate H. (I describe 1088 + how to specify which method is to be used in the Secondary 1089 + Arguments section below.) The other method is 1090 + 1091 + H = XE - XR * inv(XR` * XR) * XR` * XE. 1092 + 1093 + Algebraically, both methods yield exactly the same result 1094 + for H, as shown by Graybill's theorem 6.2.16 (1983, 112). 1095 + 1096 + (At several places in this subroutine I state that two meth- 1097 + ods yield exactly the same result. Although my statements 1098 + are algebraically correct, if we use the two methods in a 1099 + computer program, usually differences between the values 1100 + computed by the two methods will occur, due to differences 1101 + in roundoff errors. One should be aware of these differ- 1102 + ences but they can usually be ignored because they are usu- 1103 + ally restricted to just the last few significant digits in 1104 + computations that are performed in SAS in sixteen decimal 1105 + digit precision.) 1106 + 1107 + H has the following properties: 1108 + 1109 + 1. H has N rows, where N is the number of rows in y, XE, 1110 + and XR. 1111 + 1112 + 2. H has DF columns, where DF is the number of degrees of 1113 + freedom of the effect being tested. 1114 + 1115 + 3. Choose any cell in the table that summarizes the layout 1116 + of the experiment. (For an example of such a table, see 1117 + Searle's carrot seed germination data [1987, 79], repro- 1118 + duced in table 1 in Macnaughton [1998a].) All the rows 1119 + in H associated with the chosen cell are identical to 1120 + each other. 1121 + 1122 + 4. The sum of each column of H equals zero. This is sur- 1123 + prising because the sums of the columns of XE and XR, 1124 + which are used to generate H, generally do NOT equal 1125 + zero in unbalanced experiments. 1126 + 1127 + 5. If H has only one column (i.e., DF = 1), we can view the 1128 + elements in H as a statement of the hypothesis being 1129 + tested. That is, the elements in H are (indirectly) 1130 + multiplied by the corresponding elements in the response 1131 + data vector (in a "contrast") as a step in computing the 1132 + sum of squares. 1133 + 1134 + 6. If H has more than one column, the columns of H are 1135 + "linearly independent". 1136 + 1137 + 7. If H has more than one column, the elements in each col- 1138 + umn of H can be viewed as a "portion" of the hypothesis 1139 + being tested. That is, each column of H represents a 1140 + separate contrast that is applied to the response data 1141 + vector. The results of these contrasts are mathemati- 1142 + cally combined to compute the sum of squares. (The con- 1143 + trasts are not directly applied to the response data 1144 + vector, but only indirectly through PM, as discussed be- 1145 + low.) 1146 + 1147 + 8. Surprisingly, H is not unique. That is, for any given H 1148 + we can replace it by any of an infinite number of "re- 1149 + lated" matrices (which have, of course, the same number 1150 + of rows and columns as H), and this subroutine will re- 1151 + turn exactly the same value for the sum of squares. H 1152 + does not need to be unique to yield a unique sum of 1153 + squares because the columns of H are not defining them- 1154 + selves. Instead, the columns are defining a unique sub- 1155 + space of the N-dimensional vector space under study. 1156 + Linear algebra shows that other versions of the matrix H 1157 + can be used to define the same unique subspace. More 1158 + specifically, suppose T is ANY "non-singular" matrix 1159 + with DF rows and DF columns. Then we can replace H with 1160 + H1 = H * T and this subroutine will return exactly the 1161 + same sum of squares. 1162 + 1163 + 1164 + THE PROJECTION MATRIX PM 1165 + 1166 + As noted above in the "Method" section, the next step after 1167 + computing H is to compute the projection matrix PM, using 1168 + the following statement: 1169 + 1170 + PM = H * ginv(H). 1171 + 1172 + Actually (as with H), the above statement is one of two dif- 1173 + ferent methods that this subroutine can use to generate PM. 1174 + The other method is 1175 + 1176 + PM = H * inv(H` * H) * H`. 1177 + 1178 + As (again) shown by Graybill's theorem 6.2.16, both methods 1179 + yield exactly the same value for PM. 1180 + 1181 + The method immediately above is equivalent to computing 1182 + Hocking's P(c) in (6.168) (1985, 153). This method is par- 1183 + tially shown to be valid in Graybill's Theorem 4.4.1 (1983, 1184 + 73). Harville proves some useful results for projection ma- 1185 + trices (1997, sec 12.3). 1186 + 1187 + Note how the above two equations imply that PM is a "normal- 1188 + ized" version of the hypothesis matrix H. PM has the fol- 1189 + lowing properties: 1190 + 1191 + 1. PM has N rows and N columns. 1192 + 1193 + 2. Choose any cell in the table that summarizes the layout 1194 + of the experiment. All the rows in PM associated with 1195 + that cell are identical to each other. 1196 + 1197 + 3. PM is symmetric. That is, the first row has exactly the 1198 + same elements (in left-to-right order) as the first col- 1199 + umn (in top-to-bottom order), and the second row has ex- 1200 + actly the same elements as the second column, and so on. 1201 + 1202 + 4. The sum of the elements in each row of PM (and the sum 1203 + of the elements in each column) is equal to zero. This 1204 + implies that if a row of PM is multiplied by the vector 1205 + y, it produces a "contrast" of the values in y. 1206 + 1207 + 5. The trace of PM (i.e., the sum of the elements on the 1208 + descending diagonal) is equal to DF, the number of de- 1209 + grees of freedom associated with the hypothesis being 1210 + tested. 1211 + 1212 + 6. The rank of PM (i.e., the number of linearly independent 1213 + rows or columns in PM) is equal to DF. Thus if DF is 1214 + equal to 1, any row in PM is a multiple of any other 1215 + row. If DF is equal to k, any row in PM can be gener- 1216 + ated as a linear combination of any k mutually independ- 1217 + ent other rows. 1218 + 1219 + 7. The sum of the squares of the elements in PM is equal to 1220 + DF. 1221 + 1222 + 8. The rows (and columns) of PM are linear combinations of 1223 + the columns of H. 1224 + 1225 + 9. PM is unique. That is, suppose a projection matrix PM 1226 + is computed from a given hypothesis matrix H. Suppose 1227 + that then H is multiplied by any non-singular matrix T 1228 + with DF rows and DF columns to yield H1. Suppose that 1229 + then a new projection matrix PM1 is computed from H1. 1230 + Then PM1 = PM. 1231 + 1232 + 10. PM is idempotent. That is, if we multiply PM by itself, 1233 + the answer we get is PM. 1234 + 1235 + 11. If (as discussed below) we project an arbitrary vector x 1236 + through PM to yield x1, and if we then project x1 1237 + through PM to yield x2, we will find that x1 and x2 are 1238 + identical, although they will generally differ from x. 1239 + 1240 + 12. PM has DF eigenvalues (characteristic values) that are 1241 + equal to +1 and the remaining eigenvalues are equal to 1242 + zero. The eigenvectors (characteristic vectors) of PM 1243 + that correspond to the non-zero eigenvalues can be used 1244 + as columns to form a valid version of the hypothesis ma- 1245 + trix H. 1246 + 1247 + 1248 + yp, THE PROJECTION OF y BY PM 1249 + 1250 + Thus far, we have only used information about the *predic- 1251 + tor* variables in the experiment to derive the following 1252 + four new matrices: XE, XR, H, and PM. That is, we have not 1253 + yet taken any account of the values of the response variable 1254 + stored in the vector y. The next step in computing the sum 1255 + of squares is to mathematically marry the response variable 1256 + and the predictor variables. We do so by using the projec- 1257 + tion matrix PM to "project" y. That is, we postmultiply the 1258 + projection matrix by y to yield a new vector, called yp, as 1259 + follows: 1260 + 1261 + yp = PM * y. 1262 + 1263 + The vector yp has the following properties: 1264 + 1265 + 1. Like y, yp is an N x 1 column vector. 1266 + 1267 + 2. Choose any cell in the table that summarizes the layout 1268 + of the experiment. All the elements in yp associated 1269 + with that cell are identical to each other. 1270 + 1271 + 1272 + THE SUM OF SQUARES AS THE SQUARED LENGTH OF yp 1273 + 1274 + The desired sum of squares is simply the squared length of 1275 + yp, and is computed as 1276 + 1277 + result = ssq(yp). 1278 + 1279 + Thus the sum of squares is simply the squared length of the 1280 + projection of the vector y by the projection matrix PM. PM 1281 + has two important properties related to the projection of y 1282 + 1283 + 1. If no relationship exists between the response variable 1284 + and the particular set of predictor variables associated 1285 + with XE, (and if certain well-known assumptions are ade- 1286 + quately satisfied), the length of the projection of y can 1287 + be expected to be "short"; it will be roughly equal to a 1288 + known length (i.e., a length that can be computed from 1289 + the data). On the other hand if a relationship between 1290 + the variables *does* exist, the length of the projection 1291 + of y will tend to be longer than the known length. Thus 1292 + computing a p-value is simply computing whether the pro- 1293 + jection of y is longer than could be reasonably expected 1294 + if no relationship exists. 1295 + 1296 + 2. If we are studying a balanced experiment, and if we com- 1297 + pute projection matrices PM1 and PM2 for any two of the 1298 + effects (main effects or interactions) in the experiment, 1299 + PM1 * PM2 = 0, where 0 is an N x N matrix of zeros. 1300 + (This impressive result does not generally occur in un- 1301 + balanced experiments.) This means that in a balanced ex- 1302 + periment the projection of any vector y by PM1 is "or- 1303 + thogonal" to (i.e., at right angles to) the projection of 1304 + the same vector (or any other vector y1) by PM2. It also 1305 + means that if an experiment is balanced, none of the ef- 1306 + fects in the experiment can "contaminate" the statistical 1307 + tests of other effects in the experiment. (This contami- 1308 + nation, which I shall demonstrate in later material, oc- 1309 + curs with some statistical tests in unbalanced experi- 1310 + ments.) 1311 + 1312 + 1313 + OTHER METHODS OF COMPUTING THE SUM OF SQUARES 1314 + 1315 + Simple linear algebra implies that we can also compute the 1316 + desired sum of squares as 1317 + 1318 + result = yp` * yp. 1319 + 1320 + Also, Searle (1987) shows in (82) on page 264 and (90) on 1321 + page 272 and (90) on page 318 that we can compute the de- 1322 + sired sum of squares directly from PM as a quadratic form 1323 + 1324 + result = y` * PM * y. 1325 + 1326 + I find the projection approach (a geometric approach) easier 1327 + to understand than the quadratic form approach (an algebraic 1328 + approach). I visualize the response vector as an arrow in 1329 + the N-dimensional vector space that is "projected through" 1330 + the projection matrix to generate another arrow in a sub- 1331 + space of the first space. The length of the second arrow is 1332 + related to selected properties of the first arrow. In par- 1333 + ticular, the projection matrix is specifically designed so 1334 + that the length of the second arrow shows (in a way that is 1335 + as mathematically "clear" as possible) the strength of sup- 1336 + port (provided by the y-values obtained in the experiment) 1337 + for the hypothesis that the relationship between variables 1338 + associated with the projection matrix exists in the popula- 1339 + tion. 1340 + 1341 + 1342 + GENERAL COMMENTS ABOUT THE SUBROUTINE METHOD 1343 + 1344 + It is helpful to review what this subroutine accomplishes. 1345 + In essence, the calling program passes the matrices XE and 1346 + XR to the subroutine. These matrices contain the specifica- 1347 + tions of the layout of the experiment and the specification 1348 + of two model equations. The calling program also passes the 1349 + vector y to the subroutine. This vector contains the values 1350 + of the response variable obtained in the experiment. Fol- 1351 + lowing Yates' characterization of analysis of variance sums 1352 + of squares (1934, 63), the subroutine uses the three argu- 1353 + ments to compute the difference between the residual sums of 1354 + squares of the two model equations. Depending on how the 1355 + calling program specifies the values of XE and XR, this al- 1356 + lows the subroutine to compute a sum of squares using any of 1357 + the seven approaches to computing analysis of variance sums 1358 + of squares named in the abstract of the subroutine. (I il- 1359 + lustrate how to call the subroutine to compute sums of 1360 + squares for some of the approaches in two computer programs 1361 + [1998a, 1998b].) 1362 + 1363 + 1364 + SUBROUTINE SECONDARY ARGUMENTS 1365 + 1366 + The remaining lines in this subroutine (419 lines) contain 1367 + - details of the subroutine operation and 1368 + - the executable version of the statements given above. 1369 + You can find the end of the subroutine to return to reading 1370 + the calling program by searching for the row of asterisks. 1371 + 1372 + To use this subroutine you must supply values for three sec- 1373 + ondary arguments: LEVEL, PRINTSS, and PRINTH. These argu- 1374 + ments control details of how the subroutine performs the 1375 + computation and prints the results. 1376 + 1377 + The argument LEVEL controls which method the subroutine uses 1378 + to compute the hypothesis matrix H and the projection matrix 1379 + PM. If you set LEVEL to 1, the two matrices are computed 1380 + using the standard inverse. If you set LEVEL to 2, the two 1381 + matrices are computed using the generalized inverse (as 1382 + shown in the Method section). Using LEVEL = 1 seems to 1383 + yield solutions that are slightly more accurate. 1384 + 1385 + The argument PRINTSS controls whether the subroutine prints 1386 + the sum of squares it has computed. If you set PRINTSS to 1387 + 1, the subroutine prints the value of the sum of squares in 1388 + 25.15 format. If you set PRINTSS to zero, the subroutine 1389 + does not print the sum of squares but instead only returns 1390 + the value to the calling program through the argument 1391 + RESULT. If you set PRINTSS to 2, the subroutine prints all 1392 + the printable digits of the computed sum of squares in E-no- 1393 + tation for possible comparison against other computed val- 1394 + ues. If PRINTSS is 1 or 2, the subroutine also prints the 1395 + results of tests of the integrity of the projection matrix 1396 + PM. (I describe the tests below.) 1397 + 1398 + The argument PRINTH controls whether the subroutine prints 1399 + the following three intermediate results: 1400 + - the hypothesis matrix H 1401 + - the projection matrix PM and 1402 + - the projection, yp, of the response data vector. 1403 + If you set PRINTH to 1, the subroutine prints the intermedi- 1404 + ate results. If you set PRINTH to 2, the subroutine prints 1405 + the intermediate results but prints the hypothesis matrix 1406 + and the projection of the response vector transposed, which 1407 + can sometimes save space in the output. If you set PRINTH 1408 + to zero, the subroutine does not print intermediate results. 1409 + 1410 + 1411 + SUBROUTINE NOTES 1412 + 1413 + The numerical method of computing sums of squares used by 1414 + this subroutine is efficient for supporting the conceptual 1415 + approach to analysis of variance sums of squares of comput- 1416 + ing the difference between the residual sums of squares of 1417 + two overparameterized model equations. However, the method 1418 + is generally not the most efficient method in terms of 1419 + - economy of computation in memory required 1420 + - economy of computation in time required 1421 + - minimization of roundoff errors for large datasets or ill- 1422 + conditioned data. 1423 + Nevertheless, the subroutine generally yields highly accu- 1424 + rate sums of squares in minimal time and is therefore more 1425 + than adequate for most applications that cannot be handled 1426 + by a general analysis of variance program. 1427 + 1428 + One inefficiency of the method used by this subroutine re- 1429 + lates to the inclusion of identical rows within each of the 1430 + following five arrays: 1431 + - XE and XR, the relevant submatrices of the design matrix 1432 + - H, the hypothesis matrix 1433 + - PM, the projection matrix 1434 + - yp, the projection of vector y. 1435 + That is, (as noted above) for any given cell in the table 1436 + that summarizes the layout of the experiment, all the rows 1437 + in each of the five arrays that correspond to that cell are 1438 + (within the array) identical -- one row for each value of 1439 + the response variable associated with the cell. Sunwoo and 1440 + Kim (1997) discuss an approach to analyzing unbalanced ex- 1441 + periments that eliminates this duplication. For very large 1442 + experiments, this subroutine could be enhanced to take ac- 1443 + count of the Sunwoo and Kim approach, thereby substantially 1444 + increasing the computational efficiency (at the expense of 1445 + an increase in complexity). 1446 + 1447 + (The enhancement might work as follows: Compute y1, which 1448 + is an m-dimensional column vector containing the cell means 1449 + [or possibly cell totals] of the values of the response 1450 + variable, where m is the number of cells in the experiment. 1451 + Compute XE, XR, H, and PM as above but on the basis of an 1452 + equivalent experiment with only one observation per cell. 1453 + Compute yp1 as the projection of y1 by PM. Scale yp1 with 1454 + the Sunwoo and Kim T matrix to [in effect] yield yp, and 1455 + compute the desired sum of squares from yp as before.) 1456 + 1457 + The hypothesis matrix H I use in this subroutine has dimen- 1458 + sions N x DF. This is the transpose of the hypothesis ma- 1459 + trix Ronald Hocking discusses in his book (1985, 153), and I 1460 + discuss in a paper (1997, appendix C). I have used the 1461 + transpose because it yields slightly simpler notation. 1462 + 1463 + To conserve memory, the subroutine erases XE and XR after 1464 + using them. 1465 + 1466 + If you wish to run this subroutine on an EBCDIC system 1467 + (e.g., an IBM mainframe), see the note near the end. 1468 + 1469 + 1470 + SUBROUTINE EXECUTABLE STATEMENTS 1471 + 1472 + First, check the arguments passed to the subroutine and stop 1473 + if a problem is found. 1474 + */ 1475 + 1476 + if level ^= 1 & level ^=2 then do; 1477 + print '***ERROR*** in call to SS subroutine.'; 1478 + print 'Value of LEVEL is' level; 1479 + print 'LEVEL must be 1 or 2. Execution terminated.'; 1480 + abort; 1481 + end; 1482 + 1483 + if printss ^= 0 & printss ^= 1 & printss ^= 2 then do; 1484 + print '***ERROR*** in call to SS subroutine.'; 1485 + print 'Value of PRINTSS is' printss; 1486 + print 'Value must be 0, 1, or 2. Execution terminated.'; 1487 + abort; 1488 + end; 1489 + 1490 + if printh ^= 0 & printh ^= 1 & printh ^= 2 then do; 1491 + print '***ERROR*** in call to SS subroutine.'; 1492 + print 'Value of PRINTH is' printh; 1493 + print 'PRINTH must be 0, 1, or 2. Execution terminated.'; 1494 + abort; 1495 + end; 1496 + 1497 + if type(y) = 'U' | type(XE) = 'U' | type(XR) = 'U' then do; 1498 + print '***ERROR*** in call to SS subroutine.'; 1499 + string = 'One or more of the matrices y, XE'; 1500 + string = string + ', and XR do not exist. You must '; 1501 + string = string + 'specify the three matrices before '; 1502 + string = string + 'calling the SS subroutine.'; 1503 + print string; 1504 + print 'Execution terminated.'; 1505 + abort; 1506 + end; 1507 + 1508 + n = nrow(y); 1509 + if nrow(XE) ^= n | nrow(XR) ^= n then do; 1510 + string = '***ERROR*** in call to SS subroutine. '; 1511 + string = string + 'Discrepancy found between the '; 1512 + string = string + 'number of rows in y, XE, and XR:'; 1513 + print string; 1514 + nrow_y = n; 1515 + nrow_XE = nrow(XE); 1516 + nrow_XR = nrow(XR); 1517 + print nrow_y nrow_XE nrow_XR; 1518 + print 'Execution terminated.'; 1519 + abort; 1520 + end; 1521 + 1522 + /* 1523 + Compute Searle's M1 as defined in his (76) on pages 263 and 1524 + 318 of his book (1987) as 1525 + 1526 + M1 = I(n) - XR * ginv(XR) 1527 + 1528 + where 1529 + 1530 + I(n) = the identity matrix (of dimension n x n with 1's 1531 + on the diagonal and zeros elsewhere). 1532 + 1533 + Then compute the hypothesis matrix 1534 + 1535 + H = M1 * XE. 1536 + 1537 + Note that XE in this subroutine is equivalent to Searle's X2 1538 + and XR is equivalent to Searle's X1. 1539 + 1540 + The following statements perform the above arithmetic but 1541 + bypass the intermediate step of computing M1. The chosen 1542 + method depends on the value of LEVEL. 1543 + */ 1544 + 1545 + if level = 1 then 1546 + H = XE - XR * inv(XR` * XR) * XR` * XE; 1547 + else 1548 + H = XE - XR * ginv(XR) * XE; 1549 + 1550 + /* 1551 + Since they are no longer needed, erase XE and XR to conserve 1552 + memory. Note that erasing the two matrices here means that 1553 + after returning from this subroutine the values in the ma- 1554 + trices will not be available in the calling program. 1555 + 1556 + To conserve memory, other matrices are also erased below as 1557 + soon as they are no longer needed. 1558 + */ 1559 + 1560 + free XE XR; 1561 + 1562 + /* 1563 + Compute the projection matrix PM using the appropriate 1564 + method, as determined by LEVEL. 1565 + */ 1566 + 1567 + If level = 1 then 1568 + PM = H * inv(H` * H) * H`; 1569 + else 1570 + PM = H * ginv(H); 1571 + 1572 + if printh = 0 then free H; 1573 + 1574 + /* 1575 + Compute the projection of y. 1576 + */ 1577 + 1578 + yp = PM * y; 1579 + 1580 + if printss = 0 then free PM; 1581 + 1582 + /* 1583 + Compute the desired sum of squares as the squared length of 1584 + the projection yp. 1585 + */ 1586 + 1587 + result = ssq(yp); 1588 + 1589 + if printh = 0 then free yp; 1590 + 1591 + /* 1592 + If requested, print the computed sum of squares. 1593 + */ 1594 + 1595 + if printss = 1 then print result [format=25.15]; 1596 + if printss = 2 then print result [format=e23.]; 1597 + 1598 + /* 1599 + If requested, print the intermediate results: 1600 + - the hypothesis matrix 1601 + - the projection matrix 1602 + - the projection of y. 1603 + 1604 + Print the hypothesis matrix and the projection of y untrans- 1605 + posed or transposed, depending on the value of PRINTH. 1606 + */ 1607 + 1608 + if printh = 1 then do; 1609 + print , 1610 + 'Hypothesis matrix H = XE - XR * ginv(XR) * XE:', H; 1611 + print ,'Projection matrix PM = H * ginv(H):', PM; 1612 + print ,'Projection of y: yp = PM * y:', yp; 1613 + end; 1614 + 1615 + 1616 + if printh = 2 then do; 1617 + print , 1618 + 'Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE:', 1619 + (H`); 1620 + print ,'Projection matrix PM = H * ginv(H):', PM; 1621 + print ,'Transpose of projection of y: yp = PM * y:', 1622 + (yp`); 1623 + end; 1624 + 1625 + /* 1626 + If the argument PRINTSS requested printing of sums of 1627 + squares then, as a way of (indirectly) checking the accuracy 1628 + of the subroutine, perform checks on PM to see how close it 1629 + is to having the following properties: 1630 + - symmetric 1631 + - rows sum to zero 1632 + - idempotent. 1633 + */ 1634 + 1635 + if printss > 0 then do; 1636 + 1637 + /* 1638 + The checks are done by first computing the largest relative 1639 + difference between corresponding values in two matrices that 1640 + should be identical. The relative difference between two 1641 + matrices P and Q (called the "relative error") is defined as 1642 + 1643 + E = (P - Q) / P. 1644 + 1645 + Both the subtraction and the division in the above expres- 1646 + sion are done on an element-by-element basis. Thus E is it- 1647 + self a matrix with the same dimensions as P and Q. The sub- 1648 + routine computes L, the largest absolute value of the ele- 1649 + ments in E as a measure of the equality of the two matrices. 1650 + 1651 + The subroutine then converts L to a rough number of digits 1652 + of accuracy with the formula 1653 + 1654 + ndigits = -log10(L). 1655 + 1656 + First generate PMD, the divisor matrix for two of the rela- 1657 + tive errors. PMD is simply PM, except that we ensure that 1658 + none of the values in PMD is zero. 1659 + */ 1660 + 1661 + if all(PM ^= 0) then PMD = PM; 1662 + else do; 1663 + PMD = PM + ((PM = 0) * .6e-78); 1664 + n = sum(PM = 0); 1665 + string = 'elements of PM were zero. .6E-78 '; 1666 + string = string + 'was added to these elements '; 1667 + string = string + 'to avoid dividing by zero in '; 1668 + string = string + 'computing relative errors.'; 1669 + print n string; 1670 + end; 1671 + 1672 + /* 1673 + Second, set DECACC, which is the number of decimal digits of 1674 + accuracy carried in the computations. This value depends on 1675 + the computer operating system and language used to run the 1676 + subroutine. For my (Windows) computer the correct value 1677 + with SAS is 16. 1678 + */ 1679 + 1680 + decacc = 16; 1681 + 1682 + /* 1683 + Third, compute the maximum absolute relative difference be- 1684 + tween corresponding elements in PM and its transpose as a 1685 + measure of the symmetry of PM. 1686 + */ 1687 + 1688 + mrelerrS = max( abs( (PM - PM`)/PMD ) ); 1689 + if mrelerrS = 0 then ndigitsS = decacc; 1690 + else ndigitsS = -log10(mrelerrS); 1691 + 1692 + /* 1693 + Fourth, compute the maximum absolute relative error between 1694 + the sum of elements in a row of PM and zero; use the average 1695 + of the absolute values of the elements in the row as the di- 1696 + visor. (See page 47 of the IML manual for subscript reduc- 1697 + tion operators [SAS Institute Inc., 1989].) 1698 + */ 1699 + 1700 + D = abs(PM); 1701 + mrelerrR = max( abs( PM[,+] / D[,:] ) ); 1702 + if mrelerrR = 0 then ndigitsR = decacc; 1703 + else ndigitsR = -log10(mrelerrR); 1704 + 1705 + /* 1706 + Fifth, compute the maximum absolute relative difference be- 1707 + tween corresponding elements in PM and PM-squared as a meas- 1708 + ure of how close PM is to being idempotent. 1709 + */ 1710 + 1711 + mrelerrI = max( abs( (PM - (PM * PM)) / PMD ) ); 1712 + if mrelerrI = 0 then ndigitsI = decacc; 1713 + else ndigitsI = -log10(mrelerrI); 1714 + 1715 + /* 1716 + Print the computed numbers of digits of accuracy. 1717 + */ 1718 + 1719 + print ndigitsS [format=5.1] 1720 + ndigitsR [format=5.1] 1721 + ndigitsI [format=5.1]; 1722 + 1723 + end; 1723 + /* of if printss > 0 then do */ 1724 + 1725 + /* 1726 + 1727 + SUBROUTINE NOTE FOR USERS OF EBCDIC COMPUTER SYSTEMS 1728 + 1729 + The NOT operator (^) above in this subroutine is the correct 1730 + operator for ASCII systems. If the subroutine is run on an 1731 + EBCDIC system (e.g., an IBM mainframe), you may have to 1732 + change each occurrence of ^ to the EBCDIC logical NOT opera- 1733 + tor, which looks like a minus sign with a short vertical bar 1734 + dropping down from the right end (and which is EBCDIC hexa- 1735 + decimal 5F). 1736 + 1737 + 1738 + SUBROUTINE REFERENCES 1739 + 1740 + Aldrich, J. 1998. Doing least squares: Perspectives from 1741 + Gauss and Yule. _International Statistical Review_ 66, 1742 + 61-81. 1743 + 1744 + Graybill, F. A. 1983. _Matrices with Applications in 1745 + Statistics_ 2d ed. Belmont, CA: Wadsworth. 1746 + 1747 + Harville, D. A. 1997. _Matrix Algebra From a Statistician's 1748 + Perspective._ New York: Springer-Verlag. 1749 + 1750 + Hocking, R. R. 1985. _The Analysis of Linear Models._ 1751 + Monterey, CA: Brooks/Cole. 1752 + 1753 + Macnaughton, D. B. 1997. Which sums of squares are best in 1754 + unbalanced analysis of variance. Available at 1755 + http://www.matstat.com/ss/ 1756 + 1757 + Macnaughton, D. B. 1998a. PR0139.HTM: Computing numerator 1758 + sums of squares in unbalanced analysis of variance: Two- 1759 + way case. Available in the "Computer Programs" section 1760 + at http://www.matstat.com/ss/ 1761 + 1762 + Macnaughton, D. B. 1998b. PR0165.HTM: Computing numerator 1763 + sums of squares in unbalanced analysis of variance: 1764 + Three-way case. Available in the "Computer Programs" 1765 + section at http://www.matstat.com/ss/ 1766 + 1767 + SAS Institute Inc. 1989. _SAS/IML Software: Usage and Refer- 1768 + ence, Version 6, First Edition._ Cary, NC: author. 1769 + 1770 + Searle, S. R. 1987. _Linear Models for Unbalanced Data._ 1771 + New York: Wiley. 1772 + 1773 + Sunwoo, H., and B. C. Kim. 1997. Analysis of the unbalanced 1774 + linear model based on the balanced model. _Journal of 1775 + Statistical Computation and Simulation_ 56, 373-385. 1776 + 1777 + Yates, F. 1934. The analysis of multiple classifications 1778 + with unequal numbers in the different classes. _Journal 1779 + of the American Statistical Association_ 29, 51-66. 1780 + */ 1781 + 1782 + 1783 + finish SS; NOTE: Module SS defined. 1783 + /* end of subroutine SS 1784 + version of June 19, 1998 1785 + ***********************************************************/ NOTE: %INCLUDE (level 1) ending. 1786 1787 1788 /* 1789 PRELIMINARY STEP 6: 1790 SET THE VALUES OF THE THREE SECONDARY ARGUMENTS 1791 OF THE SS SUBROUTINE 1792 1793 The values of the three secondary arguments of the SS subrou- 1794 tine are set immediately below. These arguments control de- 1795 tails of how the subroutine computes sums of squares and prints 1796 results. The values set below are used on every call to the 1797 subroutine in this program. The values instruct the subroutine 1798 to 1799 - use the first method of computing sums of squares 1800 - print the value of the computed sum of squares and 1801 - print intermediate results (showing the transposed versions 1802 of the matrix H and the vector yp). 1803 */ 1804 1805 level = 1; 1806 printss = 1; 1807 printh = 2; 1808 1809 1810 /* 1811 COMPUTATION STEP 3: CALL SS 1812 1813 Now that we have made the SS subroutine known to IML, we can 1814 use it. Recall that we defined y, XE and XR above to allow us 1815 to compute the HTO sum of squares for the A (soil type) main 1816 effect in the Searle data. Thus we are now ready to actually 1817 call the SS subroutine to do the computation. But just before 1818 we do, note that Searle's exact answer for the HTO sum of 1819 squares for the A main effect (1987, 113, 114, 122 [typo]) is 1820 1821 83 127/141 = 83.90070 92198 58156 0 1822 1823 Here is the statement to call the SS subroutine to compute and 1824 print the HTO sum of squares for the A main effect: 1825 */ 1826 1827 call SS(result, y, XE, XR, level,printss,printh); RESULT 83.900709219858100 Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE: #TEM1001 1.1 1.1 1.1 .67 .67 1.2 1.2 -.9 -.9 -.9 -.9 -1 -.8 -.8 -.8 Projection matrix PM = H * ginv(H): PM .09 .09 .09 .05 .05 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .09 .09 .09 .05 .05 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .09 .09 .09 .05 .05 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .05 .05 .05 .03 .03 .06 .06 0 0 0 0 -.1 0 0 0 .05 .05 .05 .03 .03 .06 .06 0 0 0 0 -.1 0 0 0 0.1 0.1 0.1 .06 .06 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 0.1 0.1 0.1 .06 .06 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 0 0 -.1 -.1 .05 .05 .05 .05 .08 .05 .05 .05 -.1 -.1 -.1 0 0 -.1 -.1 .05 .05 .05 .05 .08 .05 .05 .05 -.1 -.1 -.1 0 0 -.1 -.1 .05 .05 .05 .05 .08 .05 .05 .05 -.1 -.1 -.1 0 0 -.1 -.1 .05 .05 .05 .05 .08 .05 .05 .05 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .08 .08 .08 .08 .12 .07 .07 .07 -.1 -.1 -.1 0 0 -.1 -.1 .05 .05 .05 .05 .07 .04 .04 .04 -.1 -.1 -.1 0 0 -.1 -.1 .05 .05 .05 .05 .07 .04 .04 .04 -.1 -.1 -.1 0 0 -.1 -.1 .05 .05 .05 .05 .07 .04 .04 .04 Transpose of projection of y: yp = PM * y: #TEM1001 -3 -3 -3 -2 -2 -3 -3 2.1 2.1 2.1 2.1 3.2 1.9 1.9 1.9 NDIGITSS NDIGITSR NDIGITSI 15.7 15.3 15.7 1828 /* 1829 The SS subroutine generated the twenty-seven printed lines 1830 above. First it printed the computed sum of squares, 83.9.... 1831 Note how this value is very close to Searle's exact answer: the 1832 difference occurs in the fifteenth significant digit (which, if 1833 it is rounded, should be 2 instead of 1). 1834 1835 Next, the subroutine printed the transposed version of the hy- 1836 pothesis matrix H that was used to compute the sum of squares. 1837 (This matrix was generated solely from XE and XR, which were, 1838 in turn, generated solely from the values of the two predictor 1839 variables. That is, the response variable, y, played no role 1840 in generating H.) 1841 1842 Next, the subroutine printed the projection matrix PM. This 1843 matrix was generated directly from the hypothesis matrix H and 1844 is simply a different way of stating the hypothesis being 1845 tested. 1846 1847 (The zeros shown above in PM are not true zeros but are only 1848 shown as zeros due to necessary rounding on the part of the SAS 1849 printing algorithm due to the narrow print fields.) 1850 1851 Note that PM is symmetric (i.e., the rows equal the columns) 1852 and the values in the matrix change in step with the changes in 1853 the treatment groups in the experiment, as defined by the dif- 1854 ferent values in the a and b vectors specified earlier. 1855 1856 Next, the subroutine printed the vector yp, which is the pro- 1857 jection of the vector y obtained when y is multiplied by PM. 1858 The computed sum of squares (i.e., 83.9...) was computed by 1859 summing the squared elements of yp. 1860 1861 To help us judge the accuracy of the computed sum of squares, 1862 the subroutine concluded by printing the three numbers 1863 NDIGITSS, NDIGITSR, and NDIGITSI. These numbers are measures 1864 of the integrity of the projection matrix -- the smallest of 1865 these numbers (i.e., 15.3) gives a rough indication of the 1866 ceiling of the number of digits of accuracy in the numbers in 1867 the projection matrix. Since IML maintains a precision of 1868 roughly sixteen floating point decimal digits on my (Windows) 1869 computer, the maximum possible number of digits of accuracy on 1870 my computer is 16. (For curious readers, I describe the three 1871 numbers further in the comments near the end of the SS subrou- 1872 tine.) 1873 1874 Since the three measures of the integrity of the projection ma- 1875 trix suggest that the projection matrix is accurate to roughly 1876 fifteen significant digits, and since an accuracy of a sum of 1877 squares to four significant digits is fully adequate for com- 1878 puting a p-value, we can be confident that the computed sum of 1879 squares is sufficiently accurate. 1880 1881 1882 COMPUTE THE REMAINING SIX SUMS OF SQUARES 1883 1884 Following are steps to compute six other sums of squares for 1885 the Searle data. Note that each case requires only the follow- 1886 ing three lines of code: 1887 - a line to specify the effect being tested (via XE) 1888 - a line to specify the other terms in the two models (via XR) 1889 - a call to the subroutine (with the CALL statement). 1890 1891 In each case I first state the two model equations whose resid- 1892 ual sums of squares are being differenced. I state the models 1893 in an abbreviated form, omitting the subscripts on the terms 1894 and omitting the error term. 1895 1896 1897 SUM OF SQUARES FOR THE HTO B (SEED VARIETY) MAIN EFFECT 1898 1899 The two model equations for computing the HTO B sum of squares 1900 are 1901 y = m + a 1902 y = m + a + b. 1903 1904 Thus to compute the HTO B sum of squares we specify XE as 1905 BDESIGN -- the submatrix of the design matrix for main effect 1906 B. 1907 */ 1908 1909 XE = Bdesign; 1910 1911 /* 1912 We specify XR as the horizontal concatenation of J(15,1) and 1913 ADESIGN, which are the submatrices for m and a, which are the 1914 other terms (excluding the error term) on the right side of the 1915 two model equations. 1916 */ 1917 1918 XR = j(15,1) || Adesign; 1919 1920 /* 1921 Searle's exact answer for HTO B sum of squares (1987, 104, 113, 1922 114, 122) is 1923 1924 124 69/94 = 124.73404 25531 91489 4 1925 */ 1926 1927 call SS(result, y, XE, XR, level,printss,printh); RESULT 124.734042553191000 Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE: #TEM1001 .86 .86 .86 -.1 -.1 -1 -1 .88 .88 .88 .88 -.1 -1 -1 -1 0 0 0 1 1 -1 -1 .25 .25 .25 .25 1.3 -.8 -.8 -.8 Projection matrix PM = H * ginv(H): PM .09 .09 .09 -.1 -.1 0 0 .07 .07 .07 .07 -.1 -.1 -.1 -.1 .09 .09 .09 -.1 -.1 0 0 .07 .07 .07 .07 -.1 -.1 -.1 -.1 .09 .09 .09 -.1 -.1 0 0 .07 .07 .07 .07 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .22 .22 -.1 -.1 0 0 0 0 .27 0 0 0 -.1 -.1 -.1 .22 .22 -.1 -.1 0 0 0 0 .27 0 0 0 0 0 0 -.1 -.1 .16 .16 -.1 -.1 -.1 -.1 -.1 .13 .13 .13 0 0 0 -.1 -.1 .16 .16 -.1 -.1 -.1 -.1 -.1 .13 .13 .13 .07 .07 .07 0 0 -.1 -.1 .07 .07 .07 .07 0 -.1 -.1 -.1 .07 .07 .07 0 0 -.1 -.1 .07 .07 .07 .07 0 -.1 -.1 -.1 .07 .07 .07 0 0 -.1 -.1 .07 .07 .07 .07 0 -.1 -.1 -.1 .07 .07 .07 0 0 -.1 -.1 .07 .07 .07 .07 0 -.1 -.1 -.1 -.1 -.1 -.1 .27 .27 -.1 -.1 0 0 0 0 .33 -.1 -.1 -.1 -.1 -.1 -.1 0 0 .13 .13 -.1 -.1 -.1 -.1 -.1 .12 .12 .12 -.1 -.1 -.1 0 0 .13 .13 -.1 -.1 -.1 -.1 -.1 .12 .12 .12 -.1 -.1 -.1 0 0 .13 .13 -.1 -.1 -.1 -.1 -.1 .12 .12 .12 Transpose of projection of y: yp = PM * y: #TEM1001 -3 -3 -3 5.1 5.1 -.9 -.9 -2 -2 -2 -2 6.1 .19 .19 .19 NDIGITSS NDIGITSR NDIGITSI 15.7 15.0 15.5 1928 1929 1930 /* 1931 SUM OF SQUARES FOR THE HTI (HIGHER-LEVEL TERMS INCLUDED) 1932 A MAIN EFFECT 1933 1934 HTI sums of squares operate by including higher-level interac- 1935 tion terms in the two model equations whose residual sums of 1936 squares are differenced. Thus the two model equations for com- 1937 puting the HTI A sum of squares are 1938 y = m + b + p 1939 y = m + a + b + p. 1940 Note the appearance of the p (interaction) term in both equa- 1941 tions. This term was omitted above in the computation of the 1942 two HTO sums of squares. Recall that the submatrix of the de- 1943 sign matrix for the interaction is ABdesign. 1944 1945 Searle's exact answer for this sum of squares (1987, 91) is 1946 1947 123 27/35 = 123.77142 85714 28571 4 1948 */ 1949 1950 XE = Adesign; 1951 XR = j(15,1) || Bdesign || ABdesign; 1952 call SS(result, y, XE, XR, level,printss,printh); RESULT 123.771428571428000 Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE: #TEM1001 .69 .69 .69 1 1 1 1 -.5 -.5 -.5 -.5 -2 -.7 -.7 -.7 Projection matrix PM = H * ginv(H): PM .04 .04 .04 .06 .06 .06 .06 0 0 0 0 -.1 0 0 0 .04 .04 .04 .06 .06 .06 .06 0 0 0 0 -.1 0 0 0 .04 .04 .04 .06 .06 .06 .06 0 0 0 0 -.1 0 0 0 .06 .06 .06 .09 .09 .09 .09 0 0 0 0 -.2 -.1 -.1 -.1 .06 .06 .06 .09 .09 .09 .09 0 0 0 0 -.2 -.1 -.1 -.1 .06 .06 .06 .09 .09 .09 .09 0 0 0 0 -.2 -.1 -.1 -.1 .06 .06 .06 .09 .09 .09 .09 0 0 0 0 -.2 -.1 -.1 -.1 0 0 0 0 0 0 0 .02 .02 .02 .02 .09 .03 .03 .03 0 0 0 0 0 0 0 .02 .02 .02 .02 .09 .03 .03 .03 0 0 0 0 0 0 0 .02 .02 .02 .02 .09 .03 .03 .03 0 0 0 0 0 0 0 .02 .02 .02 .02 .09 .03 .03 .03 -.1 -.1 -.1 -.2 -.2 -.2 -.2 .09 .09 .09 .09 .34 .11 .11 .11 0 0 0 -.1 -.1 -.1 -.1 .03 .03 .03 .03 .11 .04 .04 .04 0 0 0 -.1 -.1 -.1 -.1 .03 .03 .03 .03 .11 .04 .04 .04 0 0 0 -.1 -.1 -.1 -.1 .03 .03 .03 .03 .11 .04 .04 .04 Transpose of projection of y: yp = PM * y: #TEM1001 -2 -2 -2 -3 -3 -3 -3 1.6 1.6 1.6 1.6 6.5 2.2 2.2 2.2 NDIGITSS NDIGITSR NDIGITSI 16.0 16.0 15.7 1953 1954 1955 /* 1956 SUM OF SQUARES FOR THE HTI B MAIN EFFECT 1957 1958 The two models for computing the HTI B sum of squares are 1959 y = m + a + p 1960 y = m + a + b + p. 1961 1962 Searle does not give an exact answer for this sum of squares. 1963 The SAS Type III sum of squares for the B main effect is 1964 1965 192.12765 957 1966 */ 1967 1968 XE = Bdesign; 1969 XR = j(15,1) || Adesign || ABdesign; 1970 call SS(result, y, XE, XR, level,printss,printh); RESULT 192.127659574468000 Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE: #TEM1001 .98 .98 .98 -.1 -.1 -1 -1 .73 .73 .73 .73 -.2 -.9 -.9 -.9 .17 .17 .17 .77 .77 -1 -1 .13 .13 .13 .13 1.5 -.7 -.7 -.7 Projection matrix PM = H * ginv(H): PM 0.1 0.1 0.1 -.1 -.1 -.1 -.1 .07 .07 .07 .07 -.1 -.1 -.1 -.1 0.1 0.1 0.1 -.1 -.1 -.1 -.1 .07 .07 .07 .07 -.1 -.1 -.1 -.1 0.1 0.1 0.1 -.1 -.1 -.1 -.1 .07 .07 .07 .07 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .14 .14 -.1 -.1 0 0 0 0 .27 0 0 0 -.1 -.1 -.1 .14 .14 -.1 -.1 0 0 0 0 .27 0 0 0 -.1 -.1 -.1 -.1 -.1 0.2 0.2 -.1 -.1 -.1 -.1 -.1 .13 .13 .13 -.1 -.1 -.1 -.1 -.1 0.2 0.2 -.1 -.1 -.1 -.1 -.1 .13 .13 .13 .07 .07 .07 0 0 -.1 -.1 .06 .06 .06 .06 -.1 0 0 0 .07 .07 .07 0 0 -.1 -.1 .06 .06 .06 .06 -.1 0 0 0 .07 .07 .07 0 0 -.1 -.1 .06 .06 .06 .06 -.1 0 0 0 .07 .07 .07 0 0 -.1 -.1 .06 .06 .06 .06 -.1 0 0 0 -.1 -.1 -.1 .27 .27 -.1 -.1 -.1 -.1 -.1 -.1 .54 -.1 -.1 -.1 -.1 -.1 -.1 0 0 .13 .13 0 0 0 0 -.1 .09 .09 .09 -.1 -.1 -.1 0 0 .13 .13 0 0 0 0 -.1 .09 .09 .09 -.1 -.1 -.1 0 0 .13 .13 0 0 0 0 -.1 .09 .09 .09 Transpose of projection of y: yp = PM * y: #TEM1001 -3 -3 -3 4.8 4.8 .16 .16 -2 -2 -2 -2 9.5 .11 .11 .11 NDIGITSS NDIGITSR NDIGITSI 15.6 14.9 15.4 1971 1972 1973 /* 1974 SUM OF SQUARES FOR THE SEQUENTIAL A MAIN EFFECT 1975 WHEN A IS ENTERED FIRST 1976 1977 The two models for computing the sequential A sum of squares 1978 when A is entered first are 1979 y = m 1980 y = m + a. 1981 1982 (When A is entered SECOND [i.e., after B], the sequential sum 1983 of squares for A is the same as the HTO sum of squares for A, 1984 which is computed above.) 1985 1986 Searle's exact answer for this sum of squares (1987, 93, 113, 1987 114, 122) is 1988 1989 52 1/2 1990 */ 1991 1992 XE = Adesign; 1993 XR = j(15,1); 1994 call SS(result, y, XE, XR, level,printss,printh); RESULT 52.500000000000000 Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE: #TEM1001 1.1 1.1 1.1 1.1 1.1 1.1 1.1 -.9 -.9 -.9 -.9 -.9 -.9 -.9 -.9 Projection matrix PM = H * ginv(H): PM .08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .08 .08 .08 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .06 .06 .06 .06 .06 .06 .06 .06 Transpose of projection of y: yp = PM * y: #TEM1001 -2 -2 -2 -2 -2 -2 -2 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 NDIGITSS NDIGITSR NDIGITSI 16.0 15.2 16.0 1995 1996 1997 /* 1998 SUM OF SQUARES FOR THE SEQUENTIAL B MAIN EFFECT 1999 WHEN B IS ENTERED FIRST 2000 2001 The two models for computing the sequential B sum of squares 2002 when B is entered first are 2003 y = m 2004 y = m + b. 2005 2006 (When B is entered SECOND [i.e., after A], the sequential sum 2007 of squares for B is the same as the HTO sum of squares for B, 2008 which is computed above.) 2009 2010 Searle's exact answer for this sum of squares (1987, 95, 113, 2011 114, 122 [typo]) is 2012 2013 93 1/3 2014 */ 2015 2016 XE = Bdesign; 2017 XR = j(15,1); 2018 call SS(result, y, XE, XR, level,printss,printh); RESULT 93.333333333333300 Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE: #TEM1001 .87 .87 .87 -.1 -.1 -1 -1 .87 .87 .87 .87 -.1 -1 -1 -1 .13 .13 .13 1.1 1.1 -.9 -.9 .13 .13 .13 .13 1.1 -.9 -.9 -.9 Projection matrix PM = H * ginv(H): PM .08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .27 .27 -.1 -.1 -.1 -.1 -.1 -.1 .27 -.1 -.1 -.1 -.1 -.1 -.1 .27 .27 -.1 -.1 -.1 -.1 -.1 -.1 .27 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 .13 .08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 -.1 -.1 -.1 -.1 .08 .08 .08 .08 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .27 .27 -.1 -.1 -.1 -.1 -.1 -.1 .27 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 -.1 -.1 -.1 -.1 -.1 .13 .13 .13 Transpose of projection of y: yp = PM * y: #TEM1001 -2 -2 -2 4.7 4.7 0 0 -2 -2 -2 -2 4.7 0 0 0 NDIGITSS NDIGITSR NDIGITSI 15.4 15.1 15.4 2019 2020 2021 /* 2022 SUM OF SQUARES FOR THE A x B INTERACTION EFFECT 2023 2024 The two models for computing the A x B interaction sum of 2025 squares are 2026 y = m + a + b 2027 y = m + a + b + p. 2028 2029 Note that the sum of squares for the highest-level interaction 2030 in an experiment does not differ from one standard approach to 2031 computing analysis of variance sums of squares to the next. 2032 2033 Searle's exact answer for this sum of squares (1987, 113, 114) 2034 is 2035 2036 222 36/47 = 222.76595 74468 08510 6 2037 */ 2038 2039 XE = ABdesign; 2040 XR = j(15,1) || Adesign || Bdesign; 2041 call SS(result, y, XE, XR, level,printss,printh); RESULT 222.765957446808000 Transpose of hypoth. matrix H = XE - XR * ginv(XR) * XE: #TEM1001 .98 .98 .98 -.1 -.1 -1 -1 -.7 -.7 -.7 -.7 .19 .91 .91 .91 .17 .17 .17 .77 .77 -1 -1 -.1 -.1 -.1 -.1 -2 .68 .68 .68 Projection matrix PM = H * ginv(H): PM 0.1 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .11 .06 .06 .06 0.1 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .11 .06 .06 .06 0.1 0.1 0.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 .11 .06 .06 .06 -.1 -.1 -.1 .14 .14 -.1 -.1 .04 .04 .04 .04 -.3 .04 .04 .04 -.1 -.1 -.1 .14 .14 -.1 -.1 .04 .04 .04 .04 -.3 .04 .04 .04 -.1 -.1 -.1 -.1 -.1 0.2 0.2 .07 .07 .07 .07 .11 -.1 -.1 -.1 -.1 -.1 -.1 -.1 -.1 0.2 0.2 .07 .07 .07 .07 .11 -.1 -.1 -.1 -.1 -.1 -.1 .04 .04 .07 .07 .06 .06 .06 .06 -.1 0 0 0 -.1 -.1 -.1 .04 .04 .07 .07 .06 .06 .06 .06 -.1 0 0 0 -.1 -.1 -.1 .04 .04 .07 .07 .06 .06 .06 .06 -.1 0 0 0 -.1 -.1 -.1 .04 .04 .07 .07 .06 .06 .06 .06 -.1 0 0 0 .11 .11 .11 -.3 -.3 .11 .11 -.1 -.1 -.1 -.1 .54 -.1 -.1 -.1 .06 .06 .06 .04 .04 -.1 -.1 0 0 0 0 -.1 .09 .09 .09 .06 .06 .06 .04 .04 -.1 -.1 0 0 0 0 -.1 .09 .09 .09 .06 .06 .06 .04 .04 -.1 -.1 0 0 0 0 -.1 .09 .09 .09 Transpose of projection of y: yp = PM * y: #TEM1001 -1 -1 -1 -4 -4 5.9 5.9 .93 .93 .93 .93 8.1 -4 -4 -4 NDIGITSS NDIGITSR NDIGITSI 15.7 15.2 15.6 2042 2043 2044 /* 2045 SAVE THE DATA IN A SAS DATASET 2046 2047 The next three IML statements create a SAS dataset (called 2048 "Searle_1") and then transfer the values of vectors a, b, and y 2049 to variables with the same names in the dataset. This enables 2050 us to compute all the sums of squares for the Searle data with 2051 SAS GLM. 2052 */ 2053 2054 create Searle_1 var {a b y}; 2055 append; 2056 close Searle_1; NOTE: The data set WORK.SEARLE_1 has 15 observations and 3 variables. 2057 2058 2059 /* 2060 QUIT FROM IML 2061 */ 2062 2063 quit; Exiting IML. NOTE: The PROCEDURE IML used 27.57 seconds. 2064 2065 2066 /* 2067 RUN PROC GLM TWICE 2068 2069 The following statements run SAS GLM on the data to enable com- 2070 paring the above IML output with output from an analysis of the 2071 data by GLM. (The output from GLM comes in a separate output 2072 file.)
(Note: The output appears below after the end of this log.)
2073 2074 Examination of the GLM output reveals that all the GLM sums of 2075 squares are identical in all available digits to the sums of 2076 squares produced by the SS subroutine. 2077 */ 2078 2079 title 'IML/GLM 2 x 3 Unbalanced ANOVA, Searle (1987, 79)'; 2080 options nodate linesize=80 probsig=2; 2081 2082 proc glm data=Searle_1; 2083 class a b; 2084 model y = a | b / ss1 ss2 ss3; 2085 quit; NOTE: The PROCEDURE GLM used 3.06 seconds. 2086 2087 /* 2088 Run GLM a second time with a and b reversed in the model state- 2089 ment to get the sequential (Type I) sum of squares for B when B 2090 is entered first into the model. 2091 */ 2092 2093 proc glm data=Searle_1; 2094 class a b; 2095 model y = b | a / ss1 ss2 ss3; 2096 quit; NOTE: The PROCEDURE GLM used 1.32 seconds. 2097 2098 options date linesize=80 probsig=2; 2099 title; 2100 2101 2102 /* 2103 SUMMARY 2104 2105 This program illustrates three approaches to computing numera- 2106 tor sums of squares in unbalanced analysis of variance: HTO, 2107 HTI, and sequential. The three approaches are specified in 2108 terms of computing the difference between the residual sums of 2109 squares of two overparameterized model equations. The two mod- 2110 els are specified in terms of two matrices, XE and XR, which 2111 are made up of submatrices of the overall (full-column-rank) 2112 design matrix for the experiment. XE is the submatrix for the 2113 effect being tested. XR is the horizontal concatenation of the 2114 submatrices for all the other terms (except the error term) on 2115 the right side of the two model equations. 2116 2117 Note the conceptual economy of the approach: After the data 2118 are known to IML, the submatrices of the design matrix can be 2119 specified with one simple statement per submatrix (using the 2120 DESIGNF and HDIR functions). It takes just two more statements 2121 to specify (via XE and XR) a particular sum of squares to be 2122 computed. And it takes only four more reusable statements (in 2123 the SS subroutine) to compute any standard numerator sum of 2124 squares. 2125 2126 2127 NOTES 2128 2129 If you wish to run this program on your computer, see the 2130 checklist in the appendix. 2131 2132 This program illustrates computation of numerator sums of 2133 squares in the two-way case -- i.e., the case with two dis- 2134 crete-valued predictor variables. I discuss five approaches to 2135 computing analysis of variance numerator sums of squares in the 2136 three-way case in another program (1998). 2137 2138 I discuss in the paper (1997, sec. 17) the fact that statisti- 2139 cal tests provided by the HTO sums of squares are (in the rele- 2140 vant cases) generally more powerful than the corresponding 2141 tests provided by the HTI sums of squares. This implies that 2142 if there is a relationship, an HTO sum of squares will usually 2143 be greater than the corresponding HTI sum of squares. However, 2144 Searle's data twice illustrate the fact that an HTO sum of 2145 squares is not *always* greater than the corresponding HTI sum 2146 of squares. Specifically, the HTO (SAS Type II) sum of squares 2147 for the A main effect in Searle's data is less than the HTI 2148 (SAS Type III) sum of squares for the same effect (83.9 versus 2149 123.8). Similarly, the HTO sum of squares for the B main ef- 2150 fect in Searle's data is also less than the HTI sum of squares 2151 (124.7 versus 192.1). 2152 2153 2154 APPENDIX: STEPS TO RUN THIS PROGRAM 2155 2156 1. Ensure that the STAT and IML components of the SAS system 2157 are available on your computer. Information about the SAS 2158 system is available at http://www.sas.com 2159 2160 2. Ensure that you have the source version of this program, 2161 which is called PR0139.SAS (not the HTML version, which is 2162 called PR0139.HTM). You can obtain a copy of the source 2163 version in the "Computer Programs" section of the page at 2164 http://www.matstat.com/ss/ 2165 2166 3. Install a copy of the SS subroutine on your computer. This 2167 subroutine does the actual computations of sums of squares 2168 and is available from the above MatStat web page. 2169 2170 4. Edit the %INCLUDE statement in preliminary step 5 above to 2171 correctly point to the location of the SS.SAS subroutine 2172 file on your computer. That is, change the 2173 D:\PROGS\SS.SAS 2174 in the statement to the location where SS.SAS is stored on 2175 your computer. 2176 2177 5. (Optional.) Modify the two OPTIONS statements in the pro- 2178 gram that adjust the DATE, LINESIZE, and PROBSIG options. 2179 2180 6. Submit the program to SAS. 2181 2182 2183 REFERENCES 2184 2185 Fisher, R. A. 1925. _Statistical Methods for Research Workers._ 2186 Edinburgh: Oliver and Boyd. The 14th edition of this semi- 2187 nal work appears in Fisher (1990). 2188 2189 Fisher, R. A. 1935. _The Design of Experiments._ Edinburgh: 2190 Oliver and Boyd. The 8th edition of this seminal work ap- 2191 pears in Fisher (1990). 2192 2193 Fisher, R. A. 1990. _Statistical Methods, Experimental Design, 2194 and Scientific Inference_ edited by J. H. Bennett. Oxford: 2195 Oxford University Press. 2196 2197 Macnaughton, D. B. 1996a. The introductory statistics course: 2198 A new approach. Available at http://www.matstat.com/teach/ 2199 2200 Macnaughton, D. B. 1996b. The entity-property-relationship ap- 2201 proach to statistics: An introduction for students. Avail- 2202 able at http://www.matstat.com/teach/ 2203 2204 Macnaughton, D. B. 1997. Which sums of squares are best in un- 2205 balanced analysis of variance? Available at 2206 http://www.matstat.com/ss/ 2207 2208 Macnaughton, D. B. 1998. PR0165.HTM: Computing numerator sums 2209 of squares in unbalanced analysis of variance: Three-way 2210 case. Available in the "Computer Programs" section at 2211 http://www.matstat.com/ss/ 2212 2213 Searle, S. R. 1987. _Linear Models for Unbalanced Data._ New 2214 York: Wiley. 2215 2216 Yates, F. 1934. The analysis of multiple classifications with 2217 unequal numbers in the different classes. _Journal of the 2218 American Statistical Association_ 29, 51-66. 2219 2220 version of June 19, 1998 2221 (end of program pr0139.sas) */
This is the end of the program log for the run of the program.
Following are the three analysis of variance tables that were output on the first of the two runs of PROC GLM:
Source DF Type I SS Mean Square F Value Pr > F A 1 52.50000000 52.50000000 3.94 .07851 B 2 124.73404255 62.36702128 4.68 .04048 A*B 2 222.76595745 111.38297872 8.35 .00889 Source DF Type II SS Mean Square F Value Pr > F A 1 83.90070922 83.90070922 6.29 .03339 B 2 124.73404255 62.36702128 4.68 .04048 A*B 2 222.76595745 111.38297872 8.35 .00889 Source DF Type III SS Mean Square F Value Pr > F A 1 123.77142857 123.77142857 9.28 .01386 B 2 192.12765957 96.06382979 7.20 .01355 A*B 2 222.76595745 111.38297872 8.35 .00889
Note that all the numbers given above in the "SS" column are identical in all available digits to the corresponding sums of squares computed earlier in the program by PROC IML.
(As most readers will know, the numbers in the "Mean Square" column are computed by dividing the sums of squares by their degrees of freedom. The F values are computed by dividing the mean squares by the residual mean square, which SAS gives earlier [not shown] as 13.33333333. The p-values, which SAS labels "Pr > F", are computed from the F values using the "cumulative distribution function" for the "central F-distribution". Assuming there is no reasonable alternative explanation for the results, the low p-value for the A × B interaction [i.e., .00889 in all three tables] provides good evidence that [1] a relationship exists between the response variable and predictor variables A and B and [2] the relationship is a two-way interaction.
(After discovering the interactive relationship, the researcher's next step is to study a graph of the mean values of the response variable for the six treatment groups to gain an understanding of the relationship. By convention, the response variable is plotted on the vertical axis of the graph, one of the predictor variables is plotted on the horizontal axis, and the values of the other predictor variable are reflected by different lines (or different-shaped points) on the graph. "Standard error of the mean" bars are sometimes plotted on such graphs to give the reader a sense of the stability of the mean value of the response variable in each of the treatment groups [cells].)
Note the relationship between Type I and Type II sums of squares: the Type I sum of squares for B in the above tables is the same as the Type II sum of squares for B, but the Type I sum of squares for A is different from the Type II sum of squares for A.
Following are the three analysis of variance tables that were output on the second of the two runs of PROC GLM. This run had the same specifications as the first run except that the order of A and B was reversed in the MODEL statement. Comparison of the GLM output below with the GLM output above reveals that the only new value in the sums of squares column is the the Type I value for B (i.e., 93.3...).
Source DF Type I SS Mean Square F Value Pr > F B 2 93.33333333 46.66666667 3.50 .07508 A 1 83.90070922 83.90070922 6.29 .03339 A*B 2 222.76595745 111.38297872 8.35 .00889 Source DF Type II SS Mean Square F Value Pr > F B 2 124.73404255 62.36702128 4.68 .04048 A 1 83.90070922 83.90070922 6.29 .03339 A*B 2 222.76595745 111.38297872 8.35 .00889 Source DF Type III SS Mean Square F Value Pr > F B 2 192.12765957 96.06382979 7.20 .01355 A 1 123.77142857 123.77142857 9.28 .01386 A*B 2 222.76595745 111.38297872 8.35 .00889
This is the end of the output from PR0139.SAS.
Donald Macnaughton's page on unbalanced analysis of variance