Experimental methods: A review (Source: Tombaugh & Dillons A Practical Introduction to Experimental Design in CHI Research, 1993) Empirical methods in HCI Heuristic evaluation
Controlled laboratory experiments Quasi-experiments Ethnographic observation Task analysis User studies/user testing Experiment Design is similar to user interface design: Iterative: conceptualize, pilot, improve, run The perfect experiment does not exist! Experiments involve tradeoffs:
cost vs. running the ideal experiment saving time vs. running ideal experiment sometimes there are conflicting guidelines What IS an experiment? All experiments have control conditions
(comparisons are made) Hypotheses are tested Assignment is random (If assignment is not random, then its a quasi-experiment) In the successful experiment, conclusions about causality can be made (no bias or confounds) Can be replicated!! Criteria for expt design
External validity Internal validity Reliability Overview, experiment design Determine research problem Consider validity, etc. Pilot !! Design experiment
NO Is Design ok? YES Collect data Choices in experiment design
Variables (to manipulate and measure) Design (within-Ss, between-Ss, mixed) Controls (what will be compared) Sample (how subjects will be chosen) Task (many considerations) Stimuli Variables
IV: Independent variable (What is manipulated; factor, treatment) DV: Dependent variable (What is measured) Extraneous variable: (Any variable, other than the IV, that might affect the DV) Confound: An extraneous variable that covaries with the IV Confounds Example from handout:
Compare Screen A vs. Screen B Screen A is used in a room w/ windows Screen B is used in a room w/out windows What can you conclude if performance is better with Screen B than Screen A? Another example: The dreaded end of
the semester effect Between subjects designs Each person is tested in one condition Avoiding confounds Subjects are randomly assigned to conditions. If individual differences are likely to be
important, subjects can be matched on important characteristics. Within subjects designs (also known as repeated measures) Each person is tested in all conditions This avoids effects of ind. Differences! Order of conditions is randomized or counterbalanced
But you can get unexpected order effects! Advantages of Within-Ss fewer subjects needed statistical tests are more powerful control for individual differences best way to study learning or forgetting or the effects of expertise (longitudinal designs)
Disadvantages of Within-Ss Order effects can ruin the results (practice, fatigue, learning, boredom); counterbalancing is necessary! More testing materials are required. Order of presentation of materials must be controlled (counterbalanced).
It can be difficult to get subjects to return for repeated testing. Disadvantages of Within-Ss Order effects can ruin the results (practice, fatigue, learning, boredom); counterbalancing is necessary! More testing materials are required. Order of presentation of materials must be controlled (counterbalanced). It can be difficult to get subjects to return for repeated testing. (Between-Ss designs are an alternative.)
Mixed designs Combine between- and within-subjects comparisons One or more comparison is between two groups of different people One or more comparison is within the same group of people What is an interaction?
Example: Which type of dialog is better, commands or menus? What is an interaction? Example: Which type of dialog is better, commands or menus? (When the answer is it depends!, that suggests an interaction.) Example of an interaction DIALOG STYLE Command Menu
Novice 42 28 Expert 16 20 USER
(Here, the DV is the time it takes to do the task. The IVs are Dialogue Style and Users Expertise). Interaction For example: Which type of dialog is better, commands or menus? Answer: Commands are faster for experts & menus are faster for novices. Interaction For example: Which type of dialog is better, commands or menus? Answer: Commands are faster for experts & menus are faster for novices. (If the answer is it depends
that is, it depends on levels of another independent variable such as expertise, then there is an interaction.) Example of an interaction DIALOG STYLE Command Menu Novice 42 28
Expert 16 20 USER Example of an interaction 45 40 35 30
Time 25 Novice Expert 20 15 10 5 0 Command
Menu Example of an interaction 45 40 35 30 Time 25 Novice Expert
20 15 10 5 0 Command Menu (Which representation is better?) Between vs within vs mixed designs DIALOG STYLE Command
Menu Novice 10 Ss 10 Ss Expert 10 Ss 10 Ss USER
Between vs within vs mixed designs DIALOG STYLE Command Novice 10 Ss Expert 10 Ss Menu USER
(What are the advantages and disadvantages of having Dialog Style as a within-subs vs. a between-subs variable?) What can go wrong Two types of errors: Type 1 - Your data show a statistical effect, but its not real. Type 2 - Your data fail to show any statistical effect, but the effect is out there in the world. What can go wrong Two types of errors:
Type 1 - Your data show a statistical effect, but its not real. Type 2 - Your data fail to show any statistical effect, but the effect is out there in the world. Avoid Type 1 errors by replicating your effects. Avoid Type 2 by increasing your power. Power
Run more subjects Include more observations/items/tasks Eliminate noise (reduce variance) Try to achieve better control (These pairs of distributions show the same differences in means, but very different variances.) Common ways to reduce variance and increase power
Random assignment Minimize differences in subjects & items Try to use a within-Ss design Match subjects characteristics remove confounds by making the comparison groups or items similar Counterbalance Quasi-Experiments
When no random assignment is possible, e.g. in the study of: Gender effects Bilingualism Experts vs. novices (unless expertise can be acquired in the course of the expt.)
You cannot conclude anything about causality in a quasi-expt, due to potential confounds! Many HCI expts are quasi-expts. When is it appropriate to do an expt? when a direct comparison of two or more
systems or variables is required when its feasible to achieve some measure of control when you want to show causality when you want to test predictions; when precise understanding is needed when you need data for establishing the parameters of a model Advantages of experiments
provide comparative data enable strong statements of causality a wide variety of designs are available excellent conceptual match between experimental problems and statistical tests (always know how youre going to analyze the data before you collect it!!) Limitations of experiments Experiments often are time consuming are expensive can be ineffective for comparing complex systems (what is causing differences?)
can result in weak generalization (if the task is overly simplified, or if materials and setting arent varied enough) R&D (Research vs. Development) Scientific research is strongly associated with experiments (hypothesis-testing) Science also includes a descriptive
component Discuss: Applied vs. basic research; development Summary Designing an expt involves many tradeoffs There is no perfect expt. Pilot! (Expt design is iterative!) Applied research makes tradeoffs differently than academic research: its more timely, more generalizable, more descriptive, less controlled, and more relevant to real-world problems. A note about questionnaires:
Questionnaire design Sample questionnaire, CGB, Fig. 14.3 Overall, the system was easy to use strongly disagree to strongly agree The
system was quick and efficient The system had the capabilities I expected. What are the top 2 suggestions you could make to improve the system? What are the top 3 things you liked about the system? Questionnaire design - critique! Sample questionnaire, CGB, Fig. 14.3 Overall, the system was easy to use. strongly disagree to strongly agree
The system was quick and efficient. The system had the capabilities I expected. What are the top 2 suggestions you could make to improve the system? What are the top 3 things you liked about the system? Questionnaire design - critique!
Sample questionnaire, CGB, Fig. 14.3 Overall, the system was easy to use. strongly disagree to strongly agree Biased!
The system was quick and efficient. The system had the capabilities I expected. What are the top 2 suggestions you could make to improve the system? What are the top 3 things you liked about the system? Review of user studies (Gomolls article) User studies (Gomoll, 1990) Set up observation
(tasks, users, situation) Describe the evaluations purpose Tell user she can quit at any time
Introduce equipment Explain how to think aloud Explain that you will not provide help Describe tasks and system Ask for questions Conduct the
observations (debrief the subject) Summarize results