Empirically Revisiting the Test Independence Assumption Sai Zhang, Darioush Jalali, Jochen Wuttke, Kvan Mulu, Wing Lam, Michael D. Ernst, David Notkin University of Washington Order dependence Dependent test Two tests: createFile(foo) ... readFile(foo) ... Executing them in default order: (the intended test results)
Executing them in a different order: 2 Visible test result rather than internal program state Use the default execution order as baseline Dependent test A test that yields a different test result than Executing them in default order: the(testdefault result results by design) in a reordered subsequence Executing
them in different orders: of the original test suite. createFile(foo) ... readFile(foo) ... Execute real tests rather than contrived ones 3 Why should we care about test dependence? Makes test behaviors inconsistent Affects downstream testing techniques Test prioritization Test parallelization Test selection
CPU 1 CPU 2 4 Conventional wisdom: test dependence is not a significant issue Test independence is assumed by: Test selection Test prioritization Test parallel execution Test factoring Test generation
31 papers in ICSE, FSE, ISSTA, ASE, ICST, TSE, and TOSEM (2000 2013) 5 Conventional wisdom: test dependence is not a significant issue Consider test dependence Test independence is assumed by:
Test selection Test prioritization Test parallel execution Test factoring Test generation Assume test independence without justification 1 As a threat to validity 3 31 papers in ICSE, FSE, ISSTA, ASE, ICST, TSE,
27 and TOSEM (2000 2013) 6 Is the test independence No! assumption valid? Does test dependence arise in practice? both human-written and automatically-genera What repercussions does test dependence have? onsistent results: missed alarms and false alar ffecting downstream testing techniques How to detect test dependence? of: the general problem is NP-complete proximate algorithms based on heuristics work 7
Is the test independence No! Implications: assumption valid? Does test dependence arise in practice? both human-written and automatically-genera Test independence should no longer be assumed What repercussions does test dependence have? onsistent results: missed alarms and false alar ffecting downstream testing techniques How to detect test dependence?
New challenges in designing of: the general problem is NP-complete testing techniques proximate algorithms based on heuristics work 8 Is the test independence assumption valid? Does test dependence arise in practice? both human-written and automatically-genera What repercussion does test dependence have ? onsistent results: missed alarms and false alar ffecting downstream testing techniques How to detect test dependence? general problem is NP-complete proximate algorithms based on heuristics work 9
Methodology Reported dependent tests 5 issue tracking systems New dependent tests 4 real-world projects 10 Methodology Reported dependent tests 5 issue tracking systems Search for 4 key phrases: (dependent test, test dependence,
test execution order, dierent test outcome)erent test outcome Manually inspect 450 matched bug reports Identify 96 distinct dependent tests Characteristics: Manifestation Root cause Developers action 11 Manifestation Number of tests involved to yield a different result
(default order) (run in isolation) #Tests = 1 (run after another) #Tests = 2 12 Manifestation Number of tests involved to yield a different result 96 dependent tests 13 Manifestation Number of tests involved to yield a different result
#Tests = 1 Unknown 6 15 #Tests = 3 2 82% can be revealed by no more than 2 tests 73 #Tests = 2 14 Root cause 96 dependent tests
15 Root cause Unknown 23 database 10 4 59 at least 61% are due to side-eectingecting access to static variables. static variable file system
16 Developers action 98% of the reported tests are marked as major or minor issues 91% of the dependence has been fixed Improving documents Fixing test code or source code 17 Methodology New dependent tests Human-written test suites 4176 tests 29 dependent tests
Automatically-generated test suites use Randoop [Pacheco07] 6330 tests 354 dependent tests Ran dependent test detection algorithms (details later) 4 real-world projects 18 Characteristics Manifestation: number of tests to yield a different result 29 manual dependent tests 19
Characteristics Manifestation: number of tests to yield a different result #Tests = 3 #Tests = 2 4 2 29 manual dependent tests 354 auto-generated dependent tests 23 #Tests= 1 20 Characteristics
Manifestation: number of tests to yield a different result #Tests = 3 #Tests = 2 #Tests 2 4 2 29 manual dependent tests 186 168 23 #Tests= 1 #Tests = 1 21
Characteristics Manifestation: number of tests to yield a different result #Tests = 3 #Tests = 2 #Tests 2 4 2 29 manual dependent tests 186 168 23 #Tests= 1 #Tests = 1
Root cause All because of side-effecting access of static variables 22 Developers actions Confirm all manual dependent tests tests should always stand alone, that is test engineering 101 Merged two tests to remove the dependence Opened a bug report to fix the dependent test Wont fix the dependence, since it is due to the library design 23 Is the test independence assumption valid? Does test dependence arise in practice? both human-written and automatically-genera What repercussion does test dependence have ?
onsistent results: missed alarms and false alar ffecting downstream testing techniques How to detect test dependence? general problem is NP-complete proximate algorithms based on heuristics work 24 Reported dependent tests 96 dependent tests 5 issue tracking systems 25 Reported dependent tests Missed alarms 2
96 dependent tests 94 5 issue tracking systems False alarms 26 Example false alarm void testDisplay() { //create a Display object //dispose the Display object } void testShell() { //create a Display object }
In Eclipse, only one Display object is allowed. In default order: testDisplay In a non-default order: testShell testShell testDisplay Led to a false bug report that took developers 3 months to resolve. 27 Example missed alarm public final class OptionBuilder { static String argName = null; static void reset() { argName = arg; }
Need to be set to arg before a client calls any method in the class. } BugTest.test13666 validates correct behavior. This test should fail, but passes when running in the default order Another test calls reset() before this test Hid a bug for 3 years. 28 Example missed alarm public final class OptionBuilder { static String argName = null; static void reset() {
argName = arg; } Need to be set to arg before a client calls any method in the class. } BugTest.test13666 validates correct behavior. This test should fail, but passes when running in the default order Another test calls reset() before this test Hid a bug for 3 years. 29
Example missed alarm public final class OptionBuilder { static String argName = null; static void reset() { } static { Bug argName = arg; fix } } Need to be set to arg before a client calls any method in the class. BugTest.test13666 validates correct behavior. This test should fail,
but passes when running in the default order Another test calls reset() before this test Hid a bug for 3 years. 30 Test prioritization A test execution order A new test execution order Achieve coverage faster Improve fault detection rate Each test should yield the same result. 31
Five test prioritization techniques [Elbaum et al. ISSTA 2000] Test prioritization technique Randomized ordering Prioritize on coverage of statements Prioritize on coverage of statements not yet covered Prioritize on coverage of methods Prioritize on coverage of methods not yet covered 4 real-world projects Total: 4176 manual tests Record the number of tests yielding different results 32 Evaluating test prioritization techniques Test prioritization technique Number of tests that
yield different results Randomized ordering 12 Prioritize on coverage of statements 11 Prioritize on coverage of statements not yet covered 17 Prioritize on coverage of methods 11 Prioritize on coverage of methods not yet covered 12
Implication: Total: 4176 manual tests Existing techniques are not aware of test dependence 33 Is the test independence assumption valid? Does test dependence arise in practice? both human-written and automatically-genera What repercussion does test dependence have ? onsistent results: missed alarms and false alar ffecting downstream testing techniques How to detect test dependence? general problem is NP-complete
proximate algorithms based on heuristics work 34 General problem of test dependence detection A test suite All dependent tests NP-Complete Proof: reducing the Exact Cover problem to the dependent test detection problem 35 Detecting dependent tests in a test suite A test suite
All dependent tests Approximate algorithms Reversal algorithm Randomized execution Exhaustive bounded algorithm Dependence-aware bounded algorithm All algorithms are sound but incomplete 36 Approximate algorithms by heuristics
Reversal algorithm Randomized execution Exhaustive bounded algorithm Dependence-aware bounded algorithm Intuition: changing order of each pair may expose dependences 37 Approximate algorithms by heuristics Reversal algorithm Randomized execution Exhaustive bounded algorithm
Dependence-aware bounded algorithm Shuffle the execution order multiple times 38 Approximate algorithms by heuristics Reversal algorithm Randomized execution Exhaustive bounded algorithm Dependence-aware bounded algorithm Executes all k-permutations for a bounding parameter k k= 2
Most dependent tests can be found by running short test subsequences (82% of the dependent tests are revealed by no more than 2 tests) Approximate algorithms by heuristics Reversal algorithm Randomized execution Exhaustive bounded algorithm Dependence-aware bounded algorithm x read y write
write k= 2 Record read/write info for each test Filter away unnecessary permutations Evaluating approximate algorithms Finding New dependent tests Human-written test suites 4176 tests 29 dependent tests Automatically-generated test suites use Randoop [Pacheco07] 6330 tests 354 dependent tests 4 real-world projects
41 Evaluating approximate algorithms 400 300 200 100 Re ve rs al R an do m iz ed kbo D
ep un en de de d nc eAw ar e 0 Shuffle 1000 times 100000000 10000000 1000000 100000 10000 1000
100 10 Estimated cost Actual cost Re ve rs R al an do m iz ed kD bo ep
un en de de d nc eAw ar e Number of dependent tests Time cost (seconds) k=2 (did not finish for some programs) 42 Evaluating approximate algorithms 400 300
200 100 Re ve rs al R an do m iz ed kbo D ep un en de de d
nc eAw ar e 0 Time cost (seconds) 100000000 10000000 1000000 100000 10000 1000 100 10 Re ve rs R al
an do m iz ed kD bo ep un en de de d nc eAw ar e Number of dependent tests CheapFind
andDetects detects halfmost of the dependent tests! all dependences the dependent within a bound, tests. but computationally infeasible. 43 Related work Existing definitions of test dependence Based on program state change [Kapfhammer03] Informal definitions [Bergelson06] Our definition focuses on the concrete test execution result.
Program state change may not affect test execution result. Flaky tests [Luo et al14, Google testing blog] Tests revealing inconsistent results Dependent test is a special type of flaky test. Tools supporting to execute tests in different orders JUnit 4.1: executing tests in alphabetical order by name DepUnit, TestNg: supporting specifying test execution order Do not support detecting test dependence. 44 Contributions Revisiting the test independence assumption Test dependence arises in practice Test dependence has non-trivial repercussions
Test dependence detection is NP-complete Heuristic algorithms are effective in practice Test independence should no longer be assumed! Our tool implementation http://testisolation.googlecode.com 45 [Backup slides] 46 Why not run each test in a separate process? Implemented in JCrasher Supported in Ant + JUnit Unacceptably high overhead 10 138 X slowdown Recent work merges tests running in separate processes
into a single one [Bell & Kaiser, ICSE 2014] 47 Why more dependent tests in automatically-generated test suites? Manual test suites: Developers understanding of the code and their testing goals help build well-structured tests Developers often try to initialize and destroy the shared objects each unit test may use Auto test suites: Most tools are not state-aware The generated tests often misuse APIs, e.g., setting up the environment incorrectly Most tools can not generate environment setup / destroy code 48 What is the default test execution order?
The intended execution order as designed Specified by developers Such as, in make file, ant file, or TestAll.java Lead to the intended results as developers want to see 49 Dependent tests vs. Nondeterministic tests Nondeterminism does not imply dependence A program may execute non-deterministically, but its tests may deterministically succeed. Test dependence does not imply nondeterminism A program may have no sources of nondeterminism, but its tests can still be dependent on each other 50 Controlled Regression Testing Assumption (CRTA) [Rothermel et al., TSE 1996] A stronger assumption than determinism, forbidding:
Porting to another system Nondeterminism Time-dependencies Interaction with the external environment (implicitly) test dependence The authors commented CRTA is not necessarily impossible to employ. Our paper has a more practical focus on the overlooked issue of test dependence 51