ICPC 2009 - Working Session - TDD & Maintainability

Summary:

Test-Driven Development (TDD) is a development discipline prescribing writing tests beforehand writing the implementation code, which then shall pass the tests. Several criticisms addressed TDD's capability of delivering well-structured code, mainly because of the short focus on the features needed right now without much looking forward. So the main threat possibly introduced by TDD lies in the lack of maintainability and evolvability of the resulting system.
The goal of the working session is to gather opinions, studies, and mostly experiences related to the maintenance implications of TDD adoption. The working session will be organized around position papers presented and discussed by the participants.

TDD

Test-Driven Development [2] is one of the central practices of the eXtreme Programming approach [1]. The essence of TDD consists in formalizing a piece of functionality as a test, implementing the functionality such that the test passes, and iterating the process. While writing production code the developer (or pair) must focus closely on the test at hand. Any additional code that is not strictly required in order to pass the test must be avoided since "you ain't gonna need it" (YAGNI).
The positive aspects of TDD can be considered from several points of view:

Feedback: tests provide the programmer with instant feedback as to whether new functionality has been implemented as intended and whether it interferes with old functionality.
Task-orientation: tests drive the coding activity, encouraging the programmer to decompose the problem into manageable, formalized programming tasks, helping to maintain focus, and providing steady, measurable progress.
Quality assurance: having up-to-date tests in place ensures a certain level of quality, maintained by frequently running the tests.
Low-level design: tests provide the context in which low-level design decisions are made, such as which classes and methods to create, how they will be named, what interfaces they will possess, and how they will be used.

Empirical Evidence (bits of)

Software engineering continuously introduce new development methods, techniques, and tools, which promise cost savings, reduced development time, better product quality, etc. TDD is one of such techniques. However, when practitioners need to adopt a specific technique, they would like to have evidence that such technique produces the benefits promised.
Sensible evidence to substantiate claims can be provided via rigorous empirical investigation of techniques in different contexts, to assess their strengths and weaknesses, and their suitability for specific needs of the practitioners. Such evaluations are complex since several variables need to be taken into account (e.g., background of the people involved, working environment, technologies and tools adopted, etc.). Moreover, there are several different types of empirical study (e.g., surveys, case studies, and controlled experiments), each of which provides different degrees of strength and appropriateness to validate different kinds of empirical questions and contexts.
Rigorous procedures should be followed during an empirical study. When designing and conducting an empirical study, the investigator needs to make several decisions, including how to perform the study and which kind of data collect, e.g., measure the contextual factors that affect the study, choose the subjects and objects of study, etc. These are a few possible problem that may arise if the investigator does not follow a rigorous procedure or makes wrong choices: data may fail to support even a true hypothesis, or, conversely, false hypotheses may be believed to be true because of inadequate evidence; insufficient documentation of contextual factors may prevent other people from adequately replicating a study, so the results may not be fully comparable and combinable; an insufficient number of data points may result in a non-statistically significant result.
From an empirical point of view we observe a contrasting set of evidences about TDD.
In an early work [4] it was found that, although a direct impact on quality - intended as fault density - could be observed, a higher number of tests were written in TDD and that the more test the higher quality is the program.
A recent work [5] highlighted it was found that TDD substantially weakens traditional metrics of sound architecture: measures of coupling and cohesion (another way of talking about quality).
Another point of view [3] is far more critical about adopting TDD: the lack of an up-front design leads to a lack of architecture, that in turns makes overcomplicate any bug fixing, evolution or maintenance activity.

Goals

The goal of this working session is to share opinions, studies, and experiences related to the maintenance implication of TDD adoption. The session is intended as a meeting forum where researchers will discuss such topics and, eventually, plan together joint empirical studies. As previously mentioned, the design of such studies is tricky, and there are many obstacles to replication.
The definition of empirical studies for assessing the effectiveness of TDD could be the main objective of the session.
A number of open research issues are related to such definitions:

Identification of key variables to measure during experiments. We also lack a general taxonomy that defines and relates key variables, document potential interactions and other confusing factors.
Identification of an appropriate subject population for these experiments. While students are often the most convenient to enlist, a subject group made up entirely of students might not adequately represent the intended user population. Open issues include how best to engage industrial practitioners in empirical studies and how to judge when a subject population made up of students is sufficiently representative.
Identification of appropriate benchmark systems or tasks. Benchmarks facilitate direct comparisons of alternative approaches and may allow the meaningful aggregation of data from multiple studies.
Standardization of the experimental design format, which would facilitate the replication of experiments and the aggregation of data from multiple experiments.
Development of standardized instrumentation and analysis methodologies.
Standard procedures and criteria for packaging the material and the results of a study to facilitate replication.

We hope to establish collaborations to reach a critical mass for the replication of a selected set of studies. For example, meta-analysis is a well-known technique for aggregating the results of multiple studies to improve the statistical significance of the (combined) result. Often, a single research group conducts a user study whose result suggests a trend but lacks statistical significance because the group cannot assemble a sufficiently large subject pool. If multiple research groups run independent studies using the same material, a meta-analysis may be able to combine these studies to yield a statistically significant result. By planning these independent studies and then performing a meta-analysis, we could learn much about how to deal with this general problem. To facilitate replication, one important issue to be discussed in the workshop is how to package and share materials and results to facilitate replication and make sure to learn from mistakes occurred in the first replications, e.g., improving the experiment design or material whenever needed.

Organization

Position papers will be invited for submission until a couple of weeks before the working session.
Specifically, we encourage researchers and practitioners exchange opinions, studies, experiences, and design of empirical studies related to TDD.
Interested participants are encouraged to submit to the organizers a 2 page position paper related to TDD. For example, proposals could describe:

Experience reports
Results of experimental studies
Plan for experimental studies
Controlled experiment to evaluate the usefulness of tool or technique for TDD
Investigation about quality

The working session will be organized around position papers presentations.
Each proposer will briefly describe their position paper followed by a discussion. Participants will then cluster into groups based on interest in a particular topic. Each group will discuss the position papers in that area and will define a shared position.
Finally, each group will briefly present the position and will solicit feedback from the larger group.

Expected Results

The desired outcome is an improved framework to understand the implications of TDD. The single pieces of the framework should be, as far as possible, backed-up by some form of empirical evidence.
The organizer, in cooperation with the working session participants, will edit a final summary report.

Organizers

References

[1]	K. Beck. Extreme Programming Explained: Embrace Change. Addison-Wesley, 1999.
[2]	K. Beck. Test Driven Development: By Example. Addison-Wesley, 2003.
[3]	G. Bjørnvig, J. Coplien, and N. Harrison. A story about user stories and test driven development. Better Software, November 2007.
[4]	H. Erdogmus, M. Morisio, and M. Torchiano. On the effectiveness of the test-first approach to programming. IEEE Transactions on Software Engineering, 31(3):226-237, March 2005.
[5]	M. Siniaalto and P. Abrahamsson. Comparative case study on the effect of test-driven development on program design and test coverage. In Proc. Empirical Software Engineering and Measeurement (ESEM), pages 275-284, 2007

Important Dates
May 1^st:	Deadline for submitting position papers (circa 2 pages - PDF format)
May 17^th:	Working session