How do I gain confidence in code that generates HTML, such as tag libraries or view templates?

Well, it depends on what I’m trying to do.

Am I learning how an existing tag library works? If so, then I create a bare project, install the tag library, use it to generate some HTML, then use something like HTMLUnit1 (any HTML parser will do) to check the results. This way, I can explore all the features that the tag library has without mixing those tests up with the tests for my project’s core behavior. I can use what I learn from these Learning Tests2, meaning the contract of the tag library features that matter to me, to write tests for my core behavior that make safe—well, safer—assumptions about what the tag libraries do.

Am I creating my own tag library? I typically create custom tags by extracting duplication from HTML, so whatever tests I already have for HTML indirectly test my custom tags. Once I extract enough behavior into a little group of custom tags, then I begin to feel like I have a proper, reusable library3, and then I treat it exactly like I do any existing tag library, so this reduces to my answer above.

Am I testing other view code that generates HTML, meaning not a tag library? In this case, I make sure to separate that code from the rest of the system. In particular, I don’t want to have click-click-click in order to get to the right page so that I can check the resulting HTML. If I have to click-click-click, then I’ve clearly violated the Dependency Inversion Principle, since the view depends on its invoker, the controller.

Please note that automating this click-click-click with something like Selenium doesn’t make this problem go away; it merely makes it easier to tolerate the problem… for a while.

This means finding a way to render my HTML template directly without invoking the rest of the application. How to do this varies from framework to framework, or from library to library. It’s one of the reasons that, way back in the 2000s, I preferred using an HTML template engine like Apache Velocity over using JSP. I never did figure out how to reliably render a JSP without involving the rest of the web container and its nonsense. Are there any standalone JSP engines now? I don’t know.

I know that RSpec does this well for Rails. I can simply render a view template with whatever data I desire, and I never have to invoke a controller nor run the rest of the system. Now how RSpec-Rails does this amounts to killing kittens, but that’s mostly because Rails likes coupling everything to everything else and expects you to like it, too. I try to ignore the mewling of dying kittens as I run my view specs.

The Two Key Points

To check the HTML that X generates, run X without running the things that invoke X. (Dependency Inversion Principle.) This is true for X = JSP processor; X = HTML template engine; X = whatever. Write a test like this:

htmlAsString = render(template, dictionaryOfDynamicDataToDisplayOnTemplate)
htmlDocument = htmlParser.parse(htmlAsString)
assert whatever you like about htmlDocument

As you do this, you describe the contract of the view. You can use this information to check that the controller puts the right data in the right view template variables without having to run the controller and view together.

For example, if you know that your view expects the scripting variable customers with a collection of Customer objects, then your controller tests can check that it puts a valid (non-null) collection of Customer objects wherever the view rendering engine will look for the scripting variable customers. In the Spring WebMVC world—and I realise I’m old—this meant the customers key in the model Map inside the ModelAndView object that the controller returns from its handleRequest() implementation.

Don’t test a tag library by testing your application. If you want to test the tag library, then test it in isolation. This also applies to learning about the tag library by writing Learning Tests for it.

When you want to use a tag library, you think about which features of it you want to use and how you expect those features to behave. You can probably explore those more thoroughly by not limiting yourself to the exact context in which you plan to use that tag library feature right now. You’ll probably learn more than simply trying to get the current thing working that you want to get working. This helps you better understand which part of the tag library’s contract your application will depend on. You will find this useful, I promise.

Notice that, in what I’ve just written here, you can substitute “tag library” for other generic services like “database driver”.


J. B. Rainsberger and Scott Stirling, JUnit Recipes. In particular, chapter 12, “Testing Web Components” covers a lot of this ground. Even if you don’t use JUnit, the principles apply.

J. B. Rainsberger, “Demystifying the Dependency Inversion Principle”. A few different ways to think about this principle, as well as what it means for your ability to test your code.

Michael Feathers, Working Effectively with Legacy Code. Still the classic text on legacy code, in which he discusses characterisation tests. Some modern libraries make it easier to write these kinds of tests, like TextTest or ApprovalTests.

  1. I know that I’m showing my age here, but I was there when HTMLUnit was born, so I like to mention it every now and then.

  2. Tests that I write to document how a library behaves. When they pass, then I understand what the library does; when they fail, I don’t. Michael Feathers also refers to characterisation tests, which characterise what the code does, rather than specify what we want the code to do.

  3. Reusability happens when we make it happen.


Almost everyone starts organising their tests according to the module (or class or object) that they’re testing. If they have a class called Customer, then they have a test case class called CustomerTest and put all the tests for Customer into this one bundle (module, class, describe block, whatever you call it).

Don’t stop here.

If you continue to add all your Customer tests to CustomerTest, then you’ll certainly judge it as “too big” after a while. Even if you don’t, you’ll notice some patterns inthe names of the tests themselves.

If you work with libraries like Spock or RSpec that let you name tests with arbitrary text, then you might not notice these patterns as easily, or the duplication might seem “more natural” in natural language. Don’t let that fool you into thinking that you haven’t duplicated concepts in your code!

You’ve almost certainly noticed a pattern in the names of some of your tests.

  • testAdd_EmptyList
  • testAdd_NonemptyList
  • testAdd_ListAtCapacity
  • testAdd_DuplicateItem
  • testContains_ItemFound
  • testContains_ItemNotFound
  • testContains_DuplicateItem
  • testSize_EmptyList
  • testSize_NonEmptyList
  • testIsEmpty_EmptyList
  • testIsEmpty_NonEmptyList
  • testIndexOf_ItemFound
  • testIndexOf_ItemNotFound
  • testTrimToCapacity_TrimmingNeeded
  • testTrimToCapacity_TrimmingNotNeeded

I don’t endorse this pattern for naming tests in general, but this reflects common practice. In a real-life version of this example, I’d be writing Learning Tests to help me understand how ArrayList works in Java. In such a situation I often write tests oriented around each method and the various special cases, because I’m trying to document the API as designed. When designing new behavior in new modules or classes, I prefer not to name my tests for any particular method, function, or even class, so as not to couple—even in my own mind—the tests unnecessarily to an initial implementation.

I can imagine finding this set of tests—and more—in a test called ArrayListTest.1 You can already see two things:

  1. There are a lot of tests here.
  2. There is a fair amount of duplication here.

You can also see that we can’t remove that duplication with just a single refactoring: the various tests fall into different groups, and so need us to organise them slightly differently.

Remove Duplication

I don’t seem to have any problem understanding the names of these tests—I wrote them, so I guess that shouldn’t surprise me—which means that I will turn my attention to removing duplication.

Remove Duplication and Improve Names in small batches
The Simple Design DynamoTM

In this case, duplication in the names of the tests will suggest different ways of reorganising the tests than would a simple refactoring of the duplicate code. Even though I haven’t written these tests out in code, I’ve seen them a number of times. Especially when a programmer writes all these tests in one test case class, e2 typically arrives at only one line of setup code shared by all the tests:

public void setUp() { theList = new ArrayList<String>(); }

Removing this duplication helps a little, but we can do much better. For example, looking at the tests for the “non-empty list” cases, I imagine I’ll find copied-and-pasted lists of “pre-populated” items.

public void testSize_NonEmptyList() {
    theList.add("jbrains is awesome");
    theList.add("jbrains is awesomer");
    theList.add("jbrains is even awesomer than that");

    Assert.assertEquals(3, theList.size());

…and a little farther down…3

public void testIsEmpty_NonEmptyList() {
    theList.add("jbrains is awesome");
    theList.add("jbrains is awesomer");
    theList.add("jbrains is even awesomer than that");


When I look at the “isEmpty(), non-empty case” test, I get the idea that although I might want to check the “3” case for size(), I might prefer to check the boundary case for isEmpty(), meaning a list of one single item. Quite often, however, I see programmers merrily copy and paste lists of items to new tests because, well, we find that easier.

Now that I say this, perhaps I should add a test for testIsEmpty_BarelyNonEmptyList in order to distinguish the cases. I’ll add that to the to-do list I have by my computer.4

Group Tests by Fixture

Long ago, in a book far, far away, I wrote about grouping tests by fixture. I recommended that you “test behavior, not methods” (section 1.3.2) and “move special cases to their own fixture” (recipe 3.7). I gave some examples. It was fine. I encouraged the reader to remove duplication in the setup (now called @Before) code. More than anything else, however, don’t let tests twiddle the fixture. If a handful of tests want to share a fixture, then I prefer that they all share the very same fixture, meaning the same objects in the same state. This becomes especially important when you start trying to reuse fixture objects using inheritance. (I used to do this; I tend not to do it any more. The cure always eventually hurts more than the disease.)

Junk Drawer

You probably have a junk drawer in your house. You throw junk into it. Some of that junk you need, so you root around in it to find something specific, like a pen or a paperclip. Eventually, you find that you need a paperclip unusually often—usually to press a recessed reset button on some electronic thing—and so you decide to put the paperclip somewhere to make it easier to find. If you put it in its own little compartment, then you’ll find it, but if you then start putting some other, not-so-related items in with the paperclip, then before long you find yourself with a second junk drawer.5 Then a third. Then you just have junk everywhere. It doesn’t work.

So it goes when you try to organise fixture objects into a setup function. This works great until the first time a test wants to change the fixture just a little for its own purposes. For the first test, you don’t worry so much: you put it in the same test class, twiddle the fixture—what harm can one extra line of setup do?—then go along your merry way. The very next special case wants to twiddle the fixture in precisely the same way. Then a third. Now is the time to move these three tests into their own test class with their own fixture, as I recommended in JUnit Recipes. If you don’t do this now, then before you know it, there’s graffiti everywhere. Almost every test twiddles the fixture in some unexpected way. You find some of that fixture up in superclasses, and you become lost in a maze of super() calls that you need to make at just the right time, otherwise your tests vomit NullPointerExceptions all over the place.

Ewww. You should have moved those tests to their own fixture when you had the chance.

Organise By Fixture or By Action?

When you find a group of tests inside a larger test class, you can either extract those tests by fixture or by action.6 I used to think that choosing between these options amounted to black magic, “skill”, or wisdom. Now I think I have a rule suitable for an advanced beginner (on the Dreyfus model) to use.

If you name your tests using a convention like test<action>_<special case>—for example, testIsEmpty_NonEmptyList—then examine the test names for patterns. First look for multiple groupings of the same set of special case words, then group those tests into a test class by fixture. Then look for multiple grounds of the same set of action words, then group those tests into a test class by action.

I think this works because the special case names generally correspond to similar fixtures. If you have a bunch of tests that need to operate on a “non-empty list”, then you’ll probably copy and paste the same three items into each list object in those tests. (I don’t claim to call this a good thing, but we do it.) Moreover, if you try to organise the special case groupings by action instead, then you’ll move those tests away from each other into separate test classes, even though they have similar setup code. This creates a cohesion problem7 solved by reorganising those tests by similar fixture.

Group Tests First By Special Cases, Then By Actions

Returning to our tests for ArrayList, we have

  • testAdd_EmptyList
  • testAdd_NonemptyList
  • testAdd_ListAtCapacity
  • testAdd_DuplicateItem
  • testContains_ItemFound
  • testContains_ItemNotFound
  • testContains_DuplicateItem
  • testSize_EmptyList
  • testSize_NonEmptyList
  • testIsEmpty_EmptyList
  • testIsEmpty_NonEmptyList
  • testIndexOf_ItemFound
  • testIndexOf_ItemNotFound
  • testTrimToCapacity_TrimmingNeeded
  • testTrimToCapacity_TrimmingNotNeeded

Following my proposed rule, I would end up first with these tests grouped by fixture:

  • EmptyListTest
    • testAdd
    • testSize
    • testIsEmpty
  • NonEmptyListTest
    • testAdd
    • testSize
    • testIsEmpty
  • BarelyNonEmptyListTest
    • testIsEmpty
  • MatchingItemsTest
    • testContains
    • testIndexOf
  • NotMatchingItemsTest
    • testContains
    • testIndexOf
  • DuplicateMatchingItemsTest
    • testContains
    • testIndexOf

Also these tests grouped by function—the junk drawers:

  • AddItemToListTest
    • testListAtCapacity
    • testListNotYetAtCapacity
    • testItemAlreadyInList
  • TrimArrayListToCapacityTest
    • testNeedsTrimming
    • testDoesNotNeedTrimming

Of course, this doesn’t constitute an exhaustive test for ArrayList, but you get the idea. You’ll notice that I’ve renamed some of the tests and added a few. By reorganising the tests this way, a few ideas popped into my head, such as “adding an item when the list is not yet at capacity”. When I first wrote this list of tests, I thought of “not yet at capacity” as an unstated default assumption. Since Java creates an ArrayList with a capacity of 10 items by default, I could think of testAdd_EmptyList as implicitly checking the “not yet at capacity” case. This kind of implicit checking can lead to “holes in our tests”, which can lead to the dreaded “green bar, but there’s a bug” problem that brings us back to my old favorite: integrated tests are a scam. I don’t want to go there just now.

Instead, let me close by proposing that you try grouping tests first by repeated special cases (which correspond to similar fixtures), then by actions. I think you’ll like the results.

“If I Group Tests Like This…

…then I won’t be able to find anything!” Srsly, this is 2014. Don’t you use ag or ack or grep or something? Can’t you search your project for uses of the function add(), or at worst, the regular expression /\.add(/?!


J. B. Rainsberger, JUnit Recipes: Practical Methods for Programmer Testing. I wrote ten years ago about the benefits of organising tests by fixture, rather than by function. I never felt truly comfortable with how easily the reader could apply that advice. This article attempts to assuage my guilt at giving such questionable advice.

J. B. Rainsberger, “Integrated Tests are a Scam”. The talk as I performed it at DevConFu in Jurmala, Latvia in December 2013. Don’t watch the Agile 2009 conference version any more.

J. B. Rainsberger, “Integrated Tests are a Scam”. The series of articles. Start with the oldest ones and work your way towards the present.

J. B. Rainsberger and friends, Understanding Coupling and Cohesion. 57 minutes of video. I invited some of my friends to discuss the nebulous concepts of coupling and cohesion in software design. How do we think about these topics? How do we understand the terms? How do we use that in our work as programmers? How do we teach it to others? How much does any of it even matter? Our invited guests: Corey Haines, Curtis Cooley, Dale Emery, J. B. Rainsberger, Jim Weirich, Kent Beck, Nat Pryce, Ron Jeffries.

  1. Yes, I’m assuming Java here. Don’t let that fool you: I see exactly the same patterns in Ruby/PHP/Python as I do in Java/C#/C++/C.

  2. Not a typo, but a Spivak pronoun.

  3. …we see little Father Down… Wait. Not a Benny Hill fan?

  4. I intend eventually to replace this sentence with a link to an article that discusses in more depth how to avoid feeling distracted while programming. If you can read this after December 2014, then tell me to write this article now.

  5. This phenomenon relates to the Broken Windows Theory in which once we decide not to repair the first broken window in a neighborhood, vandalism and further damage follows soon thereafter.

  6. Do you remember the “Three A’s” of arrange, act, and assert? By action I mean the function that you intend to test with that test.

  7. Although we don’t generally agree on how to define cohesion, I find it useful to move similar things closer together and keep dissimilar things farther apart. This leads me towards higher (better) cohesion.

You have inherited some code. Congratulations. Now you need to change it.

There, there.

Michael Feathers once wrote that legacy code is “code without unit tests”. I use a slightly more general definition.

Legacy code is valuable code that we feel afraid to change.

I think that both parts matter. You probably accepted the “afraid to change” part without any need for convincing. (If not, then this article probably won’t interest you.) Moreover, if the code doesn’t generate significant value, then I don’t see much risk in changing it. If the cost of “getting it wrong” doesn’t significantly outweigh the profit we derive from “getting it right”, then who cares? Probably not I.

I treat valuable code with considerable respect. It provides food for families. I treat difficult-to-change code also with consider respect, although this comes more from fear than admiration. If we put these two things together, then, quite simply, one false move and I might destroy an order of magnitude more profit than the yearly cost to keep me around.

This brings me to Rule Number Zero of Surviving Legacy Code:

Maximise safety.

We find ourselves in the typical chicken-and-egg problem: we want to write tests in order to refactor more safely, but then we remember that integrated tests are a scam℠1 and decide that we’d rather break things apart a little in order to write less-costly-to-maintain tests. So which do we do first?

In a situation like this, I like to go back to my guiding principles.

Integrated tests are a scam℠ in part because they don’t put enough positive pressure on my designs and thereby don’t give me enough useful design feedback. Right now, I don’t care about this. I already know that the design needs significant work. I also know that I can’t handle the torrent of feedback that microtests would give me about the design.23 If I want to use this principle to guide my behavior, then I need to find another justification.

Integrated tests remain a scam℠ in part because of the combinatoric explosion in the number of tests I need to achieve a strong level of coverage, which in this case correlates to confidence. I might have to write millions of tests to achieve high coverage. I probably only have time to write hundreds of tests, in which case I have to gamble about the level of coverage. Perchance, could I not care about coverage in this situation?

Test coverage—however one measures or defines it—links directly to safety in changing code. I want to use those tests as change detectors. I want the red light that flashes the moment I make a mistake. Microtests, especially if I write them first, give me that. They help me find mistakes immediately. They help drive down the cost of making a mistake, an essential technique for managing risk.4 If I can’t write microtests cost-effectively, then what can I do?

What if, instead of a red light that flashes the moment I make (almost) any mistake, I had a pink light that flashes when I make a really obvious mistake? I can’t have what I want, but I can afford this; will it do? It will help more than doing nothing. I will simply buy as much of this confidence as I can afford. To do this, I combine two simple ideas: Golden Master and sampling.

Golden Master

I use Golden Master to help me detect changes in the behavior of a system when I can’t justify writing the typical kind of assertion that you’ve grown used to seeing in tests. I use this trick, for example, when I find it difficult to articulate the expected result of a test. Imagine a function whose output consists of an image. It happens quite often that a binary comparison between actual and expected result yields a hyperactive assertion—one which frequently fails even when a human would judge that the test had passed. I suppose some people know tricks to make it easier to articulate “looks similar enough” for images, but I don’t know how to do that, and that leaves me to choose either a hyperactive bit-by-bit comparison or ongoing, manual inspection. Rather than revert to the Guru Checks Output antipattern5, however, I take a snapshot of the last-known acceptable output—I call that the golden master—and save it for future use. When I run the test again, I compare the output to the golden master, and if they match, then the test passes; if they don’t match, then the test fails. This doesn’t make the code wrong, but it means that I need to check the result and decide whether the code needs fixing or the golden master needs replacing.

You can use Golden Master wherever you already have some output to check, even if you find the form of that output particularly challenging. With this technique, you simply diff the output and inspect the situation only when you find differences between the current test run and the golden master. If your system already sends text to an output stream that you can capture, then you have the tools to use this technique.

I warn you, however, not to give in to the temptation to start scraping your output for specific information to check. Unless you have no other alternative, you will probably find it more cost-effective to carefully extract that information from the code and check it directly using good, old-fashioned assertEquals(). Don’t build a parser for an arbitrary, unplanned, probably context-sensitive grammar. That way lies madness. (Of course, if a context-free grammar happens to describe the format, then go for it. You’ve always wanted to learn lexx and yacc, haven’t you?)


I find one glaring problem with the Golden Master technique: if the output describes a long-running algorithm, process, or path through the system, then the golden master itself might describe only one of a thousand, million, or even billion potentially-interesting possible outputs. Welcome back to the combinatoric explosion problem that makes integrated tests such a scam℠. How do we proceed when we can’t possibly check the variety of paths through the system that we need to check?

Ideally, we refactor! I know that if I can break my system into many smaller, composable pieces, then I turn products into sums: instead of checking combinations of paths through multiple parts of the system at once, I can check the handful of pairwise connexions between parts of the system in relative isolation. I could turn millions of tests into hundreds. Unfortunately, in our current situation, I don’t feel comfortable refactoring, so that means that I have to sample the inputs and hope for the best.

You can find more sophisticated sampling systems out there among blogs written by experienced testers, but they all amount to sampling: if I can’t try every combination of inputs, then I try some combinations of some of the inputs and aim for the best coverage that I can.

This shouldn’t surprise you. You’ve done this before. You’ve written a function that operates on an integer, and you knew enough about the algorithm to identify boundary cases at, for example, -1, 0, and 1, as well as around 100 and 1000, so you check on the order of ten inputs and feel satisfied that the algorithm will work for the remaining few billion inputs. You were sampling.

In the case of legacy code, however, sometimes we can’t sample quite so intentionally. Sometimes even when we limit our scope to characteristic inputs, we have so many combinations of those inputs that we still can’t afford to write and run all those tests. In some cases, we don’t even know how to identify the characteristic inputs. In other cases, the algorithm itself has a random element, defeating our goal of writing nice, deterministic, repeatable tests. Random sampling to the rescue.

If you can use the random number generator to generate a stream of inputs to your system, then you can use this generate a collection of output files, and that collection can act as your golden master. You only need to control the random number generator by seeding it with the same stream of seeds every time. I use a simple linear generating function like m + p * i where m and p represent arbitrarily-chosen numbers and i represents a loop index. Now I simply have to decide how big a sample to take. Generally speaking, a larger sample gives me more confidence in the sensitivity of the pink flashing light that signals danger.

I adjust the size of the sample depending on how long it takes to execute a test run, and how much significantly that affects my flow while programming. I also adjust the size of the sample to match my fear level: the more worried I feel about getting something wrong, the larger sample I take while working, and I accept the cost of slowing down. I’d usually rather go a little too slow than a little too fast, because I know that the cost of making a mistake would likely dominate the savings from going more quickly.

The Techniques in Action

You can see an example of this technique in action by reading this code. If you’d like to see how I added this behavior to some legacy code, then start at this commit and follow the process step by step.

Although these techniques do not, on their own, guarantee success, when I combine Golden Master and Sampling, I can usually find a way to proceed safely. When I combine these with microcommitting6, I can proceed at an even quicker pace. They help me avoid the Catch-22 problem that arises from needing to refactor dangerously unsafely in order to be able to refactor safely and sensibly. Where might you use Golden Master and Sampling to help get your arms (at least a little) around your legacy code?

Click the logo if you’d like to learn more about surviving legacy code!


Michael Feathers, Working Effectively with Legacy Code. Still the classic work on winding your way through legacy code.

J. B. Rainsberger, “Getting Started with Getting Things Done”. You don’t have time to read Getting Things Done? Start here. Four pages. It’ll be fine.

David Allen, Getting Things Done. I use it. Not all of it, and not all the time, but I use its core principles quite significantly in managing my work and home lives. No cult, I promise.

J. B. Rainsberger, “A Proven Path to Effectiveness”. A “method” that combines Getting Things Done and Test-Driven Development aimed specifically at the beleaguered programmer. A library to help you write text-based tests, such as I would use to provide golden masters. Do not download this tool until you have written your own golden master at least once. That is an order. After that, use TextTest, because it really helps.

  1. You can find a series of articles on that topic at

  2. When diving into legacy code, I find it more important than ever to keep stuff out of my head. During the two hours it takes to safely refactor some large function, I’m probably going to spot 14 potentially-useful refactorings. I can’t chase every bunny, no matter how cute they are. I need to write those ideas down, get them out of my head, and get back to the tricky surgery at hand.

  3. I see little point in spending energy generating a backlog knowing full well that I will never get around to doing about 80% of it. Who would volunteer to do that? (Ask your project manager if value-driven product development is right for em.)

  4. I claim that “the agile approach” to risk management complements the typical approach to risk management of limiting the probability of failure in order to limit exposure. “The agile way”, if you will permit me to use this shorthand, involves limiting the cost of failure instead. Eventually I will replace this sentence with a link to an article that goes into this topic in more detail.

  5. Marcia, the guru, looks at the output, pauses for a moment, then says, “Yep. That’s it.” If you want to re-run the test, then you need Marcia. That doesn’t seem to scale particularly well.

  6. Really frequent committing, like after changing a single line of code. No, really. Eventually I will replace this sentence with a reference to an article that explores this topic in more detail.


Some time ago a client asked me some questions about spies and mocks. I wanted to share what we discussed with you.

So here’s the issue my mind has been toiling over…

The project I’m on is using Jasmine for BDD. Technically though, I think most people aren’t actually executing real TDD/BDD. As in, they’re not letting the tests guide their design, but instead are sticking on unit tests at the end, after writing most of the code… this is what their tests suggest, at least.

I see, in their tests, a lot of spies and mocks. This tends to worry me,… especially the spies.

I see a lot of it as unnecessary, and even damaging. They appear to be reducing the module that they’re testing to nothing more than a series of spies and mocks. The thing they’re testing seems to bear little resemblance to the real run-time module.

From my perspective, mocking is very good and even essential in the cases of module dependencies that:

  1. Would add too many extraneous variables to the testing environment
  2. Add lag to the tests
  3. Are not semantically tied to the thing we’re testing

Examples I like are database mocks, ajax mocks etc.

But spies…. I’m very unsure of the value of spies.

The tests I’m reading are creating a series of spies… in fact, every method of the module is spied.. even private methods. The tests will call some public method (fir example initiatePriceFeed()), and then assert success by ensuring that certain spied methods have been called. This just seems to be testing the implementation… not the actual exposed behavior, which is what I thought BDD/TDD was all about.

So finally, I have a few questions:

  • What is the best way to decide whether a spy is necessary?
  • Is it ever acceptable to test the implementation, instead of exposed behavior? (for example spying on private methods)
  • How do you decide what to mock and what not to?

I am sorry for the length of this email. There seem to be so many things I’d like to say and ask about TDD.

Note! In the Javascript world, it’s common to talk about “spies” rather than “stubs”. A spy and a stub do the same thing. They only differ in intent. In what follows, you can treat “spy” and “stub” as synonyms with, I think, no risk of confusion.

That sounds common. I started doing test-first programming, rather than test-driven development. I probably spent two years focusing on tests as tests before I felt comfortable letting my tests guide my design.

I think the people writing all these spies and mocks do this because it “seems right”. People they respect do it. They need to spend some time practising the technique, so they do it at every opportunity. This corresponds to the Novice/Advanced Beginner stages of the Dreyfus Model: either they just want to practise the technique (Novice), or they feel comfortable using spies/expectations1, and treat every opportunity as an equally (Advanced Beginner) appropriate time to use them. Good news: this is a natural part of learning.

Where to go next? Find one example where a module would benefit from depending on data, rather than another module. I go back to the difference between Virtual Clock (spy on the clock so that you can make it return hardcoded times) and Instantaneous Request (pass timestamps directly, rather than the clock, pushing the clock up one level in the call stack). Perhaps this will help people start to question where they could change their approach.

IMPORTANT! Instantaneous Request isn’t necessarily always better than Virtual Clock. Which you choose is less important than the discussions and thoughts that lead you to the choice. Also: starting to use Instantaneous Request over Virtual Clock means that the programmer is evolving, not the code. What matters is not “use fewer spies”, but rather “don’t let spies become a Golden Hammer”. Spies still help, I use them frequently, and I wouldn’t give them up.

I wrote about this approach in some detail in “Beyond Mock Objects”.

Regarding the value of spies, I don’t consider spies and expectations much different from one another. A spy is merely an expectation that doesn’t verify which methods were called—instead it waits for you to do that. In some tests, it’s not important to verify what happened, but rather to provide a hardcoded answer for any method our Subject uses. One rule of thumb: spies for queries, but expectations for actions. This works because we tend to want more flexibility in our queries, but more precision in the actions we invoke. Think of the difference between findAllOverdueBalances() and findAllBalances().selectBy("overdue")—it doesn’t matter how I find all the overdue balances. Spies simply make it easier to hardcode 0, 1, a few, or a large number of overdue balances, as each test needs.

So: spies for queries, but expectations for actions.

Spy, then Spy, then Spy…

I understand your concern about series of spies, but let me check that I understand what you mean. When you say a series of spies, do you mean spying on A.getB() to return a spy B, whose B.getC() returns a spy C so that you can spy on C.theMethodIFindReallyInteresting()?

As for ensuring that spied methods have been called, those “spies” become expectations, and it can feel like those tests only check the implementation. That’s OK. If the implementation is so simple that we can check it with a simple test, then that’s good! It’s like double-entry book-keeping in accounting. If the tests are complicated and only check implementation, then that usually points to a missing abstraction, or at least, obsession with unnecessary details (could be a missing abstraction or could just be an unnecessarily complicated API). This last point is an example of not listening to what the tests are trying to tell you.

Programmers generally have this feeling eventually that expectations mean “I’m just checking the implementation”. I had the same feeling once, so I asked myself, “assuming that this actually makes sense, what am I missing?” Well, if the interactions between objects were simpler, then this “checking the implementation” issue wouldn’t cause any real problems, would it? In fact, it would only clarify what we’re trying to do. Maybe, then, when checking the implementation feels weird, we could ask about potential underlying design problems, and if those problems disappeared, then we’d feel less weird. This is one of those cases.

Go to a few tests where you feel weird in this particular way, and look for duplication between the examples. You might be surprised!

When Is A Spy “Necessary”?

You ask about “the best way” to decide whether a spy is necessary (maybe appropriate). I don’t know of One Best Way. I use them, then let duplication drive changes. I especially look for duplicating unnecessary details in the test. If I have to duplicate details in a handful of tests, just to be able to check some other part of the system, then perhaps I have two things in one place, and when I separate them, the corresponding spies become much simpler, and sometimes I can replace a spy with data (from Virtual Clock to Instantaneous Request).

Is It Ever Acceptable…?

You also ask whether it is ever acceptable to test the implementation instead of the behavior. “Is it ever acceptable…?” questions almost always have the answer “yes”, because we can always find a situation in which somewhat becomes acceptable. On the other hand, I don’t typically spy on private methods. If I need to know that level of detail in a test, then the test is trying to tell me that A cares too much about the internals of B. First, I try to remove unnecessary details from A’s tests. Next, I look for duplication in A’s tests. Especially if I spy on the same functions in the same sequence, that duplication points to a missing abstraction C.

So When to Mock?

I have two answers to this question. First, when do I use spies/expectations compared to simply using “the real thing”? I like to program to interfaces (or protocols, dependingon the language) and I like to clarify the contracts of those interfaces, something that expectations help me do effectively. To learn more about this, read the articles I list at the end related to contract tests. Especially read “When Is It Safe to Introduce Test Doubles?”.

Finally, when I’m not sure whether to use a spy or an expectation, I go back to the rule of thumb: spy on queries, but expect (mock) actions.


Wikipedia, “Dreyfus model of skill acquisition”. Not everyone likes this model of how people develop skills. I find it useful and refer to it frequently in my work., “Virtual Clock”. An overview of the Virtual Clock testing pattern, with further links.

J. B. Rainsberger, “Beyond Mock Objects”. I use test doubles (mock objects) extensively in my designs and they help me clarify the contracts between components. Even so, using test doubles mindlessly can interfere with seeing further simplifications in our design.

I apologise again for not having collected my thoughts about collaboration and contract tests into a single work. I need to find the time and energy (simultaneously) to do that. In the meantime, I have a few articles on the topic:

  1. In order to avoid confusion with the generic concepts of “mock objects” (better called “test doubles”), I use the term expectations to refer to what many people consider a mock: function foo() should be called with arguments 1, 2, 3.


I think that programmers worry far too much about design.

No, I don’t mean that they should care less about design. I think that programmers worry so much about design that they forget to just program. As they try to learn more about how to design software well, they become more reluctant to write code, fearing that “it won’t be right”. I think that we contribute to this reclutance by writing articles with a tone that implies don’t write code unless you write it my way. I don’t think we mean to do this, but we do it nonetheless.

What if we thought about design a slightly different way? Let’s not think about design principles as constraints for how we write code, but rather as suggestions for how code wants to flow. Focus your energy first on writing correct code, then use the principles of design that you’ve learned to guide the flow of code from where you’ve written it to where it seems to belong. If you prefer a more direct metaphor, then imagine you’re writing prose. Rather than obsessing over the rules of grammar on your first draft, use them to guide how you edit. Let yourself more freely write your first draft without fear of “getting it wrong”, then use your editing passes to fix grammar errors, improve clarity and elevate style.

Now you’ve probably heard this before. “Make it work, then make it right, then make it fast.” This constitutes the same advice. So why repeat it? You probably also know that sometimes we need to hear the same advice in a variety of forms before we feel comfortable using it. I’ve been talking in my training classes about “code flow” for a few years, and it seems to help some people feel more comfortable adopting an evolutionary design approach. In particular it helps some programmers avoid feeling overwhelmed by design principles to the point of not wanting to write any code at all, for fear of “doing it wrong”. After all, the more we say that “code is a liability”, the more people will tend to think of writing code as an evil act. That sounds extreme, but so does some of our rhetoric!

When I teach software design—usually through test-driven development—one or two people in the class commonly ask me questions like “Can I use set methods?” or “Can I write a second constructor?” which convey to me a feeling of reluctance to “break the rules”. I really don’t want my course participants to feel like I want to stop them from writing code; on the contrary, I want them to feel more comfortable writing code precisely because they can apply their newly-learned design pricniples to improve their designs easily and quickly over time. I expect them to feel less fear as their design skills improve, because no matter what crap they write in the morning, they can mold it into something beautiful in the afternoon. I have to remind myself to approach code this way, rather than worrying too much about “getting in right the first time”.

An Example

Consider this article on the topic of encapsulation. I like it. I think it explains a few key points about encapsulation quite well. Unfortunately, it includes a line that, out of its context, contributes to this fear-based mindset that I’ve seen so often:

If you ever use a setter or define an attribute of a component from the outside, you’re breaking encapsulation.

I remember myself as an inexperienced programmer trying to improve at my craft. That version of me would have read this sentence and thought I must not use setters any more. This would invariably lead me to a situation where I would refuse to write a setter method, even when I have no other option. (Sometimes tools get in the way.) This way lies design paralysis. When I’ve written over the years about design principles, I’ve certainly not wanted to make it harder for you to write code.

What Should I Do, Then?

Later in the same article, the author writes this:

It’s common in Rails projects to use patterns such as User.where("something = something_else") from controllers or service classes. How do you know the internal of the database to be able to pass that SQL parameters? What happens if you ever change the database? Or User? Instead, User.some_method is the way to go.

I agree to the principle and the example. I would, however, like to highlight a different way to interpret this passage. Rather than thinking, “I should never write User.where("something = something_else")”, think of it this way instead:

I’ll write User.where("something = something_else)" for now, just because I know it should work, but I probably shouldn’t leave it like that once it’s working.

Don’t let design guidelines (like improve encapsulation) stop you from writing the code you need to write in the moment (as a first draft), but rather use them to guide the next steps (your editing). Don’t let design guidelines stop you from getting things working, but rather use them to stop you from leaving freshly-written legacy code behind.

So What’s This About Code Flow?!

Many programmers offer suggestions (for varying meanings of “suggest”) for where to put code. Some programmers write frameworks to try to constrain where you put code, lest you make (what they consider) silly mistakes. This eventually leads even experienced programmers into situations where they feel like they’re stuck in some Brazilesque bureaucracy preventing them from writing the one line of code they need to make something work. (You need a controller, a model, a DTO, a database client object, …) Instead of thinking of “the right place to put things”, I prefer to offer suggestions about how to move code closer to “where it belongs”.

Going back to the previous example from that encapsulation article, I would certainly have no problem writing User.where("role = 'admin'") directly in a controller just to get things working, but I just know that if I leave the design at that, then I will have set a ticking time bomb for myself to explode a some unforeseen and, almost uncertainly, inopportune time. As a result, once I get my tests passing with this poorly-encapsulated code, then I can take a moment to look at that code, ask what does this mean?, realise that it means “the user is an admin”, then extract the function User.admin?. In the process, the details in this code will have flowed from the controller into the model, where they seem to belong.

I have found this pattern repeating itself in all the major application frameworks I’ve ever used: while learning the framework I put code directly into the controller/transaction script/extension point, and after I’ve written several of these and found them difficult to test or change, the details flow into more suitable components, often representing a model or view (or a little bit of each). By understanding my design principles in terms of where code ought to flow, I get the benefits of better design without the paralysing fear that I might “get it wrong”.

So if you need to break encapsulation, just for now, just to get something working, then do it. Just don’t leave it like that.


Alexandre de Oliveira, “Complexity in Software 2: Honor Thy Encapsulation”. In this article, Alexandre talks about “real” and “perceived” complexity in software design, which seem to me to relate to Fred Brooks’ concepts of “essential” and “accidental” complexity. He also includes a definition for encapsulation that I hadn’t read before, and that I quite like. Enjoy the article.


I don’t intend to argue Alistair’s contention one way or the other, but I invite you to set aside some time to read David Parnas’ paper “On the Criteria To Be Used in Decomposing Systems into Modules”, which I have embedded in this article. Do not let yourself be put off by the quaint-sounding title. If you prefer, think of it as titled “The Essence of Modularity”.

I care about this paper because I strive for modularity in designing software systems and I find that programmers routinely lose sight of both the what modularity offers them and what it means. I value modularity as a way to drive down the cost of changing software. I value that because most of what we do as programmers consists of changing software, and so it strikes me as a sensible place to economise.

If you can’t take the time to read the whole paper now, then let me direct you to a particularly salient part of the conclusion.

We have tried to demonstrate by these examples that it is almost always incorrect to begin the decomposition of a system into modules on the basis of a flowchart. We propose instead that one begins with a list of difficult design decisions or design decisions which are likely to change. Each module is then designed to hide such a decision from the others.

Enjoy the paper. In case the embedded viewer doesn’t work for you: click here.


J. B. Rainsberger, “Modularity. Details. Pick One.”. We introduce modularity by refusing to let details burden us.

Martin Fowler, Refactoring: Improving the Design of Existing Code. A classic text that takes an evolutionary approach to increasing modularity in a software system.


I wanted to change some of the styling at, but I have a legacy Wordpress template, so I needed a way to start making incremental changes with something remotely approximating tests. I knew that I didn’t want to have to crawl every page to check that every pixel remained in the same place, in part because that would kill me, and in part because I don’t need every pixel to remain in the same place. I needed another way.

How to Refactor CSS/SCSS

I chose to replace the raw CSS with SCSS using the WP-SCSS Wordpress plugin. Since I had all this legacy CSS lying around in imported files and I had fiddled with some of it before I knew how the original authors had organised it, I needed to consolidate the CSS rules as soon as possible so that I can change them without accidentally breaking them.

First, I created one big CSS file (the “entry point”) that imports all the other CSS files. Then, in order to use WP-SCSS effectively, I needed to move the importants into a subdirectory css/, so that I could generate CSS from only the SCSS located in scss/. This meant changing some @import statements that loaded images using a relative path. I fixed those with some simple manual checks that the images load correctly before and after the change. (Naturally, I discovered the problem by accident, then fixed it.) At this point I had one big CSS entry point that imported a bunch of other CSS files in from css/. I committed this to version control and treated it as the Golden Master1.

Next, I copied all the CSS “partials” into genuine SCSS partials and changed the entry point to import a single generated CSS file. I created an SCSS entry point that imports all the SCSS partials. This should generate the same CSS entry point, but get rid of all the little generated CSS “partials”. It did. I committed this to version control.

Now I can freely change my SCSS, generate the CSS, and check the git index for changes. As long as only the SCSS changes and the generated CSS doesn’t change, I definitely haven’t broken the CSS. If the generated CSS changes, then I check the affected web pages by hand and either undo the change or commit the generated CSS as the new golden master.

I hope this helps you deal with your own legacy CSS. You know you have some.

  1. This refers to the Golden Master technique where we check the result once by hand, then compare future versions automatically to the hand-checked “golden master” to detect changes. It’s like testing.


I have written elsewhere that people, not rules, do things. I have written this in exasperation over some people claiming that TDD has ruined their lives in all manner of ways. Enough!

People, not rules, design software systems. People decide which rules to follow and when. The (human) system certainly influences them, but ultimately, the people decide. In particular, people, not TDD, decide how to design software systems. James Shore has recently written “How Does TDD Affect Design?” to offer his opinion, in which he leads with this.

I’ve heard people say TDD automatically creates good designs. More recently, I’ve heard David Hansson say it creates design damage. Who’s right?

Neither. TDD doesn’t create design. You do.

I agree. Keith Braithwaite responded in a comment with this.

TDD does not by itself create good or bad designs, but I have evidence (see “Complexity and Test-First 0”) suggesting that it does create different designs.

Keith’s comment triggered me to think about how practising TDD has affected the way I design software systems, of which this article represents a summary. I might add more to this list over time. If you’ve noticed an interesting pattern in your designs that you attribute to your practice of TDD, then please share that in the comments.

More value objects, meaning objects with value-equality over identity-equality. I do this more because I want to use assertEquals() a lot in my tests. This also leads to smaller functions that return a value object. This also leads specifically to more functions that return a value object signifying the result of the function, where I might not have cared about the result before. Sometimes this leads to unnecessary code, and when it does, I usually find that I improve the design by introducing a missing abstraction, such as an event.

More fire-and-forget events. I do this more because I want to keep irrelevant details out of my tests. Suppose that function X should cause side-effect Y. If I check for side-effect Y, then I have to know the details of how to product side-effect Y, which usually leads to excessive, duplicate setup code in both X’s tests and Y’s tests. Not only that, but when X’s tests fail, I have to investigate to learn whether I have a problem in X or Y or both. Whether I approach this mechanically (remove duplication in the tests) or intuitively (remove irrelevant details from the tests), I end up introducing event Z and recasting my expectations of X to “X should fire event Z”. This kind of thing gives many programmers the impression of “testing the implementation”, whereas I interpret this as “defining the essential interaction between X and the rest of the system”. The decision to make function X fire event Z respects the Open/Closed Principle: inevitably I want X to cause new side-effects A, B, and C. By designing function X as a source for event Z, I can add side-effects A, B, and C as listeners for event Z without changing anything about function X. This leads me to see the recent (as of 2014) trend towards Event Sourcing as a TDD-friendly trend.

More interfaces in languages that have interface types. In the old days, we had to introduce interfaces (in Java/C#) in order to use the cool, new dynamic/proxy-based mocking libraries, like EasyMock and JMock. Since the advent of bytecode generators like cglib, we no longer need to do this, but my habit persists of introducing interfaces liberally. Many programmers complain about having only one implementation per interface, although I still haven’t understood what makes that a problem. If the language forces me to declare an interface type in order to derive the full benefits of abstraction, then I do it. At least it encourages me to organise and document essential interactions between modules in a way that looser languages like Ruby/Python/PHP don’t. (Yes, we can implement interfaces in the duck-typing languages, but Java and C# force us to make them a separate type if we want to use them.) Moreover, the test doubles themselves act as additional implementations of the interfaces, which most detractors fail to notice. They might argue that I overuse interfaces, but I argue that they underuse them. Interfaces provide an essential service: they constrain and clarify the client’s interaction with the rest of the system. Most software flaws that I encounter amount to muddled interactions—usually misunderstood contracts—between modules. I like the way that the interfaces remind me to define and refine the contracts between modules.

Immutability. As functional programming languages have become more popular, I’ve noticed more talk about mutability of state, with an obvious leaning towards immutability. In particular, not only do I find myself wanting functions more often to return value objects, but specifically immutable value objects. Moreover, thinking about tests encourages me to consider the pathological consequences of mutability. This happened recently when I wrote “The Curious Case of Tautological TDD”. Someone responded to the code I’d written pointing out a problem in the case of a mutable Cars class. I had so long ago decided to treat all value objects as immutable that I’d even forgot that the language doesn’t naturally enforce that immutability. I’ve valued immutability for so long that, for me, it goes without saying. I reached this point after writing too many tests that only failed when devious programmers take advantage of unintended mutability, such as when a function returns a Java Collection object. I went through a phase of ensuring that I always returned an unmodifiable view of any Collection, but after a while, I simply decided to treat every return value as immutable, for the sake of my sanity. Functional languages push the programmer towards more enforced immutability, and even the eradication of state altogether. I feel like my experience practising TDD in languages like Java and Ruby have prepared me for this shift, so it already feels quite natural to me; on the contrary, it annoys me when I have to work in a language that doesn’t enforce immutability for me.

How has TDD affected the way you design? or, perhaps more importantly, what about the way TDD might affect your designs makes you uneasy about trying it? I might have some useful advice for you.


I’m publishing this as a “rough cut”, so I apologise to everyone annoyed by having to download a PDF to read this article. I have my reasons. Some of the links in the document don’t work; the links to external web sites, however, should. A handful of people have already suggested improvements and reported problems, which I appreciate, particularly the cordial and civil manner in which they’ve done it. (Hint.)

(In response to complaints, I have removed the embedded PDF viewer. It was an experiment. Thank you for your consideration.)


Recently Bob Marshall opined that refactoring code is waste. This reminds me of passionate discussions from a decade ago about testing: should we classify testing as a value-added activity or as an unavoidable waste? I’d like to change the question a little, but first, allow me to play the ball where it lies.

If you haven’t read Bob’s article yet, then do so now. You’ll find it quite short; I read it in a few minutes. I composed this as a comment to Bob’s article, but it expanded to the point where I chose to promote it to a short article. You might say that I refactored my writing. With that segue manufactured…

Is editing waste for a writer? Why don’t writers simply write the right words/articles/books/sentences the first time? So I think it goes for programmers. I think of refactoring as editing for programmers. Since I plan to refactor, I don’t have to program like Mozart and “get it right” in my head before writing it down. This helps me, because often I don’t see trouble with code until I’ve written it down, even though sometimes drawing its structure helps me enough to spot trouble.

Sometimes problems don’t emerge until long after I’ve written it down and the situation changes, putting pressure on an old choice or negating an old assumption. Absolutely/permanently right-first-time seems to require clairvoyance. Writing any code entails risk.

Even so, I agree that we programmers don’t need to deny our own experience just to fit some arbitrary goal of taking tiny steps and refactoring towards abstractions. (This has got me in trouble with some people who declare what I do “not TDD”. As they wish.) Sometimes I can see the abstractions, so I go there sooner. Sometimes that doesn’t work out, so I refactor towards different abstractions. Often it works out and I’ve skipped a handful of tedious intermediary steps. One could measure my “expertise” in design by measuring the additional profit I can squeeze out of these trade-offs compared to others. (No, I don’t know how to measure that directly.) I think we broadly call that “judgment”.

A Question of Intent

I find refactoring wasteful when I do it out of habit, rather than with a purpose. Nevertheless, I don’t know how to have developed the judgment to know the difference without making a habit of refactoring. (Of course, I like to think that I do everything always with a purpose.) I encourage novices (in the Dreyfus Model sense) to force code into existence primarily through refactoring with the purpose of developing that judgment and calling into question their assumptions about design. That reasoning sounds circular, but I have written and said elsewhere how refactoring helps programmers smooth out the cost of maintaining a system over time. I can only assert that I produce more maintainable software this way, compared to what I used to do, and that refactoring plays a role. I really wish I knew how much of that improvement to attribute to refactoring. Refactoring still saves my ass from time to time, so it must pull some of its own weight.

I would classify refactoring as waste in the same way that I’d classify verification-style testing as waste: since we don’t work perfectly, we need feedback on the fitness of our work. Not only that, but I refactor to support not having to future-proof my designs, because of the waste of building structures now that we don’t intend to exploit until later. Which waste costs more? I find that open question quite interesting.


Bob Marshall, “Code Refactoring”. In his article, Bob surmises that programmers can’t quite “get it right” in their heads, and highlights refactoring as potentially a self-fulfilling waste: if we assume that we have to live with it, then we will choose to live with it. I leave the parallel with #NoEstimates as an exercise for the reader.

Gemma Cameron, “Is Refactoring Waste?”. I noticed Gemma’s article on Twitter and it led me to read Bob’s. She mentions that she plans to experiment with a TDD microtechnique that I use often: noticing while ‘on red’ that a little refactoring would make it easier to pass the test, and so ignoring (or deleting) the test to get back to ‘green’ in order to refactor safely. I don’t always do this, but I consider it part of the discipline of TDD and teach it in my training courses.

“The Dreyfus Model of Skill Acquisition”. Wikipedia’s introduction to the topic. All models are wrong; some models are useful. I find this one helpful in explaining to people the various microtechniques that I teach, when I follow them and when I don’t.

J. B. Rainsberger, “The Eternal Struggle Between Business and Programmers”. The article in which I make the case for refactoring as a key element in reducing the cost of adding features to a system over time.