Workshop Position Statement
Software Archeology: Understanding Large Systems
OOPSLA 2001
Ward Cunningham
Cunningham & Cunningham, Inc.
At times one must become familiar with a large body of software. Examples include bug fixing legacy programs, due diligence purchasing intellectual property and legal disputes over patents and copyrights. It is my position that custom source code browsing and searching tools provide the most efficient means for gaining such familiarity and that a server based engine is a convenient custom tool building environment. In this report I further summarize an example of such a tool implemented as a pair of cgi scripts.
Programmers use various mechanisms for structuring code. Sometimes these are dictated by the programming environment, such as classes and methods in Smalltalk or packages and files in Java. More structure is imposed by the language using various syntax for, say, blocks and statements. The change history may be available from a code repository. And further structural clues may be available from coding conventions or artifacts present in the problem domain addressed by the program.
Typically a browsing mechanism for large, structured programs will expose a small part of the program in response to some form of query, possibly as simple as picking from a list. However, when examining an unfamiliar program one needs to get a feel for the whole program at once. To meet this need one is better off writing a custom tool for viewing the whole program. Such a tool is easily written as a cgi script which runs on a server, reads all of the code base and then reports a summary of that code as a single html document. Here are some excerpts of such a summary document produced from the Java 1.3 source code distribution:
This report consists of 1800 lines corresponding to the 1800 source fines in the distribution. Each file is numbered, named and summarized on a single line (which may be folded by the browser.) In this case the summary is formed by eliding all but select punctuation characters from the file. I call this the file's "signature". The quantity of punctuation (length of signature) gives an intuitive sense for the size of the file while the patterns that appear in the punctuation adds to this a sense of complexity and repetition. The complete report from which these excerpts have been extracted is available here (400k).
Of course such radical summarization as offered by punctuation signatures
requires some time to get use to. Do not expect to learn much from browsing
the examples in this paper. Rather, the process of developing a sense for
the report runs in parallel to and is cooperative with familiarization
with the code under inspection. It is an interactive process that repeats
the following steps.
In the inspection step one forms theories about the whole body of
code. In the projection step one exploits (or refutes) these theories by
performing ad-hoc experiments based on the code. The process terminates
when one can find and understand any desired feature of the code base with
modest effort.
There are a number of details to this process that deserve attention. First is the mechanism of actually inspecting code. Features in the summary report can be hyperlinked to the code they summarize. This allows rapid alternation between summary and detail, a technique that exploits both visual pattern recognition and short term memory. Rapid hyperlinked retrieval of source code can be done through a second cgi script. Such a script should take as arguments file name and line (byte) offset and use these to issue the necessary system calls to retrieve the relevant source code without noticeable delay.
The same source code retrieval cgi script can do some additional formatting to facilitate rapid reading. The following is an excerpt of the html formatted java source extracted from Sun's Java 1.3 distribution. The complete report is available here. Important text has been emphasized by enlarging it. Unimportant text (in this case comments and some punctuation "noise") has been de-emphasized by printing in gray.
/* * @(#)BandedSampleModel.java 1.27 00/02/02 * * Copyright 1997-2000 Sun Microsystems, Inc. All Rights Reserved. * * This software is the proprietary information of Sun Microsystems, Inc. * Use is subject to license terms. * */ /* **************************************************************** ****************************************************************** ****************************************************************** *** COPYRIGHT (c) Eastman Kodak Company, 1997 *** As an unpublished work pursuant to Title 17 of the United *** States Code. All rights reserved. ****************************************************************** ****************************************************************** ******************************************************************/ package java.awt.image; /** * This class represents image data which is stored in a band interleaved * fashion and for * which each sample of a pixel occupies one data element of the DataBuffer. * It subclasses ComponentSampleModel but provides a more efficent * implementation for accessing band interleaved image data than is provided * by ComponentSampleModel. This class should typically be used when working * with images which store sample data for each band in a different bank of the * DataBuffer. Accessor methods are provided so that image data can be * manipulated directly. Pixel stride is the number of * data array elements between two samples for the same band on the same * scanline. The pixel stride for a BandedSampleModel is one. * Scanline stride is the number of data array elements between * a given sample and the corresponding sample in the same column of the next * scanline. Band offsets denote the number * of data array elements from the first data array element of the bank * of the DataBuffer holding each band to the first sample of the band. * The bands are numbered from 0 to N-1. * Bank indices denote the correspondence between a bank of the data buffer * and a band of image data. This class supports * {@link DataBuffer#TYPE_BYTE TYPE_BYTE}, * {@link DataBuffer#TYPE_USHORT TYPE_USHORT}, * {@link DataBuffer#TYPE_SHORT TYPE_SHORT}, * {@link DataBuffer#TYPE_INT TYPE_INT} datatypes */ public final class BandedSampleModel extends ComponentSampleModel { /** * Constructs a BandedSampleModel with the specified parameters. * The pixel stride will be one data element. The scanline stride * will be the same as the width. Each band will be stored in * a separate bank and all band offsets will be zero. * @param dataType The data type for storing samples. * @param w The width (in pixels) of the region of * image data described. * @param h The height (in pixels) of the region of image * data described. * @param numBands The number of bands for the image data. * @throws IllegalArgumentException if <code>dataType</code> is not * one of the supported data types */ public BandedSampleModel(int dataType, int w, int h, int numBands) { super(dataType, w, h, 1, w, BandedSampleModel.createIndiciesArray(numBands), BandedSampleModel.createOffsetArray(numBands)); } /** * Constructs a BandedSampleModel with the specified parameters. * The number of bands will be inferred from the lengths of the * bandOffsets bankIndices arrays, which must be equal. The pixel * stride will be one data element. * @param dataType The data type for storing samples. * @param w The width (in pixels) of the region of * image data described. * @param h The height (in pixels) of the region of * image data described. * @param numBands The number of bands for the image data. * @param scanlineStride The line stride of the of the image data. * @param bankIndices The bank index for each band. * @param bandOffsets The band offset for each band. * @throws IllegalArgumentException if <code>dataType</code> is not * one of the supported data types */ public BandedSampleModel(int dataType, int w, int h, int scanlineStride, int bankIndices[], int bandOffsets[]) { super(dataType, w, h, 1,scanlineStride, bankIndices, bandOffsets); } /** * Creates a new BandedSampleModel with the specified * width and height. The new BandedSampleModel will have the same * number of bands, storage data type, and bank indices * as this BandedSampleModel. The band offsets will be compressed * such that the offset between bands will be w*pixelStride and * the minimum of all of the band offsets is zero. * @param w the width of the resulting <code>BandedSampleModel</code> * @param h the height of the resulting <code>BandedSampleModel</code> * @return a new <code>BandedSampleModel</code> with the specified * width and height. * @throws IllegalArgumentException if <code>w</code> or * <code>h</code> equals either * <code>Integer.MAX_VALUE</code> or * <code>Integer.MIN_VALUE</code> * @throws IllegalArgumentException if <code>dataType</code> is not * one of the supported data types */ public SampleModel createCompatibleSampleModel(int w, int h) { int[] bandOffs; if (numBanks == 1) { bandOffs = orderBands(bandOffsets, w*h); } else { bandOffs = new int[bandOffsets.length]; } SampleModel sampleModel = new BandedSampleModel(dataType, w, h, w, bankIndices, bandOffs); return sampleModel; } /** * Creates a new BandedSampleModel with a subset of the bands of this * BandedSampleModel. The new BandedSampleModel can be * used with any DataBuffer that the existing BandedSampleModel * can be used with. The new BandedSampleModel/DataBuffer * combination will represent an image with a subset of the bands * of the original BandedSampleModel/DataBuffer combination. * @throws RasterFormatException if the number of bands is greater than * the number of banks in this sample model. * @throws IllegalArgumentException if <code>dataType</code> is not * one of the supported data types */ public SampleModel createSubsetSampleModel(int bands[]) { if (bands.length > bankIndices.length) throw new RasterFormatException("There are only " + bankIndices.length + " bands"); int newBankIndices[] = new int[bands.length]; int newBandOffsets[] = new int[bands.length]; for (int i=0; i<bands.length; i++) { newBankIndices[i] = bankIndices[bands[i]]; newBandOffsets[i] = bandOffsets[bands[i]]; } return new BandedSampleModel(this.dataType, width, height, this.scanlineStride, newBankIndices, newBandOffsets); } /** * Creates a DataBuffer that corresponds to this BandedSampleModel, * The DataBuffer's data type, number of banks, and size * will be consistent with this BandedSampleModel. * @throws IllegalArgumentException if <code>dataType</code> is not * either <code>DataBuffer.TYPE_BYTE</code>, * <code>DataBuffer.TYPE_USHORT</code>, or * <code>DataBuffer.TYPE_SHORT</code>, or * <code>DataBuffer.TYPE_INT</code> */ public DataBuffer createDataBuffer() { DataBuffer dataBuffer = null; int size = scanlineStride * height; switch (dataType) { case DataBuffer.TYPE_BYTE: dataBuffer = new DataBufferByte(size, numBands); break; case DataBuffer.TYPE_USHORT: dataBuffer = new DataBufferUShort(size, numBands); break; case DataBuffer.TYPE_SHORT: dataBuffer = new DataBufferShort(size, numBands); break; case DataBuffer.TYPE_INT: dataBuffer = new DataBufferInt(size, numBands); break; } return dataBuffer;Just as with the summary report, the detail report can offer hyperlinks to additional renderings of the code. In this case the method names are hyperlinked back to the summary cgi script but with an additional search parameter that adds to the summary an indication of each method reference where it occurs within the summarizing signatures. These are highlighted with a contrasting color so that they are easy to spot within a lengthy summary.
Both the summary and detail cgi scripts can produce a large amount of html output. I continue to be impressed with my browser's ability to present these documents. (This work was done using Netscape version 4.5.) I am able to fill a large screen with highly textured information and scroll through it at will. I have on occasion written summarizing cgi scripts that take some time to run. My client-server configuration and my browser's ability to display partial results keeps even this activity interactive. I've had acceptable response browsing several hundred megabytes of source by scrolling through summary documents consisting of multiple megabytes of html.
It is also important to keep the cgi scripts short and simple for without this one cannot perform browsing experiments with sufficient ease. For this reason I often write the scripts fresh each time rather than trying to preserve any investment in prior versions. Likewise, it is important that one understands exactly what the scripts do even if what they do is not 100% correct from a language syntax parsing point of view. For these reasons I've always relied on Perl for writing these scripts. Perl has excellent file and text processing facilities. With these caveats and for example use only I offer both the summary script and the detail script used to create these examples. These scripts were written as extensions to a WikiWikiWeb server but do not use any important facility not available within the cgi framework of most web servers.
To a small degree the report formats I use are inspired by Edward R. Tufte
who is famous for his series of books on the visual display of information.
Although much of his advice applies best to paper, I've found large documents
and rapid scrolling to be an acceptable substitute. I have been particularly
inspired by his insistence that single displays can provide information
at multiple levels of scale. When we use programming tools to inspect large
programs we are too quickly drawn to the one instruction at a time mode
used by both the author and the computer. Custom browsing tools as described
in this paper can escape from this microscopic view and offer instead many
simultaneous levels of perspective.