How Best to Capture Output from Scientific Calculations?

When writing programs that do computations, my overwhelming preference is to simply write results to standard output, and to use shell redirection to capture the output in a file. In this way, I am leveraging the shell’s full functionality, in particular filename completion, in the most convenient way possible. For the file format itself, I prefer simple, column-oriented, delimiter-separated flat files. They are completely portable, and can be read and understood by most tools. (They also play well with the usual Unix toolset.)

But this simple approach breaks down, once a program has to write more than one output stream: for example in the case of a simulation run, I may want to capture periodic snapshots of the simulation itself, but also track various calculated metrics as well. These two streams will not fit comfortable into a single flat file. One option is to use a structured file format, the other option is to write to multiple files simultaneously.

In general, I tend to stay away from structured files, and tend to prefer multiple flat files over a single structured file. The primary reason is that flat files are easy to read — and, invariably, the results from scientific calculations will have to be read by another program, either to do further analysis, or to plot it, or as input for further calculations.

(There is also no obvious structured file format to converge on. The case of HDF5 serves more as a warning than as an invitation. JSON can be a good solution for configuration data and similar applications, but it is verbose and not very suitable for large amounts of strictly numerical data. Bespoke file formats are almost always a bad idea. Because it provides transactional integrity, writing results to an SQLite database can be interesting when it is essential that no data point is lost, even in the case of system failure, but this is a somewhat of a special case, and not something I’d generally do.)

But with multiple files, simple shell redirection will no longer work, which means that our program now needs to open and manage a set of output files… and all that that entails. The problem is not the opening of the files and writing to them: that’s easy. The problem is that now our calculation program needs to provide some form of “user interface”, where the user can enter a filename, and do so conveniently. In other words, we have to replicate some of the functionality that is usually provided by the shell.

Towards a Solution

A basic convention that I have found useful uses file extensions to indicate the type of content. For instance, I may let the names of files containing simulation snapshots end in .snp, whereas all files with analysis data end in .stt. Since this information tends to be specific to the overall computation, and not be subject to change from one run of the program to the next, it can be hardcoded into the program itself.

What needs to be supplied, then, at each run is the base filename (before the extension), possibly including a destination directory.

A few principles:

  • I take it as a given, that the program will take whatever input from the command line.
  • Next, the input required from the user should be minimal. In particular while a program is being developed, it is often run and re-run many times. Entering complicated arguments repeatedly is a nuisance.
  • Finally, I find that most arguments to a typical bespoke computational program tend to be mandatory, not optional. Hence, command line arguments can be strictly positional, without the use of Unix-style “flags”.

Without the help of shell-level file and path name completion, here is what the program should provide:

  • Protection against clobbering an existing filename: fail if the desired filename already exists.
  • It should be possible to override this, and force the existing file to be overwritten.
  • Alternatively, instead of overwriting the existing file, the program should be able to modify the name, for example by appending or inserting a number (e.g. file.1.snp, file.2.snp, etc).
  • It can be useful to create a filename from all the input parameters. Such filenames will be long and cumbersome, but informative. They are therefore appropriate for final “production” runs, but tend to get in the way during development.

Taking all these features together, I arrived at the following requirements for the user experience. On the command line, the user can specify either:

  • the base filename OR the character @, in which case the program will create a filename from the parameters of the run
  • optionally followed, without whitespace, by either the ! character, indicating that clobbering an existing file is acceptable, OR the + character, in which case the program will insert a number in case of conflict. Without either ! or +, the program will fail with an error message in case of conflict.

Implementation

The following function implements the functionality described above.

It takes three arguments:

  • The raw input from the command line
  • A fully formed filename, to be used when the @ flag (or the # flag) is given.
  • A slice of extensions (as strings). The function will supply a leading dot for each extension if it is not supplied.

The function returns a slice of *os.File objects, one for each extension.

package main

import (
	"fmt"
	"log"
	"os"
	"path/filepath"
	"strconv"
	"strings"
	"unicode"
)

func MakeOutputFiles( rawinput, fullname string,
	extensions []string ) []*os.File {
	
	if len(rawinput) == 0 {
		log.Fatalln( "Missing outfile" )
	}

	outname, clobber, count := rawinput, false, false

	// Slice off suffix if there is one
	if len(rawinput) > 1 {
		switch rawinput[len(rawinput)-1:] {  // switch on last char in string
		case "!":
			outname = rawinput[:len(rawinput)-1] 
			clobber = true
		case "+":
			outname = rawinput[:len(rawinput)-1] 			
			count = true
		} // default: outname = rawinput
	}

	// If only a single char is left, deal with that
	if len(outname) == 1 {
		if outname == "#" || outname == "@" {
			outname = fullname
		} else if !unicode.IsLetter( rune(outname[0]) ) {
			log.Fatalln( "Single char filename must be letter!", outname )
		}
		//  } else if unicode.IsLetter( outname ) { outname = rawinput }
	}

	// Now we have outname, clobber, count...

	// Fail on conflict, unless clobber or count
	entries, _ := filepath.Glob( outname + "*" )
	if len(entries) > 0 && !clobber && !count {
		log.Fatalln( "Filename exists: ", outname )
	}

	// On count, find previous number. No number found: start with 1
	if count {
		max_counter := 0
		for _, e := range entries {
			fs := strings.Split( e, "." )
			if len(fs) == 3 {   // middle (number) part of filename exists...
				counter, err := strconv.Atoi( fs[1] )
				if err != nil {
					log.Fatalln( "Can't read file counter: ", fs[1] )
				}
				if counter > max_counter {
					max_counter = counter
				}
			}
		}
		outname = fmt.Sprintf( "%s.%03d", outname, max_counter+1 )
	}

	// If clobber: truncate!

	// Create files:
	files := make( []*os.File, 0, len(extensions) )
	for _, ext := range extensions {
		tmp := outname
		if !strings.HasPrefix( ext, "." ) {
			tmp += "."
		}
		tmp += ext
		
		f, err := os.Create( tmp )
		if err != nil {
			log.Fatalln( "Could not create file", tmp )
		}
		files = append( files, f )
	}

	return files
}

func main() {
	files := MakeOutputFiles( os.Args[1], "fullname", []string{".dtt", ".dat"} )

	fmt.Fprintln( files[0], "dtt data goes here..." )
	fmt.Fprintln( files[1], "dat data goes here..." )
}

In Conclusion

Yes, this function is easily longer than the whole functional “meat” of a simple scientific calculation or simulation.

Which is precisely why “my overwhelming preference is to simply write results to standard output, and to use shell redirection to capture the output in a file.”

Source File

makefiles.go