Data Analysis

Selecting a random element from an array of length n is easy: simply generate a random integer i, with 0 <= i < n, and use the array element at that index position. But what if the length of the array is not known beforehand, or is, in fact, infinite (i.e. a stream)? And what if we don’t just want a single element, but a set of m samples, without replacement?

When writing programs that do computations, my overwhelming preference is to simply write results to standard output, and to use shell redirection to capture the output in a file. In this way, I am leveraging the shell’s full functionality, in particular filename completion, in the most convenient way possible. For the file format itself, I prefer simple, column-oriented, delimiter-separated flat files. They are completely portable, and can be read and understood by most tools. (They also play well with the usual Unix toolset.)

But this simple approach breaks down, once a program has to write more than one output stream: for example in the case of a simulation run, I may want to capture periodic snapshots of the simulation itself, but also track various calculated metrics as well. These two streams will not fit comfortable into a single flat file. One option is to use a structured file format, the other option is to write to multiple files simultaneously.

I have started to get interested in Hidden Markov Models (HMM). As a warm-up, I prepared a pure Python implementation of the relevant algorithms (github).

I occasionally see references to the HDF5 file format, but I have never encountered it in the wild. But a recent project generated multiple data sets simultaneously, in addition to metadata. Was there a better way than maintaining a collection of flat files? This prompted me to look at HDF5.

If you are in a hurry to learn D3.js, the leading JavaScript library for web-based graphics and visualization, this book is for you. Written for technically savvy readers with a background in programming or data science, the book moves quickly, emphasizing unifying concepts and patterns. Anticipating common difficulties, the book teaches you how to apply D3 to your own problems.

Gnuplot in Action, 2nd Edition, is the authoritative guide to the gnuplot graphics and visualization program for developers, engineers, and scientists.

With this insightful book, intermediate to experienced programmers interested in data analysis will learn techniques for working with data to discover what it contains, how to capture those ideas in conceptual models, and then feed your understanding back into the organization through business plans, metrics dashboards, and other applications.

Gnuplot in Action is a comprehensive tutorial written for all gnuplot users: data analysts, computer professionals, scientists, researchers, and others. It shows how to apply gnuplot to data analysis problems. It gets into tricky and poorly documented areas.

Sampling from a Stream

How Best to Capture Output from Scientific Calculations?

Hidden Markov Models

A Look at the HDF5 Format

D3 for the Impatient

Gnuplot in Action, 2nd Edition

Data Analysis with Open Source Tools

Gnuplot in Action