Tuesday, October 23, 2007

Generalized Covariance Models

So it appears that the most rich and complex models that people have developed in bioinformatics, structural biology, and systems biology all rotate around a very powerful and general qualitative and quantitative phenomenon. The observed correlation structure of the data. In my own research, the observed correlation structure is what dictates whether or not you can reject the null hypothesis for a structural relationship between endogenous variables. Even in evolutionary biology and quantitative genetics...these fields have powerful equations that attempt to be dynamically descriptive and sufficent...and they rotate around patterns of covariation observed between multiple endogenous variables, e.g. phenotypes. Within patterns of covariation you can capture stochastic as well as deterministic connections between modules in a complex system.

There are two major questions one can ask of a correlation structure. The first, and most difficult question, is what structure does the observed patterns of correlation suggest. This is a very difficult question, because based on an underyling assumed probabilistic model, the possible number of structural relationships between endogenous variables can be an enormous combinatorial function of the number of variables in the system. Devising clever computational tricks to search this vast space of models is very tricky, and often leads to NP-complete problems. A much easier question (though still very difficult), is given some known structure (tree, directed acyclic graph, undirected graph...), estimate the parameters describing the probabilistic or structural relationships between variables in the system.

Yet I feel like the correlation structure is just the first pass at deeper patterns present in the data. There must exist higher order correlations between modules of variables. Pairwise correlation can only get you so far. In the field of data mining there are many dimensionality reduction techniques that allow you to search through high dimensional multivariate data, and try to find global patterns that can be mapped down to low dimensional manifolds. These include support vector machines, principle component analysis, and many others.

This is what I was driving at with all the gibberish about h-grammar on m-languages. Being able to accurately model and encapsulate not just the pairwise correlation between elements of a system, but also the higher order correlations between modules of the system, and the implied hierarchy therein. The most difficult aspect of such systems is the design of experiments or the collection of data which would let you start asking questions at that level. Most experiments have barely enough power to detect pairwise interactions between variables, much less complicated higher order interactions between entire suites of variables. If one has an intimate knowledge of the biology occuring, then one can impose deterministic and stochastic constraints on variables that can lead to meaningful results as far as higher order patterns of covariation, but even this is subtle and not straight forward.