dimecres, 3 de març de 2010

Analysis: Markov clustering and the case of the unnatural clusters

Aquest es potser el post mes tecnic que mai havia escrit, be de fet, com veureu, no l'escrit jo, pero hi he treballat per obtenir els resultats que despres han portat a escriure aquesta historia. Tot va començar quan vam intentar formar grups de proteines basant-nos en les interaccions que hi ha entre aquestes. Les proteines eren les proteines humanes i la base de dades emprada era STRING, on hi ha recollides les interaccions conegudes i predites entre proteines.

Per aquest objectiu necessites un fitxer amb les interaccions, del tipus ProtA ProtB 1 (la interaccio pot tenir diferents rangs, per exemple entre 0 i 1). Evidenment, en aquest exemple, com mes proper a 1 mes probabilitat que les proteines A i B interaccionin. En canvi, com mes proper a 0 la interaccio es mes debil. El segon que necesssites es un algoritme que et permeti formar grups de proteines segons el valor de la interaccio que tenen entre elles. Teoricament, el que t'esperes es que nomes les proteines que tenen interaccio acabin en el mateix cluster (grup de proteines). Pero podria passar que proteines acabessin en el mateix cluster sense tenir cap interaccio directa?

Us copio aqui sota el relat de la historia extret del blog d'en Lars Juhl Jensen, qui es avui en dia el meu supervisor de postdoc. El blog s'anomena Buried treasure.  En aquest mateix blog podeu seguir els dos seguents capitols d'aquesta historia. Hi trobareu els links al final d'aquest post i properament us copiare el text aqui tambe, mes que res per tenir jo un record personal. Espero que disfruteu amb la historia: ;)

----------------------------

The MCL (Markov CLustering) algorithm was invented/discovered by Stijn van Dongen and was published in 2000. It has since become highly popular in bioinformatics and has proven to perform well on a variety of different problems.
It was also the method of choice when my postdoc Albert Palleja needed to cluster the human interaction network from the STRING database. However, we got strange results. More specifically, we observed that some clusters contained proteins that had no interactions with any other proteins within the same cluster. I call these unnatural clusters; this should be seen as a contrast to natural clusters, which are characterized by the presence of many edges between the members of a cluster.
After we had spent a week unsuccessfully trying to find out what we were doing wrong, I finally asked myself if it could be that we were not doing anything wrong. Might it be that applying the MCL algorithm to a protein interaction network can result in clusters of non-interacting proteins?
To test this, I constructed the following toy network consisting of only 10 nodes and 12 edges:


Assigning a weight of 1 to all edges and running this network through MCL using an inflation factor (the key parameter in the MCL algorithm) between 1.734 and 3.418 yields five clusters. In the figure below, the nodes are colored according to which cluster they belong to:


Note the black cluster which consists of two proteins, X and Y, despite the two nodes only being connected via nodes that are not part of the same cluster. This example clearly shows that the MCL algorithm is indeed capable of producing unnatural clusters containing nodes with no direct edges to any other members in the cluster.
In my view this is not as such a error in the the MCL algorithm. The algorithm is based on simulation of flow in the graph. The nodes X and Y are clustered due to the strong flow between them via nodes A, C, E, and G. However, I think it is fair to say that this behavior will catch many users by surprise and that it can give rise to misleading results when applying MCL to certain types of networks.
Edit: I suspect that this is the same issue that was reported on the Mcl-users mailing list by Sungwon Jung. Using the --force-connected=y option prevents the undesirable clustering of X and Y.

Analysis: Markov clustering and the case of the nonhomologous orthologs

Analysis: Markov clustering and the case of the unsupported protein complexes

Source: 

Buried treasure