dimecres, 10 de març de 2010

Markov clustering and the case of the nonhomologous orthologs

Aqui esta el segon capitol relacionat en com es comporta el algoritme MCL clustering  en determinats contexts. En  aquest cas nosaltres vam analitzar el comportament del metode de formacio de clusters de familes de proteines agrupades per la seva similitud de sequencia.

Podeu seguir aquesta historia tambe al blog Buried treasure, on trobareu les entrades originals.


In the previous blog post I described how the MCL algorithm can sometimes produce unnatural clusters with disconnected parts. The C implementation of MCL has an option to suppress this behavior (--force-connected=y), but I suspect that it is rarely used. I have thus taken a closer look at some notable applications of MCL in bioinformatics to see if unnatural clusters arise in real data sets.
Here I will focus on OrthoMCL-DB, which is a database of orthologous groups of protein sequences. These were constructed by applying the MCL algorithm to the normalized results of an all-against-all BLAST search of the protein sequences.

To check the connectivity of the resulting orthologous groups, I downloaded OrthoMCL version 4 including the 13+ GB of gzipped BLAST results that formed the basis for the MCL clustering. I wish to thanks to the OrthoMCL-DB team for being very helpful and making this large data set available to me.

A few Perl scripts and CPU hours later, Albert Palleja and I had extracted the BLAST network for each of the 116,536 orthologous groups and performed single-linkage clustering to check if any of them contained disconnected parts. We found that this was the case for the following 28 orthologous groups:

Orthologous group Protein
OG4_10123 tcru|Tc00.1047053448329.10
OG4_10133 cmer|CMS291C
OG4_11608 bmor|BGIBMGA011561
OG4_13082 lbic|eu2.Lbscf0004g03370
OG4_17434 cint|ENSCINP00000028818
OG4_20715 mbre|fgenesh2_pg.scaffold_4000474
OG4_20953 tpal|NP_218832
OG4_21182 tvag|TVAG_333570
OG4_24433 tmar|NP_229533
OG4_29163 tcru|Tc00.1047053508221.76
OG4_32884 gzea|FGST_11535
OG4_36484 cbri|WBGene00088730
OG4_39391 ddis|DDB_G0279421
OG4_43780 cpar|cgd3_1080
OG4_44179 atha|NP_177880
OG4_44684 bmal|YP_104794
OG4_45409 rcom|29647.m002000
OG4_50671 pram|C_scaffold_62000023
OG4_50712 bpse|YP_331887.1
OG4_52326 bmaa|14961.m05365
OG4_52455 bmal|YP_338428
OG4_55725 apis|XP_001952076
OG4_57272 bbov|XP_001610684.1
OG4_58797 hwal|YP_659316
OG4_61264 crei|122343
OG4_68577 bmor|BGIBMGA000864
OG4_71107 cbur|NP_819756
OG4_84041 tcru|Tc00.1047053479883.10

For convenience, the orthologous groups are linked to the corresponding web pages in OrthoMCL-DB, which enable viewing of Pfam domain architectures and multiple sequence alignments. Cursory inspection suggests that the majority of the of the sequences listed in the table do not belong to the orthologous groups in question.

Of the 28 orthologous groups, 24 groups contain a single protein with no BLAST hits to other group members, 2 groups each contain 2 such singletons, and the remaining 2 groups each contain 2 proteins that show weak similarity to each other but not to any other group members. The latter proteins are highlighted in red.

In summary, this analysis shows that the unnatural clustering by MCL reported for a toy example in the previous post also affects the results of real-world bioinformatics applications of the algorithm.


Buried treasure