A Neural Network to Simulate NCBI Blast

A Project for CECS 477: Neural Networks

 

By Jim Ries

4/30/2000

 

 

Jim Ries

Student #423566

JimR@acm.org

4/30/2000

 

Introduction

The National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov/) has created a family of tools to compare genetic or protein "sequences". This family of tools is called Basic Local Alignment Search Tool

(BLAST) and is publicly available from NCBI at http://www.ncbi.nlm.nih.gov/BLAST/. These tools are widely used by biologists and genomic researchers to compare the "distance" or "similarity" of one genetic sequence to another.

Genetic sequences are strings of characters from the alphabet {A,G,C,T}. Each sequence may be many hundreds of characters (or "bases") long. In order to compare these sequences, BLAST employs a "scoring matrix" which is used to define the distance that a given transform will cost. For example, when comparing the strings "AGCTA" and "GGCTC", BLAST may align the middle three ("GCT") characters of the two strings. BLAST will apply a dissimilarity cost to the characters that do not match, and arrive at a score indicating how similar the two strings are.

This project seeks to approximate the function of BLAST by use of a neural network. The hope is that a neural network may be able to provide a good approximation to the BLAST algorithm, but take less processing time. In addition, a future version of the developed neural network may be able to take advantage of inherent parallelism to better scale on multi-processor platforms.

 

Methodology

I selected a set of 96 reasonably "good" sequences taken from a sequencing plate from the Monsanto Swine Genome project being conducted here at the University of Missouri. The sequences are mostly "good" in that they meet acceptance criteria used by the genomic researchers in Animal Science and the DNA Core Facility to determine whether a sequence is worthwhile to analyze.

I ran NCBI's BLAST utility for comparing two sequences (bl2seq) against all of the possible combinations of sequences, and saved the results. That is, I ran #1 against #1, #1 against #2, …, #n against #1, …, #n against #n. This yielded 9,216 results (96 squared). The result files were named by concatenating the two sequence file names together and appending ".out".

Since the sequence files are encoded as strings of letters representing the possible protein bases ("A", "G", "C", and "T"), a conversion was required. I wrote a program called NNConvert to read all of the sequence files and their corresponding results, encode the information, and pull everything together in one file. I then used this summary file as input to train (and later test) my neural network. It was necessary to encode the sequences as floating point numbers for input into the neural network. I did this via the following encoding scheme:

Original Code

Meaning

Output Code

A

A type base

0.1

G

G type base

0.2

C

C type base

0.3

T

T type base

0.4

N

Unknown or sequence misread (sequencing error)

0.5

All Else

Unexpected data in the input file. This did not actually come up in the conversion process.

0.9

After completing the conversion, I was ready to read the data into a neural network for training. I selected a simple three-layer back-propagation neural network, and implemented it using the freely available libneural C++ library. The library utilizes a sigmoid activation function at each node, but allows one to tune parameters such as the learning rate and the error tolerance (convergence point).

I tried training the network through a variety of learning rates, error targets, and hidden layer node counts. I also tried using portions of my large data set for training and portions for testing versus utilizing the entire database for training. I ran training tests on both Windows NT 4.0 and on a Compaq True 64 Unix machine.

 

Results and Discussion

Accuracy

In the end, the best network was trained using the entire data set on the Compaq True 64 Unix architecture with 50 hidden units, a 0.25 learning rate, and a convergence of mean squared error set to 5 * 10^(-07). I trained using the entire data set over 10 full iterations. That is, I repeatedly trained with all of the data. This net produced an average of about 26% variance from the scores that the real BLAST program generated. A portion of the results for this test run showing some of the tested sequences are available in Appendix D. Both the training data and the test data are quite voluminous, and so are not reproduced in their entirety here. However, they are available electronically upon request.

For a training set of this size, and a manageably sized neural network, a 26% variance from the real BLAST analysis seems acceptable. A brief scan of the entire result shows that there are relatively few cases where the network's score is off more than 100% from the BLAST score. In general, biologists look to BLAST for relative similarity of sequences. Since our neural network approach was never variant from BLAST by an order of magnitude, the results would likely be useful to a biologist. In fact, biologists seem to use BLAST scores as a starting point to filter out sequences that are not similar to those being analyzed. For gross tasks such as this, the BLAST neural network simulation may provide a reasonable alternative.

Speed

I compared running the BLAST utility bl2seq on each of the combinations of sequences studied with the performance of the neural network. I ran bl2seq on a Windows NT 4.0 machine using a batch file to generate all of the combinations. This test took 31 minutes and 40 seconds. For comparison, the neural network took 1 minute 18 seconds to perform essentially the same task. This is an improvement of 24 fold!

It must be noted, however, that the bl2seq utility provides additional information and analysis that cannot be gleaned from the neural network. BLAST's bl2seq provides an "e-score" which was not trained into our neural network, and a character representation of the best alignment, which was used to generate the given score. In addition, our simulation required bl2seq to open 9,216 individual files whereas the neural network's data was contained all in one file (which had been created by the NNConvert utility). This clearly added some possibly significant overhead to the comparison.

Due to the rough nature of this comparison, it is not conclusively certain that our neural network performs the basic task faster than BLAST. However, I do suspect this is the case. Without modifying the bl2seq source code to put it on a more even footing with our neural network, an exact comparison is not possible.

 

Future Directions

Clearly, there are a number of areas in this project, which leave room for future work. Initially I had hoped to compare the speed of the trained neural network to the BLAST tools on both a single processor machine and a multi-processor. The BLAST tools can be compiled to utilizes threads on a multi-processor machine, so this would be a good comparison of the "engineered parallelism" of BLAST to the "inherent parallelism" of a neural network. Unfortunately, the libneural software used as a basis for this neural network does not support multiple threads of execution, MPI, or any parallel mechanism. It seems that this would be a straightforward extension of the library, and would thus be an excellent next step in the project.

If time had permitted, it would have been very instructive to try much larger hidden layers in the training experiments. I sampled the current state of the neural network at various stages of the training, and its accuracy (compared to the entire data set) oscillated quite severely. That is, a network trained at one stage might produce a variance over the entire data set averaging 40%, while the same network at a later stage of training might produce a worse variance of 50%! However, the final network produced after 10 iterations of training with the entire data set did produce a reasonably good result. Still, this implies that the network may be experiencing interference from one training instance to another. This might be mitigated by a larger hidden layer, thus allowing a more complex regionalization of the network.

Another possible way of dealing with this interference would be to utilize an ensemble approach as discussed in class. This would require setting up multiple neural networks. The data could be clustered before training to determine which regions are candidates for their own networks. Alternatively, one could make use of another network that was designed to control the clustering for the smaller networks. Either of these approaches would add significant complexity to the project, and are out of the scope of a single semester project.

It makes intuitive sense that this problem would require a very large and complex network for accurate representation. Consider that BLAST may shift sequences around, insert gaps, etc. in order to find a good alignment for comparison of sequences. In fact, a complete analysis of best alignment of two sequences is an NP complete problem and covers a combinatorial explosion of possible best alignments. In order for a neural network to represent even a small but heuristically useful subset of this large search space, it may require a very large number of nodes.

 

Conclusion

The score function of the BLAST family of genomic sequence alignment and scoring tools can be grossly approximated by a back-propagation neural network. The resultant neural network may provide a processing speed advantage over the BLAST tool, but may suffer somewhat in comparison to the accuracy of BLAST. Further study is necessary to determine whether a neural network with additional hidden units or structural complexity could be used to more closely approximate BLAST. However, closer approximation may also limit the speed performance advantages enjoyed by the neural network approach.

 

Appendix A - Sample Sequence File

The following is a sample sequence file in a format appropriate for use with BLAST.

 

Appendix B - Sample BLAST Result File

The following is a file output from a bl2seq BLAST analysis.

 

Appendix C - A Portion of the Encoded Sequence/Result File

The following file shows a portion of an encoded sequence/result file that was generated by my own NNConvert program. The complete file is extremely large and not reproduced here. However, it is available electronically upon request by emailing JimR@acm.org.

 

Appendix D - Neural Network Results Compared to BLAST

The following is clipped from a full run of results compared to BLAST. The last line shows the average variance.

Test8964 should be 0.00897 but is 0.0160594. 79.035% difference.

Test8965 should be 0.0155 but is 0.0234338. 51.1858% difference.

Test8966 should be 0.0159 but is 0.0209292. 31.6301% difference.

Test8967 should be 0.0159 but is 0.0230557. 45.0042% difference.

Test8968 should be 0.0163 but is 0.0246003. 50.9219% difference.

Test8969 should be 0.0159 but is 0.0221933. 39.5808% difference.

Test8970 should be 0.0121 but is 0.0175644. 45.1605% difference.

Test8971 should be 0.0149 but is 0.0195024. 30.8889% difference.

Test8972 should be 0.0163 but is 0.0220672. 35.3814% difference.

Test8973 should be 0.0159 but is 0.0200938. 26.3758% difference.

Test8974 should be 0.0159 but is 0.0206077. 29.6082% difference.

Test8975 should be 0.0159 but is 0.0247638. 55.7473% difference.

Test8976 should be 0.0155 but is 0.0184595. 19.0934% difference.

Test8977 should be 0.0159 but is 0.0255887. 60.9349% difference.

Test8978 should be 0.0155 but is 0.0157164. 1.39631% difference.

Test8979 should be 0.0155 but is 0.0192265. 24.0418% difference.

Test8980 should be 0.0155 but is 0.0204756. 32.1007% difference.

Test8981 should be 0.0155 but is 0.0231429. 49.3091% difference.

Test8982 should be 0.0159 but is 0.0205096. 28.9913% difference.

Test8983 should be 0.0155 but is 0.0179287. 15.6687% difference.

Test8984 should be 0.0172 but is 0.0163977. 4.66466% difference.

Test8985 should be 0.0149 but is 0.0195166. 30.984% difference.

Test8986 should be 0.0151 but is 0.0195769. 29.6483% difference.

Test8987 should be 0.0149 but is 0.0206737. 38.7499% difference.

Test8988 should be 0.0159 but is 0.0208911. 31.3903% difference.

Test8989 should be 0.0159 but is 0.0230177. 44.7657% difference.

Test8990 should be 0.0159 but is 0.0220626. 38.7584% difference.

Test8991 should be 0.0155 but is 0.016894. 8.99345% difference.

Test8992 should be 0.0155 but is 0.0202786. 30.8297% difference.

Test8993 should be 0.0151 but is 0.019656. 30.1722% difference.

Test8994 should be 0.0159 but is 0.017163. 7.94352% difference.

Test8995 should be 0.0159 but is 0.0230527. 44.9853% difference.

Test8996 should be 0.0159 but is 0.0258489. 62.5719% difference.

Test8997 should be 0.0159 but is 0.022056. 38.7172% difference.

Test8998 should be 0.0155 but is 0.0184634. 19.119% difference.

Test8999 should be 0.0155 but is 0.0176856. 14.1007% difference.

Average difference is 25.8304%.

 

Appendix E - Source Code

The various source code for programs used in this project is included on the pages which follow.