Pattern recognition in the case of strong background noise
This study presents a development of a method for recognition of a class of patterns in signals contaminated by strong noise. The class of signals considered is described by a finite alphabet. The target class of patterns is assumed to have specific statistical properties that can be conveniently captured by the position weight matrices (PWM) description. Itis also assumed thatthe 'signals: contain numerous patterns si~ilar to the patterns of the target class, but which belong to different classes. These other patterns represent the noise in the signals. The method for-improved recogrrition of the target class of patterns is based on clustering of the target motifs with regard to distance form the reference point (event) in the signal. This positional clustering enables more precise description of the target class of patterns by means of the PWMs. However, it requires the use of as many PWMs as there are clusters of the target class. The method developed is of general nature, applicable to the situations described. It is however, applied to the recognition of the specific short motifs in DNA sequences. The short motif considered is the TATA-box,one of the most important docking sites for proteins in Eukaryotic polymerase II promoter regions. The reference point in the singals obtained form DNA sequences the transcription .start site (TSS). Thus the positional dustering of the TATA-box motif resulted in 20 different PWMs, instead of only one that describes the whole TATA motif class. This however, resulted in more discriminative PWMs and the recognition accuracy has increased by about a factor of two when compared to the recognition of the TATA moti f based on the original PWM.