Patrice Koehl
Department of Computer Science
Genome Center
Room 4319, Genome Center, GBSF
451 East Health Sciences Drive
University of California
Davis, CA 95616
Phone: (530) 754 5121
koehl@cs.ucdavis.edu




New Algorithms for Biological Sequence Analyses

Software: SeqKernel

The C++ source code for SeqKernel is available under LGPL license.

Source code: WSeqKernel.cpp

Input substitution matrix: BLOSUM62.txt

Optimized subsitution matrix: OPTIM10.txt

To compile and run the code:

  • Compiling: g++ -o WSeqKernel.exe -O SeqKernel.cpp
    where g++ (the GNU c++ compiler) can be replaced with your c++ compiler of choice
  • Running: WseqKernel.exe -i Seq1.fa -j Seq2.fa -p SM.txt -o Output -b 0.1 -k 10 -w 1
    with:
    • Seq1.fa: file containg sequence 1, in FASTA format
    • Seq2.fa: file containg sequence 1, in FASTA format
    • SM.txt : file containg substitution matrix, in its own format (see BLOSUM62.txt for example)
    • Output : file containg distance between Seq1 and Seq2
    • 0.1 : selected value for parameter beta
    • 10 : selected value for parameter kmax
    • 1 : weighting scheme (here mean value)
    WseqKernel.exe -h will give you the list of options, with description

Overview: alignment-free method for sequence alignment

The order in which amino acids appear defines the primary sequence of a protein. Amino acids are usually labeled using a one-letter code, and sequences are correspondingly represented as strings of letters. This representation has proved very valuable, especially in the context of sequence alignment that are performed using string-matching algorithms. Those methods represent the overwhelming majority of methods used for sequence comparisons in bioinformatics. When comparing two sequences, they proceed in two steps, first the generation of the alignment between the two sequences, then the derivation of a statistical score for that alignment. They rely on a weighting scheme to measure the cost of matching pairs of amino acids. Many such weights have been proposed, from substitution matrices derived from evolution models such as the PAM matrices and the BLOSUM matrices, to matrices that capture physico-chemical properties of amino acids. Using this score, an alignment is derived following a dynamic programming algorithm, either the local method of Smith and Waterman, or the global method of Needleman and Wunsch. This alignment is then scored by summing the individual weights of the matching pairs of amino acids and adding penalties for the presence of gaps. It should be noted that this score is not a metric in sequence space. Statistical methods have been developed to assess the significance of such scores, both for gapped and non-gapped alignments. Such statistical scores are widely used for the identification of homologous sequences, or for fold recognition. It has been shown however that those scores are efficient for both tasks when sequences with high similarities can be identified, but often fail for dissimilar pairs. Extensions to pair-wise alignment methods have been proposed to alleviate those problems, such as those based on multiple sequence alignments and profiles, and those based on Hidden Markov Models. While those show improved sensitivity, they remain prone to the problems related to the construction and use of alignments.

To overcome the limitations of the alignment-based methods described above, many "alignment-free" methods have been developed over the past three decades. We have developed our own version of such an alignment free alternative that is based on the concept of string kernels. Starting from recently proposed kernels on the discrete space of protein sequences (Shen et al, Found. Comput. Math., 2013,14:951-984), we have developed our own version, SeqKernel.

SeqKernel: A string kernel for alignment-free sequence comparison

The string kernel considered here is inspired by the convolution string kernels introduced by D. Haussler (University of California, Santa Cruz; 1999. UCS-CRL-99-10.), adapted by Saigo et al (Saigo H, Vert JP, Ueda N, Akutsu T. Bioinformatics. 2004;20:1682-1689.) as the local alignment kernels, and later by Smale and co-workers (Shen et al, Found. Comput. Math., 2013,14:951-984). We provide here the key elements of its construction. Readers should refer to the original papers for a more detailed presentation, notably for the proofs of the mathematical properties that are relevant to kernels in general.

Notations Let $X$ be a finite set of size $n$. A kernel K is a symmetric function from $X\times X$ to $\mathbb{R}$ such that the Gram matrix $G$ of size $n\times n$ defined by $G(i,j)=K(x_i,x_j)$ is symmetric, positive, and definite. Let $\mathcal{A}$ be the set of the standard twenty amino acids found in proteins. A protein sequence $S$ is a string of elements from $\mathcal{A}$. We note $|S|$ the length of $S$.

A kernel for amino acid pairs

Let $SA$ be a function from $\mathcal{A}\times \mathcal{A}$ to $\mathbb{R}$, such that $SA(i,j)$ measures the similarity of the amino acids $i$ and $j$, and let $SM$ be the Gram matrix associated to $SA$. Examples of $SM$ include the matrices representing the raw data of any BLOSUM matrices, namely the raw counts of how often amino acid $i$ is substituted by amino acid $j$ in a set of selected protein sequence alignments that is then normalized by considering its row sums $P(i)$:

$$P(i) = \sum_{j=1}^{20} SM(i,j) $$

$$SM2(i,j) = \frac{SM(i,j)}{P(i)p(j)} $$

We have checked that when $SM$ is a raw count BLOSUM matrix, then $SM2$ is symmetric, positive, and definite. Note that when $SM$ is such a raw count matrix, the corresponding BLOSUM matrix $BL$ is defined as $BL(i,j) = round(log_2(SM2(i,j)))$, where $round$ is the function that rounds a real number to its nearest integer. Given a strictly positive real number $\beta$, we define the function $K_1: \mathcal{A}\times \mathcal{A} \rightarrow \mathbb{R}$ as:

$$K_1(i,j) = SM2(i,j)^{\beta}$$

$K_1$ is a kernel function on $\mathcal{A}\times \mathcal{A}$, as long as $SM2$ is symmetric, positive, definite, and $\beta$ is strictly positive.

A kernel for comparing two strings of the same length.

Let $k$ be a strictly positive integer and let $\mathcal{A}^k$ be the $k$-th Cartesian power of $\mathcal{A}$. An element of $\mathcal{A}^k$ is a string of $k$ letters taken from $\mathcal{A}$, it is usually referred to as a k-mer, or a k-gram. Let $u^k=(u_1,\ldots,u_k)$ and $v^k=(v_1,\ldots,v_k)$ be two k-mers in $\mathcal{A}^k$. The function $K_2^k$ defined by:

$$K_2^k(u^k,v^k) = \prod_{i=1}^k K_1(u_i,v_i)$$

is a kernel on $\mathcal{A}^k$, the set of all k-mers. We note that $K_2^k$ is a convolution kernel.

A kernel for computing protein sequence similarity.

Let $S=(s_1,\ldots,s_n)$ and $T=(t_1,\ldots,t_m)$ be two protein sequences; both are strings, with $S \in \mathcal{A}^n$ and $T \in \mathcal{A}^m$. Let $u^k$ and $v^k$ be substrings of length $k$ (i.e. k-mers) of $S$ and $T$ respectively. $u^k$ and $v^k$ are considered contiguous, i.e. we do not allow gaps. There are therefore $n-k+1$ and $m-k+1$ such substrings in $S$ and $T$, respectively. We define

  • $$K_3^k(S,T) = \sum_{u^k \in S} \sum_{v^k \in T} K_2^k(u_k,v_k)$$
  • $$K_3(S,T) = \sum_{k=1}^p w_k K_3^k(S,T)$$

where the summation extends to $p=\min(n,m)$, $n$ and $m$ the lengths of the two sequences that are compared, and $\omega_k$ a weight.

We set the weights $\omega_k$ to:

$$\omega(k)=\frac{1}{(n-k+1)(m-k+1)}$$

Finally, we define the correlation kernel $\hat{K}_3$ as:

$$\hat{K}_3(S,T) = \frac{K_3(S,T)}{ \sqrt{K_3(S,S)K_3(T,T)}}$$

We make the following remarks:
  • The input kernel matrix $SM$ is not a traditional substitution matrix, as it does not involve applying the logarithm function on the probability measures. While the latter is needed to make scores additive, a necessary condition to the use of dynamic programming algorithms to generate pairwise sequence alignment, it is not needed for the string kernel we use here. Note that this differs from the local alignment kernel, which is designed to mimic pairwise alignment.
  • The summation in Equation \ref{eqn:k3} extends to the lengths of the smallest sequence. This summation can be truncated however to only include \emph{k-mers} up to a fixed lengths, $k_{max}$. This will be discussed in details.
  • $\hat{K}_3$ is a kernel as long as $SM2$ and $\beta$ (which define $K_1$) are definite positive, and strictly positive, respectively. Reversely, for any kernel $K_1$ defined on $\mathcal{A}\times \mathcal{A}$, we can define a correlation string kernel $\hat{K}_3$.
  • As defined, $\hat{K}_3$ does not consider gap penalties, or even gaps. We consider this as an advantage, as it does reduce the number of parameters that defines $\hat{K}_3$.
  • The string kernel $\hat{K}_3$ is a similarity measure in the space of sequence. Notice that for all sequences $S$, $\hat{K}_3(S) = 1$. This similarity measure can be transformed into a distance measure, using $D(S,T) = \sqrt{2-2\hat{K}_3(S,T)}$. $D(S,T)$ takes values between 0 and $\sqrt{2}$.





  Page last modified 19 July 2017 http://www.cs.ucdavis.edu/~koehl/