Thursday, May 31, 2012

Finding palindromes efficiently

Using the palindromes in palindromes property, I now develop an algorithm for finding palindromes that requires a number of steps approximately equal to the length of its input. This linear-time algorithm can be used to find palindromes in documents of any size, even in DNA strings. Finding palindromes in a string of length $20,000,000$ using this algorithm requires a number of seconds on a modern laptop. It is impossible to find palindromes significantly faster, unless you have a machine with multiple cores, and use a parallel algorithm.

The program for efficiently finding palindromes is only about 25 lines long, but for formatting purposes it extends to over 80 lines in this blog post. Although the program is short, it is rather intricate. I guess that you need to experiment a bit with to find out how and why it works.

The naive algorithm for finding palindromes discussed in a previous blog post uses the following top-level definition for finding maximal palindromes:


< maximalPalindromes :: Array Int Char -> [Int]
< maximalPalindromes a  =   
<   let (first,last)  =  bounds a
<       centers       =  [0 .. 2*(last-first+1)]
<   in  map (lengthPalindromeAround a) centers

The reason why this algorithm is naive is that lengthPalindromeAround calculates the maximal palindrome around a center independently of the palindromes calculated previously. I now change this by calculating the maximal palindromes from left to right around the centers of a string. In this calculation I either extend a palindrome around a center, or I move the center around which I determine the maximal palindrome rightwards because I have found a maximal palindrome around a center. So I replace the above definition by

> maximalPalindromes :: Array Int Char -> [Int]
> maximalPalindromes a  =   
>   let (first,last)  =  bounds a
>   in  reverse (extendPalindrome a first 0 [])

Before I introduce and explain function extendPalindrome, I give an example of how the algorithm works. Suppose I want to find the maximal palindromes in the string "yabadabadoo". The algorithm starts by finding the maximal palindrome around the position in front of the string, which cannot be anything else than the empty string. It moves the position around which to find the maximal palindrome one step to point to the `y'. The maximal palindrome around this position is "y", since there is no character in front of it. It again moves the position around which to find palindromes one step to point to the position in between `y' and `a'. Since `y' and `a' are different, the maximal palindrome around this position is the empty string. Moving the center to `a', it finds that "a" is the maximal palindrome around this center, since `y' and `b' are different. The maximal palindrome around the next center in between `a' and `b' is again the empty string. Moving the center to `b', it can extend the current longest palindrome "b" around this center, since both before and after `b' it finds an `a'. It cannot further extend the palindrome "aba", since `y' and `d' are different. To determine the maximal palindrome around the center in between `b' and `a', the next center position, it uses the fact that "aba" is a palindrome, and that it already knows that the maximal palindrome around the center in between `a' and `b' is the empty string. Using the palindromes in palindromes property, it finds that the maximal palindrome around the position in between `b' and `a' is also the empty string, without having to look at the `b' and the `a'. To determine the maximal palindrome around the next center position on the last `a' of "aba", it has to determine if `d' equals `b', which it doesn't of course. Also here it uses the palindromes in palindromes property. Since "a" is the maximal palindrome around the center of the first `a' in "aba", and it reaches until the start of the palindrome "aba", I have to determine if the palindrome "a" around the second `a' can be extended. I won't describe all steps extendPalindrome takes in detail, but only give one more detail I already described above: the second occurrence of the palindrome "aba" in "yabadabadoo" is not found by extending the palindrome around its center, but by using the palindromes in palindromes property to find "aba" a second time in "abadaba".

Function extendPalindrome takes four arguments. The first argument is the array a in which we are determining maximal palindromes. The second argument is the position in the array directly after the longest palindrome around the current center (the longest palindrome around the center before the first symbol has length $0$, so the position directly after the empty palindrome around the first center is the first position in the array). I will call this the current rightmost position. The third argument is the length of the current longest palindrome around that center (starting with 0), and the fourth and final argument is a list of lengths of longest palindromes around positions before the center of the current longest tail palindrome, in reverse order (starting with the empty list []). It returns the list of lengths of maximal palindromes around the centers of the array, in reverse order. Applying function reverse to the result gives the maximal palindromes in the right order. The function extendPalindrome maintains the invariant that the current palindrome is the longest palindrome that reaches until the current rightmost position.

There are three cases to be considered in function extendPalindrome.

  • If the current position is after the end of the array, so rightmost is greater than last, I cannot extend the current palindrome anymore, and it follows that it is maximal. This corresponds to the null after case in the definition of maximal palindromes. It only remains to find the maximal palindromes around the centers between the current center and the end of the array, for which I use the function finalPalindromes.
  • If the current palindrome extends to the start of the array, or it cannot be extended, it is also maximal, and I add it to the list of maximal palindromes found. This case corresponds to the null after || last before /= head after case in the definition of maximal palindromes. I then determine the maximal palindrome around the following center by means of the function moveCenter.
  • If the element at the current rightmost position in the array equals the element before the current palindrome I extend the current palindrome.
In the following Haskell code, I list the arguments to a function below the function, since they don't fit on a single line anymore. I have also included some comments in the code, which are the pieces of text appearing after --.

> extendPalindrome 
>   a 
>   rightmost 
>   currentPalindrome 
>   currentMaximalPalindromes
>   | rightmost > last                                   =  
>       -- reached the end of the array
>       finalPalindromes 
>         currentPalindrome 
>         currentMaximalPalindromes 
>         (currentPalindrome:currentMaximalPalindromes)
>   | rightmost-currentPalindrome == first ||
>     a!rightmost /= a!(rightmost-currentPalindrome-1)   =  
>       -- the current palindrome extends to the start 
>       -- of the array, or it cannot be extended
>       moveCenter 
>         a 
>         rightmost 
>         (currentPalindrome:currentMaximalPalindromes) 
>         currentMaximalPalindromes 
>         currentPalindrome 
>   | otherwise                                          =  
>       -- the current palindrome can be extended
>       extendPalindrome 
>         a 
>         (rightmost+1) 
>         (currentPalindrome+2) 
>         currentMaximalPalindromes      
>   where  (first,last)  =  bounds a

In two of the three cases, function extendPalindrome finds a maximal palindrome, and goes on to the next center by means of function finalPalindromes or moveCenter. In the other case it extends the current palindrome, and moves the rightmost position one further to the right.

Function moveCenter moves the center around which the algorithm determines the maximal palindrome. In this function I make essential use of the palindromes in palindromes property. It takes the array as argument, the current rightmost position in the array, the list of maximal palindromes to be extended, the list of palindromes around centers before the center of the current palindrome, and the number of centers in between the center of the current palindrome and the rightmost position. It uses the palindromes in palindromes property to calculate the longest palindrome around the center after the center of the current palindrome.

  • If the last center is on the last element, there is no center in between the rightmost position and the center of the current palindrome. I call extendPalindrome with rightmost position one more than the previous position, and a current palindrome of length 1.
  • If the previous element in the list of maximal palindromes reaches exactly to the left end of the current palindrome, I use the palindromes in palindromes property of palindromes to find the next current palindrome using extendPalindrome.
  • In the other case, I have found the longest palindrome around a center, add that to the list of maximal palindromes, and proceed by moving the center one position, and calling moveCenter again. I only know that the previous element in the list of maximal palindromes does not reach exactly to the left end of the current palindrome, so it might be either shorter or longer. If it is longer, I need to cut off the new maximal palindrome found, so that it reaches exactly to the current rightmost position.

> moveCenter 
>   a 
>   rightmost 
>   currentMaximalPalindromes 
>   previousMaximalPalindromes 
>   nrOfCenters
>   | nrOfCenters == 0                                  =  
>       -- the last center is on the last element: 
>       -- try to extend the palindrome of length 1
>       extendPalindrome 
>         a 
>         (rightmost+1) 
>         1 
>         currentMaximalPalindromes
>   | nrOfCenters-1 == head previousMaximalPalindromes  =  
>       -- the previous maximal palindrome reaches 
>       -- exactly to the end of the last current 
>       -- palindrome. Use the palindromes in palindromes 
>       -- property to extend the current palindrome
>       extendPalindrome 
>         a 
>         rightmost 
>         (head previousMaximalPalindromes) 
>         currentMaximalPalindromes
>   | otherwise                                         =  
>       -- move the center one step. Add the length of 
>       -- the longest palindrome to the maximal
>       -- palindromes
>       moveCenter 
>         a 
>         rightmost 
>         (min (head previousMaximalPalindromes) 
>              (nrOfCenters-1)
>         :currentMaximalPalindromes) 
>         (tail previousMaximalPalindromes) 
>         (nrOfCenters-1)

In the first case, function moveCenter moves the rightmost position one to the right. In the second case it calls extendPalindrome to find the maximal palindrome around the next center, and in the third case it adds a maximal palindrome to the list of maximal palindromes, and moves the center of the current palindromes one position to the right.

Function finalPalindromes calculates the lengths of the longest palindromes around the centers that come after the center of the current palindrome of the array. These palindromes are again obtained by using the palindromes in palindromes property. Function finalPalindromes is called when we have reached the end of the array, so it is impossible to extend a palindrome. We iterate over the list of maximal palindromes, and use the palindromes in palindromes property to find the maximal palindrome at the final centers. As in the function moveCenter, if the previous element in the list of maximal palindromes reaches before the left end of the current palindrome, I need to cut off the new maximal palindrome found, so that it reaches exactly to the end of the array.


> finalPalindromes 
>   nrOfCenters     
>   previousMaximalPalindromes 
>   currentMaximalPalindromes   
>   | nrOfCenters == 0  =  currentMaximalPalindromes
>   | otherwise         =
>       finalPalindromes 
>         (nrOfCenters-1) 
>         (tail previousMaximalPalindromes) 
>         (min (head previousMaximalPalindromes) 
>              (nrOfCenters-1)
>         :currentMaximalPalindromes)

In each step, function finalPalindromes adds a maximal palindrome to the list of maximal palindromes, and moves on to the next center.

I have discussed the number of steps this algorithm takes for each function. At a global level, this algorithm either extends the current palindrome, and moves the rightmost position in the array, or it extends the list of lengths of maximal palindromes, and moves the center around which we determine the maximal palindrome. If the length of the input array is $n$, the number of steps the algorithm is $n$ for the number of moves of the rightmost position, plus $2n+1$ for the number of center positions. This is a linear-time algorithm.

Tuesday, May 22, 2012

Sotades

His name has been mentioned already a couple of times: Sotades reportedly wrote some of the first palindromes.

Sotades was an Ancient Greek poet, living in the third century before Christ. He lived in Alexandria during the reign of Ptolemy II Philadelphus (285 BC-246 BC). Ptolemy II was born in Kos in about 309 BC. As a youth, he enjoyed the best tutors. The practice of getting the best scholars or poets available to educate the crown prince was something that Ptolemy II's father (also called Ptolemy, which explains the II in his son's name), had the occasion to observe in Macedonia, where the young Alexander was taught by no less a figure than Aristotle himself. Ptolemy II and his father set up the Alexandrian University and Library, and attracted the most learned men of the times to Alexandria. Ptolemy II provided these scholars with a life free from want and taxes, allowing them to study, write, perform research, and lecture in their respective disciplines.

Ptomely II was married to Arsinoe (I), and got three children with her. He had also a sister called Arsinoe (II), who was married to a Greek king. Sister Arsinoe was a scheming woman. She had the oldest son of her husband, by a different woman, murdered to position her own children for the throne. When her husband was killed in the battle that ensued after the widow of the murdered son fled to a neighboring country, she married her half-brother (also called Ptolemy), who claimed the same throne as her husband had had. When her new husband became too powerful, she and her sons conspired against him. They were found out. Two sons got killed, and she and her eldest son (also called Ptolemy) fled to her full brother Ptolemy II in Egypt. In Eqypt she managed to get rid of Ptolemy's first wife Arsinoe I by accusing her of conspiring against her husband. Arsinoe I was sent in exile, and Arsinoe II married her brother.

Alexandria not only attracted scholars, but also poets and satirists, such as Sotades. One of his poems attacked Ptolemy II's marriage to his sister Arsinoe, from which came the infamous line: "You're sticking your prick in an unholy hole." For this, Sotades was imprisoned, but he escaped to the island of Kaunos, where he was captured by Ptolemy's admiral, shut up in a leaden chest, and thrown into the sea.

Sotades wrote obscene verses, in particular about the love between an older man and an adolescent boy. He invented his own metre for such verses, the sotadic metre. This metre spread around the world, and was used by other satirists to hint at an obscene or satirical interpretation. For example, Ennius wrote his `Sota' already in the same century in southern Italy. Sotades' verses could be read both forwards and backwards, where the backwards reading was the same, or contained an obscene interpretation. Unfortunately, none of the palindromic verses of Sotades survived.

Tuesday, May 15, 2012

Palindromes in palindromes

In a previous blog post, I develop a quadratic time algorithm for finding the length of the maximal palindromes around all centers of a string. Although it is a lot faster than the naive algorithm that finds all palindromes in a string, it is still not very useful when trying to find palindromes in DNA, or in a text as long as the Bible.

To design an efficient algorithm for finding palindromes, I reuse as much information as possible when calculating the length of maximal palindromes around centers. In this blog post, I will explain the central idea that allows me to reuse previously calculated maximal palindromes.

In the naive algorithm for finding the maximal palindromes around all centers of a string, I start with calculating the maximal palindrome around the first, leftmost, center, then the palindrome around the next center on the first character, then the palindrome around the center in between the first two characters, and so on. I calculate the palindromes from left to right. In the illustration below I depict a string with the letters blackened out. But you may assume that it represents the string "yabadabadoo". In this string, p is the maximal palindrome around center b. For example, p might be "abadaba", with b equal to $9$. I assume p is the longest palindrome that reaches to its rightmost position, so that no palindrome around a center before b reaches there.

I calculate the maximal palindromes from left to right, so I have already calculated the maximal palindrome around center a, which appears within palindrome p. I call the maximal palindrome around center a q. In the "yabadabadoo" example a might be center $5$, with "aba" as maximal palindrome q around it. In the illustration q is completely within p, but it might also extend to before p. It cannot extend after p, since then there would be a longer palindrome that extends to the rightmost position of p. Center a' is the center I get when mirroring center a with respect to center b: b plus the difference between b and a. This would be $13$ ($=9+(9-5)$) in our example. What is the maximal palindrome around a'?

Since p is a palindrome, the string with the same length as q around a' is reverse q, provided q is completely within p. If q extends to before p, we cut it off at the start of p, and remove an equally long part at the end of q. Since q is a palindrome, reverse q equals q, and is hence also a palindrome. I will call this palindrome q' to distinguish it from q. Is it the maximal palindrome around a'?

To find out whether or not q' is the maximal palindrome around a', I determine whether or not the characters before and after q' are the same. If the maximal palindrome around center a does not extend to the start of p, I know that the characters around q are different. Since the same characters appear around the palindrome q', it is the maximal palindrome around center a'. In this case I don't even have to look at the characters before and after q': the fact that q does not extend to the start of p gives me enough information to determine that q' is maximal. If the maximal palindrome around center q does extend to the start of p, as in our example, where "abadaba" starts with "aba", or even before, I have to compare the character before q' with the character after p. If these are different, q' is the maximal palindrome around a', otherwise, I have to investigate further to find out the maximal palindrome around a'. I will do this in the blog post that discusses an efficient algorithm for finding the length of all maximal palindromes in a string. In our example, "aba" is not the maximal palindrome around center a', since the character 'd' appears both before and after q'

Wednesday, May 9, 2012

The word `palindrome'

The word `palindrome' comes from the Greek παλινδρόμους. The word παλιν translates to `again', and the word δρόμους (or δρόμος) to `way' or `direction', and the combination is often translated to `running back again'. In the 17th century, the English poet Ben Johnson connected the word palindrome to the concept of a word or a sequence of words that reads, letter for letter, the same backwards as forwards. The palindrome concept existed long before the 17th century. Reportedly, the first recorded palindromes were written by Sotades in (again) Greece in the third century B.C.

`Palindrome' is an old word, which already appears in some early Greek texts. For example, it appears in the work Timon of Lucian of Samosata.

Lucian was a rhetorician and satirist living in the 2nd century A.D. He wrote more than eighty works, among which Timon, sometimes also translated as The Misanthrope. Shakespeare's play Timon of Athens has been influenced by the work of Lucian.

Timon of Athens was a citizen of Athens during the era of the Peloponnesian War between Athens and Sparta (431 BC–404 BC). He was wealthy and lavished his money on flattering friends. When funds ran out, friends deserted and Timon was reduced to working in the fields. One day, he found a pot of gold and soon his unreliable friends were back. This time, he drove them away with dirt clods. Timon became an angry despiser of mankind, and his reputation for misanthropy grew to legendary status.

He can't have been nice company: a typical sentence of the misanthropic Timon in Lucian's work is `Go your ways, then, Hermes, and take Plutus back to Zeus. I am quite content to let every man of them go hang.' Hermes is the messenger god, en Plutus the god of wealth. In Greek, this sentence reads: ὥστε παλίνδρομος ἄπιθι, ὦ Ἑρμῆ, τὸν Πλοῦτον ἐπανάγων τῷ Διί: ἐμοὶ δὲ τοῦτο ἱκανὸν ἦν, πάντας ἀνθρώπους ἡβηδὸν οἰμώζειν ποιῆσαι. The second word is the occurrence of `palindrome'. It is not translated by `running back again', but clearly something goes back, or returns, here.

Another occurrence of `palindrome' is found in Diogenes Laërtius' book `Lives of Eminent Philosophers', which was probably published somewhere in the 3rd century A.D. Laërtius introduces Aristippus in Chapter 8.

Aristippus was drawn to Athens by the fame of Socrates. He became a lecturer himself, and was one of the first to both pay and charge fees for lecturing. `And on one occasion the sum of twenty minae which he had sent was returned to him, Socrates declaring that the supernatural sign would not let him take it; the very offer, in fact, annoyed him.' In Greek, this reads: καί ποτε πέμψας αὐτῷ μνᾶς εἴκοσι παλινδρόμους ἀπέλαβεν, εἰπόντος Σωκράτους τὸ δαιμόνιον αὐτῷ μὴ ἐπιτρέπειν: ἐδυσχέραινε γὰρ ἐπὶ τούτῳ. The 7th word of this sentence is `palindrome' in Greek. Again, it is not translated exactly as `running back again', but something is returned.

Although the roots of the word `palindrome' are in Greek, the Greek themselves describe the palindrome concept by karkinikê epigrafê (καρκινικὴ επιγραφή; "crab inscription"), or simply karkinoi (καρκίνοι; "crabs"), alluding to the movement of crabs.

Thursday, May 3, 2012

A naive algorithm for finding maximal palindromes

The previous blog post discusses why it is sufficient to calculate the length of the maximal palindrome around each center position in a string if I want to find palindromes in a string. In this blog post I will describe a naive algorithm for finding the length of all maximal palindromes in a string. This algorithm is slightly less naive than the naive algorithm for finding palindromes given in an earlier blog post. The latter algorithm requires a supercomputer for a couple of days to find palindromes in a long string, the algorithm I will develop in this blogpost requires several days of computing power of the laptop on which I am typing this post. In numbers, this algorithm is about a milliard times faster on such a long string.

Given a string as input, I want to find the list of lengths of maximal palindromes around all centers of the string. I will use the function maximalPalindromes for this purpose. It has the following type


< maximalPalindromes :: String -> [Int]

I want to find the length of the maximal palindrome around each center in a string. I will do this by trying to extend the trivial palindromes consisting of either a single letter (for odd centers) or of the empty string (for even centers) around each center. To extend a palindrome, I have to compare the characters before and after the current palindrome. It would be helpful if I had random access into the string, so that looking up the character at a particular position in a string can be done in constant time. Since an array allows for constant time lookup, I change the input type of the maximalPalindromes function to an array. I have to import the module Data.Array to use arrays.

> import Data.Array
>
> maximalPalindromes :: Array Int Char -> [Int]

If I change my input type from strings to arrays, I have to convert an input string into an array. The Data.Array module contains a function just for this purpose: the function listArray creates an array from a string together with a pair of indices for the first and last position in the string. Function maximalPalindromes calculates the following lengths of maximal palindromes in the string "abb":

?maximalPalindromes (listArray (0,2) "abb")
[0,1,0,1,2,1,0]

Function maximalPalindromes calculates the length of maximal palindromes by first calculating all center positions of an input array, and then the length of the maximal palindrome around each of these centers. The center positions of an array with first and last as bounds are the elements in the list [0 .. 2*(last-first+1)]. I will always use arrays that start at index 0, which implies that the length of an array is last+1.


> maximalPalindromes a  =   
>   let (first,last)  =  bounds a
>       centers       =  [0 .. 2*(last-first+1)]
>   in  map (lengthPalindromeAround a) centers

Function lengthPalindromeAround takes an array and a center position, and calculates the length of the longest palindrome around that position. Center positions do not exactly correspond to indices in the array, since they not only describe locations of characters, but also locations in between characters. To obtain an array index from a center position I divide it by two. If the center position is even, it is in between two characters, and if I divide it by two, I get an array index that points at the character after the center position. I compare the character pointed to by the index with the character before it to find out if the (empty) palindrome can be extended. If it can, I subtract one from the starting position, and add one to the end position, and try to extend the palindrome with the new indices. If the palindrome cannot be extended, because the elements around it are different, or the current palindrome starts at the start or ends at the end of the array, I return the length of the maximal palindrome found, which is the difference between the end and the start position minus one. For odd center positions I can start with the indices before and after the character pointed to by the index obtained by dividing the center position by two, rounding down.

> lengthPalindromeAround  ::  Array Int Char -> Int -> Int
> lengthPalindromeAround a center 
>   | even center = 
>       lengthPalindrome (first+c-1) (first+c) 
>   | odd  center = 
>       lengthPalindrome (first+c-1) (first+c+1) 
>   where  c             =  div center 2
>          (first,last)  =  bounds a
>          lengthPalindrome start end  = 
>             if   start < 0 
>               || end > last-first 
>               || a!start /= a!end
>             then end-start-1
>             else lengthPalindrome (start-1) (end+1) 

For each position, this function may take an amount of steps linear in the length of the array, so this is a quadratic-time algorithm.