Proper Treatment 正當作法/ blog/ posts/ Word numbers, Part 1: Billion approaches
標籤 Tags:
2008-08-17 19:19

ITA Software recruits computer scientists using puzzles such as the following.

If the integers from 1 to 999,999,999 are written as words, sorted alphabetically, and concatenated, what is the 51 billionth letter?

In a series of posts, Dylan Thurston and I will solve this problem step by step, introducing concepts such as monoids and differentiation along the way. We will use the programming language Haskell: every post will be a literate program that you can run as is. For example, you can download this post as a program.

Our basic strategy is quintessential computer science: first write a program to specify the problem, then interpret the program creatively to find the solution before the the universe ends. Our first step, then, is to specify how to write integers as words in English. Such a specification defines a long list of strings

["one", "two", ...,
 "onehundred", ...,
 "ninehundredninetyninemillionninehundredninetyninethousandninehundredninetynine"]

of type [String]. There is a lot of repetitive structure within and among these strings. We need to express this structure concisely so that we can exploit it later to solve the problem efficiently.

The structure we will use is that lists of strings form a seminearring. To use the nice-looking symbols + and * for operations in a seminearring, we hide the definitions of these operators from the Prelude, but we will keep the same infix precedences.

{-# OPTIONS -W -fglasgow-exts #-}
module WordNumbers1 where

import Prelude hiding ((+), (*), sum, product)
import qualified Prelude as P

infixl 6 +
infixl 7 *

A seminearring is first of all a monoid. A monoid is a set along with an associative operation + and its identity element zero.

class Monoid a where
  zero :: a
  (+) :: a -> a -> a

(The Data.Monoid module already defines a Monoid type class, but our notation fits better with the development below.) Any list type is a monoid: addition is concatenation, and the identity is the empty list.

instance Monoid [a] where
  zero = []
  (+) = (++)

You can think of + as nondeterministic choice and zero as failure.

A seminearring is a monoid with an additional associative operation * and its identity element one

class (Monoid a) => Seminearring a where
  one :: a
  (*) :: a -> a -> a

—satisfying distributivity (x+y) * z = x*z + y*z on one side only. That is, the other distribution law x * (y+z) = x*y + x*z does not necessarily hold (hence the “near” in “seminearring”). For example, every list-of-list type (including [String] because String is just [Char]) is a seminearring: to multiply two string lists is to concatenate every string in the first list and every string in the second list; the identity for this operation is the singleton list containing the empty string.

instance Seminearring [ [a] ] where
  one = [[]]
  xss * yss = [ xs ++ ys | xs <- xss, ys <- yss ]

Whereas distributivity holds on one side…

*WordNumbers1> (["twenty"] + ["thirty"]) * ["","one","two"]
["twenty","twentyone","twentytwo","thirty","thirtyone","thirtytwo"]
*WordNumbers1> ["twenty"] * ["","one","two"] + ["thirty"] * ["","one","two"]
["twenty","twentyone","twentytwo","thirty","thirtyone","thirtytwo"]

… it does not hold on the other side.

*WordNumbers1> ["twenty","thirty"] * (["","one"] + ["two"])
["twenty","twentyone","twentytwo","thirty","thirtyone","thirtytwo"]
*WordNumbers1> ["twenty","thirty"] * ["","one"] + ["twenty","thirty"] * ["two"]
["twenty","twentyone","thirty","thirtyone","twentytwo","thirtytwo"]

We require distributivity on one side only because a list of choices, unlike a set of choices, is ordered: concatenating a choice of 2 strings and a choice of 3 strings yields a choice of 6 strings, but there are two obvious ways to order the output choices, depending on which input choices are grouped together. The definition of * in the instance above enforces left-to-right evaluation: the leftmost choice is the outermost loop. This convention makes sense for English because we pronounce the most significant digit first and we want a list of English strings ordered by their numeric value. Unfortunately, some mathematicians use the opposite convention.

The seminearring [String] is generated by characters: every character maps to an element of this seminearring. Let us represent this property by a type class.

class Character a where
  char :: Char -> a

instance Character Char where
  char = id

instance (Character a) => Character [a] where
  char c = [char c]

We can extend this mapping from characters to strings by concatenating (multiplying) its output.

product :: (Seminearring a) => [a] -> a
product = foldr (*) one

string :: (Seminearring a, Character a) => String -> a
string = product . map char

We can now express a choice of strings, such as a digit between "one" and "three", not just as a list of strings but generically as a value of any type that is an instance of Seminearring and Character.

string "one" + string "two" + string "three" :: (Seminearring a, Character a) => a

We can specify a choice of words more concisely as a space-delimited string, as in strings "one two three".

sum :: (Monoid a) => [a] -> a
sum = foldr (+) zero

strings :: (Seminearring a, Character a) => String -> a
strings = sum . map string . words

We can now concisely define the list of 109−1 strings at the core of the ITA problem, in a way that expresses its repetitive structure.

ten1, ten2, ten3, ten6, ten9 :: (Seminearring a, Character a) => a
ten1 = strings "one two three four five six seven eight nine"
ten2 = ten1
     + strings "ten eleven twelve"
     + (strings "thir four" + prefixes) * string "teen"
     + (strings "twen thir for" + prefixes) * string "ty" * (one + ten1)
    where prefixes = strings "fif six seven eigh nine"
ten3 = ten2 + ten1 * string "hundred" * (one + ten2)
ten6 = ten3 + ten3 * string "thousand" * (one + ten3)
ten9 = ten6 + ten3 * string "million" * (one + ten6)

If you ignore the order of strings in these lists, the code above is just a context-free grammar for numbers in English. It is a bit strange for one to mean the empty string (usually notated ε or λ), but you get used to it. And it is super natural for + to mean alternation and * to mean concatenation.

We can check by brute force that ten6 contains 106−1 strings, but the same check on ten9 exhausted my patience.

*WordNumbers1> length (ten6 :: [String])
999999
*WordNumbers1> length (ten9 :: [String])
Interrupted.

We can even compute the total length of all words between

*WordNumbers1> head (ten6 :: [String])
"one"

and

*WordNumbers1> last (ten6 :: [String])
"ninehundredninetyninethousandninehundredninetynine"

by evaluating

*WordNumbers1> length (concat (ten6 :: [String]))
44872000

by brute force. But again, the same computation on ten9 exhausted my patience.

*WordNumbers1> length (concat (ten9 :: [String]))
Interrupted.

In the next post, we will compute the total length of ten9 without trying my patience.