A124015 thoughts

Jonathan Post jvospost3 at gmail.com
Sat Nov 4 03:49:46 CET 2006


I mostly agree.  A large sample of text as actually used in a language, such
as a set of books, newspaper articles, transcripts of phone conversations,
or the like, is often called a "corpus" by those who undertake statistical
analyses.  Various government agencies have done so since well before World
War II, tapping the minds of as many of half the Mathematicians in the
world, with computational resources in advance of corporate and personal
users, and with the help of Mathematicians such as Turing.

There are statistical analyses of various editions of the Bible, of the
works of Shakespeare, and the like, with phalanxes of statistical analysts
on Literature.

Many of these are covered in specialized journals, of questionable value to
seqfans, and unnoticed here except when, for instance, a minor poem is
claimed by such methods to be by Shakespeare.

Hence I second the motion by David Wilson to start with an extremely
well-defined if artificial corpus, such as "In various languages, what is
the distribution of the number of letters in the names of numbers from one
to (say) one million?"

I have commented more extensively on Scrabble in other venues.  Do not ever
agree to use the OED (Oxford English Dictionary) as the official dictionary
when playing Scrabble.  Many misspellings by Tournament rumes are there.
Even though the OED is the definitive dictionary of the largest language on
Earth, and fascinating reading for various purposes, it is just too big and
too weird for game use.  Also, it makes it hard to distinguish a bluff from
a variant.

Now that I think of it, logarithmic trends can be interesting.  How about "what
is the distribution of the number of letters in the names of numbers from
one to one billion?"

-- Jonathan Vos Post

On 11/3/06, David Wilson <davidwwilson at comcast.net> wrote:
>
>  The description of A124015 is
>
> %N A124015 Number of words with n letters in the National Scrabble
> Association Dictionary.
>
> Some notes:
>
> - It should be noted that the NSAD includes a restricted set of English
> words (words of 2 to 14 letters, no proper nouns or derivatives, no words
> with non-alphabetic characters (e.g, contractions, hyphenated words), and
> it is this restricted set that is being counted.
>
> - The edition of the NSAD used to create A124015 should be specified,
> since the NSAD is continually edited.  While I do not think that A124015
> should change with each new edition of the NSAD, because it may be
> referenced in other literature. On the other hand, I don't think that a new
> sequence should be created for each new edition of the NSAD unless the new
> edition exhibits some interesting statistical departure from the current
> NSAD (e.g, in some future NSAD, the number of 9-letter words exceeds the
> number of 8-letter words). This is because I believe that the value of
> A124015 is mainly to indicate a distribution of English word lengths.
>
> - On the other hand, I have always understood that 4-letter words are most
> common in literature (where small words appear more often). It might be
> interesting to create a sequence counting words of length n in the King
> James Bible, War and Peace, or some other suitably large and stable piece of
> English literature (we don't have to go overboard on this, just one or two
> examples indicating word length distributions in literature).
>
> - Another interesting idea: In various languages, what is the distribution
> of the number of letters in the names of numbers from one to (say) one
> million?
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.seqfan.eu/pipermail/seqfan/attachments/20061103/cbb0807f/attachment-0002.htm>


More information about the SeqFan mailing list