Rastell Toull

Site web consacrée à la Bretagne,
à l'Afrique du nord
à la chanson française,
à la recherche scientifique,
et à bien d'autres sujets ...

The monoalpha-multinumeric numeral system  :
a numeration compatible with the lexicographic order and concatenation


par Jacques-Deric Rouault

Article original
Page publique
Page opérationnelle

Version 4.1 du
4 mai 2012

Table thématique
Table chronologique
Administrateur du site
Comment citer ce document ?
Jacques-Deric Rouault, 2012. The monoalpha-multinumeric numeration : a numeration compatible with the lexicographic order and concatenation. Rastell Toull page C115.

Summary

    The monoalpha-multidecimal numeration consists in gathering a first number made up with a sole Latin alphabetic numeral A-Z and a second classical decimal number expressed with Arabic numerals 0-9. The alphabetic numeral codes the number of decimal figures of the second number (A=1, B=2, …, Z=26). This process provides a total compatibility between the decimal writing and the alphanumeric sort, assuring a maximal compactness and integrating a redundancy check. This numeral system is wholly compatible with the lexicographic order used in computers. These monoalpha-multidecimal numbers can be directly concatened and the result is also compatible with the lexicographic order. This numeral system provides an economic way of naming all the branches of a tree.

Introduction
 
    When a name of file or of directory is coded by a number, the alphanumeric sort used in computer operating systems imperfectly restores the natural numeric order of numbers. For instance, the alphanumeric sort of the numbers 1, 2, 3, 4, 10, 11, 12, 20, 23, 105 provides the order 1, 10, 105, 11, 12, 2, 20, 23, 3, 4 instead of the natural numeric order. We are looking for a numeral system of integer numbers close to the classical decimal numeration and compatible with the lexicographic order.

First solution

    A first solution consists in prefixing the decimal number by zeros, thus assigning all numbers to present the same number of figures. The compatibility with the alphanumeric sort is then acquired. However, the value of the greatest codable integer has to be firstly fixed , and usually this value is unknown when initiating the process.

    Two drawbacks are then possible: in the first one, the greatest codable integer is overestimated and a lot of useless zeros are trailed, weighting down the writing and significantly reducing the readability. In the second one, the greatest codable integer is underestimated and the current numeration has to be remelted by adding an extra zero at the beginning of all the names of files or directories previously created. This revision does not set a particular technical problem, but its cost may be prohibitive, as for the Y2K bug where the year was coded by the two last figures instead of the four ones, making the year 1999 followed by 1900. T
he "...97, 98, 99, 00..." ascending numbering assumption suddenly became invalid.

Second solution

    In the second solution advanced here, we propose to prefix the decimal number by a first figure which codes the number of figures composing the decimal number. To make the reading easier and to limit confusions, the first figure is taken in the Latin alphabet with the convention A=1, B=2, C=3, ..., Z=26. This figure can indifferently be written in capitals or lower-case letters, but to avoid any problems with operating systems making the differences between upper and lower case letters (such as UNIX), the use of capitals is strongly recommended.

This monoalpha-multidecimal numeration is written as follows : A0, A1, A2, ..., A9, B10, B11, ..., B99, C100, C101, ..., C999, D1000, D1001, etc ...

The compatibility with the alphanumeric sort is complete (property false for the decimal numeration) and the compactness is optimal because only a single figure is added to the classical decimal writing.

The greatest expressible value in this numeration is 1026-1, usually convenient for all common applications.

Discussion

    There already exist several alphadecimal numerations: the hexadecimal or base-16 numeration based on the 10 Arabic decimals 0-9 and the first 6 capital Latin letters A-F, the hexatridecimal, sexatrigesimal or hexatrigesimal numeration or base-36 based on the 10 Arabic numerals 0-9 and the 26 Latin letters A-Z, or other numerations in base-64. In the numeration proposed here, differently from the above alphadecimal numerations, the Latin letters and the Arabic decimals are located at definite places of the number. This is why, to avoid any ambiguity, we qualify it as monoalpha-multidecimal numeration in order to specify that it is based upon a first Latin letter followed by one or several Arabic decimals.

Coding numbers by letters was relatively common in the antiquity, in particular in the Hebraic and Greek civilizations (Ifrah, 1981). However in these alphabets, if the 9 first letters code the units, the 10 following ones codes tens and the last letters the first hundreds. It however exists an archaic system of numeration based on the 24 Greek letters, dated form the VI century BC, and analogous to the one desrcibed here. This system, which does not exceed 24, was in particular used to number the 24 books of Iliad and Odyssey (Reinach, 1885).

The kind of code we propose here can also be compared to the code of variable length strings in some computer languages. In the string of characters, each character is coded by one byte (value in 0..255), and there is an extra first byte coding the actual number of bytes present in the string. In other languages, the string is completed by a byte with a predefined value (usually ASCII 3) which marks the end of the string, in the same manner as a stop codon in a DNA sequence.

The complete structuration of the monoalpha-multidecimal numeration can also be interpreted as integrating a redundancy check, as even bits (Spataru 1987). thus warranting the integrity when transferring or storing data. All possible sequences are not valid, for instance the numbers 1B2 and B123 are obviously wrong and show a corruptness of data.


Extensions

In order to increase the readability of large integer numbers, a separator symbol (blank, point in French language, comma in English language) is usually added every 3 decimals. This use remains compatible with the monoalpha-multidecimal numeration, under the constraint it is applied to all numbers of the serial. Because of reasons of compatibility with the different operating systems of computers, the choice of the underscore character (ASCII 95) used for writing numbers in the ADA language (Taft & Duff, 1997) is highly recommended.

    In order to code numbers with more than 26 figures, the length will be coded with two alphabetic Latin characters, allowing numbers up to 1026*26=10676. Then, we call it the dialpha-multidecimal numeration. And so on …

This method of prefixing with a Latin figure can also be extended to binary, hexadecimal, …, numerations.


Concatenation

    The direct concatenation of two or several integers can be read as a single integer or as several integers written in a fixed format. For instance le number 20080611 can also be considered as the eleventh day of the sixth month of 2008 in the fixed format YYYYMMDD. In order to remove any ambiguity, the writing D2008A6B11 clearly shows that this number is a concatenation of three integers. Because in the ASCII table integers (codes 48-57) are ranked before upper-case letters (65-90) and lower-case letters (97-122), the concatenation of monoalpha-multidecimal numbers is wholly compatible with the lexicographic order. A1B12 is before B12A1 because A is before B, A1C124 is before A1D1308 because C is before D.


Nommer les branches d'un arbre

    In a tree, the name of a branch is the concatenation or the name of the previous branch and of the rank of the branch at the node. The use of the monoalpha-multidecimal numeration provides a natural way to name all the branches in an order compatible with the lexicographic order (Figure 1)



Figure 1 : Naming the branches of a tree by concatening monoalpha-multidecimal numbers

 
    In the tree of figure 1, the names of the 46 branches are automatically ranked in the lexicographic order A1, A1A1, A1A1A1, A1A1A2, A1A1A2A1, A1A1A2A2, A1A1A2A3, A1A1A2A4, A1A1A3, A1A2, A1A3, A1A3A1, A1A3A1A1, A1A3A2, A1A4, A1A4A1, A2, A2A1, A2A1A1, A2A1A1A1, A2A1A1A2, A2A1A2, A2A1A3, A2A1A3A1, A2A2, A3, A3A1, A3A1A1, A3A1A2, A3A1A3, A3A1A4, A3A1A5, A3A1A6, A3A1A7, A3A1A8, A3A1A9, A3A1B10, A3A1B11, A3A1B12, A3A1B13, A3A1B14, A3A2, A3A3, A3A3A1, A3A3A2, A4. The lexicographic order describes the tree by taking first at each node the rightest branch. Even for a branch at the fourth level, the name (for instance A2A1A3A1) remains easily readable.


Applications

    The monoalpha-multidecimal numeration was previously developed in the particular context of building up a data base devoted to an exhaustive census of transposable elements. They are short DNA sequences more or less highly repeated in the genomes of living organisms. A same transposable element can be found in different organisms, and will therefore be identified under different names in the available data banks. From the fact that two transposable elements presenting different lengths will be necessarily different, the transposable elements are then coded following their number of pairs of bases using the monoalpha-multidecimal numeration. The name of a sequence is built as the concatenation of the two numbers coding their length and their order of appearance, for instance C451B48. Comparing a new element to those previously recorded is then strongly accelerated following a process of quick sort, because one need only to compare it with those of the same length, which names becomes with the same first number.


References

Ifrah G, 1981-1994.
L'alphabet et la numération. pp 511-545. In Histoire universelle des chiffres. Volume 1. Bouquins. Robert Laffont. 1042 pp
Ifrah G, 1998. The universal history of numbers. From prehistory to the invention of the coimputer, Wiley.

Reinach S. 1885. Traité d'épigraphie grecque. E Leroux, Paris. 2000 Reproduction en fac-similé.

Spataru A, 1987. Fondements de la théorie de l'information. Presses polytechniques Romandes. 644 pp.

Taft T, Duff RA, 1997.
ADA 95 Reference Manual. Language and standard libraries: International standard ISO/IEC 8652:1995(E). Lecture notes in computer sciences n°1246. Springer Verlag. 526 pp.

Year 2000 bug
http://fr.wikipedia.org/wiki/Passage_informatique_%C3%A0_l%27an_2000
http://en.wikipedia.org/wiki/Year_2000_problem

Numeration
http://fr.wikipedia.org/wiki/Num%C3%A9ration
http://fr.wikipedia.org/wiki/Syst%C3%A8me_de_num%C3%A9ration
http://fr.wikipedia.org/wiki/Les_Shadoks#Arithm.C3.A9tique_-_compter_en_Shadok
http://en.wikipedia.org/wiki/Numeral_system

Liens internes

Autolien
Numéro
Article
Auteur
Rubrique Sous-rubrique Nature
C115Monoalpha-multinumeric numerationJacques-Deric RouaultB41 Mathématiques
Article original

Cette page utilise les articles

Articles utilisant cette page

Articles connexes
Numéro
Article
Auteur
RubriqueSous-rubrique Nature
C104
La numération monoalpha-multinumérique
Jacques-Deric RouaultB41 Mathématiques
Numération
Article original

This page 115 is the English version of page C104.


Page d'accueil
Table thématique
Table chronologique
Administrateur du site / Contact