A fast (written in assembly) library for continued fractions

Fri Nov 3 23:17:29 CET 2006

I wrote for my own purposes the following X86 assembly code, intended
to be called from GFORTRAN. Such a thing should only work if you have
a similar configuration to mine, but anyway, it may interest you.
I wanted a fast library for computing the length in the continued
fraction expansion of sqrt(n) for any integer n that can fit in
a REAL*8 gfortran type for statistical purposes. I also included
a faster function that tells if this length is odd or even. Though
I am not an assembly expert, there should be a significant speed
improvement if you use this code rather than your own fortran
functions, since the whole process is performed on the FPU stack
with no load/store useless travels.

It may help you, it may not. You know about these kinds of software...

Instructions are at the beginning of the file, as comments.
Read them carefully. Code is below.

Regards.

------------------ FILE quad.s BEGINS BELOW --------------------

/*

  Compile your piece of code with something like
    gfortran -fno-underscoring mycode.f90 quad.s
  and add freely your own flags. Enjoy !
  Of course, if you don't want to use the -fno-underscoring
  option for some reason, you can add an underscore at the end
  of the global functions below, like that:
    .globl initfpu_
    initfpu_:
  it should work.

  Remember this piece of code has the following status:
    Seems to work for me, maybe it will also seem for you,
    maybe your computer will explode, maybe it will make the
    earth explode, maybe God will choose not to have created
    the world, think about it before using it.
  (Anyway, if you see some kind of bug, please let me know ;-)
  Remember that
    a) this is x86 assembly code;
    b) it uses the GNU AS (gas) syntax;
    c) it uses the gfortran calling conventions;
  (now you understand why you can't expect it will works ;-)

  Usage. In your program (or subroutine or whatever), declare
  the function you need like:

    logical :: oddperiod
    integer*8 :: period

  (the two subroutines don't need to be declared).

  Then use

    integer*4 :: oldfpustatus
    call initfpu(oldfpustatus)

  before any computation; if you don't do it, it won't work.
  When you don't need any more the library,

    call restorefpu(oldfpustatus)

  will restore the control register of the FPU.

  During all that time, you shouldn't be disturbed too much, 
  for your own mathematical operations, but
  remember the rounding mode of the FPU has been set to
  "round down" (rather than the usual "nearest").

  How does it work? The whole computation is performed on the
  stack of the FPU; only integer values are needed for these
  tasks, but they are stored as REAL*10, 80 bits reals. They
  are not converted to anything during the whole process, thus
  it should work for great integers stored in your own REAL*8,
  64 bits reals. Of course, it may happen (I think it will never,
  but didn't check) that an intermediate result can not fit in
  such a REAL*10 register; in that case, read above what is
  written about the end of the world...

*/

/*
  call initfpu(integer*4)
  =====================
    This subroutine should be called before any other call to the
    library; its purpose is to set the rounding control in the FPU
    control register to the "round down" mode.
    The subroutine puts in the integer variable the initial value
    of the FPU control register.

    If you want to use often this subroutine (which you shouldn't:
    speed being an issue, this subroutine is intended to be used
    once, then the functions billions of times, not the contrary ;-),
    maybe removing the "fnclex" can improve the speed, since I don't
    think it plays a great role here.
*/
.globl initfpu
initfpu:
	pushl	%ebp
	movl	%esp, %ebp
	movl	8(%ebp), %eax
        fnstcw  (%eax)
	fwait
        movw    (%eax), %dx
	andb	$0xf3, %dh
	orb	$0x4, %dh
        movw    %dx, 2(%eax)
	fnclex
	fldcw   2(%eax)
        fwait
        movw    $0, 2(%eax)
	leave
	ret

/*
  call restorefpu(integer*4)
  ========================
    This subroutine should be used for restoring the initial state of
    the FPU control register; the integer variable must contain the
    right value (which should have been previously set by the
    subroutine initfpu).
*/
.globl restorefpu
restorefpu:
	pushl	%ebp
	movl	%esp, %ebp
	movl	8(%ebp), %eax
	fclex
	fldcw   (%eax)
	leave
	ret

/*
  sqrtint( real double precision : REAL*8 )
  =========================================
    This function returns the integer part of the square root
    of the given real. The return value has itself the real type.
*/
.global sqrtint
sqrtint:
	pushl	%ebp
	movl	%esp, %ebp
	movl	8(%ebp), %eax
	fldl	(%eax)
	fsqrt
        frndint
	leave
	ret

/*
  oddperiod( real double precision : REAL*8 )
  ===========================================
    Despite the type of the real argument, the variable has to
    be some integer value stored as a double precision.
    The function has the logical type. It returns FALSE if
    the argument is a perfect square or if the continued
    fraction expansion of its square root has an even period
    length; it returns TRUE in the remaining case.
      oddperiod(dble(25)) --> FALSE
      oddperiod(dble(13)) --> TRUE
      oddperiod(dble(12)) --> FALSE
    If it may help, you can declare the function as an integer*4
    type rather than a logical one (output will be 0 or 1).

    Though it seems to work with integer*8, I am not absolutely
    sure it *always* will (unless you add the following instruction
      movl $0, %edx
    at the beginning of the code (for instance between the current
    second and third lines of code), but really why would you
    do it? Maybe something in the calling conventions is such that
    the register EDX already contains 0, but, again, who cares?

    If you read the instructions for the period() function, you
    will learn about some 2^32 limitations, but you don't have to
    care about that here: the length is not kept in any register
    during the computation and the code should work as long as the
    values in the FPU stack are the right one.

*/
.global oddperiod
oddperiod:
	pushl	%ebp
	movl	%esp, %ebp
	movl	8(%ebp), %eax
        fldln2
	fldl	(%eax)
        fld     %st
        fsqrt
        frndint
        fld     %st
        fld     %st
        fmul    %st, %st
        fld     %st(3)
        fsubp   %st, %st(1)
        fcom    %st(4)
        fstsw   %ax
        fwait
        sahf
        ja      oddperiod0
        ffree   %st
        ffree   %st(1)
        ffree   %st(2)
        ffree   %st(3)
        ffree   %st(4)
oddperiod2:
        movl    $0, %eax
        leave
        ret
oddperiod0:
        fld     %st(1)
        fadd    %st(3), %st
        fdiv    %st(1), %st
        frndint
        fld     %st(2)
        fld     %st(1)
        fmul    %st(3), %st
        fsubp   %st, %st(1)
        fxch    %st(3)
        fsub    %st(3), %st
        fabs
        fcomp   %st(6)
        fstsw   %ax
        fwait
        sahf
        ja      oddperiod1
        fld     %st(3)
        fadd    %st, %st
        fsubp   %st, %st(1)
        fcomp   %st(5)
        fstsw   %ax
        fwait
        ffree   %st
        ffree   %st(1)
        ffree   %st(2)
        ffree   %st(3)
        ffree   %st(4)
        sahf
        ja      oddperiod2
        movl    $1, %eax
        leave
        ret
oddperiod1:
        ffree   %st
        fincstp
        fld     %st(1)
        fmul    %st, %st
        fld     %st(4)
        fsubp   %st, %st(1)
        fdivp   %st, %st(1)
        frndint
        jmp oddperiod0

/*
  period( real double precision : REAL*8 )
  ========================================
    Despite the type of the real argument, the variable has to
    be some integer value stored as a double precision.
    The function has integer*8 type, though the output
    can't be greater than 2^32-1 (this choice has been made
    for avoiding sign problem when the result is greater
    (or equal) then 2^31 and smaller then 2^32.
    The result is the length of period in the continued fraction
    expansion of the square root of the given integer.

    If speed is really for you an obsession, remember that it
    is highly unprobable that you use that library for periods
    between the two fatidic values. Maybe it will increase a little
    the speed to use integer*4 which allows to delete twice the line
        movl    $0,%edx
    (not the first time it appears, but the second and third one).
    If you try such a thing and see some speed improvement, please
    let me know.
*/
.global period
period:
	pushl	%ebp
	movl	%esp, %ebp
	movl	8(%ebp), %eax
        movl    $0, %edx
        fldln2
	fldl	(%eax)
        fld     %st
        fsqrt
        frndint
        fld     %st
        fld     %st
        fmul    %st, %st
        fld     %st(3)
        fsubp   %st, %st(1)
        fcom    %st(4)
        fstsw   %ax
        fwait
        sahf
        ja      period0
        ffree   %st
        ffree   %st(1)
        ffree   %st(2)
        ffree   %st(3)
        ffree   %st(4)
        movl    %edx, %eax
        movl     $0, %edx
        leave
        ret
period0:
        inc     %edx
        fld     %st(1)
        fadd    %st(3), %st
        fdiv    %st(1), %st
        frndint
        fld     %st(2)
        fld     %st(1)
        fmul    %st(3), %st
        fsubp   %st, %st(1)
        fxch    %st(3)
        fsub    %st(3), %st
        fabs
        fcomp   %st(6)
        fstsw   %ax
        fwait
        sahf
        ja      period1
        fld     %st(3)
        fadd    %st, %st
        fsubp   %st, %st(1)
        fcomp   %st(5)
        fstsw   %ax
        fwait
        ffree   %st
        ffree   %st(1)
        ffree   %st(2)
        ffree   %st(3)
        ffree   %st(4)
        sahf
        jb      period2
        add     %edx, %edx
period2:
        movl    %edx, %eax
        movl    $0,%edx
        leave
        ret
period1:
        ffree   %st
        fincstp
        fld     %st(1)
        fmul    %st, %st
        fld     %st(4)
        fsubp   %st, %st(1)
        fdivp   %st, %st(1)
        frndint
        jmp period0