1. 25 Nov, 2012 1 commit
  2. 21 Nov, 2012 2 commits
  3. 20 Nov, 2012 2 commits
  4. 29 Oct, 2012 2 commits
  5. 28 Oct, 2012 2 commits
    • athena's avatar
      make the index-computation logic less paranoid · 905ded71
      athena authored
      The problem is that for each K and for each expression of the form P[I
      + STRIDE * K] in a loop, most compilers will try to lift an induction
      variable PK := &P[I + STRIDE * K].  In large codelets we have many
      such values of K.  For example, a codelet of size 32 with 4 input
      pointers will generate O(128) induction variables, which will likely
      overflow the register set, which is likely worse than doing the index
      computation in the first place.
      
      In the past we (wisely and correctly) assumed that compilers will do
      the wrong thing, and consequently we disabled the induction-variable
      "optimization" altogether by setting STRIDE ^= ZERO, where ZERO is a
      value guaranteed to be 0.  Since the compiler does not know that
      ZERO=0, it cannot perform its "optimization" and it is forced to
      behave sensibly.
      
      With this patch, FFTW is a little bit less paranoid.  FFTW now
      disables the induction-variable optimization" only when we estimate
      that the codelet uses more than ESTIMATED_AVAILABLE_INDEX_REGISTERS
      induction variables.
      
      Currently we set ESTIMATED_AVAILABLE_INDEX_REGISTERS=16.  16 registers ought
      to be enough for anybody (or so the amd64 and ARM ISA's seem to imply).
      905ded71
    • athena's avatar
      silence warnings · 1dacef5b
      athena authored
      1dacef5b
  6. 27 Oct, 2012 3 commits
    • athena's avatar
      bump version to 3.3.3 · fb08724b
      athena authored
      fb08724b
    • athena's avatar
      evaluate plans for >1ms when using gettimeofday() · c4d6abbc
      athena authored
      The previous limit 10ms was too paranoid, and it made life difficult
      on machines without an "official" cycle counter, such as ARM.
      c4d6abbc
    • athena's avatar
      use 4-way NEON SIMD instead of 2-way · 172dd3de
      athena authored
      Kai-Uwe Bloem tried to warn me a year ago that 128-bit NEON was better
      than 64-bit NEON even on machines with a 64-bit pipe, but I foolishly
      did not listen.  Now that 128-bit NEON pipes are starting to appear on
      the market it is definitely time to switch.
      172dd3de
  7. 26 Sep, 2012 1 commit
  8. 18 Jul, 2012 1 commit
  9. 29 Jun, 2012 1 commit
  10. 28 Apr, 2012 2 commits
  11. 26 Apr, 2012 2 commits
    • athena's avatar
      change revision to 3.3.2 · cb553a83
      athena authored
      cb553a83
    • athena's avatar
      Remove old aligned_main() hack. · 98229b0d
      athena authored
      On i386, in our benchmark program we used to manually aligned the
      stack to 16-byte boundary via asm trickery.  This was a good idea in
      1999 (and it was actually necessary to make things work) but the hack
      is now obsolete and it seems to break gcc-4.7.  So the hack is now
      gone.
      98229b0d
  12. 29 Mar, 2012 1 commit
  13. 20 Mar, 2012 1 commit
  14. 06 Mar, 2012 2 commits
  15. 02 Mar, 2012 3 commits
  16. 25 Feb, 2012 1 commit
  17. 21 Feb, 2012 2 commits
  18. 20 Feb, 2012 4 commits
  19. 09 Nov, 2011 2 commits
  20. 25 Sep, 2011 1 commit
  21. 18 Sep, 2011 1 commit
  22. 13 Sep, 2011 1 commit
  23. 03 Sep, 2011 2 commits
    • athena's avatar
    • athena's avatar
      shorten ESTIMATE planning time for certain weird sizes · f004d764
      athena authored
      FFTW includes a collection of "solvers" that apply to a subset of
      "problems".  Assume for simplicity that a "problem" is a single 1D
      complex transform of size N, even though real "problems" are much more
      general than that.  FFTW includes three "prime" solvers called
      "generic", "bluestein", and "rader", which implement different
      algorithms for prime sizes.
      
      Now, for a "problem" of size 13 (say) FFTW also includes special code
      that handles that size at high speed.  It would be a waste of time to
      measure the execution time of the prime solvers, since we know that
      the special code is way faster.  However, FFTW is modular and one may
      or may not include the special code for size 13, in which case we must
      resort to one of the "prime" solvers.  To address this issue, the
      "prime" solvers (and others) are proclaimed to be SLOW".  When
      planning, FFTW first tries to produce a plan ignoring all the SLOW
      solvers, and if this fails FFTW tries again allowing SLOW solvers.
      
      This heuristic works ok unless the sizes are too large.  For example
      for 1044000=2*2*2*2*2*3*3*5*5*5*29 FFTW explores a huge search tree of
      all zillion factorizations of 1044000/29, failing every time because
      29 is SLOW; then it finally allows SLOW solvers and finds a solution
      immediately.
      
      This patch proclaims solvers to be SLOW only for small values of N.
      For example, the "generic" solver implements an O(n^2) DFT algorithm;
      we say that it is SLOW only for N<=16.
      
      The side effects of this choice are as follows.  If one modifies FFTW to
      include a fast solver of size 17, then planning for N=17*K will be
      slower than today, because FFTW till try both the fast solver and the
      generic solver (which is SLOW today and therefore not tried, but is no
      longer SLOW after the patch).  If one removes a fast solver, of size say
      13, then he may still fall into the current exponential-search behavior
      for "problems" of size 13*HIGHLY_FACTORIZABLE_N.
      
      If somebody had compleined about transforms of size 1044000 ten years
      ago, "don't do that" would have been an acceptable answer.  I guess the
      bar is higher today, so I am going to include this patch in our 3.3.1
      release despite their side-effects for people who want to modify FFTW.
      f004d764