Data reordering, bit reversal, and in-place algorithms
Although the abstract Cooley-Tukey factorization of the DFT, above, applies in some form to all implementations of the algorithm, much greater diversity exists in the techniques for ordering and accessing the data at each stage of the FFT. Of special interest is the problem of devising an in-place algorithm that overwrites its input with its output data using only O(1) auxiliary storage.
The most well-known reordering technique involves explicit bit reversal for in-place radix-2 algorithms. Bit reversal is the permutation where the data at an index k, written in binary with digits b4b3b2b1b0 (e.g. 5 digits for n=32 inputs), is transferred to the index with reversed digits b0b1b2b3b4 . Consider the last stage of a radix-2 DIT algorithm like the one presented above, where the output is written in-place over the input: when fj' and fj'' are combined with a size-2 DFT, those two values are overwritten by the outputs. However, the two output values should go in the first and second halves of the output array, corresponding to the most significant bit b4 (for n=32); whereas the two inputs fj' and fj'' are interleaved in the even and odd elements, corresponding to the least significant bit b0. Thus, in order to get the output in the correct place, these two bits must be swapped in the input. If you include all of the recursive stages of a radix-2 DIT algorithm, all the bits must be swapped and thus one must pre-process the input with a bit reversal to get in-order output. Correspondingly, the reversed (dual) algorithm is radix-2 DIF, and this takes in-order input and produces bit-reversed output, requiring a bit-reversal post-processing step. Alternatively, some applications (such as convolution) work equally well on bit-reversed data, so one can do radix-2 DIF without bit reversal, followed by processing, followed by the radix-2 DIT inverse DFT without bit reversal to produce final results in the natural order.
Many FFT users, however, prefer natural-order outputs, and a separate, explicit bit-reversal stage can have a non-negligible impact on the computation time, even though bit reversal can be done in O(n) time and has been the subject of much research (e.g. Karp, 1996; Carter, 1998; and Rubio, 2002). Also, while the permutation is a bit reversal in the radix-2 case, it is more generally an arbitrary (mixed-base) digit reversal for the mixed-radix case, and the permutation algorithms become more complicated to implement. Moreover, it is desirable on many hardware architectures to re-order intermediate stages of the FFT algorithm so that they operate on consecutive (or at least more localized) data elements. To these ends, a number of alternative implementation schemes have been devised for the Cooley-Tukey algorithm that do not require separate bit reversal and/or involve additional permutations at intermediate stages.
The problem is greatly simplified if it is out-of-place: the output array is distinct from the input array or, equivalently, an equal-size auxiliary array is available. The Stockham auto-sort algorithm (Stockham, 1966) performs every stage of the FFT out-of-place, typically writing back and forth between two arrays, transposing one "digit" of the indices with each stage, and has been especially popular on SIMD architectures (Swarztrauber, 1982). Even greater potential SIMD advantages (more consecutive accesses) have been proposed for the Pease (1968) algorithm, which also reorders out-of-place with each stage, but this method requires separate bit/digit reversal and O(n log n) storage. One can also directly apply the Cooley-Tukey factorization definition with explicit (depth-first) recursion and small radices, which produces natural-order out-of-place output with no separate permutation step and can be argued to have cache-oblivious locality benefits on systems with hierarchical memory (Singleton, 1967; see also FFTW).
A typical strategy for in-place algorithms without auxiliary storage and without separate digit-reversal passes involves small matrix transpositions (which swap individual pairs of digits) at intermediate stages, which can be combined with the radix butterflies to reduce the number of passes over the data (Johnson & Burrus, 1984; Temperton, 1991; Qian et al., 1994; Hegland, 1994).
- James W. Cooley and John W. Tukey, "An algorithm for the machine calculation of complex Fourier series," Math. Comput. 19, 297–301 (1965).
- Carl Friedrich Gauss, "Nachlass: Theoria interpolationis methodo nova tractata," Werke band 3, 265–327 (Königliche Gesellschaft der Wissenschaften, Göttingen, 1866). See also M. T. Heideman, D. H. Johnson, and C. S. Burrus, "Gauss and the history of the fast Fourier transform," IEEE ASSP Magazine 1 (4), 14–21 (1984).
- G. C. Danielson and C. Lanczos, "Some improvements in practical Fourier analysis and their application to X-ray scattering from liquids," J. Franklin Inst. 233, 365–380 and 435–452 (1942).
- W. M. Gentleman and G. Sande, "Fast Fourier transforms—for fun and profit," Proc. AFIPS 29, 563–578 (1966).
- P. Duhamel and M. Vetterli, "Fast Fourier transforms: a tutorial review and a state of the art," Signal Processing 19, 259–299 (1990).
- David H. Bailey, "FFTs in external or hierarchical memory," J. Supercomputing 4 (1), 23–35 (1990).
- Alan H. Karp, "Bit reversal on uniprocessors," SIAM Review 38 (1), 1–26 (1996).
- Larry Carter and Kang Su Gatlin, "Towards an optimal bit-reversal permutation program," Proc. 39th Ann. Symp. on Found. of Comp. Sci. (FOCS), 544–553 (1998).
- M. Rubio, P. Gómez, and K. Drouiche, "A new superfast bit reversal algorithm," Intl. J. Adaptive Control and Signal Processing 16, 703–707 (2002).
- T. G. Stockham, "High speed convolution and correlation", Spring Joint Computer Conference, Proc. AFIPS 28, 229–233 (1966).
- P. N. Swarztrauber, "Vectorizing the FFTs", in G. Rodrigue (Ed.), Parallel Computations (Academic Press, New York, 1982), pp. 51–83.
- M. C. Pease, "An adaptation of the fast Fourier transform for parallel processing", J. ACM 15 (2), 252–264 (1968).
- Richard C. Singleton, "On computing the fast Fourier transform", Commun. of the ACM 10 (1967), 647–654.
- H. W. Johnson and C. S. Burrus, "An in-place in-order radix-2 FFT," Proc. ICASSP, 28A.2.1–28A.2.4 (1984).
- C. Temperton, "Self-sorting in-place fast Fourier transform," SIAM J. Sci. Stat. Comput. 12 (4), 808–823 (1991).
- Z. Qian, C. Lu, M. An, and R. Tolimieri, "Self-sorting in-place FFT algorithm with minimum working space," IEEE Trans. ASSP 52 (10), 2835–2836 (1994).
- M. Hegland, "A self-sorting in-place fast Fourier transform algorithm suitable for vector and parallel processing," Numerische Mathematik 68 (4), 507–547 (1994).
- Matteo Frigo and Steven G. Johnson: FFTW, http://www.fftw.org/. A free (GPL) C library for computing discrete Fourier transforms in one or more dimensions, of arbitrary size, using the Cooley-Tukey algorithm. Also M. Frigo and S. G. Johnson, "FFTW: An adaptive software architecture for the FFT," Proc. ICASSP 3, 1381–1384 (1998).
Source | Copyright