Astronomical Data Formats: What we have and how we got here by Jessica D. Mink Smithsonian Astrophysical Observatory, 60 Garden St., Cambridge, MA 02131 Abstract: Despite almost all being acquired as photons, astronomical data from different instruments and at different stages in its life may exist in different formats to serve different purposes. Beyond the data itself, descriptive information is associated with it as metadata, either included in the data format or in a larger multi-format data structure. Those formats may be used for the acquisition, processing, exchange, and archiving of data. It has been useful to use similar formats, or even a single standard to ease interaction with data in its various stages using familiar tools. Knowledge of the evolution and advantages of present standards is useful before we discuss the future of how astronomical data is formatted. The evolution of the use of world coordinates in FITS is presented as an example. 1. Where We Are The astronomical community's most widespread data format, FITS [1] is 35 years old, and interest in developing a new and improved standards for formatting the larger and more varied types of astronomical data being produced by more and more complicated instruments on larger and larger telescopes is spreading [2] and [3]. Several existing options are being proposed: HDF5, a Hierarchical Data System [4], and JPEG2000, a widely-used image format [5], among others. The problems we face are not all new, and I would like to cover some history about how we got where we are and how our present solutions developed. 2. Formatting Data Through Its Life-cycle 2.1. Genesis: Origins of Astronomical Data In the beginning, there is light. Most astronomical observations are of photons. In addition to a count or other measure of their intensity, we record information including direction of the source, wavelength, polarization, or frequency or energy of individual photons or some grouping thereof, and the time(s) at which they were collected. Associated metadata describing the conditions under which the data was created may be included with the data as a header, trailer, or internal labels, or may reside in a separate format, such as a logbook, digital table, or label on the data container. At its simplest, a data format includes data structured in some way to make it retrievable. It may be a qualitative drawing in a logbook, such as Galileo's drawing of Jupiter and its largest satellites, with descriptive information about the data written right next to it. It may be a table of numbers in a published paper or monograph, with the text of the paper providing the contextual metadata and the headings on the table labeling the actual numbers. In the nineteenth century, it became possible to record a signal from photons from the sky more directly on glass photographic plates, such as those in the Harvard Plate Collection [6]. It is made up of photographic plates containing images of the sky, with metadata as notes in logbooks and on the paper jackets in which the plates are stored. Metadata for each plate includes the pointing direction, the time and exposure of the observation, the name of the object being observed, and who observed it. The logbook and jacket indicate what telescope was used and where it was located (See Figure 1). Sky coverage comes from the telescope focal plane plate scale and the physical size of the plate. Not all useful parameters were (or needed to be) written out because the humans using the data knew them, so additional work has been needed to make the plate images scientifically usable [7]. Digital data acquisition formats are usually limited to one instrument or a class of instruments. Metadata is usually recorded digitally either in the same files or files associated in some way within a data structure so that processing software can learn as much as possible from its digital input. For spacecraft observations, this would be digital telemetry; for optical telescopes, this is likely to be some sort of image files. Both might have associated input files, such as pointing catalogs or fiber positions. 2.2. Transfer and Exchange As we process that data, we end up with derived data which can take many different forms. To exchange that data, process it with standard software, and archive it so scientists can read it in the near or far future, the metadata has to be standardized and well-documented. It is helpful if that metadata travels easily with the actual bits of data. 2.3. Processing and Analysis But the data we have acquired is not ready to be used for science. As David Hogg noted [8], "The data is like a noisy hash of the things we care about." It has to be processed from raw data, including observations and calibrations, to something that can be analyzed. Before we can look at the data, we want it formatted in a specific way which makes it accessible to the tools we wish to use. In the course of processing, metadata as well as data is changed, with information about the processing or results of the analysis being added. It is useful if the metadata both utilizes standard definitions and is clearly associated with the data as part of the same file or data structure. 2.4. Archiving Data Final disposition of data can take several paths. It can be destroyed because it is seen to have no value or there is no space to keep it. It can be stored on media which become unreadable, such as no-longer-readable or slowly-decaying tape or disk formats). It can be presented in a publication, copies of which are preserved in multiple places, though only that part of the data content relevant to the publication will be preserved. And finally, data may be preserved in a format which is both persistent in content and readability. Most ground-based and much space-based data from the 1970's through the 1990's was recorded on magnetic tapes or disks which are now next to impossible to read. More recent data on CDROM's and DVD's will be lost as the media degrade and readers become less common. While tapes I made during the 1970's and 1980's are no longer readable, Hollerith cards of photometry and software from my senior year in college are still human- and scanner-readable (see Figure 2). If they can, scientists share their data, analysis of it, and conclusions from it in presentations and papers. In the age of hardcopy journals and books, these methods were more persistent than any other method of preservation, but 21st century astronomers tend to find articles through ADS [9] or the ArXiv server [10] and read them online instead of in hard copy. As journals move toward online publication, standardized, retrievable formatting for the long term is becoming an issue here, too. We now have the storage capacity to save most of the data that we take, as well as its derivatives. So far the most permanent format we are using is printed paper, with tables and graphs containing data. Photographic plates, which degrade a bit faster over time, with metadata in separate paper logs, have lasted over a century. But now most of our data and more and more of the papers and presentations which describe its meaning are digital. We need both persistent media and persistent formats to keep today's data accessible over time. 3. Standardizing Data Formats As the inventors of FITS noted in their first paper [11]: "Under the traditional system for data interchange in astronomy, each institution exports data on magnetic tape in its own unique internal format. Thus, a group of N "cooperating" institutions would begin by creating N(N-l) format translation programs. Then, whenever one of the institutions changes its internal format, the other N-1 institutions have to change their corresponding translation programs. For obvious reasons, this traditional system has been very inefficient. It would be very nice if the astronomical community could agree, instead, on a unique data interchange format." It may be in the form of structures, such as a single file with several FITS extensions [12], or a file with metadata linking to data, such as VO returned data packages [13] containing metadata separate from the actual data. FITS (Flexible Image Transport System) [1] was originally designed in 1979 as an exchange format, first used by radio astronomers in the AIPS software system [14]. It enabled astronomers to share data without having to maintain separate translation programs. At roughly the same time, the more flexible and more complicated N-Dimensional Data Format (NDF) was being developed in the U.K. Tim Jenness has explained the evolution of that system [15]. The original simple FITS consisted of a human-readable ASCII header of 80-character lines (matching the width of Hollerith cards then used to store and use computer software and data) and blocks of binary data described by the header and system commands. Each of these contained an integral number of 2880-byte blocks, padded with spaces at the end of each unit. A basic set of standard metadata keywords was included with the original FITS definition. The use of FITS expanded beyond exchange and archiving to recording and processing as computers got fast enough that the time it took to read and write ASCII header information and convert pixel information into internally-usable bits became increasingly negligible. This expanding use was aided by the ability to use FITS reading and writing libraries such as FITSIO [16] and CFITSIO [17] to deal with input and output from local software which worked on the bits or packages of tools such as AIPS [14], IRAF [18] (which started with a propriety format and added FITS [19]), and WCSTools [20] which perform sophisticated operations directly on FITS files. If we wish to display a FITS file, we can use a variety of tools: DS9 [21] for images, TOPCAT [22] for tables, FV [23] for either images or tables, and something like WCSTools [20] IMHEAD to check out the metadata. 4. Standardizing Metadata: World Coordinate Systems The availability of the FITS data format standard with its standard header format for metadata and system of registered conventions enabled astronomers to concentrate on science and using its simple keyword = value metadata format to transmit parameters of their data and models to users. One movement to standardize metadata has been the inclusion of a set of parameters linking pixels in an image to pointing directions in the sky or in the case of spectra, specific wavelengths or energies. As more precise relationships between image and spatial direction became necessary for specific projects, a variety of world coordinate system (WCS) solutions were developed. Because there was a standard way of defining parameters within a FITS header, it was straightforward to implement these projections in other software. The original FITS format [1] included only a linear projection defined by the keywords CRPIXn, the coordinate system reference pixel for axis n, CRVALn, the coordinate system value for axis n at that reference pixel, CDELTn for the the coordinate increment along axis n, CROTAn for the rotation angle of coordinate system axis n (usually only defined for one axis and assumed to be the same for both), and CTYPEn for the name of the coordinate axis n. For the 1983 IRAS satellite [24], a set of standard projections-GNOMONIC for sky sections, AITOFF for all-sky images, and SINUSOIDAL for the galactic plane-were used in distributed images, with parameters and comment lines including Fortran code defining the projection in the image headers. Table 1 shows how an AITOFF projection is presented in a FITS header. Table 1: Aitoff all-sky projection information for converting between image coordinates PIXEL,LINE and sky coordinates XLON,XLAT from the header of an IRAS image. COMMENT PROJECTION FORMULAE: COMMENT FORWARD FORMULA; XLON0 IS THE CENTER LONGITUDE OF THE COMMENT MAP. ARC-SINE AND ARC-COSINE FUNCTIONS ARE REQUIRED. COMMENT R2D = 45. / ATAN(1.) COMMENT PIX = 2. COMMENT RHO = ACOS( COS(XLAT) * COS((XLON-XLON0)/2.) ) COMMENT THETA = ASIN( COS(XLAT) * SIN((XLON-XLON0)/2.) / SIN(RHO)) COMMENT F = 2. * PIX * R2D * SIN(RHO/2.) COMMENT SAMPLE = -2. * F * SIN(THETA) COMMENT XLINE = -F * COS(THETA) COMMENT IF(XLAT .LT. 0.) XLINE = -XLINE COMMENT COMMENT REVERSE FORMULA; XLON0 IS THE CENTER LONGITUDE OF THE MAP. COMMENT ARC-SINE AND ARC-COSINE FUNCTIONS NEEDED. COMMENT R2D = 45. / ATAN(1.) COMMENT PIX = 2. COMMENT Y = -XLINE / (PIX * 2. * R2D) COMMENT X = -SAMPLE / (PIX * 2. * R2D) COMMENT A = SQRT(4.-X*X-4.*Y*Y) COMMENT XLAT = R2D * ASIN(A*Y) COMMENT XLON = XLON0 + 2. * R2D * ASIN(A*X/(2.*COS(XLAT))) COMMENT COMMENT REFERENCES: COMMENT IRAS SDAS SOFTWARE INTERFACE SPECIFICATION(SIS) 623-94/NO. SF05 COMMENT ASTRON. ASTROPHYS. SUPPL. SER. 44,(1981) 363-370 (RE:FITS) COMMENT RECONCILIATION OF FITS PARMS W/ SIS SF05 PARMS: COMMENT NAXIS1 = (ES - SS + 1); NAXIS2 = (EL - SL + 1); COMMENT CRPIX1 = (1 - SS); CRPIX2 = (1 - SL) In the AIPS system, a set of "Non-Linear Coordinate Systems" [25]-TAN (tangent plane), SIN (sinusoidal, typically along the galactic equator), ARC, and NCP (centered on the north celestial pole)-were developed. By 1986 [26], four more were added: GLS (Sanson-Flamsteed sinusoidal), MER (Mercator), AIT (Hammer-Aitoff equal area all-sky), STG (Stereographic or zenithal orthomorphic). In 1993 and 1994, the Space Telescope Science Institute released the "Digitized Sky Survey", which included astrometric solutions [27] in FITS images extracted from compressed scans of Schmidt plates from Palomar and ESO. This was the first attempt to provide an astometrically accurate WCS for astronomical images, a problem which has become more acute over time. After much discussion in the community, Marc Calabretta and Eric Greisen [28] presented a more completely worked out set of 25 world coordinate system transformations, along with a code library which implemented them [29]. Figure 3 shows the how well such a tangent plane projection can be fit matching stars in an image to a catalog using WCSTools. But there were still problems exactly matching telescope images, and several methods of adding information to FITS headers to refine their astrometry were proposed. Calabretta, Valdes, and Greisen started in 2003 [30] with an expansion on the previous FITS WCS papers, but a resulting standard has yet to be published. In the mean time, the Spitzer space observatory needed to release data with improved WCS information and developed the SIP convention [31], while NOAO established the IRAF-understandable polynomial expansions, TNX [32] (based on FITS-WCS TAN) and ZPX [33] (based on FITS-WCS ZPN). These registered conventions are documented with the FITS standard at the FITS Support Office at NASA/Goddard Space Flight Center [34]. In France, the Terapix project also needed to manage distortion, and Emanuel Bertin wrote SCAMP [35] to produce a distortion correction to an image's WCS. WCSTools includes subroutines to decode all of these distortion methods (see Table 2), and the AST library [36], IRAF [18], and Astrometry.net [37] implement some of them. Table 2: WCS projections supported by WCSTools Code Projection PIX Pixel WCS LIN Linear projection AZP Zenithal/Azimuthal Perspective SZP Zenithal/Azimuthal Perspective TAN Gnomonic = Tangent Plane SIN Orthographic/synthesis STG Stereographic ARC Zenithal/azimuthal equidistant ZPN Zenithal/azimuthal Polynomial ZEA Zenithal/azimuthal Equal Area AIR Airy CYP CYlindrical Perspective CAR Cartesian MER Mercator CEA Cylindrical Equal Area COP Conic Perspective COD Conic equidistant COE Conic Equal area COO Conic Orthomorphic BON Bonne PCO Polyconic SFL Sanson-Flamsteed (Global Sinusoidal) PAR Parabolic AIT Hammer-Aitoff MOL Mollweide CSC COBE quadrilateralized Spherical Cube QSC Quadrilateralized Spherical Cube TSC Tangential Spherical Cube NCP Special case of SIN from AIPS GLS Same as SFL from AIPS DSS Digitized Sky Survey plate solution PLT Plate solution (SAO corrections) TNX Tangent Plane (NOAO corrections) ZPX Zenithal Polynomial (NOAO corrections) TPV Tangent Plane (SCAMP corrections) TAN-SIP Tangent Plane (Spitzer corrections) 5. Into the Future Most of the observational astronomical community uses the FITS format for at least some stage in the life of their data. FITS meets their needs by being a well-defined format with embedded human-readable metadata, and having multiple software packages capable of reading it, published specifications, and an evolving, well-defined, published set of metadata models and keywords. But it is not everything to everybody. Where do we go from here, and how do we keep the advantages of a format which has reached across disciplines and uses over decades? As Brian Schmidt has said [38], " Getting standards for data in place that work requires a consensus dictatorship. It requires collaborations between librarians, and computer scientists to figure out how to create and maintain data hierarchies." As we move forward, we should be careful not to lose what we have. 6. Acknowledgements Thanks to the Harvard Plate Collection for getting me involved in its digitization and providing useful images and to Bob Mann and Tim Jenness for comments which were very useful in developing this paper from a talk at the 2014 Astronomical Data Analysis Software and Systems conference which is summarized in [39]. References [1] D. C. Wells, E. W. Greisen, R. H. Harten, FITS - a Flexible Image Transport System, A&AS 44 (1981) 363. [2] B. Thomas, T. Jenness, F. Economou, P. Greenfield, P. Hirst, D. S. Berry, E. M. Bray, N. Gray, D. Muna, J. Turner, M. de Val-Borro, J. Santander-Vela, D. Shupe, J. Good, G. B. Berriman, Significant Problems in FITS Limit Its Use in Modern Astronomical Research, in: N. Manset, P. Forshay (Eds.), Astronomical Data Analysis Software and Systems XXIII, Vol. 485 of Astronomical Society of the Pacific Conference Series, 2014, p. 351. arXiv:1502.05958. [3] B. Thomas, T. Jenness, F. Economou, P. Greenfield, P. Hirst, D. S. Berry, E. Bray, N. Gray, D.Muna, J. Turner,M. de Val-Borro, J. Santander- Vela, D. Shupe, J. Good, G. B. Berriman, S. Kitaeff, J. Fay, O. Laurino, A. Alexov,W. Landry, J. Masters, A. Brazier, R. Schaaf, K. Edwards, R. O. Redman, T. R. Marsh, O. Streicher, P. Norris, S. Pascual, M. Davie, M. Droettboom, T. Robitaille, R. Campana, A. Hagen, P. Hartogh, D. Klaes, M. W. Craig, D. Homeier, Learning from FITS: Limitations in use in modern astronomical research, Astron. Comput. in press arXiv:1502.00996, doi:10.1016/j.ascom.2015.01.009. [4] T. Jenness, Reimplementing the hierarchical data system using hdf5 (2015). doi:10.1016/j.ascom.2015.02.003. [5] V. V. Kitaeff, A. Cannon, A. Wicenec, D. Taubman, Astronomical Imagery: Considerations For a Contemporary Approach with JPEG2000, Astron. Comput. in pressarXiv:1403.2801, doi:10.1016/j.ascom.2014.06.002. [6] J. Grindlay, S. Tang, R. Simcoe, S. Laycock, E. Los, D. Mink, A. Doane, G. Champine, DASCH to Measure (and preserve) the Harvard Plates: Opening the 100-year Time Domain Astronomy Window, in: W. Osborn, L. Robbins (Eds.), Preserving Astronomy's Photographic Legacy: Current State and the Future of North American Astronomical Plates, Vol. 410 of Astronomical Society of the Pacific Conference Series, 2009, p. 101. [7] S. Tang, J. Grindlay, E. Los, M. Servillat, Improved Photometry for the DASCH Pipeline, PASP 125 (2013) 857-865. doi:10.1086/671760. [8] D. Bard, D. Hogg, Big everything: the future of astronomical data (Nov. 2013). URL http://kipac.stanford.edu/kipac/big-everything-future-astronomical-data [9] M. J. Kurtz, G. Eichhorn, A. Accomazzi, C. S. Grant, S. S. Murray, J. M. Watson, The NASA Astrophysics Data System: Overview, A&AS 143 (2000) 41-59. arXiv:astro-ph/0002104, doi:10.1051/aas:2000170. [10] T. Vence, One Million Preprints and Counting: A conversation with arXiv founder Paul Ginsparg (Dec. 2014). URL http://www.the-scientist.com/?articles.view/articleNo/41677/title/Q-A--One-Million-Preprints-and-Counting/ [11] E. W. Greisen, D. C. Wells, R. H. Harten, The FITS Tape Formats - Flexible Image Transport Systems, in: D. A. Elliott (Ed.), Conference on Applications of Digital Image Processing to Astronomy, Vol. 264 of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, 1980, p. 298. [12] J. D. Ponz, R. W. Thompson, J. R. Munoz, The FITS image extension, A&AS 105 (1994) 53-55. [13] M. Dolensky, D. Tody, The Simple Spectral Access protocol, in: P. J. Quinn, A. Bridger (Eds.), Optimizing Scientific Return for Astronomy through Information Technologies, Vol. 5493 of Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, 2004, pp. 262-268. doi:10.1117/12.553167. [14] I. Associated Universities, AIPS: Astronomical Image Processing System, Astrophysics Source Code Library ascl:9911.003 (Nov. 1999). [15] T. Jenness, D. Berry, M. Currie, D. P.W., E. F., N. Gray, B. McIlwrath, K. Shortridge, M. Taylor, P.Wallace, R.Warren-Smith, Learning from 25 years of the extensible n-dimensional data format (2015). doi:10.1016/j.ascom.2014.11.01. [16] W. D. Pence, FITSIO - A New Fortran-77 Subroutine Interface for Reading and Writing FITS Format Files, in: Bulletin of the American Astronomical Society, Vol. 23 of Bulletin of the American Astronomical Society, 1991, p. 936. [17] W. Pence, CFITSIO, v2.0: A New Full-Featured Data Interface, in: D. M. Mehringer, R. L. Plante, D. A. Roberts (Eds.), Astronomical Data Analysis Software and Systems VIII, Vol. 172 of Astronomical Society of the Pacific Conference Series, 1999, p. 487. [18] National Optical Astronomy Observatories, IRAF: Image Reduction and Analysis Facility, Astrophysics Source Code Library ascl:9911.002 (Nov. 1999). [19] N. Zarate, P. Greenfield, A FITS Image Extension Kernel for IRAF, in: G. H. Jacoby, J. Barnes (Eds.), Astronomical Data Analysis Software and Systems V, Vol. 101 of Astronomical Society of the Pacific Conference Series, 1996, p. 331. [20] J. D. Mink, WCSTools: Image Astrometry Toolkit, Astrophysics Source Code Library ascl:1109.015 (Sep. 2011). [21] Smithsonian Astrophysical Observatory, SAOImage DS9: A utility for displaying astronomical images in the X11 window environment, Astrophysics Source Code Library ascl:0003.002 (Mar. 2000). [22] M. Taylor, TOPCAT: Tool for OPerations on Catalogues And Tables, Astrophysics Source Code Library ascl:1101.010 (Jan. 2011). [23] W. Pence, P. Chai, Fv: Interactive FITS file editor, Astrophysics Source Code Library ascl:1205.005 (May 2012). [24] C. A. Beichman, G. Neugebauer, H. J. Habing, P. E. Clegg, T. J. Chester (Eds.), Infrared astronomical satellite (IRAS) catalogs and atlases. Volume 1: Explanatory supplement, Vol. 1, 1988. [25] E. W. Greisen, Non-linear Coordinate Systems in AIPS (Jun. 1983). URL ftp://ftp.aoc.nrao.edu/pub/software/aips/TEXT/PUBL/AIPSMEMO27.PS [26] E. W. Greisen, Additional Non-linear Coordinate Systems in AIPS (Jan. 1993). URL ftp://ftp.aoc.nrao.edu/pub/software/aips/TEXT/PUBL/AIPSMEMO46.PS [27] J. L. Russell, B. M. Lasker, B. J. McLean, C. R. Sturch, H. Jenkner, The Guide Star Catalog. II - Photometric and astrometric models and solutions, AJ 99 (1990) 2059-2081. doi:10.1086/115484. [28] E. W. Greisen, M. R. Calabretta, Representations of world coordinates in FITS, A&A 395 (2002) 1061-1075. doi:10.1051/0004-6361:20021326. [29] M. R. Calabretta, Wcslib and Pgsbox, Astrophysics Source Code Library ascl:1108.003 (Aug. 2011). [30] M. R. Calabretta, F. Valdes, E. W. Greisen, S. L. Allen, Representations of distortions in FITS world coordinate systems, in: F. Ochsenbein, M. G. Allen, D. Egret (Eds.), Astronomical Data Analysis Software and Systems (ADASS) XIII, Vol. 314 of Astronomical Society of the Pacific Conference Series, 2004, p. 551. [31] D. L. Shupe, M. Moshir, J. Li, D. Makovoz, R. Narron, R. N. Hook, The SIP Convention for Representing Distortion in FITS Image Headers, in: P. Shopbell, M. Britton, R. Ebert (Eds.), Astronomical Data Analysis Software and Systems XIV, Vol. 347 of Astronomical Society of the Pacific Conference Series, 2005, p. 491. [32] D. Tody, L. Davis, F. Valdes, TNX convention for Representing FITS Image Distortions (2008). URL http://fits.gsfc.nasa.gov/registry/tnx.html [33] F. Valdes, ZPX convention for Representing FITS Image Distortions (2011). URL http://fits.gsfc.nasa.gov/registry/zpxwcs.html [34] W. D. Pence, The FITS Support Office at NASA/GSFC (2014). URL http://fits.gsfc.nasa.gov [35] E. Bertin, SCAMP: Automatic Astrometric and Photometric Calibration, Astrophysics Source Code Library ascl:1010.063 (Oct. 2010). [36] D. S. Berry, R. F. Warren-Smith, AST: World Coordinate Systems in Astronomy, Astrophysics Source Code Library ascl:1404.016 (Apr. 2014). [37] D. Lang, D. W. Hogg, K. Mierle, M. Blanton, S. Roweis, Astrometry.net: Astrometric calibration of images, Astrophysics Source Code Library ascl:1208.001 (Aug. 2012). [38] Astronomical Data and Astronomical Digital Stewardship: An interview with Brian Schmidt by Jane Mandelbaum (Nov. 2013). URL http://blogs.loc.gov/digitalpreservation/2013/11/astronomical-data-and-astronomical-digital-stewardship-an-interview- [39] J. Mink, R. G. Mann, R. Hanisch, A. Rots, R. Seaman, T. Jenness, B. Thomas, W. O'Mullane, The Past, Present and Future of Astronomical Data Formats, ArXiv e-printsarXiv:1411.0996.