Funtext(7) Support for Column-based Text Files

DESCRIPTION

Funtools will automatically sense and process ``standard'' column-based text files as if they were FITS binary tables without any change in Funtools syntax. In particular, you can filter text files using the same syntax as FITS binary tables:

  fundisp foo.txt'[cir 512 512 .1]'
  fundisp -T foo.txt > foo.rdb
  funtable foo.txt'[pha=1:10,cir 512 512 10]' foo.fits

The first example displays a filtered selection of a text file. The second example converts a text file to an RDB file. The third example converts a filtered selection of a text file to a FITS binary table.

Text files can also be used in Funtools image programs. In this case, you must provide binning parameters (as with raw event files), using the bincols keyword specifier:

  bincols=([xname[:tlmin[:tlmax:[binsiz]]]],[yname[:tlmin[:tlmax[:binsiz]]]

For example:

  funcnts foo'[bincols=(x:1024,y:1024)]' "ann 512 512 0 10 n=10"

Standard Text Files

Standard text files have the following characteristics:

  • Optional comment lines start with #
  • Optional blank lines are considered comments
  • An optional table header consists of the following (in order):
  • a single line of alpha-numeric column names
  • an optional line of unit strings containing the same number of cols
  • an optional line of dashes containing the same number of cols
  • Data lines follow the optional header and (for the present) consist of
         the same number of columns as the header.
  • Standard delimiters such as space, tab, comma, semi-colon, and bar.

Examples:

  # rdb file
  foo1  foo2    foo3    foos
  ----  ----    ----    ----
  1     2.2     3       xxxx
  10    20.2    30      yyyy

  # multiple consecutive whitespace and dashes
  foo1   foo2    foo3 foos
  ---    ----    ---- ----
     1    2.2    3    xxxx
    10   20.2    30   yyyy

  # comma delims and blank lines
  foo1,foo2,foo3,foos

  1,2.2,3,xxxx
  10,20.2,30,yyyy

  # bar delims with null values
  foo1|foo2|foo3|foos
  1||3|xxxx
  10|20.2||yyyy

  # header-less data
  1     2.2   3 xxxx
  10    20.2 30 yyyy

The default set of token delimiters consists of spaces, tabs, commas, semi-colons, and vertical bars. Several parsers are used simultaneously to analyze a line of text in different ways. One way of analyzing a line is to allow a combination of spaces, tabs, and commas to be squashed into a single delimiter (no null values between consecutive delimiters). Another way is to allow tab, semi-colon, and vertical bar delimiters to support null values, i.e. two consecutive delimiters implies a null value (e.g. RDB file). A successful parser is one which returns a consistent number of columns for all rows, with each column having a consistent data type. More than one parser can be successful. For now, it is assumed that successful parsers all return the same tokens for a given line. (Theoretically, there are pathological cases, which will be taken care of as needed). Bad parsers are discarded on the fly.

If the header does not exist, then names ``col1'', ``col2'', etc. are assigned to the columns to allow filtering. Furthermore, data types for each column are determined by the data types found in the columns of the first data line, and can be one of the following: string, int, and double. Thus, all of the above examples return the following display:

  fundisp foo'[foo1>5]'
        FOO1                  FOO2       FOO3         FOOS
  ---------- --------------------- ---------- ------------
          10           20.20000000         30         yyyy

Comments Convert to Header Params

Comments which precede data rows are converted into header parameters and will be written out as such using funimage or funhead. Two styles of comments are recognized:

1. FITS-style comments have an equal sign ``='' between the keyword and value and an optional slash ``/'' to signify a comment. The strict FITS rules on column positions are not enforced. In addition, strings only need to be quoted if they contain whitespace. For example, the following are valid FITS-style comments:

  # fits0 = 100
  # fits1 = /usr/local/bin
  # fits2 = "/usr/local/bin /opt/local/bin"
  # fits3c = /usr/local/bin /opt/local/bin /usr/bin
  # fits4c = "/usr/local/bin /opt/local/bin" / path dir

Note that the fits3c comment is not quoted and therefore its value is the single token ``/usr/local/bin'' and the comment is ``opt/local/bin /usr/bin''. This is different from the quoted comment in fits4c.

2. Free-form comments can have an optional colon separator between the keyword and value. In the absence of quote, all tokens after the keyword are part of the value, i.e. no comment is allowed. If a string is quoted, then slash ``/'' after the string will signify a comment. For example:

  # com1 /usr/local/bin
  # com2 "/usr/local/bin /opt/local/bin"
  # com3 /usr/local/bin /opt/local/bin /usr/bin
  # com4c "/usr/local/bin /opt/local/bin" / path dir

  # com11: /usr/local/bin
  # com12: "/usr/local/bin /opt/local/bin"
  # com13: /usr/local/bin /opt/local/bin /usr/bin
  # com14c: "/usr/local/bin /opt/local/bin" / path dir

Note that com3 and com13 are not quoted, so the whole string is part of the value, while comz4c and com14c are quoted and have comments following the values.

Some text files have column name and data type information in the header. You can specify the format of column information contained in the header using the ``hcolfmt='' specification. See below for a detailed description.

Multiple Tables in a Single File

Multiple tables are supported in a single file. If an RDB-style file is sensed, then a ^L (vertical tab) will signify end of table. Otherwise, an end of table is sensed when a new header (i.e., all alphanumeric columns) is found. (Note that this heuristic does not work for single column tables where the column type is ASCII and the table that follows also has only one column.) You also can specify characters that signal an end of table condition using the eot= keyword. See below for details.

You can access the nth table (starting from 1) in a multi-table file by enclosing the table number in brackets, as with a FITS extension:

  fundisp foo'[2]'

The above example will display the second table in the file. (Index values start at 1 in oder to maintain logical compatibility with FITS files, where extension numbers also start at 1).

TEXT() Specifier

As with ARRAY() and EVENTS() specifiers for raw image arrays and raw event lists respectively, you can use TEXT() on text files to pass key=value options to the parsers. An empty set of keywords is equivalent to not having TEXT() at all, that is:

  fundisp foo
  fundisp foo'[TEXT()]'

are equivalent. A multi-table index number is placed before the TEXT() specifier as the first token, when indexing into a multi-table:

  fundisp foo'[2,TEXT(...)]'

The filter specification is placed after the TEXT() specifier, separated by a comma, or in an entirely separate bracket:

  fundisp foo'[TEXT(...),circle 512 512 .1]'
  fundisp foo'[2,TEXT(...)][circle 512 512 .1]'

Text() Keyword Options

The following is a list of keywords that can be used within the TEXT() specifier (the first three are the most important):

Environment Variables

Environment variables are defined to allow many of these TEXT() values to be set without having to include them in TEXT() every time a file is processed:

  keyword       environment variable
  -------       --------------------
  delims        TEXT_DELIMS
  comchars      TEXT_COMCHARS
  cols          TEXT_COLUMNS
  eot           TEXT_EOT
  null1         TEXT_NULL1
  alen          TEXT_ALEN
  bincols       TEXT_BINCOLS
  hcolfmt       TEXT_HCOLFMT

Restrictions and Problems

As with raw event files, the '+' (copy extensions) specifier is not supported for programs such as funtable.

String to int and int to string data conversions are allowed by the text parsers. This is done more by force of circumstance than by conviction: these transitions often happens with VizieR catalogs, which we want to support fully. One consequence of allowing these transitions is that the text parsers can get confused by columns which contain a valid integer in the first row and then switch to a string. Consider the following table:

  xxx   yyy     zzz
  ----  ----    ----
  111   aaa     bbb
  ccc   222     ddd

The xxx column has an integer value in row one a string in row two, while the yyy column has the reverse. The parser will erroneously treat the first column as having data type int:

  fundisp foo.tab
         XXX          YYY          ZZZ
  ---------- ------------ ------------
         111        'aaa'        'bbb'
  1667457792        '222'        'ddd'

while the second column is processed correctly. This situation can be avoided in any number of ways, all of which force the data type of the first column to be a string. For example, you can edit the file and explicitly quote the first row of the column:

  xxx   yyy     zzz
  ----  ----    ----
  "111" aaa     bbb
  ccc   222     ddd

  [sh] fundisp foo.tab
           XXX          YYY          ZZZ
  ------------ ------------ ------------
         '111'        'aaa'        'bbb'
         'ccc'        '222'        'ddd'

You can edit the file and explicitly set the data type of the first column:

  xxx:3A   yyy  zzz
  ------   ---- ----
  111      aaa  bbb
  ccc      222  ddd

  [sh] fundisp foo.tab
           XXX          YYY          ZZZ
  ------------ ------------ ------------
         '111'        'aaa'        'bbb'
         'ccc'        '222'        'ddd'

You also can explicitly set the column names and data types of all columns, without editing the file:

  [sh] fundisp foo.tab'[TEXT(xxx:3A,yyy:3A,zzz:3a)]'
           XXX          YYY          ZZZ
  ------------ ------------ ------------
         '111'        'aaa'        'bbb'
         'ccc'        '222'        'ddd'

The issue of data type transitions (which to allow and which to disallow) is still under discussion.