parsing_rules.txt 4.91 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106

This is a summary of parsing rules for fastq files, as implemented in the latf-load tool.

I. Notation

The parsing rules are specified using a variant of Backus-Naur Form notation (BNF). The conventions are:

    Token names are specified in capital letters. Tokens are defined in section II.
    Names of non-terminals are specified in lowercase. Non-terminals are defined in section III.
    { A }   means "A repeated 0 or more times"
    [ A ]   means "optional A"
    A | B   means "A or B"
    ( )     parentheses are used to group items

II. Tokens

The following tokens are used in the grammar description.

    ALPHANUM    - a sequence of (Latin) letters and decimal digits or dashes, beginning with a letter
    NUMBER      - a sequence of decimal digits
    WS          - a sequence of white space characters (spaces or tabs)
    EOL         - an end of line character (not considered a white space)
    COORDS      - a sequence of 4 decimal numbers, each preceded by ':', e.g. :8:2:342:540
    SPOTGROUP   - '#' followed by a possibly empty sequence of letters, decimal digits, or '_'
    RUNDOTSPOT  - a special identifier representing a spot inside a run, in the format [SDE]RR{digits}.{digits}. An example would be "DRR000123.12"
    BASES       - a sequence of characters representing bases in text notation. Valid characters are A, C, G, T, a, c, g, t, N, n.
    CSBASES     - a sequence of characters representing bases in colorspace notation. The first character has to be one of A, C, G, T, a, c, g, t.
                  The remaining characters have to be '1', '2', '3', or '.'
    QUAL        - a sequence of characters representing qualities. The range of valid characters is determined by quality encoding
                  specified by the command line option --quality, characters with ASCII codes 33-126 for PHRED_33 or 64-127 for PHRED_64, 59-126 for LOGODDS (all ranges inclusive).

In addition, grammar rules use literal tokens in apostrophes, e.g. ':'

III. Parsing rules

1.  Input file

    input:
        { readLines [ qualityLines ] } |
        { qualityLines } |
        { name COORDS ':' ( BASES | CSBASES ) ':' QUAL EOL }

Normally, an input file consists of a sequence of reads, each occupying multiple lines which identify read and contain its bases, with or without qualities.
It is possible to have a file with only qualities specified for every read; in that case the loader must also be given a file with corresponding bases.

An alternative is an "inline" format which has name, bases and qualities of each read on a single line.


2.  Read

    readLines : ( '@' | '>' ) ( tagLine | runSpotRead ) EOL read

    read : baseRead | csRead

    baseRead : BASES EOL { BASES EOL }

    csRead : CSBASES EOL { CSBASES EOL }

    qualityLines: '+' { TOKEN } EOL quality

    quality: QUAL EOL { QUAL EOL }

A read in a fastq file is identified by its tag line. Tag line is followed by bases in text or colorspace format.

An optional quality specification may follow. The quality specification starts with '+' and its own tag line, which is
expected to match the tag line of the textually preceding read, although it is not enforced by latf-load.

3. Tag line

    tagLine : nameSpotGroup |
              nameSpotGroup readNumber WS NUMBER ':' ALPHANUM ':' NUMBER indexSequence |
              nameSpotGroup readNumber [ WS [ ALPHANUM ] ] |
              nameSpotGroup WS casava1_8 |
              nameSpotGroup WS ALPHANUM |
              runSpotRead [ WS ] |
              name [ readNumber [ WS ] ]

    nameSpotGroup : [ name WS ] nameWithCoords [ SPOTGROUP ] |
                    name SPOTGROUP |
                    name [ readNumber [ WS ] ] |
                    name WS ALPHANUM '='

    nameWithCoords :  name COORDS |
                      name COORDS '_' [ casava1_8 ] |
                      name COORDS [ ':' ] '.' name |
                      name COORDS ':' [ name ]

    name : ( ALPHANUM | NUMBER ) { '_' | '-' | '.' | ':' | ALPHANUM | NUMBER }

    readNumberOrTail : ( readNumber | WS [ casava1_8 ] | WS tail ) { WS tail }

    readNumber : '/' NUMBER { '/' name }

    casava1_8 : NUMBER ':' ALPHANUM ':' NUMBER [ ':' [ ALPHANUM | NUMBER ] ]

    tail : ALPHANUM { NUMBER | ALPHANUM | '_' | '/' | '=' )

    runSpotRead : RUNDOTSPOT [ ( '.' | '/' ) NUMBER ] { tail | NUMBER }

The purpose of the tag line is  to identify a read. The components of a tag line may represent run name (possibly including coordinates), spot group, read number, and some additional information which may be of no concern to latf-load.
Tag lines come in a multitude of variations on the basic fastq format, somewhat loosely defined in e.g. http://en.wikipedia.org/wiki/FASTQ_format
Parsing rules for the tag line in latf-load have been designed to recognize major variations of the basic tag line format and
evolved based on format variations encountered in real life submissions.