I. State and Event Based Parsing

 

a.  All computer programs have some kind of state inherent to their execution.  Whether or not you, the programmer, need to worry about this state is determined by the questions you are asking, or the task you are trying to solve.  One type of programming in perl that regularly involves the use of state information is regular expression parsing of data.  First, we need to define state.

 

    State:  When a program runs, it is in the running state.  For many programs, this is the only state that is ever expressed, and the programmer doesnt even take its existence into consideration unless the state is not entered correctly by the program (e.g. it terminates with syntax errors, or other exceptions).  Other programs do, explicitly, take state into account, and design specific states into their execution model.  User interfaces, such as the browser you are using to read this blog, or even the operating system that you are running on your machine, are written to loop through the same code over and over again, until you hit a button (an event), or type a command, which triggers a state change.  That state change may be to go from an 'open' state in the browser, to a 'closed' state (e.g. hitting the exit button), or to go from a 'running' state to an 'off' state (shutting down the computer).  The state of a progam can be diagramed with flow charts, and it is a pretty good idea to start thinking about how to do this before writing your programs.

 

 

    [STATE]

    <? event test ?>

    ----> state transition

    +--> state transition after true event test

    !--> state transition after false event test

    ---( process instructions to be done during state transition )-->

 

example Browser

[open]---><? exit button hit ?>+--->[closed]

                      !---><? new url entered ?>+--(go to new url)--->[open]

                                     !--->[open]

 

 

    b. Event Based Processing:  When you are processing a text file in perl, especially when you are using regular expressions, you will typically design the system around a pre-defined set of existing states, and make the system transition into different states upon encountering a specific regular expression (e.g. the state transition event occurs when the program encounters a line which matches a particular regular expression).  In some cases, you will set perl variables to true (e.g. defined) or false (e.g. undefined), or test for the defined status of some part of your data structure to change or determine state.

In the class, we used a GEO file (produced by a research group within the IGSP) /home/londo003/perl_class/data_files/GSE3149.  We actually stripped the file down to two samples for smaller processing times, but the full dataset is available /home/londo003/perl_class/data_files/GSE3149.two_sample.  We wanted to parse this into an array of sample entries (using an iterator, instead of a real array, just like in the Fasta Processing Example).

 

$entry = {

                   'ID' => $SAMPLE_ID,

                   $attributeKey => $attributeValue,

                   'table' => [

                                        [$sampleTableColumn1Value, $sampleTableColumn2Value],

                                   ]

                };

 

To get at the headers for the sample_table, we would use a global

@sample_table_headers array.  The index of a particular value in this array corresponds with the header for that index in the data_row arrayRef, e.g. in this case 'ID_REF' corresponds to $sampleTableColumn1Value, and 'VALUE' corresponds with $sampleTableColumn2Value.  Note, if you wanted to, you could replace the arrayRef representation of a single row of data with a hashRef like so:  { 'ID_REF' => $sampleTableColumn1Value, 'VALUE' => $sampleTableColumn2Value }.  This would use more memory, as you are storing the strings 'ID_REF' and 'VALUE' each time you store a single row of data (each sample contains about 23000 rows of data.

 

We designed our program around the following state diagram.  Note, it uses a global boolean $in_sample_table variable as part of its state, in addition to the regular expression events it encounters:

 

[RUN]---><? m/\^SAMPLE\s+\=\s+(\w+)/ ?>+----( set $entry->{'ID'} and $next_sample_id to $1)--->[SAMPLE DATA COLLECTION]

                           !--->[RUN]

 

[SAMPLE DATA COLLECTION]---><? m/^\!(.*)/ ?>+---[SAMPLE ATTRIBUTE COLLECTION]

                                                                  !--->[SAMPLE DATA COLLECTION]

 

[SAMPLE ATTRIBUTE COLLECTION]---><? m/sample_table_begin/ ?>+--( set $in_sample_table true, get @sample_table_headers from next line)--->[SAMPLE TABLE DATA COLLECTION]

                                                                     !--(add $attributekey, and $attributeValue to $entry)-->[SAMPLE ATTRIBUTE COLLECTION]

 

[SAMPLE TABLE DATA COLLECTION]--><? m/sample_table_end/ ?>+--(set $in_sample_table false)-->[RUN]

                                                                    !--( push [ $sampleTableColumn1Value, $sampleTableColumn2Value ] onto @{$entry->{'table'}})-->[SAMPLE TABLE DATA COLLECTION]

 

 

the result was /home/londo003/perl_class/scripts/geo_parser.pl.

 

 

On Wednesday, we will be doing a practical style test of your programming knowledge.  I will make a few problems available to you, and you can write code to solve the problem, and cp them to me for my feedback (if you want).  There is no grade, it is just to let you see what you know.

 


Page Information

  • 9 months ago [history]
  • View page source
  • You're not logged in
  • No tags yet learn more

Wiki Information

Recent PBwiki Blog Posts