Brian Stanley LIS452LE Final Project Fall 2008 ---------------------------------------------------------------------------------------------------------------- Poetry Web Application - Documentation ---------------------------------------------------------------------------------------------------------------- Contents: 1. Overview of the Project 2. XML Processing a. Tree-Based v. Stream-Based Parsers b. Tree-Based Parsing with Python's lxml module c. TEI encoding d. Stream-Based Parsing with PHP's XMLReader 3. Web development with PHP 4. PHP v. Python 5. AJAX 6. Bibliography ---------------------------------------------------------------------------------------------------------------- 1. Overview of the Project I developed a web application for the comparative study of poetry editions. The application is powered by PHP, with incorporations of AJAX and JavaScript. The poems are encoded in TEI, with corresponding stanzas in the various editions linked through a numeric attribute associated with each stanza. The application contains forms for the user to specify the poem and version to be displayed. Each stanza of the displayed poem contains a button labeled "compare". Clicking the button calls a JavaScript function which uses AJAX to call a PHP function which retrieves the stanzas corresponding to the stanza making the call. These stanzas are delivered to the browser and displayed beneath the original stanza. The display page also contains a checkbox where users can select whether or not to highlight changes in the retrieved stanzas compared to the original stanza. If the box is checked, the PHP function for retrieving the stanzas compares each retrieved line to the corresponding line in the stanza making the function call; if the line is different, it is returned highlighted yellow. ---------------------------------------------------------------------------------------------------------------- 2. XML Processing a. Tree-Based v. Stream-Based Parsers The bulk of the application consists in XML processing. PHP, like Python, includes several XML parsing modules to choose from. These parsers can be divided into two main categories: tree-based and stream-based (or event-based). Tree-based parsers (such as DOM) parse an entire XML file and create a model of the XML tree in memory. This in-memory tree can then be navigated and modified in a variety of ways. When finished, the internal tree can be written to file. Stream-based parsers do not build an internal model of the XML tree that can then be manipulated. Rather, the parser reads and deals with the XML file only as the file is being streamed through the parser. Any processing of nodes in the XML file must occur as those nodes are being read. Among streaming parsers, a distinction can be made between push parsers and pull parsers. Push parsers, once they begin parsing, parse the entire XML file without stopping, "pushing" the parsed data to the application. Pull parsers only move through the document when told to do so. Thus parsed data must be "pulled" from the parser. Tree-based and stream-based parsers have distinct pros and cons, as discussed in Robert Richard's book, Pro PHP XML and Web Services. Because tree-based parsers build an internal representation of the XML file, they offer maximum flexibility. The nodes can be processed in any order because they reside in memory: nodes appearing later in the document can be visited prior to earlier nodes; also, nodes can be revisited. However, this flexibility comes at a cost. Tree-based parsers, not surprisingly, use more memory than stream-based parsers, due to the construction of the tree in memory. Also, because a tree-based parser must recreate the entire XML file in memory before any processing can begin, tree-based parsers are slower than stream-based. Stream-based parsers are lightweight. Because they do not store the tree, they use little memory. And because they process the file as they are reading it, they are very fast. Nevertheless, their functionality is limited compared to tree-based parsers. Most notably, because the tree is not stored in memory, nodes must be processed in the order they are read from the document. It is not possible to query a later node and then process an earlier node, unless one reads the entire file again. With a tree-based parser, by contrast, this would be a simple task. In addition, stream-based parsers are read-only; tree-based parsers are required in order to modify or create XML. b. Tree-Based Parsing with Python's lxml module Although my web application is written in PHP, I wrote a small program in Python to assist with TEI encoding. Of the four poems available in my web application, one of them (Edward FitzGerald's The Rubiayat of Omar Khayyam) is long (100+ stanzas). Manually encoding four editions of this poem and linking stanzas together through numeric attributes would have been tedious and time-consuming. Fortunately, a website exists (www.therubiayat.com) where an enthusiast of the poem has coded the four versions of the poem, side-by-side with corresponding stanzas aligned, into HTML tables. Each stanza in the four versions occupies a distinct table row, with one version's stanza in each cell of the row. This orderly arrangement makes programmatic extraction of the poem versions and encoding of them in TEI possible. The Python script I designed is procedural. I used the tree-based lxml parser, a non-standard Python module, to complete the task. This decision was partly for the practical reason that I had used lxml in completing assignments 2 and 3, and was familiar with it. However, the nature of the task also dictated the choice of parser. One advantage of lxml is that it is capable of reading HTML (even poorly formed HTML) and storing it in memory as an XML tree. Also, because I needed to create XML files, not just read them, a stream-based parser would have been inappropriate. Finally, the ability to use XPath in lxml made navigating the complex table and poem structure fairly simple, whereas this would have been difficult (though probably possible) with a stream-based parser. c. TEI encoding The Python script extracted the poem from the webpage and wrote each version of the poem to a separate, TEI-encoded XML file. The TEI structure consisted of line group nodes () for each stanza of the poem. Within the node, a node stored the stanza number for that version of the poem, and line nodes () stored each line. The line nodes were numbered 1-4 in each stanza using the n attribute. Perhaps the most critical aspect of the TEI markup was the n attribute for each line group. Unlike TEI's xml:id attribute, which must be unique, the n attribute need not be unique. In the present case, this numeric attribute serves to link each stanza with the stanzas corresponding to it in other versions of the poem. Since the ordering of stanzas often changed from one poem to the next, stanza heading numbers cannot be used to link corresponding stanzas. Fortunately, because corresponding stanzas were horizontally aligned on the webpage the script was run on, the attribute value could be automatically derived from the order of table rows in the parsed webpage. Because the stanzas were arranged on the webpage so that corresponding stanzas were aligned, the stanzas within individual versions of the poem were out of order. Left unfixed, this would have required a more complicated PHP script to display each version of the poem in order. (And, it would have required a tree-based parser instead of the stream-based parser which was used.) Therefore, I designed a simple XSLT stylesheet to sort each TEI file by the stanza number, stored in the node. Finally, I pasted the four files into a single XML file and added an appropriate TEI header to make the document valid. d. Stream-Based Parsing with PHP's XMLReader Based on Richard's discussion of the pros and cons of tree-based and stream-based XML parsers, I decided to employ a stream-based parser for the PHP script. There were several reasons for this decision. First of all, there was to be no writing of XML, only retrieving and displaying of stanzas. Second, because the web application generates pages on-the-fly at the time of the user's request, the speed of the script execution was far more important than in the case of the offline file processing using Python's lxml. Finally, I valued the opportunity to work with a stream-based parser, to complement my experience with tree-based lxml. PHP's XMLReader comes with several advantages over PHP's other stream-based parser, SAX, including faster processing (according to Richards) and simpler implementation. Also, XMLReader is a pull parser, while SAX is a push parser, meaning XMLReader gives the programmer more control. (This does have a downside; I found it tedious to have to continually write lines of code to move the parser forward.) Compared to using a tree-based parser like lxml in Python, using a stream-based parser like PHP's XMLReader definitely feels like low-level programming. With lxml, you simply tell the parser where you want it to go; with XMLReader, you have to tell it how to get there. The stream-based parser does not know where it is in relation to the rest of the document. This made it necessary to use extensive nesting of if clauses. For example, with a tree-based parser, you could use XPath to collect a list of all elements with attribute n=20, then use a for loop to perform some operation (like print the child nodes) on each element in the list. With a stream-based parser, you have to tell the parser to start reading lines and, if it encounters an element, then check the n attribute, and if the attribute value is 20, then keep reading, and if it encounters an , move forward to the text node and display the contents. Thus, what requires one for loop in a tree-based parser requires 3 levels of if clauses in a stream-based parser. This equates to bulkier code and more tedious programming. Stream-based parsers are limited to moving forward through a document. Although I anticipated only needing to move forward through the TEI files, there turned out to be two situations where this was not the case. First of all, to output an HTML form with options for each version of the poem, it was necessary to cycle through the file and retrieve the versions (from the TEI node). Then, it was still necessary to cycle through the file again to read and display the current version. I used a nested function to retrieve the version information from within the display function. Secondly, if the highlighting box is checked, the script must not only retrieve the corresponding stanzas but also compare each line to the corresponding line of the current version. Therefore it must first retrieve the lines of the current version. Unless the current version is the first version in the document, the script must cycle through, collect the lines, then cycle through again to retrieve the corresponding stanzas and compare them to the collected lines. Again, I used a nested function to achieve this. Given the speed benefits of stream-based parsing, this double pass through the file was still almost certainly faster than using a tree-based parser (which by its very nature has to make a double pass through every document, once to parse and store it in memory, and then again to actually process it). ---------------------------------------------------------------------------------------------------------------- 3. Web Development with PHP Constructing dynamic web pages with PHP is convenient and straightforward. To create HTML, you can either insert HTML outside of the PHP script tags, or you can use PHP's echo command to print the HTML from within the script. The former approach seems sensible when most of the page is static with a few dynamic areas, while the latter seems more sensible when (as in this application) the majority of the page is dynamically generated. PHP variables and values can be intermixed with the HTML to produce dynamic content. If one is familiar with HTML and familiar with PHP, there is little to no learning curve involved in combining them. Arguments can be passed from the user's click to the PHP script using the GET or POST method. Since I was not dealing with passing private data, I used the GET method, which encodes the arguments into the URL as key-value pairs. For example, if a form field named "version" with a user-selected value of "manuscript" is submitted to the PHP script called sample.php, the URL will read: http://www.sample.php?version=manuscript Multiple key-value pairs can be separated by ampersands: http://www.sample.php?version=manuscript&highlighting=on The arguments can be retrieved by the PHP script using the syntax $_GET["version"] and $_GET["highlighting"]. Once retrieved, the arguments can be manipulated or passed to functions just as in a non-web-based program. For this application, the PHP script needed not only to receive arguments from the webpage, but also to insert them there in the first place. This was done through the creation of forms, JavaScript buttons, and div ids by the script. For selecting of poem and poem version, forms were used. The PHP script created the forms and populated them with the appropriate choices. When a user selects an option from the form menu, that particular value is passed back to the script via the GET method, and the script displays the new poem or new poem version. For comparing stanzas, each stanza includes a "compare" button. This button, when clicked, calls a JavaScript function which initiates the process of retrieving the corresponding stanzas. For the appropriate stanzas to be retrieved, the button must pass appropriate arguments to the function, including the poem name, current version, and numeric id of the stanza. These values are inserted into the button code by the PHP script when the page is initially generated, using variables in the script. In addition, when corresponding stanzas are retrieved, the script must know where on the page to place them. This is accomplished by inserting the stanza's numeric id as an id attribute of the div box for that stanza, allowing the script to target that particular box. In short, the PHP script creates and encodes the values which the user, by clicking a button or selecting a form option, passes back to the script as arguments for performing some new operation. ---------------------------------------------------------------------------------------------------------------- 4. PHP v. Python Developing with PHP did not differ significantly from developing with Python. The primary differences were ones of syntax. Whereas Python relies on line breaks and indentation to distinguish different portions of code, PHP relies on specific character markers. Variables must be preceded by $. For if-else and similar constructions, the expression being tested must be in parentheses (which is optional in Python). Nesting is indicated by enclosing an entire sequence of code within curly brackets, {}, instead of by indentation as in Python. The end of any particular statement must be indicated by a semi-colon, whereas Python uses line breaks. Python code is therefore much cleaner than PHP code; Python deserves its reputation as an elegant language. ---------------------------------------------------------------------------------------------------------------- 5. AJAX The application also relies on AJAX. When a user clicks a "compare" button, AJAX is used to send a request to a PHP script to retrieve the appropriate stanzas; the PHP script returns the stanzas to the AJAX function, which passes them back to the webpage. This process occurs without reloading the webpage. Consequently it is less intrusive for the user than a normal server-side script (which requires reloading the page). I deemed AJAX functionality to be important for this application because a user who clicks to compare a stanza does not wish the entire poem to be reloaded. Such a reload wastes time, could be confusing, and, depending on the browser, might revert to the top of the page, losing the user's place in the poem. AJAX, which stands for Asynchronous JavaScript and XML, is not a distinct scripting language but a way of using scripting languages. It uses features inherent in JavaScript to create an XMLHttp object which sends a request to a script on the server and returns the results of that request to the browser. While I worried about the use of AJAX adding an additional layer of complexity to the development of this application, it turned out to be one of the simplest components. Because the protocol for making a XMLHttp request is the same or similar regardless of the nature of the request, the AJAX code could mainly be copied and adapted from existing code on the web. There were only two customized aspects. First, it was necessary to pass arguments from the webpage and use them to construct an appropriate URL with which the AJAX function could make its request of the PHP script and provide it with appropriate GET arguments. Second, in order for the AJAX function to know where to display the retrieved stanzas, an appropriate div id value also had to be passed to the function as an argument. ---------------------------------------------------------------------------------------------------------------- 6. Bibliography Pro PHP XML and Web Services By Robert Richards Published by Apress, 2006 ISBN 1590596331, 9781590596333 919 pages XML and PHP By Vikram Vaswani Published by Sams Publishing, 2002 ISBN 0735712271, 9780735712270 358 pages Programming PHP By Rasmus Lerdorf, Kevin Tatroe, Peter MacIntyre Published by O'Reilly, 2006 ISBN 0596006810, 9780596006815 521 pages Beginning Ajax with PHP: From Novice to Professional By Lee Babin Published by Apress, 2007 ISBN 1590596676, 9781590596678 253 pages Ajax with PHP 5 By Andrew G Curioso Published by O'Reilly, 2007 ISBN 0596514034, 9780596514037