Download: XML Pocket Reference
,TITLE.15229 Page i Wednesday, September 12, 2001 1:12 PM XML Pocket Reference ,TITLE.15229 Page ii Wednesday, September 12, 2001 1:12 PM ,TITLE.15229 Page iii Wednesday, September 12, 2001 1:12 PM XML Pocket Reference Second Edition Robert Eckstein with Michel Casabianca Beijing • Cambridge • Farnham • Köln • Paris • Sebastopol • Taipei • Tokyo ,COPYRIGHT.15383 Page iv Wednesday, September 12, 2001 1:12 PM XML Pocket Reference, Second Edition by Robert Eckstein with Michel Casabianca Copyright © 2001, 1999 O’Reilly & Associates, Inc. All rights reserved. Printed in the United States of Americ...
Author: Walton Shared: 7/30/19
Downloads: 947 Views: 4985
,TITLE.15229 Page i Wednesday, September 12, 2001 1:12 PM
XML Pocket Reference, ,TITLE.15229 Page ii Wednesday, September 12, 2001 1:12 PM, ,TITLE.15229 Page iii Wednesday, September 12, 2001 1:12 PM
XML Pocket Reference Second Edition Robert Ecksteinwith Michel Casabianca Beijing • Cambridge • Farnham • Köln • Paris • Sebastopol • Taipei • Tokyo, ,COPYRIGHT.15383 Page iv Wednesday, September 12, 2001 1:12 PM
XML Pocket Reference, Second Editionby Robert Eckstein with Michel Casabianca Copyright © 2001, 1999 O’Reilly & Associates, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly & Associates, Inc., 101 Morris Street, Sebastopol, CA 95472. Editor: Ellen Siever Production Editor: Jeffrey Holcomb Cover Designer: Hanna Dyer Printing History: October 1999: First Edition April 2001: Second Edition Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly & Associates, Inc. The use of the image of the peafowl in association with XML is a trademark of O’Reilly & Associates, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly & Associates, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher assumes no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. 0-596-00133-9 [C] [5/01],
Table of ContentsIntr oduction ... 1 XML Ter minology ... 2 Unlear ning Bad Habits ... 3 An Overview of an XML Document ... 5 A Simple XML Document ... 5 A Simple Document Type Definition (DTD) ... 9 A Simple XSL Stylesheet ... 10 XML Reference ... 13 Well-For med XML ... 14 Special Markup ... 14 Element and Attribute Rules ... 17 XML Reserved Attributes ... 19 Entity and Character References ... 20 Document Type Definitions ... 21 Element Declarations ... 22 ANY and PCDATA ... 22 Entities ... 26 Attribute Declarations in the DTD ... 29 Included and Ignored Sections ... 34 The Extensible Stylesheet Language ... 37 For matting Objects ... 38 XSLT Stylesheet Structure ... 39 v, Templates and Patterns ... 40 Parameters and Variables ... 43 Stylesheet Import and Rules of Precedence ... 44 Loops and Tests ... 45 Numbering Elements ... 46 Output Method ... 47 XSLT Elements ... 48 XPath ... 70 Axes ... 73 Pr edicates ... 74 Functions ... 76 Additional XSLT Functions and Types ... 79 XPointer and XLink ... 81 Unique Identifiers ... 81 ID References ... 82 XPointer ... 83 XLink ... 87 Building Extended Links ... 90 XBase ... 96 vi,
XML Pocket Reference IntroductionThe Extensible Markup Language (XML) is a document- pr ocessing standard that is an official recommendation of the World Wide Web Consortium (W3C), the same group respon- sible for overseeing the HTML standard. Many expect XML and its sibling technologies to become the markup language of choice for dynamically generated content, including non- static web pages. Many companies are alr eady integrating XML support into their products. XML is actually a simplified form of Standar d Generalized Markup Language (SGML), an international documentation standard that has existed since the 1980s. However, SGML is extr emely complex, especially for the Web. Much of the credit for XML’s creation can be attributed to Jon Bosak of Sun Micr osystems, Inc., who started the W3C working group responsible for scaling down SGML to a form mor e suitable for the Internet. Put succinctly, XML is a meta language that allows you to cre- ate and format your own document markups. With HTML, existing markup is static: and , for example, ar e tightly integrated into the HTML standard and cannot be changed or extended. XML, on the other hand, allows you to cr eate your own markup tags and configure each to your lik- ing — for example,
. Each of these elements can be defined through your own document type definitions and stylesheets and applied to one or more XML documents. XML schemas pro- vide another way to define elements. Thus, it is important to Introduction 1, realize that there are no “corr ect” tags for an XML document, except those you define yourself. While many XML applications currently support Cascading Style Sheets (CSS), a more extensible stylesheet specification exists, called the Extensible Stylesheet Language (XSL). With XSL, you ensure that XML documents are for matted the same way no matter which application or platform they appear on. XSL consists of two parts: XSLT (transfor mations) and XSL-FO ( for matting objects). Transfor mations, as discussed in this book, allow you to work with XSLT and convert XML docu- ments to other formats such as HTML. Formatting objects are described briefly in the section “Formatting Objects.” This book offers a quick overview of XML, as well as some sample applications that allow you to get started in coding. We won’t cover everything about XML. Some XML-related specifications are still in flux as this book goes to print. How- ever, after reading this book, we hope that the components that make up XML will seem a little less foreign.
XML Ter minolog yBefor e we move further, we need to standardize some termi- nology. An XML document consists of one or more elements. An element is marked with the following form: This is text formatted according to the Body element . This element consists of two tags: an opening tag, which places the name of the element between a less-than sign (<) and a greater-than sign (>), and a closing tag, which is identi- cal except for the forward slash (/) that appears before the element name. Like HTML, the text between the opening and closing tags is considered part of the element and is pro- cessed according to the element’s rules. 2 XML Pocket Reference, Elements can have attributes applied, such as the following:
25.43Her e, the attribute is specified inside of the opening tag and is called curr ency. It is given a value of Eur o, which is placed inside quotation marks. Attributes are often used to further refine or modify the default meaning of an element. In addition to the standard elements, XML also supports empty elements. An empty element has no text between the opening and closing tags. Hence, both tags can (optionally) be com- bined by placing a forward slash before the closing marker. For example, these elements are identical: Empty elements are often used to add nontextual content to a document or provide additional information to the application that parses the XML. Note that while the closing slash may not be used in single-tag HTML elements, it is mandatory for single-tag XML empty elements.
Unlear ning Bad HabitsWher eas HTML browsers often ignore simple errors in docu- ments, XML applications are not nearly as forgiving. For the HTML reader, ther e areafew bad habits from which we should dissuade you: XML is case-sensitive Element names must be used exactly as they are defined. For example,
and ar e not the same. Attribute values must be in quotation marks You can’t specify an attribute value as , an err or that HTML browsers often overlook. An attribute value must always be inside XML Ter minolog y 3, single or double quotation marks, or else the XML parser will flag it as an error. Her e is the correct way to specify such a tag: A non-empty element must have an opening and a closing tag Each element that specifies an opening tag must have a closing tag that matches it. If it does not, and it is not an empty element, the XML parser generates an error. In other words, you cannot do the following: This is a paragraph. This is another paragraph. Instead, you must have an opening and a closing tag for each paragraph element: This is a paragraph. This is another paragraph.Tags must be nested correctly It is illegal to do the following: The closing tag for the This is incorrect element should be inside the closing tag for the element to match the near- est opening tag and preserve the correct element nesting. It is essential for the application parsing your XML to pro- cess the hierarchy of the elements: These syntactic rules are the source of many common errors in XML, especially because some of this behavior can be ignor ed by HTML browsers. An XML document adhering to these rules (and a few others that we’ll see later) is said to be well-for med. 4 XML Pocket Reference, This is correct
An Overview of an XML DocumentGenerally, two files are needed by an XML-compliant applica- tion to use XML content: The XML document This file contains the document data, typically tagged with meaningful XML elements, any of which may con- tain attributes. Document Type Definition (DTD) This file specifies rules for how the XML elements, attributes, and other data are defined and logically related in the document. Additionally, another type of file is commonly used to help display XML data: the stylesheet. The stylesheet dictates how document elements should be for- matted when they are displayed. Note that you can apply dif- fer ent stylesheets to the same document, depending on the envir onment, thus changing the document’s appearance with- out affecting any of the underlying data. The separation between content and formatting is an important distinction in XML.
A Simple XML DocumentExample 1 shows a simple XML document. Example 1. sample.xml
Let’s look at this example line by line. In the first line, the code between the is called an XML declaration. This declaration contains special XML Ter minolog y 5, infor mation for the XML processor (the program reading the XML), indicating that this document conforms to Version 1.0 of the XML standard and uses UTF-8 (Unicode optimized for ASCII) encoding. The second line is as follows: This line points out the root element of the document, as well as the DTD validating each of the document elements that appear inside the root element. The root element is the outer- most element in the document that the DTD applies to; it typi- cally denotes the document’s starting and ending point. In this example, the XML Pocket Reference 12.95 element serves as the root ele- ment of the document. The SYSTEM keyword denotes that the DTD of the document resides in an external file named sam- ple.dtd. On a side note, it is possible to simply embed the DTD in the same file as the XML document. However, this is not recommended for general use because it hampers reuse of DTDs. Following that line is a comment. Comments always begin with . You can write whatever you want inside comments; they are ignor ed by the XML processor. Be awar e that comments, however, cannot come before the XML declaration and cannot appear inside an element tag. For example, this is illegal: > Finally, the elements , , and ar e XML elements we invented. Like most ele- ments in XML, they hold no special significance except for whatever document rules we define for them. Note that these elements look slightly differ ent than those you may have seen pr eviously because we are using namespaces. Each element tag can be divided into two parts. The portion before the colon (:) identifies the tag’s namespace; the portion after the colon identifies the name of the tag itself. 6 XML Pocket Reference, Let’s discuss some XML terminology. The and elements would both consider the element their par ent. In the same manner, ele- ments can be grandpar ents and grandchildr en of other ele- ments. However, we typically abbreviate multiple levels by stating that an element is either an ancestor or a descendant of another element. Namespaces Namespaces wer e cr eated to ensure uniqueness among XML elements. They are not mandatory in XML, but it’s often wise to use them. For example, let’s pretend that the element was simply named . When you think about it, it’s not out of the question that another publisher would create its own element in its own XML documents. If the two publishers combined their documents, resolving a single (cor- rect) definition for the tag would be impossible. When two XML documents containing identical elements from dif ferent sources are merged, those elements are said to col- lide. Namespaces help to avoid element collisions by scoping each tag. In Example 1, we scoped each tag with the OReilly name- space. Namespaces are declar ed using the xmlns:something attribute, where something defines the prefix of the name- space. The attribute value is a unique identifier that differ enti- ates this namespace from all other namespaces; the use of a URI is recommended. In this case, we use the O’Reilly URI http://www.or eilly.com as the default namespace, which should guarantee uniqueness. A namespace declaration can appear as an attribute of any element, in which case the namespace remains inside that element’s opening and closing tags. Here are some examples: ...XML Ter minolog y 7, ...You are allowed to define more than one namespace in the context of an element: ...If you do not specify a name after the xmlns pr efix, the name- space is dubbed the default namespace and is applied to all elements inside the defining element that do not use a name- space prefix of their own. For example: Her e, the default namespace (repr esented by the URI http://www.or eilly.com) is applied to the elements XML Pocket Reference 0-596-00133-9 18231 , , , and . However, it is not applied to the element, which has its own namespace. Finally, you can set the default namespace to an empty string. This ensures that there is no default namespace in use within a specific element:Her e, the and elements have no default names- pace. 8 XML Pocket Reference,
A Simple Document Type Definition (DTD)Example 2 creates a simple DTD for our XML document. Example 2. sample.dtd The purpose of this DTD is to declare each of the elements used in our XML document. All document-type data is placed inside a construct with the characters . Each construct declares a valid element for our XML document. With the second line, we’ve specified that the
element is valid: The parentheses group together the requir ed child elements for the element . In this case, the and elements must be included inside our element tags, and they must appear in the order specified. The elements and ar e ther efor e consider ed childr en of . Likewise, the and elements are declar ed in our DTD: Again, parentheses specify requir ed elements. In this case, they both have a single requir ement, repr esented by #PCDATA. This is shorthand for parsed character data, which means that any characters are allowed, as long as they do not include XML Ter minolog y 9, other element tags or contain the characters < or &, or the sequence ]]>. These characters are forbidden because they could be interpreted as markup. (We’ll see how to get around this shortly.) The line indicates that the attribute of the element defaults to the URI associated with O’Reilly & Associates if no other value is explicitly speci- fied in the element. The XML data shown in Example 1 adheres to the rules of this DTD: it contains an element, which in turn contains an element followed by an element inside it (in that order). Therefor e, if this DTD is applied to the data with a statement, the document is said to be valid.
A Simple XSL StylesheetXSL allows developers to describe transformations using XSL Transfor mations (XSLT), which can convert XML documents into XSL Formatting Objects, HTML, or other textual output. As this book goes to print, the XSL Formatting Objects specifi- cation is still changing; therefor e, this book covers only the XSLT portion of XSL. The examples that follow, however, are consistent with the W3C specification. Let’s add a simple XSL stylesheet to the example:
10 XML Pocket Reference, The first thing you might notice when you look at an XSL stylesheet is that it is formatted in the same way as a regular XML document. This is not a coincidence. By design, XSL stylesheets are themselves XML documents, so they must adher e to the same rules as well-formed XML documents. Br eaking down the pieces, you should first note that all XSL elements must be contained in the appropriate outer element. This tells the XSLT processor that it is describing stylesheet information, not XML content itself. After the opening tag, we see an XSLT dir ective to optimize output for HTML. Following that are the rules that will be applied to our XML document, given by the elements (in this case, there is only one rule). Each rule can be further broken down into two items: a tem- plate pattern and a template action. Consider the line: This line forms the template pattern of the stylesheet rule. Her e, the target pattern is the root element, as designated by match="/". The / is shorthand to repr esent the XML document’s root element. The contents of the element: specify the template action that should be perfor med on the target. In this case, we see the empty element located inside a element. When the XSLT pro- cessor transforms the target element, every element inside the root element is surrounded by the tags, which will likely cause the application formatting the output to increase the font size. In our initial XML example, the and elements are both enclosed inside the tags. Therefor e, the font size will be applied to XML Ter minolog y 11, the contents of those tags. Example 3 displays a more realistic example.
Example 3. sample.xsl
12 XML Pocket Reference, In this example, we target the element, print- ing the word Books: befor e it in a larger font size. In addition, the element applies the default font size to each of its children, and the tag uses a slightly larger font size to display its children, overriding the default size of its parent, . (Of course, neither one has any children elements; they simply have text between their tags in the XML document.) The text Price: $ will precede each of ’s children, and the characters + tax will come after it, formatted accordingly. Her e is the result after we pass sample.xsl thr ough an XSLT pr ocessor: Books:
XML Pocket Reference
Price $12.95 + tax And that’s it: everything needed for a simple XML document! Running the result through an HTML browser, you should see something similar to Figure 1.
XML ReferenceNow that you have had a quick taste of working with XML, her e is an overview of the more common rules and constructs of the XML language. XML Reference 13, Figur e 1. Sample XML output
Well-For med XMLThese are the rules for a well-formed XML document: • All element attribute values must be in quotation marks. • An element must have both an opening and a closing tag, unless it is an empty element. • If a tag is a standalone empty element, it must contain a closing slash (/) befor e the end of the tag. • All opening and closing element tags must nest correctly. • Isolated markup characters are not allowed in text; < or & must use entity refer ences. In addition, the sequence ]]> must be expressed as ]]> when used as regular text. (Entity refer ences ar e discussed in further detail later.) • Well-for med XML documents without a corresponding DTD must have all attributes of type CDATA by default.
Special MarkupXML uses the following special markup constructs. 14 XML Pocket Reference, Although they are not requir ed to, XML documents typically begin with an XML declaration, which must start with the characters . Attributes include: version The version attribute specifies the correct version of XML requir ed to process the document, which is currently 1.0. This attribute cannot be omitted. encoding The encoding attribute specifies the character encoding used in the document (e.g., UTF-8 or iso-8859-1). UTF-8 and UTF-16 are the only encodings that an XML proces- sor is requir ed to handle. This attribute is optional. standalone The optional standalone attribute specifies whether an exter nal DTD is requir ed to parse the document. The value must be either yes or no (the default). If the value is no or the attribute is not present, a DTD must be declared with an XML instruction. If it is yes, no exter- nal DTD is requir ed. For example: .?> A processing instruction allows developers to place attributes specific to an outside application within the document. Pro- cessing instructions always begin with the characters and end with the characters ?>. For example: .?> 15, You can create your own processing instructions if the XML application processing the document is aware of what the data means and acts accordingly. The instruction allows you to specify a DTD for an XML document. This instruction currently takes one of two for ms:
SYSTEMThe SYSTEM variant specifies the URI location of a DTD for private use in the document. For example:
PUBLICThe PUBLIC variant is used in situations in which a DTD has been publicized for widespread use. In these cases, the DTD is assigned a unique name, which the XML pro- cessor may use by itself to attempt to retrieve the DTD. If this fails, the URI is used: Public DTDs follow a specific naming convention. See the XML specification for details on naming public DTDs. You can place comments anywhere in an XML document, except within element tags or before the initial XML process- ing instructions. Comments in an XML document always start 16 XML Pocket Reference, with the characters . In addition, they may not include double hyphens within the comment. The contents of the comment are ignor ed by the XML processor. For example:
CDATAYou can define special sections of character data, or CDATA, which the XML processor does not attempt to interpret as markup. Anything included inside a CDATA section is treated as plain text. CDATA sections begin with the characters . For example: tag of documents 5 & 6: "Sales" and "Profit and Loss". Luckily, the XML processor wont apply rules of formatting to these sentences! ]]> Note that entity refer ences inside a CDATA section will not be expanded.
Element and Attribute RulesAn element is either bound by its start and end tags or is an empty element. Elements can contain text, other elements, or a combination of both. For example:
Elements can contain text, other elements, or a combination. For example, a chapter might contain a title and multiple paragraphs, and a paragraph might contain text andCDATA 17, An element name must start with a letter or an underscore. It can then have any number of letters, numbers, hyphens, peri- ods, or underscores in its name. Elements are case-sensitive : emphasis elements. , , and ar e consider ed thr ee dif ferent ele- ment types. Element type names may not start with the string xml in any variation of upper- or lowercase. Names beginning with xml ar e reserved for special uses by the W3C XML Working Gr oup. Colons (:) are per mitted in element type names only for specifying namespaces; otherwise, colons are forbidden. For example: Example Comment Legal <_Budget> Legal Illegal: has a space <205Para> Illegal: starts with number Illegal: contains @ character Illegal: starts with xml Element type names can also include accented Roman charac- ters, letters from other alphabets (e.g., Cyrillic, Greek, Hebr ew, Arabic, Thai, Hiragana, Katakana, or Devanagari), and ideograms from the Chinese, Japanese, and Korean lan- guages. Valid element type names can therefor e include , , , and , plus a number of others our publishing system isn’t equipped to handle. If you use a DTD, the content of an element is constrained by its DTD declaration. Better XML applications inform you which elements and attributes can appear inside a specific element. Otherwise, you should check the element declara- tion in the DTD to determine the exact semantics. 18 XML Pocket Reference, Attributes describe additional information about an element. They always consist of a name and a value, as follows: The attribute value is always quoted, using either single or double quotes. Attribute names are subject to the same restric- tions as element type names.
XML Reserved AttributesThe following are reserved attributes in XML. xml:lang xml:lang="iso_639_identifier" The xml:lang attribute can be used on any element. Its value indicates the language of the body of the element. This is use- ful in a multilingual context. For example, you might have:
Hello BonjourThis format allows you to display one element or the other, depending on the user’s language prefer ence. The syntax of the xml:lang value is defined by ISO-639. A two- letter language code is optionally followed by a hyphen and a two-letter country code. Traditionally, the language is given in lowercase and the country in uppercase (and for safety, this rule should be followed), but processors are expected to use the values in a case-insensitive manner. In addition, ISO-3166 provides extensions for nonstandardized languages or language variants. Valid xml:lang values include notations such as en, en-US, en-UK, en-cockney, i-navajo, and x-minbari. xml:lang 19, xml:space xml:space="default|preserve" The xml:space attribute indicates whether any whitespace inside the element is significant and should not be altered by the XML processor. The attribute can take one of two enumer- ated values: pr eserve The XML application preserves all whitespace (newlines, spaces, and tabs) present within the element. default The XML processor uses its default processing rules when deciding to preserve or discard the whitespace inside the element. You should set xml:space to pr eserve only if you want an ele- ment to behave like the HTML element, such as when it documents source code.
Entity and Character ReferencesEntity refer ences ar e used as substitutions for specific charac- ters (or any string substitution) in XML. A common use for entity refer ences is to denote document symbols that might otherwise be mistaken for markup by an XML processor. XML pr edefines five entity refer ences for you, which are substitu- tions for basic markup symbols. However, you can define as many entity refer ences as you like in your own DTD. (See the next section.) Entity refer ences always begin with an ampersand (&) and end with a semicolon (;). They cannot appear inside a CDATA section but can be used anywhere else. Predefined entities in XML are shown in the following table: 20 XML Pocket Reference, Entity Char Notes & & Do not use inside processing instructions. < < Use inside attribute values quoted with ". > > Use after ]] in normal text and inside processing instructions. " " Use inside attribute values quoted with ". ' Use inside attribute values quoted with . In addition, you can provide character refer ences for Unicode characters with a numeric character refer ence. A decimal char- acter refer ence consists of the string , followed by the deci- mal number repr esenting the character, and finally, a semicolon (;). For hexadecimal character refer ences, the string is followed first by the hexadecimal number repr esenting the character and then a semicolon. For example, to repr esent the copyright character, you could use either of the following lines: This document is © 2001 by OReilly and Assoc. This document is © 2001 by OReilly and Assoc. The character refer ence is replaced with the “circled-C” (©) copyright character when the document is formatted.
Document Type DefinitionsA DTD specifies how elements inside an XML document should relate to each other. It also provides grammar rules for the document and each of its elements. A document adhering to the XML specifications and the rules outlined by its DTD is consider ed to be valid. (Don’t confuse this with a well-formed document, which adheres only to the XML syntax rules out- lined earlier.) Document Type Definitions 21,
Element DeclarationsYou must declare each of the elements that appear inside your XML document within your DTD. You can do so with the declaration, which uses this format: This declares an XML element and an associated rule called a content model, which relates the element logically to the XML document. The element name should not include < > charac- ters. An element name must start with a letter or an under- scor e. After that, it can have any number of letters, numbers, hyphens, periods, or underscores in its name. Element names may not start with the string xml in any variation of upper- or lowercase. You can use a colon in element names only if you use namespaces; otherwise, it is forbidden.
ANY and PCDATAThe simplest element declaration states that between the opening and closing tags of the element, anything can appear: The ANY keyword allows you to include other valid tags and general character data within the element. However, you may want to specify a situation where you want only general characters to appear. This type of data is better known as parsed character data, or PCDATA. You can specify that an element contain only PCDATA with a declaration such as the following: Remember, this declaration means that any character data that is not an element can appear between the element tags. 22 XML Pocket Reference, Ther efor e, it’s legal to write the following in your XML docu- ment:
XML Pocket Reference Java Network ProgrammingHowever, the following is illegal with the previous PCDATA declaration: XMLOn the other hand, you may want to specify that another ele- ment must appear between the two tags specified. You can do this by placing the name of the element in the parenthe- ses. The following two rules state that a Pocket Reference element must contain a element, and a element must contain parsed character data (or null content) but not another ele- ment:
Multiple sequencesIf you wish to dictate that multiple elements must appear in a specific order between the opening and closing tags of a spe- cific element, you can use a comma (,) to separate the two instances: In the preceding declaration, the DTD states that within the opening
and closingtags, there must first appear a element consisting of parsed character data. It must be immediately followed by an element con- taining parsed character data. The element cannot pr ecede the element. Document Type Definitions 23, Her e is a valid XML document for the DTD excerpt defined pr eviously: The previous example showed how to specify both elements in a declaration. You can just as easily specify that one or the other appear (but not both) by using the vertical bar (|): This declaration states that either a XML Pocket Reference, Second Edition Robert Eckstein with Michel Casabianca element or an element can appear inside the element. Note that it must have one or the other. If you omit both ele- ments or include both elements, the XML document is not consider ed valid. You can, however, use a recurr ence opera- tor to allow such an element to appear more than once. Let’s talk about that now.
Grouping and recurrenceYou can nest parentheses inside your declarations to give finer granularity to the syntax you’re specifying. For example, the following DTD states that inside the
element, the XML document must contain either a element or a element immediately followed by an ele- ment. All three elements must consist of parsed character data: Now for the fun part: you are allowed to dictate inside an ele- ment declaration whether a single element (or a grouping of elements contained inside parentheses) must appear zero or 24 XML Pocket Reference, one times, one or more times, or zero or mor e times. The characters used for this appear immediately after the target element (or element grouping) that they refer to and should be familiar to Unix shell programmers. Occurrence operators ar e shown in the following table: Attr ibute Descr iption ? Must appear once or not at all (zero or one times) + Must appear at least once (one or more times) * May appear any number of times or not at all (zero or mor e times) If you want to provide finer granularity to the ele- ment, you can redefine the following in the DTD: This indicates that the element must have at least one element under it. It is allowed to have more than one as well. You can define more complex relationships with parentheses:
Mixed contentUsing the rules of grouping and recurr ence to their fullest allows you to create very useful elements that contain mixed content. Elements with mixed content contain child elements Document Type Definitions 25, that can intermingle with PCDATA. The most obvious example of this is a paragraph:
This is aMixed content declarations look like this: This declaration allows a paragraphelement. It contains this link to the W3C. Their website is veryhelpful.element to contain text (#PCDATA), elements, elements, and/or elements in any order. You can’t specify things such as: Once you include #PCDATA in a declaration, any following elements must be separated by “or” bars (|), and the grouping must be optional and repeatable (*).
Empty elementsYou must also declare each of the empty elements that can be used inside a valid XML document. This can be done with the EMPTY keyword: For example, the following declaration defines an element in the XML document that can be used as
EntitiesInside a DTD, you can declare an entity, which allows you to use an entity refer ence to substitute a series of characters for another character in an XML document—similar to macros. 26 XML Pocket Reference,
General entitiesA general entity is an entity that can substitute other charac- ters inside the XML document. The declaration for a general entity uses the following format: We have already seen five general entity refer ences, one for each of the characters <, >, &, ', and ". Each of these can be used inside an XML document to prevent the XML processor fr om interpr eting the characters as markup. (Incidentally, you do not need to declare these in your DTD; they are always pr ovided for you.) Earlier, we provided an entity refer ence for the copyright character. We could declare such an entity in the DTD with the following: Again, we have tied the ©right; entity to Unicode value 169 (or hexadecimal 0xA9), which is the “circled-C” (©) copy- right character. You can then use the following in your XML document:
©right; 2001 by MyCompany, Inc.Ther e areacouple of restrictions to declaring entities: • You cannot make circular refer ences in the declarations. For example, the following is invalid: • You cannot substitute nondocument text in a DTD with a general entity refer ence. The general entity refer ence is resolved only in an XML document, not a DTD docu- ment. (If you wish to have an entity refer ence resolved in the DTD, you must instead use a parameter entity refer- ence.) Document Type Definitions 27,
Parameter entitiesParameter entity refer ences appear only in DTDs and are replaced by their entity definitions in the DTD. All parameter entity refer ences begin with a percent sign, which denotes that they cannot be used in an XML document—only in the DTD in which they are defined. Here is how to define a parameter entity: Her e ar e some examples using parameter entity refer ences: As with general entity refer ences, you cannot make circular refer ences in declarations. In addition, parameter entity refer- ences must be declared before they can be used.
Exter nal entitiesXML allows you to declare an exter nal entity with the follow- ing syntax: This allows you to copy the XML content (located at the spec- ified URI) into the current XML document using an external entity refer ence. For example:
This example copies the XML content located at the URI http://www.or eilly.com/stocks/quotes.xml into the document when it’s run through the XML processor. As you might guess, this works quite well when dealing with dynamic data. 28 XML Pocket Reference, Current Stock Quotes"es;
Unparsed entitiesBy the same token, you can use an unparsed entity to declare non-XML content in an XML document. For example, if you want to declare an outside image to be used inside an XML document, you can specify the following in the DTD: Note that we also specify the NDATA (notation data) keyword, which tells exactly what type of unparsed entity the XML pro- cessor is dealing with. You typically use an unparsed entity refer ence as the value of an element’s attribute, one defined in the DTD with the type ENTITY or ENTITIES. Her e is how you should use the unparsed entity declared previously: Note that we did not use an ampersand (&) or a semicolon (;). These are only used with parsed entities.
NotationsFinally, notations ar e used in conjunction with unparsed enti- ties. A notation declaration simply matches the value of an NDATA keyword (GIF89a in our example) with more specific infor mation. Applications are free to use or ignore this infor- mation as they see fit:
Attribute Declarations in the DTDAttributes for various XML elements must be specified in the DTD. You can specify each of the attributes with the declaration, which uses the following form: Document Type Definitions 29, The declaration consists of the target element name, the name of the attribute, its datatype, and any default value you want to give it. Her e ar e some examples of legal declarations: In these examples, the first keyword after ATTLIST declar es the name of the target element (i.e.,
, , ). This is followed by the name of the attribute (i.e., length, width, visible, marital). This, in turn, is generally followed by the datatype of the attribute and its default value.
Attribute modifiersLet’s look at the default value first. You can specify any default value allowed by the specified datatype. This value must appear as a quoted string. If a default value is not appr opriate, you can specify one of the modifiers listed in the following table in its place: Modifier Description #REQUIRED The attribute value must be specified with the ele- ment. #IMPLIED The attribute value is unspecified, to be determined by the application. #FIXED "value" The attribute value is fixed and cannot be changed by the user. "value" The default value of the attribute. With the #IMPLIED keyword, the value can be omitted from the XML document. The XML parser must notify the applica- tion, which can take whatever action it deems appropriate at 30 XML Pocket Reference, that point. With the #FIXED keyword, you must specify the default value immediately afterwards:
DatatypesThe following table lists legal datatypes to use in a DTD: Type Descr iption CDATA Character data enumerated A series of values from which only one can be chosen ENTITY An entity declared in the DTD ENTITIES Multiple whitespace-separated entities declared in the
DTDID A unique element identifier IDREF The value of a unique ID type attribute IDREFS Multiple whitespace-separated IDREFs of elements NMTOKEN An XML name token NMTOKENS Multiple whitespace-separated XML name tokens NOTATION A notation declared in the DTD The CDATA keyword simply declares that any character data can appear, although it must adhere to the same rules as the PCDATA tag. Here are some examples of attribute declarations that use CDATA: Her e ar e two examples of enumerated datatypes where no keywords are specified. Instead, the possible values are sim- ply listed: Document Type Definitions 31, The ID, IDREF, and IDREFS datatypes allow you to define attributes as IDs and ID refer ences. An ID is simply an attribute whose value distinguishes the current element from all others in the current XML document. IDs are useful for applications to link to various sections of a document that contain an ele- ment with a uniquely tagged ID. IDREFs are attributes that ref- er ence other IDs. Consider the following XML document:
and its DTD: Her e, all employees have their own identification numbers (e1013, e1014, etc.), which we define in the DTD with the ID keyword using the empid attribute. This attribute then forms an ID for each Jack Russell Samuel Tessen Terri White Steve McAlister element; no two ele- ments can have the same ID. Attributes that only refer ence other elements use the IDREF datatype. In this case, the boss attribute is an IDREF because it uses only the values of other ID attributes as its values. IDs will come into play when we discuss XLink and XPointer. The IDREFS datatype is used if you want the attribute to refer to more than one ID in its value. The IDs must be separated by whitespace. For example, adding this to the DTD: 32 XML Pocket Reference, allows you to legally use the XML: Steve McAllisterThe NMTOKEN and NMTOKENS attributes declare XML name tokens. An XML name token is simply a legal XML name that consists of letters, digits, underscores, hyphens, and periods. It can contain a colon if it is part of a namespace. It may not contain whitespace; however, any of the permitted characters for an XML name can be the first character of an XML name token (e.g., .pr ofile is a legal XML name token, but not a legal XML name). These datatypes are useful if you enumerate tokens of languages or other keyword sets that match these restrictions in the DTD. The attribute types ENTITY and ENTITIES allow you to exploit an entity declared in the DTD. This includes unparsed entities. For example, you can link to an image as follows: You can use the image as follows: The ENTITIES datatype allows multiple whitespace-separated refer ences to entities, much like IDREFS and NMTOKENS allow multiple refer ences to their datatypes. The NOTATION keyword simply expects a notation that appears in the DTD with a declaration. Here, the player attribute of the element can be either mpeg or jpeg: Document Type Definitions 33, Note that you must enumerate each of the notations allowed in the attribute. For example, to dictate the possible values of the player attribute of the element, use the following: Note that according the rules of this DTD, the ele- ment is not allowed to play AVI files. The NOTATION keyword is rarely used. Finally, you can place all the ATTLIST entries for an element inside a single ATTLIST declaration, as long as you follow the rules of each datatype:
Included and Ignored SectionsWithin a DTD, you can bundle together a group of declara- tions that should be ignored using the IGNORE dir ective: Conversely, if you wish to ensure that certain declarations are included in your DTD, use the INCLUDE dir ective, which has a similar syntax: Why you would want to use either of these declarations is not obvious until you consider replacing the INCLUDE or IGNORE 34 XML Pocket Reference, dir ectives with a parameter entity refer ence that can be changed easily on the spot. For example, consider the follow- ing DTD: ]]> ]]> Depending on the values of the entities book and article, the definition of the text element will be differ ent: • If book has the value INCLUDE and article has the value IGNORE, then the text element must include chapters (which in turn may contain sections that themselves include paragraphs). • But if book has the value IGNORE and article has the value INCLUDE, then the text element must include sections. When writing an XML document based on this DTD, you may write either a book or an article simply by properly defining book and article entities in the document’s inter nal subset.
Inter nal subsetsYou can place parts of your DTD declarations inside the DOCTYPE declaration of the XML document, as shown: ]> The region between brackets is called the DTD’s internal sub- set. When a parser reads the DTD, the internal subset is read first, followed by the exter nal subset, which is the file refer- enced by the DOCTYPE declaration. Document Type Definitions 35, Ther e ar e restrictions on the complexity of the internal subset, as well as processing expectations that affect how you should structur e it: • Conditional sections (such as INCLUDE or IGNORE) are not per mitted in an internal subset. • Any parameter entity refer ence in the internal subset must expand to zero or mor e declarations. For example, specifying the following parameter entity refer ence is legal: %paradecl; as long as %paradecl; expands to the following: However, if you simply write the following in the internal subset, it is considered illegal because it does not expand to a whole declaration: Nonvalidating parsers aren’t requir ed to read the external sub- set and process its contents, but they are requir ed to process any defaults and entity declarations in the internal subset. However, a parameter entity can change the meaning of those declarations in an unresolvable way. Therefor e, a parser must stop processing the internal subset when it comes to the first exter nal parameter entity refer ence that it does not process. If it’s an internal refer ence, it can expand it, and if it chooses to fetch the entity, it can continue processing. If it does not pro- cess the entity’s replacement, it must not process the attribute list or entity declarations in the internal subset. Why use this? Since some entity declarations are often rele- vant only to a single document (for example, declarations of chapter entities or other content files), the internal subset is a good place to put them. Similarly, if a particular document needs to override or alter the DTD values it uses, you can place a new definition in the internal subset. Finally, in the event that an XML processor is nonvalidating (as we 36 XML Pocket Reference, mentioned previously), the internal subset is the best place to put certain DTD-related information, such as the identification of ID and IDREF attributes, attribute defaults, and entity decla- rations.
The Extensible Stylesheet LanguageThe Extensible Stylesheet Language (XSL) is one of the most intricate specifications in the XML family. XSL can be broken into two parts: XSLT, which is used for transformations, and XSL Formatting Objects (XSL-FO). While XSLT is curr ently in widespr ead use, XSL-FO is still maturing; both, however, pr omise to be useful for any XML developer. This section will provide you with a firm understanding of how XSL is meant to be used. For the very latest information on XSL, visit the home page for the W3C XSL working group at http://www.w3.or g/Style/XSL /. As we mentioned, XSL works by applying element-formatting rules that you define for each XML document it encounters. In reality, XSL simply transforms each XML document from one series of element types to another. For example, XSL can be used to apply HTML formatting to an XML document, which would transform it from:
to the following HTML: Starting XML If you havent used XML, then ... XML CommentsThe Extensible Stylesheet Language 37,
Working with XML
If you havent used XML, then ...If you look carefully, you can see a predefined hierarchy that remains from the source content to the resulting content. To ventureaguess, the
element probably maps to the , , , andelements in HTML. The element maps to the HTML
element, the element maps to the ele- ment, and so on. This demonstrates an essential aspect of XML: each document contains a hierarchy of elements that can be organized in a tr ee-like fashion. (If the document uses a DTD, that hierarchy is well defined.) In the previous XML example, the
element is a leaf of the ele- ment, while in the HTML document, the and elements are leaves of the element. XSL’s primary purpose is to apply formatting rules to a sour ce tr ee, render- ing its results to a result tree, as we’ve just done. However, unlike other stylesheet languages such as CSS, XSL makes it possible to transform the structure of the document. XSLT applies transformation rules to the document source and by changing the tree structure, produces a new document. It can also amalgamate several documents into one or even pro- duce several documents starting from the same XML file.
Formatting ObjectsOne area of the XSL specification that is gaining steam is the idea of for matting objects. These objects serve as universal for matting tags that can be applied to virtually any arena, including both video and print. However, this (rather large) ar ea of the specification is still in its infancy, so we will not 38 XML Pocket Reference, discuss it further in this refer ence. For more infor mation on formatting objects, see http://www.w3.or g/TR/XSL /. The remainder of this section discusses XSL Transfor mations.
XSLT Stylesheet StructureThe general order for elements in an XSL stylesheet is as follows:
Essentially, this ordering boils down to a few simple rules. First, all XSL stylesheets must be well-formed XML documents, and each ... ...... ... element must use the namespace specified by the xmlns declaration in the element (commonly xsl:). Second, all XSL stylesheets must begin with the XSL root element tag, , and close with the corresponding tag,. Within the opening tag, the XSL names- pace must be defined: XSLT Stylesheet Structure 39, After the root element, you can import external stylesheets with elements, which must always be first within the element. Any other elements can then be used in any order and in multiple occurrences if needed.
Templates and Patter nsAn XSLT stylesheet transforms an XML document by applying templates for a given type of node. A template element looks like this: ... wher e patter n selects the type of node to be processed. For example, say you want to write a template to transform a
node (for paragraph) into HTML. This template will be applied to all elements. The tag at the beginning of the template will be: The body of the template often contains a mix of “template instructions” and text that should appear literally in the result, although neither are requir ed. In the previous example, we want to wrap the contents of the element in
andHTML tags. Thus, the template would look like this:
element recursively applies all other templates from the stylesheet against the element (the curr ent node) while this template is processing. Every stylesheet has at least two templates that apply by default. 40 XML Pocket Reference,
The first default template processes text and attribute nodesand writes them literally in the document. The second default template is applied to elements and root nodes that have no associated namespace. In this case, no output is generated, but templates are applied recursively from the node in ques- tion.
Now that we have seen the principle of templates, we canlook at a more complete example. Consider the following
Sample text This is the first section of the text. This is the second section of the text.
To transfor m this into HTML, we use the following template:
Let’s look at how this stylesheet works. As processing begins, the current node is the document root (not to be confused with the
Templates and Patter ns 41,
element, which is its only descendant), des- ignated as / (like the root directory in a Unix filesystem). The XSLT processor searches the stylesheet for a template with a matching pattern in any children of the root. Only the first template matches (). The first tem- plate is then applied to the node, which becomes the current node. The transformation then takes place: the , , , andelements are simply copied into the docu- ment because they are not XSL instructions. Between the tags and , the element copies the contents of the element into the document. Finally, theand tags. This time through, the title and section templates are applied because their patterns match. The title template inserts the contents of the element tells the XSL proces- sor to apply the templates recursively and insert the result between the element between the HTML and
tags, thus displaying the document title. The section tem- plate works by using the
element to recopy the contents of the current element’s title attribute into 42 XML Pocket Reference, the document produced. We can indicate in a pattern that we want to copy the value of an attribute by placing the at sym- bol (@) in front of its name. The process continues recursively to produce the following HTML document: Sample text
This is the first section of the text.
This is the second section of the text.As you will see later, patter ns ar e XPath expressions for locat- ing nodes in an XML document. This example includes very basic patterns, and we have only scratched the surface of what can be done with templates. More infor mation will be found in the section “XPath.” In addition, the element has a mode attribute that can be used for conditional processing. An template is tested only when it is called by an
element that matches its mode. This functionality can be used to change the processing applied to a node dynamically.
Parameters and VariablesTo finish up with templates, we should discuss the name attribute. These templates are similar to functions and can be called explicitly with the
ele- ment, where name matches the name of the template you want to invoke. When you call a template, you can pass it Templates and Patter ns 43, parameters. Let’s assume we wrote a template to add a footer containing the date the document was last updated. We could call the template, passing it the date of the last update this way: The call-template declar es and uses the parameter this way:
Last update: The parameter is declared within the template with the