Showing posts with label xml. Show all posts
Showing posts with label xml. Show all posts

Sunday, February 12, 2012

Basic ISO-Schematron functions

Every blogger loves Google Analytics: oh, the voyeuristic joy of knowing what search terms people misspelled for Google to return your blog as a 'top' result is irresistible! My blog isn't popular because it's about VMware, it's because I keep misspelling VMware as WMware. Did you know my blog is the most popular WMware blog in the world?! Actually, that's not true. My blog isn't popular at all. But I'm hoping today's post about ISO-Schematron will turn more fortunes around. I'll be grabbing some search terms out of Google Analytics and answering the questions you never asked me, which is not dissimilar to hosting an unpopular talk back radio show.

I seem to get a lot of ISO-Schematron-related hits, so if you're uninterested in this topic, close your browser window now and throw your computer into a fountain. If you're interested in the XML validating magic of ISO-Schematron, feel free to read on or throw your computer into a fountain as well.

What is ISO-Schematron?

ISO-Schematron is an XML schema language. Schema languages are ways of making sure your XML is valid. Take a look at the following XML code

<dog id="500">
     <name>Chop Chop</name>
     <breed>Silky Terrier</breed>
     <dob>2012-01-05</dob>
</dog>

It's well formed XML. Without knowing too much about the XML, you can tell it describes a dog named Chop Chop. The dog has the breed Silky Terrier and a date of birth sometime in 2012. This reminds me, I better change my password reset questions now. Let's look at the same snipplet of XML code except with some creativity.


<dogPASTA
     <name>Chop Chop</name
     <breed>Silky Terrier
     <dob>2012-01-05</date fo birth y'all>
<dog> PASTAPASTASPTA


This is definitely not well formed! Not every opening tag has a closing tag, there are slashes missing and there's pasta everywhere! You don't use an XML schema language to determine whether this XML is well formed or not well formed: you use an XML syntax checker and common sense. Let's look at yet another snipplet of XML.


<dog id="500">
     <name>Chop Chop</name>
     <name>Choppy</name>
     <breed>Silky Terrier</breed>
     <dob>2012-01-05</dob>
</dog>


It's easy to tell that this XML is well formed (there's no pasta lying around, for starters). But is the XML valid? Is a dog allowed to have two names? Whether or not a dog is allowed to have two names is the decision of the owner. I, for one, welcome our multi-named dog overlords! But what if you don't? What if you believe that a dog can have one name only?! We'd be in disagreement! We'd both agree that the XML is well formed, but we would disagree on whether the XML was valid.

What is ISO-Schematron? (no really, answer my question this time).

I need to test XML code to see whether it meets some rules.

  • A dog has an ID attribute
  • A dog can have multiple names, but has to have at least one name.
  • A dog definitely has a date of birth
  • A dog must have a breed
  • A dog's date of birth cannot be in the future
With ISO-Schematron, I can write rules. Each rule will contain one or more assertions. Writing these assertions gets tricky.

Alright! Let's write assertions!

Let's start by ripping search terms out of analytics like uncreative Law & Order writers rip stories out of headlines and back episodes.

Google search term: count elements schematron children must have
Translation: Señor Paul, what ISO-Schematron could check whether an element has the correct amount of child elements?

What you're looking for is the count() function. The following snipplet will make sure you have 5 child elements.

<iso:rule context="dog">
<iso:assert test="count(breed) = 1">The dog element must have one breed element only!</iso:assert>
</iso:rule>

With a simple and, you can check for the correct amount of multiple types of child elements.

<iso:rule context="dog">
<iso:assert test="count(breed) = 1 and count(dob) = 5">The dog element must have 1 breed element and 1 dob element</iso:assert>
</iso:rule>

Easy, next!

Google search term: check schematron first element
Translation: Señor Paul, I too use ISO-Schematron to verify the validity of my XML files, possibly because I'm a student trying to cheat on my homework! Please assist me by explaining how to check if the first child element within an element is of a certain type.

Try this buster. This will check if the first element in a DiskSection element is Info.

<iso:rule context="dog">
<iso:assert test="*[1][self::name]">The first element within dog must be name</iso:assert>
</iso:rule>

Google search term: check existance of attribute schematron
Translation: do my homework for me please

First of all, you spelled existence incorrectly. I'm guessing you want to check if an element had an attribute. This snipplet checks whether the dog element has a name attribute

<iso:rule context="dog">
<iso:assert test="@id">The dog element doesn't have an ID attribute!</iso:assert>
</iso:rule>


Easy! If you have any other tricky ISO-Schematron questions, put them in the comments and I'll try to help you and then make fun of you.

Monday, June 13, 2011

ISO 8601 date/time/duration manipulation with XSL

I have an XML log full of events (yay). The vendor have chosen to represent events and event durations with two XML variables: eventStart and eventDuration. My challenge: I need to transform the following XML


<eventStart>2011-07-13T18:00:00</eventStart>
<eventDuration>PT1H30M</eventDuration>
<eventDescription>Cisco Burger Maker cannot make burgers</eventDescription>


into CSV that can be scraped by another application


7/13/2011,18:00,19:30,"Cisco Burger Maker cannot make burgers"


You might comment "that eventDescription looks normal, but what kind of silly notation are eventStart and eventDuration in?!" It's ISO 8601 which is the standard for interchange of date and times. It's commonly used in XML documents prevent a Abbott and Costello "Who's on first?" ambiguity when representing date, time and duration.


"One second? No, I need you to tell me the duration now!"

The XML contains the eventStart and eventDuration, but no end time. To produce the output I need, I'm going to need to do some date manipulation. This would be easy enough in any other languages: Java and C# have classes that deal with date manipulation. Unfortunately, XSL isn't as flexible. To do the sort of transformations requires to get the output, we'll need to do a bit of string hacking. To ease the string happening, I've expended all my photoshop skills to produce this diagram that shows the character positions.

I installed Photoshop for this?!

Let's get manipulating!


1) How do you convert an ISO 8601 date to DD/MM/YYYY?
We can do this with simple string manipulation. Because ISO 8601 requires padding of date variables (ie. the Queen's birthday is stored as 2011-06-13 and not 2011-6-13), we are guaranteed that that the first four characters are the year, the 6th and 7th are the month, and the 9th and 10th are the days. You can use the substring command to grab the right characters, some / characters to separate them, and the concat command to glue it all together.


<xsl:value-of select="concat(substring(.,9,2)),'/',substring(.,6,2),'/',substring(.,1,4))"/>


Using that operation could result in the output 02/05/2011. What if we want to drop the preceding zero (ie. get 2/5/2011)? The number function does that.

<xsl:value-of select="concat(number(substring(.,9,2))),'/',number(substring(.,6,2)),'/',number(substring(.,1,4)))"/>


2) How do you get the time from an ISO 8601 date?
This can be performed with easy string manipulation. We can use substring to grab all the five characters after the T and stick them into a new variable called start-time.

<xsl:varaible name="start-time" select="substring(substring-after(.,'T'),1,5)"/>


Applying this to 2011-06-13T18:30:00 gives 18:30.

3) How do I convert an ISO 8601 duration into a 24hr duration?
For the purposes of simplicity, I'm going to assume that your periods contain only hours ('H') and minutes ('M') (ie. your period will either be in the form PT30M, PT1H, PT1H30M). No days/weeks/months/years.

To do this, I'll create three variables.
  • duration-dirty will contain the duration in ISO 8601 format, except with the PT and M characters removed. I'm using this variable to reduce the amount of substring and translate functions in the later code.
  • duration-hour will contain the hour digits
  • duration-minute will contain the minute digits
Here's a diagram that shows these variables relation to the original eventDuration element.

To get dirtyDuration, we can use the substring functions to perform some slicing and dicing. To start, we can get rid of the PT and M characters.

<xsl:variable name="duration-dirty" select="translate(translate(eventDuration/text(),'PT',''),'M','')"/> 

Once we've done that, the period will look something like 30 (30 minutes), 1H (1 hour) or 1H30 (1 hour and 30 minutes. We can determine whether the duration contained hours by converting the duration-dirty variable to a number. If the conversion outputs NaN (not a number), we know there were more than 60 minutes in the duration.

<xsl:variable name="duration-hour">
     <xsl:choose>
          <!-- If it's not NaN (ie. a valid number), then the hours are 0 -->
          <xsl:when test="not(string(number($duration-dirty)) = 'NaN')">
               <xsl:text>0</xsl:text>
               <xsl:otherwise>
                    <!-- If it's NaN, there were hours. Use substring-before to grab anything before the H. -->
                    <xsl:value-of select="substring-before($duration-dirty,'H')"/>
               </xsl:otherwise>
          </xsl:when>
     </xsl:choose>
</xsl:variable>

Calculating the minutes is same same but different: we check if the duration-dirty element can be converted to a number. If it can, then dirty-duration contained only minutes (so we can use it). If converting it to a number returns an NaN, there were hours so we need to grab everything after the H.

<xsl:variable name="duration-minute">
     <xsl:choose>
          <!-- If it's NaN, there are no hours. duration-dirty is good to use. -->
          <xsl:when test="not(string(number($duration-dirty)) = 'NaN')">
               <xsl:value-of select="$duration-dirty"/>
          </xsl:when>
          <xsl:otherwise>
               <!-- If it's not NaN, then there are hours! Grab everything after the H. -->
               <xsl:value-of select="substring-after($duration-dirty,'H')"/>
          </xsl:otherwise>
     </xsl:choose>
</xsl:variable>

4) How do I add times together?
Suppose we want to calculate the end time of an event given eventStart and eventDuration. Step 2 will give us 18:00 from 2011-07-13T18:00:00. Step 3 will give us the variables duration-hour and duration-minute (1 and 30 respectively). But how do we add these two?

Start by calculating the end hour. If we add duration-minute to the minute digits in start-time and exceed 60, an hour has passed. And if we have more than 24 hours...go back to zero using the modulo function! The modulo function is sorta like the the math equivalent of word wrap: 22 mod 24 = 22, 23 mod 24 = 23, 24 mod 24 = 0, 25 mod 24 = 1. Perfect for 'resetting' back to 0.

<xsl:choose>
     <xsl:when test="substring($start-time,3,2) + $duration-minute > 60">
          <!-- An hour has passed! Add an extra hour -->
          <xsl:value-of select="(number(substring($start-time,1,2)) + $duration-hour + 1) mod 24"/> 
     </xsl:when>
     <xsl:otherwise>
          <!-- An hour has not passed. Just add the hours together. -->
          <xsl:value-of select="(substring($start-time,1,2) + $duration-hour) mod 24"/> 
     </xsl:otherwise>
</xsl:choose>


Awesome! But...if you have less than 10 hours, your output won't look pretty (ie. we want 09:30 rather than 9:30). We can easily fix this by padding a zero character if the hours are less than 10.


<xsl:if test="((number(substring($start-time,1,2)) + $duration-hour + 1) mod 24) &lt; 10">
     <xsl:text>0</xsl:text>
</xsi:if>


Good. Now calculate the end minute. I'm no physics major but if I recall correctly, there are only 60 minutes in an hour. If there are 60 minutes, the hour increments and the minutes reset back to 0.

<xsl:choose>
     <xsl:when test="not(string(number($duration-minute)) = 'NaN')">
          <!-- More than 60 minutes - go back to 0! -->
     <xsl:value-of select="number(substring($start-time,3,2) + $duration-minute) mod 60"/>
     </xsl:when>
     <xsl:otherwise>
          <!-- Less than 60 minutes. Easy. -->
          <xsl:value-of select="substring($start-time,3,3)"/>
     </xsl:otherwise>
</xsl:choose>

Awesome! This code assumes that your event starts and ends during the same day. I'll leave incrementing the day as an exercise for you. Not because I don't know how, but because my '<' key is playing up!

Monday, May 30, 2011

Limit number of element occurrences with DTD

Both DTD and XML Schema allow the restriction of how many times an element occurs. DTD has modifiers that allow the limiting of element occurrences: * ? and +. If you add any of these symbols to an element, the amount of times it can occur is restricted.

<!ELEMENT Computer (Disk?)> means the Disk element can occur zero or once.
<!ELEMENT Computer (Disk+)> means the Disk element can occur one or more times.
<!ELEMENT Computer (Disk*)> means the Disk element can occur zero or more times.

XML Schema allows you to achieve the same result with minOccurs and maxOccurs attributes. In the following example, the Disk element can occur a minimum of once and a maximum of three times.


<xs:element name="Disk" type="xs:string" minOccurs="1" maxOccurs="3"/>


Can we perform the same restriction in DTD? It's messy, but possible! Use the OR operator (|) to specify a choice between the amount of options.

<!ELEMENT DiskSection (Disk | (Disk,Disk) | (Disk,Disk,Disk))>

Of course, anything more than a few options and it becomes very messy!

Sunday, May 8, 2011

Validating an XML document with an ISO-Schematron schema on OSX

Schema languages make sure your XML documents are valid. But not all schema languages are equal. ISO-Schematron is great but validation isn't as simple as right-clicking on your XML document in Eclipse and clicking Validate - Eclipse doesn't support ISO-Schematron natively!.

A good free and open-source XSLT and XQuery toolkit is Saxon. You can use the XSLT processor in Saxon to validate XML documents against ISO-Schematron schemas. There are three editions of Saxon - HE (Home Edition), PE (Professional Edition) and EE (Enterprise Edition). For validating the occasional document, Saxon-HE is fine. As of writing, the latest available is 9.3 which you can download here (saxonhe9-3-0-4j.zip).

Here are a few tips for first time Saxon users.

  1. When you've finished downloading Saxon, don't unzip it with OSX's built in Archive Utility. There is a bug in Apple's Archive Utility that affects the way Java .jar files are handled (it "helpfully" extracts the contents of the .jar file). Unzip it with StuffIt Expander (available in the Apple App Store).

  2. The command line usage of Saxon on OSX is slightly different to Windows. The command

    java -jar saxon9he.jar -o output.xsl -s mySchema.sch iso_svrl_for_xslt2.xsl

    will work in Windows but will throw an error in OSX ("Command line option -o requires a value"). You'll need your command slightly: add a colon after the -o operator and remove the space.

    java -jar saxon9he.jar -o:output.xsl -s:mySchema.sch iso_svrl_for_xslt2.xsl

Now, let's test an XML document against an ISO-Schematron schema.

Saxon doesn't simply spit out a 'Your XML document makes no sense' report. It takes two steps to validate an XML document against an ISO-Schematron schema. Step one involves three files.
  1. iso_schematron_skeleton_for_saxon.xsl - this contains the ISO-Schematron schema definition/rules of war! This comes with Saxon.
  2. iso_svrl_for_xslt2.xsl - SVRL is the Schematron Validation Report Language. It prepares your report and shows where you screwed up.
  3. mySchema.sch - this is the schema you've written. It defines valid content.
We need to transform mySchematron.sch with iso_svrl_for_xslt2.xsl to create yet another XSL (let's call it output.xsl). To do this, execute the following command

java -jar saxon9he.jar -o:output.xsl -s:mySchema.sch iso_svrl_for_xslt2.xsl

If your schema was invalid, you'll get a Transformation failed: Run-time errors were reported message. Luckily, the error message is verbose and will tell you what line in your schema is invalid and why. If your schema was valid, you'll now have a output.xsl file.

Now, this output.xsl file you've generated is special. It contains a mashup of your schema with the SVRL. If you transform any XML document with the output.xsl file, you'll get an XML report detailing whether it is valid against your schema! How clever! Let's transform myXMLDocument.xml.

java -jar saxon9he.jar -o:whereDidIScrewUp.xml -s:myXMLDocument.xml output.xsl

In this case, whereDidIScrewUp.xml is the validation report of your XML document.

I hope this helps! I might write a Eclipse plugin if I get some time.

UPDATE: I've been beaten to the punch! Castle Systems have released Schematron-EP (Eclipse Plugin). I've yet to test it.