parsing with parser combinators

I've known the limitation of hand parsing for a while. Parsing that relies on token positions quickly gets out of hand when there are more complex grammars like repetitions, options, and choices of sequences. At some point, I decided to use scala's parser combinators to do the parsing of content types, but it's been a long way to implement it.

First let's look at a real-life example of such complex structure:

<complexType name="SubjectType">
    <choice>
        <sequence>
            <choice>
                <element ref="saml:BaseID"/>
                <element ref="saml:NameID"/>
                <element ref="saml:EncryptedID"/>
            </choice>
            <element ref="saml:SubjectConfirmation" minOccurs="0" maxOccurs="unbounded"/>
        </sequence>
        <element ref="saml:SubjectConfirmation" maxOccurs="unbounded"/>
    </choice>
</complexType>

There's a choice within a sequence within a choice. How does 0.0.3 handle this?

case class SubjectType(arg1: org.scalaxb.rt.DataRecord[Any]) extends org.scalaxb.rt.DataModel {
....
}

The case class only has rt.DataRecord[Any], which is a mistake considering that the second choice is a repetition of one or more saml:SubjectConfirmation.

object SubjectType {
  def fromXML(node: scala.xml.Node): SubjectType =
    SubjectType(SubjectTypeOption.fromXML(node.child.filter(_.isInstanceOf[scala.xml.Elem])(0))) 
}

The problem is in the parsing. First, the above logic only deals with the first child element while both options of the choice covers a sequence of multiple elements.

object SubjectTypeOption {  
  def fromXML: PartialFunction[scala.xml.NodeSeq, org.scalaxb.rt.DataRecord[Any]] = {
    case x: scala.xml.Elem if (x.label == "SubjectConfirmation" && 
        x.scope.getURI(x.prefix) == "urn:oasis:names:tc:SAML:2.0:assertion") =>
      org.scalaxb.rt.DataRecord(x.scope.getURI(x.prefix), x.label, SubjectConfirmationType.fromXML(x))
  }
}

The limitation stems from the fact that SubjectTypeOption.fromXML is implemented to handle only one element. In other words, the option that includes the sequence is completely ignored.

Here's the same code that's generated by the new version:

case class SubjectType(arg1: rt.DataRecord[Any]*) extends rt.DataModel {
     .... 
}

object SubjectType extends rt.ElemNameParser[SubjectType] {
  val targetNamespace = "urn:oasis:names:tc:SAML:2.0:assertion"

  def parser(node: scala.xml.Node): Parser[SubjectType] =
    rep((((((rt.ElemName(targetNamespace, "BaseID")) ^^ 
      (x => rt.DataRecord(x.namespace, x.name, BaseIDAbstractType.fromXML(x.node)))) ||| 
    ((rt.ElemName(targetNamespace, "NameID")) ^^ 
      (x => rt.DataRecord(x.namespace, x.name, NameIDType.fromXML(x.node)))) ||| 
    ((rt.ElemName(targetNamespace, "EncryptedID")) ^^ 
      (x => rt.DataRecord(x.namespace, x.name, EncryptedElementType.fromXML(x.node))))) ~ 
    rep(rt.ElemName(targetNamespace, "SubjectConfirmation"))) ^^ 
      { case p1 ~ 
      p2 => rt.DataRecord(null, null, SubjectTypeSequence1(p1,
      p2.map(x => SubjectConfirmationType.fromXML(x.node)).toList)) }) ||| 
    ((rt.ElemName(targetNamespace, "SubjectConfirmation")) ^^ 
      (x => rt.DataRecord(x.namespace, x.name, SubjectConfirmationType.fromXML(x.node))))) ^^
        { case p1 => SubjectType(p1.toList: _*) }
}

case class SubjectTypeSequence1(arg1: rt.DataRecord[Any],
  SubjectConfirmation: Seq[SubjectConfirmationType]) extends rt.DataModel {
    ....
}

The parsing logic is now implemented using parser combinators. The subtree of XML document in question is first distilled down to pairs of namespace and element label called ElemName. Then the sequence of ElemNames are evaluated against the parser expressed using combinator. The inline sequence structure is explicitly expressed as SubjectTypeSequence1.

The repetition is expressed as rep(rt.ElemName(targetNamespace, "SubjectConfirmation")), which corresponds to <element ref="saml:SubjectConfirmation" minOccurs="0" maxOccurs="unbounded"/>.