Parsing opml files in Java

The full & latest version of this code is available here. The unit tests are also there, they are very long and boring so I didn't include them in this article.

Outline Processor Markup Language (OPML) is a format that didn't really take off. Maybe it did in some industry I'm unaware of, completely possible. I was working on an integration system of sorts when the XML specification was first published so I've been there since day one. I've only seen OPML used to import/export lists of podcasts or RSS subscriptions.

OPML is fine for that. It has a light generic structure that works for lists of things.

I rely on RSS for following sites and news. I've been off social media for nearly a decade and off TV news for longer. If RSS disappeared I'd be one of those people who doesn't know what year it is. I am not kidding at all. I use Feedly for my RSS reader, ever since Google killed off their free reader. Feedly allows you to export your feed list to OPML. It would be nice if Spotify did that for podcasts.

One random day I thought it would be "fun" to export my RSS feeds from Feedly and post them here. I have a zany tool I wrote to update and publish this site. It seemed logical to add a feature where it can read the latest OPML export and re-create this page of RSS links.

So I guess I'm writing a parser based on the OPML 2.0 spec. The first step in that is creating constants for all the element and attribute names.


public abstract class OPMLConstants{
 public final static String ELEMENT_OPML="opml";
 public final static String ELEMENT_HEAD="head";
 public final static String ELEMENT_TITLE="title";
 public final static String ELEMENT_DATECREATED="dateCreated";
 public final static String ELEMENT_DATEMODIFIED="dateModified";
 public final static String ELEMENT_OWNERNAME="ownerName";
 public final static String ELEMENT_OWNEREMAIL="ownerEmail";
 public final static String ELEMENT_OWNERID="ownerId";
 public final static String ELEMENT_DOCS="docs";
 public final static String ELEMENT_EXPANSIONSTATE="expansionState";
 public final static String ELEMENT_VERTSCROLLSTATE="vertScrollState";
 public final static String ELEMENT_WINDOWTOP="windowTop";
 public final static String ELEMENT_WINDOWLEFT="windowLeft";
 public final static String ELEMENT_WINDOWBOTTOM="windowBottom";
 public final static String ELEMENT_WINDOWRIGHT="windowRight";
 public final static String ELEMENT_BODY="body";
 public final static String ELEMENT_OUTLINE="outline";
 public final static String ATTRIBUTE_VERSION="version";
 public final static String ATTRIBUTE_TEXT="text";
 public final static String ATTRIBUTE_DESCRIPTION="description";
 public final static String ATTRIBUTE_HTMLURL="htmlUrl";
 public final static String ATTRIBUTE_LANGUAGE="language";
 public final static String ATTRIBUTE_TITLE="title";
 public final static String ATTRIBUTE_TYPE="type";
 public final static String ATTRIBUTE_XMLURL="xmlUrl";
 public final static String ATTRIBUTE_ISCOMMENT="isComment";
 public final static String ATTRIBUTE_ISBREAKPOINT="isBreakpoint";
 public final static String ATTRIBUTE_CREATED="created";
 public final static String ATTRIBUTE_CATEGORY="category";
 public final static String ATTRIBUTE_URL="url";
 public final static String XML_VERSION="<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>"; 
 public final static String OPML_DEFAULT_VERSION="2.0\">";
}

The last two constants are going to come up later when we get write a toXml() method. That has nothing to do with the goal of this article though.

The OPML file/object has two parts, a head and a collection of elements (called outlines). The OPML object itself is going to be the last thing to write.

The OPML head is overall simple. The date fields are a little messy. The specification supports two date formats. So I went with trying to parse both and storing the result as a long (epoch time). Then it's possible to sort off the date fields. Sorting by date when the dates are strings is of course possible but more complicated. I didn't try to convert the string dates to a Java Date object because I value my sanity.


public class OPMLHead{
 private String title;
 /* 
  * OPML allows years to be four or two digits which is, whatever... 
  * Easier to treat them as a Strings then and store a hidden epoch time that can be used for sorting.
  */
 private String dateCreated;
 private long dateCreatedEpoch=0L;
 private String dateModified;
 private long dateModifiedEpoch=0L;
 private String ownerName;
 private String ownerEmail;
 private String ownerId; 
 private String docs; 
 private String expansionState;
 //These are Integer so they can be null to indicate there is no value for them, which is common.
 private Integer vertScrollState; 
 private Integer windowTop; 
 private Integer windowLeft;
 private Integer windowBottom;
 private Integer windowRight;

 public void setDateCreated(String dateCreated){
  this.dateCreated=dateCreated;
  this.dateCreatedEpoch=DateUtil.toEpochTime(dateCreated,DateUtil.DF_RFC822);
  if(this.dateCreatedEpoch==0L){
   //try again I guess, if still 0 after this then the date passed is definitely invalid
   this.dateCreatedEpoch=DateUtil.toEpochTime(dateCreated,DateUtil.DF_RFC822_OPML_ALT);
  }
 }
 
 public void setDateModified(String dateModified){
  this.dateModified=dateModified;
  this.dateModifiedEpoch=DateUtil.toEpochTime(dateModified,DateUtil.DF_RFC822);
  if(this.dateModifiedEpoch==0L){
   //try again I guess, if still 0 after this then the date passed is definitely invalid
   this.dateModifiedEpoch=DateUtil.toEpochTime(dateModified,DateUtil.DF_RFC822_OPML_ALT);
  }
 }
 [... autogenerated constructors + get/set methods ...]
}

The OMPL outline also has to navigate the potential of multiple date formats. It overrides compareTo() to facilitate sorting by name, trivial to make it sort by date instead.


import com.huguesjohnson.dubbel.util.DateUtil;

public class OPMLOutline implements Comparable<OPMLOutline>{
 private List<OPMLOutline> children;
 private String text;
 private String type;
 private Boolean isComment;
 private Boolean isBreakpoint;
 private String created;
 private long createdEpoch=0L;
 private String language;
 private String category;
 private String xmlUrl;
 private String description;
 private String htmlUrl;
 private String title;
 private String version;
 private String url;

 public OPMLOutline(){
  this.children=new ArrayList<OPMLOutline>();
 }

 public void setCreated(String created){
  this.created=created;
  this.createdEpoch=DateUtil.toEpochTime(created,DateUtil.DF_RFC822);
  if(this.createdEpoch==0L){
   //try again I guess, if still 0 after this then the date passed is definitely invalid
   this.createdEpoch=DateUtil.toEpochTime(created,DateUtil.DF_RFC822_OPML_ALT);
  } 
 }

 @Override
 public int compareTo(OPMLOutline arg0){
  String titleCompare=arg0.getTitle();
  if(titleCompare==null){
   if(this.title==null){
    return(0);
   }else{
    return(1);
   }
  }else{
   if(this.title==null){
    return(-1);
   }else{
    return(this.title.compareToIgnoreCase(titleCompare));
   }
  }
 } 
 [... autogenerated constructors + get/set methods ...]
}

Let's break down the OPML object into pieces. As previously noted, the object only contains the head and a collection of outlines. There is also a member to store the version.


public class OPMLObject{
 private String version;
 private OPMLHead head;
 private List<OPMLOutline> body;

 public OPMLObject(){
  this.head=new OPMLHead();
  this.body=new ArrayList<OPMLOutline>();
 }

I wanted the ability to sort the entire collection alphabetically. Changing it to sort by created date is a very small bit of work (see the compareTo() above). I did this through recursion which could be a problem for a very large OPML file. Programming is all about trade-offs. I'm choosing something simple that would fail in extreme edge cases.


 public void sortOutlines(){
  this.sortOutlines(this.getBody());
 }

 private void sortOutlines(List<OPMLOutline> outlines){
  Collections.sort(outlines);
  for(OPMLOutline outline:outlines){
   List<OPMLOutline> children=outline.getChildren();
   if((children!=null)&&(children.size()>0)){
    sortOutlines(children);
   }
  }
 }

Although I didn't need it, I wanted to add a toXML() method. It wasn't very complicated so why not?

The prep work for that is having a way to convert elements to an XML fragment.


 private void appendElementXml(StringBuilder sb,Integer element,String constant,String newLine){
  if(element!=null){
   appendElementXml(sb,element.toString(),constant,newLine);
  }
 } 
 
 private void appendElementXml(StringBuilder sb,String element,String constant,String newLine){
  if((element!=null)&&(element.length()>0)){
   sb.append("<");
   sb.append(constant);
   sb.append(">");
   sb.append(element);
   sb.append("</");
   sb.append(constant);
   sb.append(">");
   sb.append(newLine);
  }   
 }

The same is needed for attributes.


 private void appendAttribute(StringBuilder sb,String attribute,String constant){
  if((attribute!=null)&&(attribute.length()>0)){
   sb.append(" ");
   sb.append(constant);
   sb.append("=\"");
   sb.append(attribute);
   sb.append("\"");
  }  
 }
 
 private void appendAttribute(StringBuilder sb,Boolean attribute,String constant){
  if(attribute!=null){
   if(attribute.booleanValue()){
    appendAttribute(sb,"true",constant);
   }else{
    appendAttribute(sb,"false",constant);
   }
  }  
 } 

The toXml() method is then only a matter of going through all the members and converting them.


 public String toXml(){
  StringBuilder sb=new StringBuilder();
  String newLine=System.lineSeparator();
  sb.append(OPMLConstants.XML_VERSION);
  sb.append(newLine);
  //start of <opml>
  sb.append("<");
  sb.append(OPMLConstants.ELEMENT_OPML);
  sb.append(" version=\"");
  if((this.version!=null)&&(this.version.length()>0)){
   sb.append(this.version);
  }else{
   sb.append(OPMLConstants.OPML_DEFAULT_VERSION);
  }
  sb.append("\">");
  sb.append(newLine);
  //start of <head>
  sb.append("<");
  sb.append(OPMLConstants.ELEMENT_HEAD);
  sb.append(">");
  sb.append(newLine);
  //elements in the head section
  this.appendElementXml(sb,head.getTitle(),OPMLConstants.ELEMENT_TITLE,newLine);
  this.appendElementXml(sb,head.getDateCreated(),OPMLConstants.ELEMENT_DATECREATED,newLine);
 [... and so until until ...]
  this.appendElementXml(sb,head.getWindowBottom(),OPMLConstants.ELEMENT_WINDOWBOTTOM,newLine);
  //end of <head>
  sb.append("</");
  sb.append(OPMLConstants.ELEMENT_HEAD);
  sb.append(">");
  sb.append(newLine);
  //start of <body>
  sb.append("<");
  sb.append(OPMLConstants.ELEMENT_BODY);
  sb.append(">");
  sb.append(newLine);
  //outlines
  this.appendOutlines(sb,this.getBody(),newLine);
  //end of <body>
  sb.append("</");
  sb.append(OPMLConstants.ELEMENT_BODY);
  sb.append(">");
  sb.append(newLine);
  //end of <opml>
  sb.append("</");
  sb.append(OPMLConstants.ELEMENT_OPML);
  sb.append(">");
  //done
  return(sb.toString());
 }

Next we're off to the appendOutlines method that is called in toXML(). This is again recursive with the same trade-offs previously noted.


 private void appendOutlines(StringBuilder sb,List<OPMLOutline> outlineList,String newLine){
  for(OPMLOutline outline:outlineList){
   sb.append("<");
   sb.append(OPMLConstants.ELEMENT_OUTLINE);
   //elements in the outline section
   this.appendAttribute(sb,outline.getTitle(),OPMLConstants.ATTRIBUTE_TITLE);
   this.appendAttribute(sb,outline.getText(),OPMLConstants.ATTRIBUTE_TEXT);
   [... and so until until ...]
   this.appendAttribute(sb,outline.getVersion(),OPMLConstants.ATTRIBUTE_VERSION);
   //recursively add children if there are any
   List<OPMLOutline> children=outline.getChildren();
   if((children!=null)&&(children.size()>0)){
    sb.append(">");
    sb.append(newLine);
    //add child nodes
    this.appendOutlines(sb,children,newLine);
    //close out this outline
    sb.append("</");
    sb.append(OPMLConstants.ELEMENT_OUTLINE);
    sb.append(">");
    sb.append(newLine);
    //"</outline>\n" + 
   }else{
    sb.append("/>");
    sb.append(newLine);
   }
  }
 }

With that distraction out of the way we can get to the actual topic of the article - parsing an OPML file.

I'll get the extremely simple part out of the way, a custom exception for any parsing errors.


public class OPMLParseException extends Exception{
 private static final long serialVersionUID=666136489L;
 public OPMLParseException(String message,Throwable cause){
  super(message,cause);
 }
 public OPMLParseException(String message){
  super(message);
 }
 public OPMLParseException(Throwable cause){
  super(cause);
 } 
}

The comments at the start of the class are kind of important to read.


/*
 * This will also parse an OPML file that isn't perfectly structured.
 * 
 * For example:
 * -This doesn't care what order the <head> and <body> elements are in.
 * -Actually this doesn't care whether there is <body> element at all.
 * --A document with a bunch of <outline> elements and no <body> would be handled.
 * --It could also read documents missing <head> and/or <outline> without any issues.
 * -The specification has rules about <ownerId> that this does not care about.
 * 
 * Reminder, this is called "Parser" and not "Validator".
 */

This code will read a valid OPML file and also handle anything that is close to one. If you need to only support valid files it takes about 1 second to find the source for a Javascript validator that would be easy to port to another language.

There are two public parse methods, but one is just a wrapper for the other. Let's start with that one.


public class OPMLParser{
 public static OPMLObject parse(File f) throws OPMLParseException{
  try{
   return(parse(new FileInputStream(f)));
  }catch(FileNotFoundException fnfx){
   OPMLParseException opmlEx=new OPMLParseException(fnfx.getMessage(),fnfx);
   throw(opmlEx);
  }
 }

We can debate whether it should wrap the FileNotFoundException or make it part of the throws clause. I went with what I preferred in this moment.

I did not feel like re-implementing javax.xml.stream so the parser relies heavily on it. You'll notice right away this indeed does not care about the order of the head and outline. It looks for the opml node, head node, and outline node. The parsing for head and outline will follow shortly.


public static OPMLObject parse(InputStream in) throws OPMLParseException{
  OPMLObject opml=new OPMLObject();
  try{
   XMLInputFactory xin=XMLInputFactory.newInstance();
   XMLStreamReader xmlReader=xin.createXMLStreamReader(in);
   while(xmlReader.hasNext()){
    int eventType=xmlReader.next(); 
    if(eventType==XMLStreamConstants.START_ELEMENT){
     if(xmlReader.hasName()){
   String name=xmlReader.getName().toString();
   if(name.equals(OPMLConstants.ELEMENT_OPML)){//start of <opml>
    //find the version string
    int attributeCount=xmlReader.getAttributeCount();
    for(int i=0;i<attributeCount;i++){
     String attributeName=xmlReader.getAttributeName(i).toString();
     if(attributeName.equalsIgnoreCase(OPMLConstants.ATTRIBUTE_VERSION)){
      opml.setVersion(xmlReader.getAttributeValue(i));
     }
    }
   }else if(name.equalsIgnoreCase(OPMLConstants.ELEMENT_HEAD)){//start of <head>
    opml.setHead(parseHead(xmlReader));
   }else if(name.equalsIgnoreCase(OPMLConstants.ELEMENT_OUTLINE)){//start of <outline>
    opml.getBody().add(parseOutline(xmlReader));
   }
  }
 }
}
 return(opml);
}catch(Exception x){
   OPMLParseException opmlEx=new OPMLParseException(x.getMessage(),x);
   throw(opmlEx);
  }
 }

Before we get to the head and outline, here are some methods to parse Strings and Integers.


 private static String parseString(XMLStreamReader xmlReader) throws XMLStreamException{
  int eventType=xmlReader.next();
  if(eventType==XMLStreamConstants.CHARACTERS){
   String s=(new String(xmlReader.getText()));
   return(s);
  }
  return(null);
 }
 
 private static Integer parseInteger(XMLStreamReader xmlReader) throws XMLStreamException{
  int eventType=xmlReader.next();
  if(eventType==XMLStreamConstants.CHARACTERS){
   String s=(new String(xmlReader.getText()));
   return(Integer.decode(s));
  }
  return(null);
 } 

Parsing the head section isn't terribly exciting, a lot of repetition since there are so many elements in it.


 private static OPMLHead parseHead(XMLStreamReader xmlReader) throws XMLStreamException{
  OPMLHead head=new OPMLHead();
  boolean atEnd=false;
     while((xmlReader.hasNext())&&(!atEnd)){
      int eventType=xmlReader.next();
      if(eventType==XMLStreamConstants.END_ELEMENT){//end of <head>?
       if(xmlReader.hasName()){
        String name=xmlReader.getName().toString();
        if(name.equalsIgnoreCase(OPMLConstants.ELEMENT_HEAD)){//yes, at the end of <head>
         atEnd=true;
        }
       }
      }else if(eventType==XMLStreamConstants.START_ELEMENT){//start of element within <head>
       if(xmlReader.hasName()){
        String name=xmlReader.getName().toString();
        if(name.equalsIgnoreCase(OPMLConstants.ELEMENT_TITLE)){
         head.setTitle(parseString(xmlReader));
        }else if(name.equalsIgnoreCase(OPMLConstants.ELEMENT_DATECREATED)){
         head.setDateCreated(parseString(xmlReader));
        }else if(name.equalsIgnoreCase(OPMLConstants.ELEMENT_DATEMODIFIED)){
         head.setDateModified(parseString(xmlReader));
        [... and so on until ..]   
        }else if(name.equalsIgnoreCase(OPMLConstants.ELEMENT_WINDOWRIGHT)){
         head.setWindowRight(parseInteger(xmlReader));
        }
       }       
      }
     }  
  return(head);
 }

Parsing the outline is also not extremely exciting except for the recursion again.


 private static OPMLOutline parseOutline(XMLStreamReader xmlReader) throws XMLStreamException{
  OPMLOutline outline=new OPMLOutline();
  //read the attributes
  int attributeCount=xmlReader.getAttributeCount();
  for(int i=0;i<attributeCount;i++){
   String attributeName=xmlReader.getAttributeName(i).toString();
   if(attributeName.equalsIgnoreCase(OPMLConstants.ATTRIBUTE_TEXT)){
    outline.setText(xmlReader.getAttributeValue(i));
   }else if(attributeName.equalsIgnoreCase(OPMLConstants.ATTRIBUTE_VERSION)){
    outline.setVersion(xmlReader.getAttributeValue(i));
  [... and so on until ..]   
   }else if(attributeName.equalsIgnoreCase(OPMLConstants.ATTRIBUTE_ISCOMMENT)){
    String s=xmlReader.getAttributeValue(i);
    if(s!=null){
     if(s.equalsIgnoreCase("true")){
      outline.setComment(Boolean.TRUE);
     }else{
      outline.setComment(Boolean.FALSE);
     }
    }
   }else if(attributeName.equalsIgnoreCase(OPMLConstants.ATTRIBUTE_ISBREAKPOINT)){
    String s=xmlReader.getAttributeValue(i);
    if(s!=null){
     if(s.equalsIgnoreCase("true")){
      outline.setBreakpoint(Boolean.TRUE);
     }else{
      outline.setBreakpoint(Boolean.FALSE);
     }
    }
   }  
  }
  //now look for either the end of the outline or the start of a child outline
  boolean atEnd=false;
     while((xmlReader.hasNext())&&(!atEnd)){
      int eventType=xmlReader.next();
      if(eventType==XMLStreamConstants.END_ELEMENT){//end of <outline>?
       if(xmlReader.hasName()){
        String name=xmlReader.getName().toString();
        if(name.equalsIgnoreCase(OPMLConstants.ELEMENT_OUTLINE)){//yes, at the end of <outline>
         atEnd=true;
        }
       }
      }else if(eventType==XMLStreamConstants.START_ELEMENT){//start of nested <outline>?
       if(xmlReader.hasName()){
        String name=xmlReader.getName().toString();
        if(name.equalsIgnoreCase(OPMLConstants.ELEMENT_OUTLINE)){//yes, new <outline>
         outline.getChildren().add(parseOutline(xmlReader));
        }
       }
      }
     }//while
  return(outline);
 }

Ultimately this is simply parsing an XML file with some expected element names and nested content. XML parsing always looks like a lot of code but in reality it's a lot of repetitive code. Perhaps this is the kind of thing I should have AI do now.



Related