gembin

          OSGi, Eclipse Equinox, ECF, Virgo, Gemini, Apache Felix, Karaf, Aires, Camel, Eclipse RCP

          HBase, Hadoop, ZooKeeper, Cassandra

          Flex4, AS3, Swiz framework, GraniteDS, BlazeDS etc.

          There is nothing that software can't fix. Unfortunately, there is also nothing that software can't completely fuck up. That gap is called talent.

          About Me

           

          Why And How To Use PDOM: A Persistent W3C DOM API

          What is PDOM?

          PDOM stands for Persistent Document Object Model.

          PDOM implements the W3C DOM API, as MiniDOM solves SAX processing problems, so PDOM solves DOM scalability problems by providing a persistent implementation of the DOM API.

          An enhanced implementation of XPATH provides excellent usability.

          PDOM's implementation exploits the capabilities of GPO, the Generic Persistent Object model.

          Is it straightforward to use?

          It is simplest to give an example, here is some java code:

          import cutthecrap.pdom.*;

          client = new PDOMClient(null, "D:/testxml/pdom.rw");
          PDOM pdom = client.getPDOM();

          PDocument doc = pdom.createDocument("Opera", "D:/testxml/opera.xtm");

          The PDocument class implements the org.w3c.dom.Document interface.

          You should be able to see from the above code that a PDOM system can contain many DOM documents.

          At some later stage, the persistent document could be retrieved:

          import cutthecrap.pdom.*;

          client = new PDOMClient(null, "D:/testxml/pdom.rw");
          PDOM pdom = client.getPDOM();

          PDocument doc = pdom.getDocument("Opera");

          So PDOM provides both a Persistent DOM repository to manage and interact with individual huge XML documents, and also allowing the storage of perhaps millions of separate XML documents.

          Once a PDocument has been returned the Document interface can be used to navigate to the contained nodes.

          In addition to the ord.w3c.dom interfaces, support is also provided for XPath-based queries.

          What PDOM Isn't

          PDOM has not been developed to provide a rigorous implementation of the full W3C DOM model. It does not currently support DTDs and there are no immediate plans to do so.

          One example of this is that PDOM automatically recognises the "id" attribute to provide the identity for an element - subsequently accessible using document.getElementById, where the standard specifies that the DTD must indicate which attribute is used to identify a specific element type.

          By default also, text nodes are not added if they only include whitespace. Although this behaviour can be overridden when an XML document is imported.

          XPath

          Support is provided for using XPath to return nodes from the DOM.

          PNode someNode = doc.getElementById("someId");
          XPathQuery query = new XPath(".//baseNameString/text()");

          query.setContext(someNode);

          Iterator nodes = query.execute();

          while (nodes.hasNext()) {
          Text txt = (Text) nodes.next();

          System.out.println("baseNameString : " + txt.getNodeValue());
          }

          A number of utility methods are provided to make this even simpler, for example:

          PNode someNode = doc.getElementById("someId");
          Iterator nodes = someNode.queryXPath(".//baseNameString/text()");

          Will produce the same result.

          Creating XPathQuery objects directly though may have some advantages, for example, they might be passed as arguments to methods to be applied to other computationally chosen nodes - simply calling setContext for each node to be queried against.

          XPath [Predicates]

          The XPath support now also includes predicates where before it was limited to object navigation. For example:

          PElement root = (PElement) doc.getDocumentElement();

          nodes = root.queryXPath(".//instanceOf/topicRef[starts-with(@xlink:href,'#wri')]");

          ..or

          nodes = root.queryXPath(".//instanceOf/topicRef[string-length(@xlink:href)=9]");

          It should be stressed tho' that XPath access should not be "abused". Many ill-considered XPath queries may involve traversal of the entire XML tree where more focussed queries could and should be used.

          Performance

          PDOM is built using the Generic Persistent Object Model. No special optimization has been carried out to minimise storage requirements for the PDOM data model.

          When compared with the Xerces DOM, if an in-memory system is specified then PDOM will require over twice the java memory for Xerces to store the same data, for example:

          Source XML    Xerces    PDOM (memory based)
          ---------- ------- ----
          523K 4.8Mb 10.2Mb

          The figures for the in-memory PDOM representation are a little disappointing, it would have been nice to show a broad equivalence with Xerces for in-memory options. Xerces also is significantly quicker than PDOM in parsing the document.

          It should be stressed that these figures demonstrate what an excellent product Xerces is. PDOM uses a generic representation that requires many java objects. The "bloat" on the PDOM memory usage is mostly explained by the overhead associated with any object instance.

          However, if the PDOM is stored persistently, the memory requirement drops, here are the figures for the PDOM memory requirements and the datastore disk space:

          Source XML    PDOM      GPO Datastore
          ---------- ------- -------------
          523K 1.9Mb 1.6Mb

          You may find it odd that the datastore is so small. This is achieved by various optimizations that ensure the object data is packed efficiently.

          Clearly, as objects are read in the PDOM java memory requirement will increase - particularly if the application retains references to many objects.

          It should be emphasised that the PDOM memory increases only very slightly as the source XML becomes bigger, while the backing datastore will be approximately three times the size of the source XML.

          Scalability

          The main reason to use PDOM is scalability. For small DOMs Xerces is an excellent choice, it's parsing performance is particulalry impressive, but if you cannot predict what size the DOM will be, then PDOM provides a scalable solution.

          If you read in a 300Mb XML file, the Xerces DOM will require a java VM of around two gigabytes, just to hold the data, while PDOM would process the file with a backing store of around 1Gb and do so quite happily, even with a java VM limited to 10Mb.

          Furthermore, processing a 300Mb XML file will take Xerces a considerable time - assuming the memory is available. Processing with PDOM will also take sometime - perhaps several times longer than Xerces would (if it is able to do so) - but thereafter the DOM could be accessed directly rather than having to reprocess the file.

          Not having a 300Mb XML file around, here are some figures for a 5Mb file.

          Source XML    Xerces    PDOM      GPO Datastore
          ---------- ------- ------- -------------
          5.2Mb 21Mb 1.9Mb 13Mb

          When PDOM was used to produce these figures I ran with

          java -mx10M

          This limits the java heap to a maximum of 10Mb. The overhead will effectively remain constant no matter how big or how many DOM documents are stored in the datastore.

          Summary

          PDOM solves the problem of using the standard DOM API to access huge XML data files.

          The persistent DOM allows for XML files to be parsed once, and thereafter retrieved by name.

          The resource overhead on the java VM - and OS virtual memory - when retaining huge in-memory DOMs is removed.

          How Can I Get PDOM?

          PDOM is provided as part of the full Cut The Crap distribution and can be downloaded from www.cutthecrap.biz/software/downloads.html along with other Cut The Crap software.

          posted on 2008-07-29 17:18 gembin 閱讀(512) 評論(0)  編輯  收藏 所屬分類: XML

          導航

          統計

          常用鏈接

          留言簿(6)

          隨筆分類(440)

          隨筆檔案(378)

          文章檔案(6)

          新聞檔案(1)

          相冊

          收藏夾(9)

          Adobe

          Android

          AS3

          Blog-Links

          Build

          Design Pattern

          Eclipse

          Favorite Links

          Flickr

          Game Dev

          HBase

          Identity Management

          IT resources

          JEE

          Language

          OpenID

          OSGi

          SOA

          Version Control

          最新隨筆

          搜索

          積分與排名

          最新評論

          閱讀排行榜

          評論排行榜

          free counters
          主站蜘蛛池模板: 宁城县| 白水县| 沙湾县| 平武县| 民乐县| 富宁县| 乐清市| 南雄市| 金塔县| 金山区| 信丰县| 定陶县| 侯马市| 杭锦后旗| 文登市| 阿坝县| 甘孜县| 新泰市| 元江| 安宁市| 隆昌县| 大余县| 张北县| 华坪县| 屏山县| 凤阳县| 怀安县| 千阳县| 安福县| 定西市| 扶余县| 镇原县| 吉安县| 钟山县| 梅河口市| 确山县| 榆林市| 元江| 农安县| 浙江省| 合川市|