Reducing memory usage with XslCompiledTransform

I've currently been using and XslCompiledTransform in my software to apply an XSL stylesheet to a user-provided XML file and then bulkloading the resulting file into a SQL table. This worked great until an XML file larger than 100 MB is thrown into the mix. The RAM usage jumps off the charts and for most users, my program exits will an OutOfMemory Exception. I've looked for ways to get around this and cannot seem to find any solutions.


What can I do differently or what other alternatives do I have other than physically parsing through the XML file line by line (or node by node) and performing the operations the XSL file does myself... which is my LAST option considering I am sometimes dealing with 2GB and 3GB XML files.

Oh.. and here's a small excerpt of my code... and no, I don't use and XSL scripts.. it's simple, basic XSL file and simple, basic C# code:

C#:
-----------------------------
                                XslCompiledTransform transform = new XslCompiledTransform(false);
                                XmlReader XSLFileReader = XmlReader.Create(tempXSLFile);

                                transform.Load(tempXSLFile);

                                string tempFile = System.IO.Path.GetTempFileName();

                                XmlReader sourceReader = XmlReader.Create(mSourceFile);
                                TextWriter destWriter = new StreamWriter(tempFile);

                                transform.Transform(sourceReader,null ,destWriter);
                                sourceReader.Close();
                                destWriter.Close();
-----------------------------
XSL File:

------------------------------------------------
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="yes" />

    <xsl:template match="/*">
        <Data>
            <xsl:apply-templates select="Document"/>
        </Data>
    </xsl:template>

    <xsl:template match="Document">
        <xsl:call-template name="proc" >
            <xsl:with-param name="id" select="DocID/text()" />
            <xsl:with-param name="txt" select="value/text()" />
        </xsl:call-template>
    </xsl:template>
    <xsl:template name="proc">
        <xsl:param name="id" select="DocID" />
        <xsl:param name="txt" select="value/text()" />
        <xsl:choose>
            <xsl:when test="contains($txt, 'value')" >
                <xsl:element name="document" >
                    <xsl:element name="docid" >
                        <xsl:value-of select="$id"/>
                    </xsl:element>
                    <xsl:element name="value" >
                        <xsl:value-of select="substring-before($txt, 'value')" />
                    </xsl:element>
                    <xsl:element name="fieldname" >
                        <xsl:text>NEWVAL</xsl:text>
                    </xsl:element>
                </xsl:element>
            </xsl:when>
            <xsl:otherwise>
                <xsl:element name="document" >
                    <xsl:element name="docid" >
                        <xsl:value-of select="$id"/>
                    </xsl:element>
                    <xsl:element name="value" >
                        <xsl:value-of select="$txt" />
                    </xsl:element>
                    <xsl:element name="fieldname" >
                        <xsl:text>NEWVAL</xsl:text>
                    </xsl:element>
                </xsl:element>
            </xsl:otherwise>
        </xsl:choose>
        <xsl:variable name="left" select="substring-after($txt, 'value')" />
        <xsl:if test="string-length($left)>1" >
            <xsl:call-template name="proc" >
                <xsl:with-param name="id" select="DocID/text()" />
                <xsl:with-param name="txt" select="$left" />
            </xsl:call-template>
        </xsl:if>
    </xsl:template>
</xsl:stylesheet>
---------------------------------------------

Thanks!



Answer this question

Reducing memory usage with XslCompiledTransform

  • nhaas

    HI,

    I am also facing the same problem. for a 700KB of xslt file and 2 kb of XML file it is taking almost 50MB sapce in memory for transformation using XslCOmpiled Transform object.

    If u have any solution to minimize this.. PLease let me know.

    Cheers,

    Ravi


  • Meliphar

    Well, basically, I have xml files that are greater than 100MB and they have data in them such as "test1;test2;test3;" and I have written an XSL to find nodes with this text and break it out into individual nodes containing "test1", "test2", "test3" and so on. Since I am dealing with such large XML files, parsing through the entire file is a bit too time consuming so I decided on using an XSL stylesheet and the XSLCompiledTransform object. This works great (and fast!) except it uses too much memory. This data is being prepared for a bulkload into SQL and we had originally written a stored procedure to do the parsing but it is VERY slow. Are there any other options I have
  • Ruud Poutsma

    XSLT usually works on a tree model where the complete XML input is parsed and a tree for it is built in memory on which the XSLT processor executes the stylesheet. Processing 100MB files that way sounds doable but as you also mention 2 or 3 GB files I am not sure XSLT is a viable choice. In the end it obviously depends on the details of the system you want to run the transformation on.



  • windowsxp168

    Thanks for all the help but a few days ago I went the XmlReader route and I simply perform the transformations in C# instead of using XSLT. I have the process going through each document node with the XMLReader, parsing the value of the node with regular expressions, and then writing out the parsed values with an XMLWriter. I find it to be the sloppiest way but everything I've tried using XSLT takes up too much memory. It's surprisingly faster than I expected but still not as fast as using the XslCompiledTransform object to do all the work. Thanks again!
  • ameyayashu

    I've tried that in the past and when dealing with large XML files, that requires alot of time... so I guess the trade-off here is time vs. RAM. Using the XslCompiledTransform takes less 40 seconds on a 200MB XML file (and requires 800 MB of RAM) but parsing through the XML file with XMLReader takes about 3 minutes (and takes up less than 1MB of RAM). I'm not sure about the Transform object on a 3.5GB XML file, but using the XMLReader on the 3.5GB XML file takes at least 10 minutes or more to parse.
  • Dr Crs

    For very large XML files using XmlReader to read through the file and XmlWriter to write a new file is a much better option in terms of memory consumption as XmlReader reads node by node and does not build a complex in memory model of the complete XML document.

  • Simon bridgens

    First of all I'd like to rewrite a bit your XSLT. This will not solve the problem, but improves readability a bit.

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

           <xsl:output method="xml" indent="yes" />

     

           <xsl:template match="/">

                  <!-- Whith for-each instead of template here we avoid one recurcion level -->

                  <xsl:for-each select="*">

                         <Data>

                               <!-- I assume that all "Document" elements are always chilren of root element (like /*/Document)

                               and replace apply-templates with for-each here as well. If this esamption is not correct put it back. -->

                               <xsl:for-each select="Document">

                                      <xsl:call-template name="proc" >

                                             <xsl:with-param name="id"   select="string(DocID)" />

                                             <xsl:with-param name="txt" select="string(value)" />

                                      </xsl:call-template>

                               </xsl:for-each>

                               <!-- Using string(DocID/text()) instead of DocID/text() is important to prevent allocating unneeded in your case node-set objects .

                               (DocID/text() - is collection of all text nodes of all DocID elements and value-of would take string value of the first of them anyway.)

                               string(DocID) is not the same as string(DocID/text()). string(DocID) - is concatination of all text nodes of first DocID element.

                               Most likely you need string(DocID) and when these values was different you had a bug.   -->

                         </Data>

                  </xsl:for-each>

           </xsl:template>

     

           <xsl:template name="proc">

                  <xsl:param name="id"   /> <!-- removed unused default values -->

                  <xsl:param name="txt" />

                  <document><!-- Used Literal Result eleents instead of xsl:element for readability -->

                         <docid><xsl:value-of select="$id"/></docid>

                         <value>

                               <xsl:choose>

                                      <xsl:when test="contains($txt, 'value')" >

                                                    <xsl:value-of select="substring-before($txt, 'value')" />

                                      </xsl:when>

                                      <xsl:otherwise>

                                                    <xsl:value-of select="$txt" />

                                      </xsl:otherwise>

                                </xsl:choose>

                         </value>

                         <fieldname>NEWVAL</fieldname>

                  </document>

                  <xsl:variable name="left" select="substring-after($txt, 'value')" />

                  <xsl:if test="string-length($left)>1" >

                         <xsl:call-template name="proc" >

                               <xsl:with-param name="id"   select="$id"    /> <!-- you have $id string calculated. no reason recalculate it again. -->

                               <xsl:with-param name="txt" select="$left" />

                         </xsl:call-template>

                  </xsl:if>

           </xsl:template>

    </xsl:stylesheet>



  • Poetjevel

    Finaly this is C# code that uses streaming algorithm to process imput:

    using System;

    using System.Xml.XPath;

    using System.Xml;

    using System.Diagnostics;

    class Class1 {

    static void Main() {

    using (XmlReader r = XmlReader.Create("data.xml")) {

    XmlWriterSettings ws = new XmlWriterSettings();

    ws.Indent = true;

    using (XmlWriter w = XmlWriter.Create("out.r.xml", ws)) {

    while (r.Read()) { // <xsl:for-each select="*">

    if (r.NodeType == XmlNodeType.Element) {

    ProcessRootElement(r, w);

    }

    } // </xsl:for-each>

    }

    }

    }

    static void ProcessRootElement(XmlReader r, XmlWriter w) {

    int depth = r.Depth;

    w.WriteStartElement("Data"); //<Data>

    if (!r.IsEmptyElement) {

    do { // <xsl:for-each select="Document">

    r.Read();

    if (

    r.NodeType == XmlNodeType.Element &&

    r.LocalName == "Document" &&

    r.NamespaceURI.Length == 0

    ) {

    ProcessDocument(r, w);

    Debug.Assert(r.LocalName == "Document");

    }

    } while (depth < r.Depth);

    Debug.Assert(depth == r.Depth && r.NodeType == XmlNodeType.EndElement);

    }

    w.WriteEndElement(); //</Data>

    }

    static void ProcessDocument(XmlReader r, XmlWriter w) {

    int depth = r.Depth;

    if (!r.IsEmptyElement) {

    string docID = null;

    string value = null;

    do {

    r.Read();

    if (r.NodeType == XmlNodeType.Element && r.NamespaceURI.Length == 0) {

    if (r.LocalName == "DocID" && docID == null) {

    docID = ReadTextValue(r);

    }else if (r.LocalName == "value" && value == null) {

    value = ReadTextValue(r);

    }

    }

    } while (depth < r.Depth);

    Debug.Assert(depth == r.Depth && r.NodeType == XmlNodeType.EndElement);

    if (docID == null) docID = "";

    if (value == null) value = "";

    ProcessValue(docID, value, w);

    }

    }

    static string[] separator = new string[] {"value"};

    static void ProcessValue(string docID, string value, XmlWriter w) {

    string[] values = value.Split(separator, StringSplitOptions.None);

    foreach (string txt in values) {

    w.WriteStartElement("document"); //<document>

    w.WriteStartElement("docid"); //<docid><xsl:value-of select="$id"/></docid>

    w.WriteString(docID);

    w.WriteEndElement();

    w.WriteStartElement("value"); // <value>

    w.WriteString(txt); // <xsl:value-of select="$txt"/>

    w.WriteEndElement(); // </value>

    w.WriteStartElement("fieldname"); //<fieldname>NEWVAL</fieldname>

    w.WriteString("NEWVAL");

    w.WriteEndElement();

    w.WriteEndElement(); //</document>

    }

    }

    static string ReadTextValue(XmlReader r) {

    int depth = r.Depth;

    string text = "";

    if (!r.IsEmptyElement) {

    do {

    r.Read();

    if (r.NodeType == XmlNodeType.Text || r.NodeType == XmlNodeType.Whitespace || r.NodeType == XmlNodeType.SignificantWhitespace) {

    text += r.Value;

    }

    } while (depth < r.Depth);

    Debug.Assert(depth == r.Depth && r.NodeType == XmlNodeType.EndElement);

    }

    return text;

    }

    }



  • Lorenatcallwave

    This is na XML document I used to verify the results. Due to comment and PI your stylesheet gives different result then mine.

    <Data>

    <Document>

    <DocID>12</DocID>

    <value>value 5value8<!--test comment--> 4value 65< pi test pi >

    </value>

    </Document>

    <Document>

    <DocID>13</DocID>

    <value>4</value>

    </Document>

    </Data>



  • Hoopla

    Martin, as usual, is absolutely right. XSLT id defiled as a transformation over cached (in memory) data and as a result has problems with scaling documents up. The limit depends on your machine and what you do in the transformation, but at some moment you, d not be able to fit it to memory.

    It’s theoretically possible to implement XSLT or subset of it over stream of XML nodes (over XmlReader). We (MS) don’t have any solution in this are yet.

    There are different approaches dealing with the problem. Most typical is splitting large XML on chunks and process them one by one. It should be simple in your case – you can put each element “Document” in to separate files. Some teams write there’s own caches that “lazily” load document and as a result caching it chunk by chunk.

    I like streaming XML processing – this is my hobby and I’ll try to convince you that this is the best way to deal with such problems. Processing with XmlReader can’t be slower and take more memory then XSLT because to lead the cache you in any case read it with reader and at the end of transformation write with XmlWriter. This is the price you can’t avoid but it is really minimal and unavoidable. If you it’s slower something wrong with the way you process it.



  • Reducing memory usage with XslCompiledTransform