cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
JMP is taking Discovery online, April 16 and 18. Register today and join us for interactive sessions featuring popular presentation topics, networking, and discussions with the experts.
Choose Language Hide Translation Bar
uday_guntupalli
Level VIII

Help With XML Parsing

All,
    This is in connection to a different question on the forum (https://community.jmp.com/t5/Discussions/Can-JMP-Support-Excel-Formatting-through-JSL/m-p/86355#M384...) . I am providing a small subset of a really large XML file ( XML of a worksheet in Excel). I would like to see a good example on how to traverse the XML tree to get to a targeted element efficiently. Can someone provide a good example on what is the best way to use the Parse XML or a different function in JSL to get a handle to the value here and edit the value ? 

     

<c r="A1" s="67" t="s">
     <v>0</v>
</c>



<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet
    xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
    xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
    xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x14ac"
    xmlns:x14ac="http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac">
    <dimension ref="A1:BB101473" />
    <sheetViews>
        <sheetView showGridLines="0" tabSelected="1" zoomScaleNormal="100" workbookViewId="0">
            <selection activeCell="A13" sqref="A13" />
        </sheetView>
    </sheetViews>
    <sheetFormatPr defaultRowHeight="15" x14ac:dyDescent="0.25" />
    <cols>
        <col min="1" max="1" bestFit="1" width="27.5703125" customWidth="1"/>
        <col min="2" max="2" width="5.140625" customWidth="1"/>
        <col min="3" max="3" width="5.140625" customWidth="1"/>
        <col min="4" max="4" width="5.140625" customWidth="1"/>
        <col min="5" max="5" width="5.140625" customWidth="1"/>
        <col min="6" max="6" width="5.140625" customWidth="1"/>
        <col min="7" max="7" width="5.140625" customWidth="1"/>
        <col min="8" max="8" width="5.140625" customWidth="1"/>
        <col min="9" max="9" width="5.140625" customWidth="1"/>
        <col min="10" max="10" width="6.5703125" customWidth="1"/>
        <col min="11" max="11" width="5.140625" customWidth="1"/>
        <col min="12" max="12" width="5.140625" customWidth="1"/>
        <col min="13" max="13" width="5.140625" customWidth="1"/>
        <col min="14" max="14" width="2.85546875" customWidth="1"/>
        <col min="15" max="15" bestFit="1" width="22" customWidth="1"/>
        <col min="16" max="16" width="15.85546875" customWidth="1"/>
        <col min="17" max="17" width="5.140625" customWidth="1"/>
        <col min="18" max="18" width="33" customWidth="1" style="43"/>
        <col min="19" max="19" width="14.42578125" customWidth="1" style="43"/>
        <col min="20" max="20" width="15.85546875" customWidth="1" style="43"/>
        <col min="21" max="22" width="15.85546875" customWidth="1" style="44"/>
    </cols>
    <sheetData>
        <row r="1" ht="24">
            <c r="A1" s="67" t="s">
                <v>0</v>
            </c>
            <c r="B1" s="67"/>
            <c r="C1" s="67"/>
            <c r="D1" s="67"/>
            <c r="E1" s="67"/>
            <c r="F1" s="67"/>
            <c r="G1" s="67"/>
            <c r="H1" s="67"/>
            <c r="I1" s="67"/>
            <c r="J1" s="67"/>
            <c r="K1" s="67"/>
            <c r="L1" s="67"/>
            <c r="M1" s="67"/>
            <c r="N1" s="10"/>
            <c r="O1" s="4"/>
            <c r="P1" s="4"/>
            <c r="Q1" s="4"/>
            <c r="R1" s="27"/>
            <c r="S1" s="27"/>
            <c r="T1" s="27"/>
        </row>
	</sheetData>
</worksheet>

   I tried the following and got an empty table: 

 

Parse XML(xml,
			On Element("sheetData",Start Tag(New Table())),
			On Element( "col", End Tag( New Column( XML Attr( "row r = '1' ht ='24'" ), Set Values( Parse( XML Text() ) ) ) ) )
		 );
Best
Uday
6 REPLIES 6
Craige_Hales
Super User

Re: Help With XML Parsing

if nothing else updates the XML, you could keep it really simple with a little pattern matching:

 


Pat Match(
	xml, // source
	Pat Regex( "\[<c\s+r="A1"\s*s="67"\s*t="s">\s*<v>]\" ) >> prefix + // what comes before goes in prefix
	Pat Regex( "[^<]*" ) >> old + // the middle part we care about, probably some digits
	Pat Regex( "\[</v>\s*</c>]\" ) >> suffix, // what comes after goes in suffix
	prefix || "42" || suffix // replace the matched section with this
);
show(old);

 

which will replace the 0 with 42. It shows the old value (0). run it again and it will show 42.

I used \s+ and \s* rather than a hard coded set of white space characters. The suffix/prefix is captured from the document, including the document's actual white space characters, so the only change should be the 0 -> 42. The middle bit, [^<]*, captures a run of zero or more characters that are not < (the beginning of the next tag).

 

Be sure to test the return code from patmatch(). If it isn't 1, the XML is no longer what the pattern is expecting.

Craige
uday_guntupalli
Level VIII

Re: Help With XML Parsing

@Craige_Hales
      This is really helpful. Can we try to tap in to the inherent XML structure which has well defined structure ? 
      The reason I ask this question is - while Pat Regex approach is fast and efficient, I am worried that it might require constant changes to the script to keep up with even minute changes in the source Excel. While targeting Attributes rather than strings might be more robust ? 

Best
Uday
uday_guntupalli
Level VIII

Re: Help With XML Parsing


@uday_guntupalli wrote:

@Craige_Hales
      This is really helpful. Can we try to tap in to the inherent XML structure which has well defined structure ? 
      The reason I ask this question is - while Pat Regex approach is fast and efficient, I am worried that it might require constant changes to the script to keep up with even minute changes in the source Excel. While targeting Attributes rather than strings might be more robust ? 


Update:  As I start building my solution, the more Pat Regex operations I use, the slower the solution is becoming. It doesn't look scalable. I have atleast 20 fields that I need to replace plus a time series of 8760 values in 5 columns that I need to replace. Currently over 8 regex operations, JMP is taking an average of 3.25 seconds per each, can you recommend any other scalable alternatives. Additionally, I don't get 1 at the end of Pat Regex. 

x = char(10); 

Pat Match(
	   xml, // source
           Pat Regex( "\[<c\s+r="J6"\s*s="48"\s*t="s">\s*<v>]\" ) >> prefix + // what comes before goes in prefix
	   Pat Regex( "[^<]*" ) >> old + // the middle part we care about, probably some digits
	   Pat Regex( "\[</v>\s*</c>]\" ) >> suffix, // what comes after goes in suffix
	   prefix || x|| suffix // replace the matched section with this
        );
		 
Best
Uday
gzmorgan0
Super User (Alumni)

Re: Help With XML Parsing

@uday_guntupalli from your earlier post and this email string, I still am not sure what you are trying to do? Excel owns its format.  I made a simple read from one xml related file and wrote to the file and saved the zipped file and when I opened it in Excel it said it was corrupted but when I selected the option to fix it, the Excel workbook contained my changed values.  (Maybe I wrote to a formula cell??).

 

Are you trying to use a large Excel file as a database to read and write? are you trying to change formatting?  What is the specific task including scope (size, task, udate in place, add rows, delete rows, etc.).

 

Below is a snippet from one XML file in a 3 sheet XLSX file craeted from Big Class.jmp, Iris.jmp, and Semiconductor Capability.jmp. This is the first sheet.  Item to note teh dimension is A1:E41 since the first row is column names. Once you find the Tag sheetData, the data is organized by row and cell, not easy to replace a column.  Also note the cells with strings have extra coding t =\!"s\!" denoting it is a string and the value is a number.  If you need that string value  you need to to lookup that string value in the file "xl/sharedStrings.xml". The reference numbering starts with 0.  For this simple 3 sheet Excel file there are 490 unique strings, String 44 is "F" and 45 is "M" for gender.

 

If you are interested in reading and writing XML there is a secion in JSL Companion 2nd Edition and several scripts and examples. But For what I "tink" you are trying to do, I wouldn't do it with XML.  

"xl/worksheets/sheet1.xml"
"<?xml version=\!"1.0\!" encoding=\!"UTF-8\!" standalone=\!"yes\!"?>
<worksheet xmlns=\!"http://schemas.openxmlformats.org/spreadsheetml/2006/main\!" 
  xmlns:r=\!"http://schemas.openxmlformats.org/officeDocument/2006/relationships\!" 
  xmlns:mc=\!"http://schemas.openxmlformats.org/markup-compatibility/2006\!" mc:Ignorable=\!"x14ac xr xr2 xr3\!" 
  xmlns:x14ac=\!"http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac\!" 
  xmlns:xr=\!"http://schemas.microsoft.com/office/spreadsheetml/2014/revision\!" 
  xmlns:xr2=\!"http://schemas.microsoft.com/office/spreadsheetml/2015/revision2\!" 
  xmlns:xr3=\!"http://schemas.microsoft.com/office/spreadsheetml/2016/revision3\!" 
  xr:uid=\!"{00000000-0001-0000-0000-000000000000}\!">
<dimension ref=\!"A1:E41\!"/>
<sheetViews><sheetView tabSelected=\!"1\!" workbookViewId=\!"0\!"/></sheetViews>
<sheetFormatPr defaultRowHeight=\!"14.4\!" x14ac:dyDescent=\!"0.3\!"/>
<cols>
 <col min=\!"1\!" max=\!"1\!" width=\!"10.33203125\!" bestFit=\!"1\!" customWidth=\!"1\!"/>
 <col min=\!"2\!" max=\!"2\!" width=\!"7.77734375\!" bestFit=\!"1\!" customWidth=\!"1\!"/>
 <col min=\!"3\!" max=\!"3\!" width=\!"7.6640625\!" bestFit=\!"1\!" customWidth=\!"1\!"/>
 <col min=\!"4\!" max=\!"4\!" width=\!"11\!" bestFit=\!"1\!" customWidth=\!"1\!"/>
 <col min=\!"5\!" max=\!"5\!" width=\!"11.6640625\!" bestFit=\!"1\!" customWidth=\!"1\!"/>
</cols>
<sheetData>
 <row r=\!"1\!" spans=\!"1:5\!" ht=\!"18.600000000000001\!" x14ac:dyDescent=\!"0.45\!">
   <c r=\!"A1\!" s=\!"1\!" t=\!"s\!"><v>0</v></c>
   <c r=\!"B1\!" s=\!"1\!" t=\!"s\!"><v>1</v></c>
   <c r=\!"C1\!" s=\!"1\!" t=\!"s\!"><v>2</v></c>
   <c r=\!"D1\!" s=\!"1\!" t=\!"s\!"><v>3</v></c>
   <c r=\!"E1\!" s=\!"1\!" t=\!"s\!"><v>4</v></c>
 </row>
<row r=\!"2\!" spans=\!"1:5\!" x14ac:dyDescent=\!"0.3\!">
   <c r=\!"A2\!" t=\!"s\!"><v>5</v></c>
   <c r=\!"B2\!"><v>12</v></c>
   <c r=\!"C2\!" t=\!"s\!"><v>44</v></c>
   <c r=\!"D2\!"><v>59</v></c>
   <c r=\!"E2\!"><v>95</v></c>
</row>
gzmorgan0
Super User (Alumni)

Re: Help With XML Parsing

p.s. for the snippet of XML you had in your post, only cell A1 had a value. It was of type string and you would have to get its value from the file  "xl/sharedStrings.xml"

uday_guntupalli
Level VIII

Re: Help With XML Parsing

@gzmorgan0,
        Thank you for a detailed response. 
         Objective: Take a large Excel workbook with lot of formatting and edit the values in the sheet without changing the formatting. 

         Options I have come across so far: 

  1. Use an R package (like XLConnect (too slow) or openxlsx (messes up formatting and slow) )
  2. Use JSL 
    1. Using JSL, the only viable option I have come across so far to preserve the formatting is based on Ian's suggestion i.e. to tap into the XML behind the Excel workbook which is what I am trying to do. 
    2. There may be a way to use XPath Query like so: 
      XPath Query("\[<c r="J6" s="48"><v>1</v></c>]\","//v/text()");
      It returns the right value in the cell, however then the question falls on how do you replace this value without performing a Global Replace. 
  3. Use VB script with Run Program as you suggested. 

         Since, the XML behind sheet1 is so big (100,000 lines or more), I took a small snippet to provide an example to seek the community's help on how I can drill down the hierarchy efficiently. I was hoping to keep the entire solution within JSL and make it clean and efficient, but it looks like I have to rely on other programs if there are no other options to modify the XML. Do you think there are any other options that I am missing ? Is there a better way to handle this problem ? 

Best
Uday