Solved: Pull ZIP Files from HTTP Link

thickey1 · Apr 6, 2020 08:13 AM

I have published ZIP files and want to programmatically pull them and store to my PC using JSL. I don't know up front how many files will be present.

Is this possible with JSL?

Craige_Hales · Apr 6, 2020 12:52 PM

Or maybe this is closer to what you are asking:

path="https://www.vsp.virginia.gov/downloads/"; // a page with an index of files. Yours may be different format, adjust pattern below.
html = loadtextfile(path); // get the HTML text so we can scrape the links
// somewhat custom pattern for scraping the links, may be specific to this page
urls = {}; // this list will collect the urls 
rc = patmatch(html,
	patpos(0)+ // make sure the pattern matches from the start
	patrepeat( // this is the loop that extracts the urls from the html
		(
			// the urls look like <a href="2017%20Virginia%20Firearms%20Dealers%20Procedrures%20Manual.pdf">
			// and we want just the part between the quotation marks. Quickly scan forward (patBreak)
			// for a < then see if it matches. >>url grabs the text between quotation marks.
			(patbreak("<") + "<a href=\!"" + patbreak("\!"") >> url + pattest(insertinto(urls,url);1))
			| // OR
			patlen(1) // skip forward one character
		)
		+
		patfence() // fence off the successfully matched text. There is no need to backtrack if something goes wrong.
	) + 
	patrepeat(patnotany("<"),0) + // any trailing bits of html are consumed here
	patrpos(0) // make sure the pattern matches to the end
);

if(rc==0, throw("pattern did not match everything"));
show(nitems(urls),urls[6]); // pick item 6. You'll have a different strategy.

fullpath = path||regex(urls[6],"%20"," ",GLOBALREPLACE);// minimal effort to fix up the url, might need more work

pdfblob = loadtextfile(fullpath,blob); // download item 6, it is a pdf when this was written...
savetextfile("$temp/example.pdf",pdfblob); // save it somewhere

Craige

View solution in original post

Craige_Hales · Apr 6, 2020 10:44 AM

zip file from @wilkap presentation

za=open("https://community.jmp.com/kvoqx44227/attachments/kvoqx44227/virtual-jug/12/1/VJUG%20July%202015.zip","zip");
zipfiles=za<<dir;
show(zipfiles);
blob=za<<read(zipfiles[4],format(blob));
dt=open(blob,jmp);
clearglobals(za);

several things to note

the file is downloaded to your temp directory; the "zip" option to open returns a zip archive object
you can get a list of members from the zip archive using <<dir
you can use the blob format with zip archive for reading binary data like JMP tables
the 3rd line uses a 2nd argument to tell open that the blob is a JMP data table
clearing the za variable is needed if you rerun the whole script; the zip archive object keeps the file in the temp directory from being reused.
you could use loadtextfile/savetextfile with blobs to download the zip file to a location of your choice (and delete it when done) and then use the zip archive to process that file.
I already looked to see the 4th item in the archive directory was a JMP data table

Craige

Craige_Hales · Apr 6, 2020 12:52 PM

Or maybe this is closer to what you are asking:

path="https://www.vsp.virginia.gov/downloads/"; // a page with an index of files. Yours may be different format, adjust pattern below.
html = loadtextfile(path); // get the HTML text so we can scrape the links
// somewhat custom pattern for scraping the links, may be specific to this page
urls = {}; // this list will collect the urls 
rc = patmatch(html,
	patpos(0)+ // make sure the pattern matches from the start
	patrepeat( // this is the loop that extracts the urls from the html
		(
			// the urls look like <a href="2017%20Virginia%20Firearms%20Dealers%20Procedrures%20Manual.pdf">
			// and we want just the part between the quotation marks. Quickly scan forward (patBreak)
			// for a < then see if it matches. >>url grabs the text between quotation marks.
			(patbreak("<") + "<a href=\!"" + patbreak("\!"") >> url + pattest(insertinto(urls,url);1))
			| // OR
			patlen(1) // skip forward one character
		)
		+
		patfence() // fence off the successfully matched text. There is no need to backtrack if something goes wrong.
	) + 
	patrepeat(patnotany("<"),0) + // any trailing bits of html are consumed here
	patrpos(0) // make sure the pattern matches to the end
);

if(rc==0, throw("pattern did not match everything"));
show(nitems(urls),urls[6]); // pick item 6. You'll have a different strategy.

fullpath = path||regex(urls[6],"%20"," ",GLOBALREPLACE);// minimal effort to fix up the url, might need more work

pdfblob = loadtextfile(fullpath,blob); // download item 6, it is a pdf when this was written...
savetextfile("$temp/example.pdf",pdfblob); // save it somewhere

Craige

thickey1 · Apr 7, 2020 05:20 AM

Thanks Craig for the comprehensive reply. I'll take elements of both suggestions and merge into a generic function to suit my current and future needs.

I know I'd have to use a REGEXP to find the links from the HTTP Source but was hoping for a '<< saveLink' function for the zip part.

This will work perfectly fine though.

Great answer(s)

Craige_Hales · Apr 7, 2020 06:54 AM

Glad you can get something out of it! I'm pretty sure the pattern could be improved, speed-wise. Probably doesn't make a difference for directories of only a few thousand links, but the len(1) part could skip non-link text faster. And a more flexible pattern for the links would be better too.

@bryan_boone @ErnestPasour @paul_vezzetti

Craige

Pull ZIP Files from HTTP Link

Re: Pull ZIP Files from HTTP Link

Re: Pull ZIP Files from HTTP Link

Re: Pull ZIP Files from HTTP Link

Re: Pull ZIP Files from HTTP Link

Re: Pull ZIP Files from HTTP Link