Importing Binary Data with JSL (2022-EU-45MP-946)

Hello, I'm Michael Hecht,

and I'm here today to talk to you about importing binary data in JMP using JSL.

I'm on the team that develops the software for JMP, and let's get started.

JMP can import lots of different file formats,

everything from plain text to Excel spreadsheets.

When those are imported into JMP, they're shown to you as a data table

that you can then use for further analysis with all of JMP's capabilities.

But what if there's a data type that you can't import?

JMP doesn't know how to do it.

In this case study, I'll be looking at the JPEG image format

that is pretty common amongst all digital cameras and smartphones.

I'm sure everyone's familiar with it.

In fact, you might be saying,

wait a minute, I thought JMP can open JPEGs,

and in fact, it can open a JPEG.

But when it does, you get an image like this.

But there's more in that JPEG than just the image.

For example, if I get information on this one,

I see what kind of device the image was taken with, what lenses were used,

and even the GPS coordinates of where I was standing when I took the photo.

Now, how can we get that data imported into JMP?

Well, we can do it through JSL,

which I'll get to in just a minute.

But if we open this file in a text editor, we see that it's not human readable text.

It's a series of bytes that are shown as these unprintable characters,

and we call that binary data.

It's data that's outside the range

of normal alphabetic text.

It has a structure, though, and the data is locked inside there.

If only we can determine how to unravel it.

To do that, we need to know

the specification of how this data is laid out.

In the case of JPEG, that's defined by a specification

called Exif or Exchangeable Image File Format,

and we can download the spec for it.

It's a document that's been around for about 20 years,

and it's in use by all the devices that produce JPEGs.

Not only hardware devices like cameras,

but even Photoshop puts metadata in a JPEG in this Exif format.

To access it from JMP,

we need to use a JSL object known as the BLOB or Binary Large Object.

This is just a JSL object that holds a sequence of bytes.

Like the name says, that sequence of bytes can be large.

We can actually create a BLOB by loading the contents

of any file on your hard disk into it using this Load T ext File function.

Normally, that would return the contents of a file as text,

but if we add this BLOB keyword as a second parameter,

then the function returns a BLOB.

We can take one BLOB and subset a part of it

into another BLOB using BLOB Peek like you see here.

This is taking 50 bytes from b, starting at offset 100.

Now the offset for BLOBs is always starting at zero,

so the first byte in the BLOB is an offset zero.

Now we could do both of those operation in a single function call

by passing the offset and length as parameters

to the BLOB keyword when we call low text file.

If you see here that says, open the file at this path,

skip 100 bytes in and then read 50 bytes, and return that as a BLOB.

Once we have a BLOB,

we can convert it to a character string using the Blob To Char function.

Here we're taking b2 and converting its bytes into a character string,

assuming that those bytes are in the " us-ascii" charset.

If we don't specify a charset, JMP assumes it's UTF-8 .

We could also consider that BLOB

to contain a series of numeric values, all of the same type and size.

Using Blob to Matrix here, we're taking b2,

which we read in and set to be a length of 50 bytes,

and interpreting it as an array of unsigned integers,

each of which is two bytes long.

We should get back a matrix with 25 numbers in it.

Now that fourth parameter, the string "big,"

says that those unsigned integers are in big endian format,

meaning the first byte is the most significant,

is the highest part of the end,

and then followed by the lowest part of the end.

We could also specify the string little to specify little-endian format.

Binary files have

both of these kinds of representation of integers and other numbers.

In fact, the Exif format uses both big-endian and little-e ndian.

Let's take a look at these operations in action.

I'm going to switch over to JMP,

and I'm going to open this demo script here, D emo number 1.

Now we see some of the code we just looked at.

Here we are loading a text file, this file named Beach.j peg.

This is the same file that I used in my slide.

It's right here and you can see it.

You can also see that it has a size of about 3.4 meg.

When I run this one line of code

and log, it tells me that

b was assigned a BLOB of 3,45,000, et cetera bytes, or 3.4 meg.

It doesn't show me all those bytes, but I can see how big it is.

I can get that length using the Length function,

just like you use for character strings.

But when I use Length on a BLOB, it gives me back the number of bytes.

I can

get a sub- BLOB of the first six bytes

in b using BLOB Peek.

We'll do that here,

and I see a BLOB of six bytes was assigned.

I can actually look at the value if I want by just

submitting the name of the variable.

I can see here are those six bytes in this "ascii-hex" format

FF-D8-FF-E1, et cetera.

I can take these six bytes and convert them to a matrix,

and I'm going to convert them to a matrix of two bytes unsigned int

or shorts in big-e ndian format.

Given that there are six bytes here,

I should end up with a matrix of three numbers.

When I run this, sure enough, there's my three numbers.

We can see those three numbers in Hex

just to verify with this little four loop that I wrote, so let's do that.

There they are, just same as before, FFD8, FFE1, 0982.

So now, let's look at the next four bytes following those six in the file,

and we'll get them in a sub- BLOB all by themselves

so that we can then convert th em to a character string using BLOB to Char.

When I run this, I get the character string " Exif."

You may have noticed in the slide

showing the binary file contents

that that little string was up there near the top,

and it's part of the Exif file specification and identifies it as such.

Let's go back to the slides.

Those functions are powerful.

They let us do what we need to do to manipulate and read data from a BLOB,

but they're a little cumbersome to use.

Let's write our own utilities to make them a bit more manageable.

I'm going to start with a function that I've named Read Value

and it takes a BLOB and then some offset within that BLOB,

and then the numeric type I want to read and the size of that type.

It's going to read one value out of the BLOB.

I passed my BLOB and offset and size into BLOB Peek,

get back a sub- BLOB of just those bytes, and then call Blob To Matrix

passing in the type.

I use the same size,

so the size of the BLOB and the size of an element are the same.

I should get back a matrix of one value

and I pass in "big" because I'm just always going to use big-endian format.

But I don't want to return a matrix.

I want to return that one value, so I pull it out of the matrix and return that.

This is called like so. I call Read Value, I pass in b.

I read one unsigned int starting at offset zero

and it's two bytes long and I get back that value FFD8.

There's a problem with this code though, and that's in this parameter b.

B is that BLOB that's 3.4 meg in size.

The problem is that JSL, when passing a BLOB to a function,

always passes it by value, meaning it makes a copy of it.

For every single number I want to pull out of my BLOB using this function,

it will make a copy of that 3.4 meg

just to pull out two bytes, or four bytes or whatever, and then throw it away

when the function returns.

That's inefficient, wasteful, and probably really slow,

so we don't want to do that.

How can we get around that?

Well, instead of passing it as a parameter, let's put it in a global.

We'll make a global that we load with our BLOB,

and then we can call the function

and it'll just refer to the global instead of a parameter.

In fact, we can make a bunch of globals.

We can record the length of the BLOB to offset

to where we are currently processing data in the BLOB,

and maybe even for the endianness.

The problem with globals, though,

is that they are in the global symbol table,

and they might interfere with other code that you have.

In fact, we'd like to write our importing code as something that can be used

by other clients,

and those clients might have their own variables by these names,

or they might be using other code libraries that would interfere.

How do we get around that?

Well, I've done it by creating a namespace.

I call my namespace " EXIF Parser"

Now, instead of globals, I put them all as variables inside my namespace,

and now they're N amespace globals with that prefix.

Before I call my function, I need to initialize them.

I'll load the "Beach.jpeg" file into the EXIF P arser BLOB,

I'll record its link.

I'll start off my offset at the very beginning at zero,

and I'll set the endianness to "big,"

and then I can change my function, simplifying it a bit like this.

Now, I've actually put my function in the same namespace Read Value

as part of the EXIF Parser namespace.

Now, all I need to do is pass a type and a size.

BLOB Peek now uses the global BLOB that's stored in the namespace,

the global current offset that we're reading from,

and BLOB the matrix even uses the endianness,

so we can parameterize that, we can change it.

Once I've retrieved the result I want,

I'll increment the offset by the number of bytes

we just process, and then I'll return it as before.

Let's see what that looks like in action.

We'll look at Demo 2

and here's my namespace.

Here are my globals in that namespace.

Here's the Read Value function, just like we saw.

I've got some more functions. I've put an EXIF Parser

Here's Read Short, which just cause Read Value,

but it always passes unsigned integer of two bytes.

Similarly, Read Long reads and unsigned int of four bytes.

I've also got Read Ascii which you pass it a size in bytes

and it makes a sub- BLOB of that many bytes

at the current offset from the global BLOB,

and then cause BLOB to charge to convert it into a string.

It's using the "us-ascii" char set,

because that's the charset that the Exif specification says

all of its character data uses.

Then just like with Read Value, it increments to offset past the bytes

that were already processed and returns the string.

Let's submit all of this code

so that those things are all defined and then we can try to use them.

First, we'll initialize our EXIF P arser globals,

and then I'll read the first three shorts from the file just like we did before.

But now, I'm going to call Read Short.

We're starting it off at zero,

so I'm going to call it three times in this loop.

It will read in each successive short,

advancing offset as it goes, and then print them out just like before.

There they are.

Now, our offset is sitting at offset six just past the last thing it read.

I can call Read Ascii for four bytes and I get back that same string.

Okay, so let's go back to slides.

Now, we have some tools we can use to start building our EXIF Parser.

We need to dig into the specifications

to see what does the EXIF data format and this JPEG file look like.

Well, at the top level, it looks something like this.

It starts with two bytes, which are what's called the start of image marker.

We've already seen those two bytes.

They're the value FFD 8.

If your file doesn't start with that, it's not a JPEG.

Then there's a series of blocks of data,

and each block starts with two bytes, which is a marker,

then two bytes, which is a size,

and then some data which is however many bytes the size of there were.

Now the size also includes itself, so really the data is size minus two.

You can see there, are a bunch of different block types defined,

but some of them are optional, some of them can be repeated.

The ones that we care about are APP1 and APP2.

That's where the EXIF data will be.

Then there's a bunch of others that we don't care about.

Eventually, we see one called Start of Scan or SOS.

When we hit that, we know that the next part of the file

will be the actual image data, which is the pixels.

When we hit that, we can stop.

Then after the image data is end of image.

We need an algorithm to read this data.

Here's what we'll use.

First, we'll read the first two bytes,

see if it matches the start of image marker,

and then we know we have a JPEG.

Then we'll have a while loop, where within the while loop

each time through, we'll process a single block.

To do that, we will save the current offset position,

read the two bytes for the next marker,

and if that marker is SOS, we can break out of the loop.

Next, we read the two bytes for the size.

Now we have all the information we need to process the block.

Whatever that entails, we'll do it.

Then we can skip past the data in case processing the block

didn't change our offset at all, but we'll explicitly move our offset

to whatever it was at the beginning of the loop,

plus the two bytes for the marker, plus the value of the block size.

When we get out of the loop,

we either ran out of data in the block to process or we hit that SOS marker,

so we're done.

Let's see what that looks like.

Demo 3 has this code.

You see, we have our namespace, and all this is the same as before.

I'm going to run it just to make sure everything is defined.

Now, we're adding a new function,

which I've also put in the namespace,

and it's called EXIF Parser:Get EXIF Data Raw,

and I'm passing in the path to the file, that JPEG file that we want to process.

Now, I've defined an associative array here that maps those magic marker codes

to their abbreviations so that we can print them out on the log.

I load up my

Exif Parser globals like before,

only now I'm passing in the file path that I was given,

and then I start interpreting what the data is.

First, I look at the very first short

and make sure it's at the start of image marker.

If not, I just return because it's not a jpeg file.

I'm going to write to the log that I saw it at offset zero.

Then here's my while loop to walk through those blocks.

At the top of the loop,

I'm going to reset my endianness to "big," because some of the blocks,

when we process them, will have their own endianness

and change it to little.

We want to know that the endianness is big at the top of the loop

because the block structure always uses big-endian data.

Then I'm going to save whatever the current offset is,

and then I'm going to read the next marker.

It's a short, and I'd look to see if it's equal to SOS,

which is that magic number.

If it is, I can break out of the loop after logging that I saw it.

Next, I'll read the two bytes for the block size,

and then I will process the block.

Now, in this example, my processing consists of writing a message to the log,

so I'll do that.

Then I'm ready to skip past the block.

I do that by changing my offset to be whatever it was at the beginning,

plus the two bytes for the marker, plus the block size.

When I break out of the loop, I reset my global s and I'm done.

Let's define that function by submitting this.

I'll run script, and then I can call it passing in "Beach.jpg,"

let's see what we get.

It printed out to the log.

It offset zero, there's start of image, then it offset two, there's APP1,

and it has this size 2,466 bytes.

Then we get APP2, which has about 30K of data.

That's most of it right there.

Then we have a bunch of blocks that we don't really care about,

but eventually, we see SOS so we break out of the loop.

That's all working well.

Let's go back to slides.

Now, I'm going to skip ahead in processing some of this file format

just for the sake of time.

But if you download the paper that's associated with this talk,

the full code is there and much more detail.

I highly recommend that you do that,

but I'm going to give you a flavor of it here.

What we do next is we process each of those blocks that we have read,

and some of them we can ignore.

We want to filter out the blocks that do not contain excess data.

Then the ones that do, we need to do its own parsing.

What we've discovered when we look into the Exif specification

is that these blocks contain their own set of blocks of data

called Image F ile Directories or IFDs.

Then those contain individual metadata information

with tags saying what the data is

and then what format it's in, and then the data itself.

We want to collect all of those things together into these lists.

There'll be lists of lists of lists, a somewhat complicated data structure.

But the list data structure is very generic

and JMP and can hold all kinds of data, so that's what we want to use.

It'll have those metadata items tagged

with these numeric values that we call the raw tags,

but we want to replace those

with actual human readable labels that identify what they are.

Let's look at this and JMP, and I'll look at Demo 4 .

Now, at this point,

I have taken all my code in the "EXIF Parser" namespace

and put it into its own file.

Now Demo 4 is a client of my code, which is an "EXIF Parser.jsl."

I can just include that.

Now, the function that gets the data,

I can call it passing in the name of the jpeg file.

Now I've extended this function in here to actually process those blocks,

and break them down, filter out the ones that are not EXIF

all the things we just talked about, and give us back that data structure.

Let's run this and see what we get.

Well, we get a lot of numbers, some strings.

You see, this is in lists within lists here,

and there's an outermost list,

and then it contains different items which are lists.

Each of those have these pairs of values.

There's a number which is the raw tag, and then the value.

This one's a string. This one is also a string.

This one's a number. This one's a matrix.

The data can be different types, but the tags are all numbers.

What we want to do next is convert those numbers into human readable items.

In fact, this whole list,

we want to convert it into an associative array that indexes the data by keys.

The first thing I'm going to do

is define a mapping from these numbers to the human readable keys we want to use.

That's in this long associative array right here.

Now, I'm actually using the Hex values for the keys,

because that's how they're specified in the Exif specification,

and it makes it easier to follow along when you're looking at the spec.

There's a bunch of those,

and I'm going to start at the bottom and work my way up.

Down here at this function, Label Raw Data.

I passed in this whole data structure that we got back from parsing the BLOB.

Here's the definition of Label Raw Data.

I'm going to return

a list as my result,

so I'm going to walk through the list as my input and use it

to build up the result list.

I use this For Each construct, which is a new modern JMP function

that walks through a list, and for each element of the list,

it pulls that out into this variable raw exif, which I'll pass to this function.

Then I want to append it to my result. I do that using this Insert Into line.

I'm inserting it to the end of result, and I have to use Eval List to overcome

something that JMP is doing to be helpful with list creation.

Again, there's more detail about this stuff in the paper.

It's worth downloading and checking into.

But for here, we're just going to look at the call to Label EXIF,

and that's right up here.

Label EXIF is going to do a similar thing

where it's going to walk through each of these tag value pairs.

Instead of returning a list, it's going to return an associative array.

Here, we are initializing results to be an associate array.

That's what this token means.

It's an empty associative array, and then we'll return it at the end.

We'll also use for each to walk through the list,

and we know that each raw item is going to be a list of two elements.

We get the first element,

which is the raw tag, and then the second element is the data.

We simply build up our associative array by adding

an item keyed by the raw tag with the value of the data.

Pretty simple, except we don't want the key to be the raw tag.

We want to transform it using our lookup table.

That's what Get Tag does, and that's defined next up here.

It simply takes the tag id we pass in, converts it to hexadecimal.

This will give us back a four-character hexadecimal string.

We need the right most four characters from that,

and then we look it up in our ifd tag array up here.

I'm going to submit all this code

to define it, and then we'll have it call Label Raw Data.

Here's the result.

You can see that it's similar to before, except now,

instead of the topmost level list, it doesn't have another list,

it has an associative array.

This top most thing is a list of associated arrays.

This first one we can see that the raw key whatever the number was,

got converted to DateT ime and there's its value and so on.

But we've noticed the second one looks like it has tags

that didn't get converted.

Why is that?

Well, it's because this key ExifIFD has as its data.

Actually, another IDF, yes, this is a recursive data structure

that's defined in terms of itself.

If we want to label the things inside here,

we have to change our code to label recursively,

and we'll get to that in a minute.

But before I leave this, I want to show that I'm going to actually

combine these two steps into a single function that I call Get EXIF data,

where I first get the raw Exif data out of the BLOB,

and then I label it, and then I return the result of that.

Let's define and run that,

and it should be exactly the same as what we just saw.

Sure enough, it is.

I'm going to close this and go back to slides.

Yeah, skipping ahead again,

as I mentioned, we have to do our labeling recursively.

As I mentioned, some entries in our metadata or IFDs

have as their data another IFD.

That means we have to call our labeling routine recursively.

The way that I do it,

is to use this JSL built- in Recurse, which calls the current to function,

and you can pass in separate parameters for the recursive call.

There's more details on that in the paper,

which I'm sure you've already downloaded at this point.

Now, the one thing to be aware of

is that some of these embedded IFDs, most of them use the same look-up table

that we already defined,

but some of them have their own lookup table.

We have to make sure we're passing

the correct look-up table with its own definition of tags

to our recursive call as we're going through the different levels of recursion.

Then once we have a fully labeled data structure return,

we can extract pieces from it to get the things we're interested in.

We can run that over a whole series of images

and collect all that data into a data table or some other format.

Let's look at that in JMP.

We'll pull up Demo 5, which here I've

rolled all of that recursive labeling into my Get EXIF Data function.

I'm going to include my namespace code and then run that function.

Now we're getting back

our fully labeled

data structure.

You can see that now this EXIF ID has labels in it.

There is this big block of data of numeric stuff,

and we look and we see that that's in this thing called Maker Note.

Maker Note is a special extension to the Exif specification

that allows the maker of a particular device, in this case , Apple,

the maker of an iPhone 12, to embed their own proprietary data.

In some cases, a camera manufacturer might reveal what they've embedded there.

In other cases, people have sort of

guessed at it and come up with their best guess.

That's the case with Apple.

There's some things that are known and other things that aren't known.

You'd see a lot of this is just untagged.

But some things we can see, acceleration vector, and runtime,

and whatnot in there.

Anyway, we're going to ignore that for the most part

and look at what this thing contains.

I can see that it's an associative array in that first element,

and that's where most of the things I want to deal with

is I'm going to pull that out into its own variable right here.

It has 14 elements, so we can see what those are, what their keys are

like so.

There are those keys.

If we want to pull something out like "Model,"

I can do that simply by subscripting into exif1 "Model"

and I can see it's iPhone 12 Pro.

I can do the same thing to get the date time, and there it is.

But this is in the date time format that the Exif specification defines.

That's not a format that JMP recognizes,

but I can use JMP's in format function to convert it into something

using this format pattern option, which is a modern JSL thing

that lets us specify the pattern of the date time data,

and JMP will convert it to a numeric date time,

which it recognizes as such, and formats it for the log.

That worked.

Now, I'm also interested in the GPS coordinates,

and that's in this GPS IFD part of the EXIF.

It is one of those that is itself an IFD.

Let's access it, and then we can see what it contains.

It has information about the altitude,

differential, image direction, latitude, longitude, and speed.

What we care about is the latitude and longitude, which is these four things.

There's latitude, there's longitude,

and then they have these associated Ref values.

We need all four of those to compute the coordinates.

Let's start with the latitude.

We'll pull that out into its own variable, and we see it's a list of three elements.

If we look at what those three elements are,

we see that there are three vectors with two numbers in each one.

The Exif specification refers to these as rationals, and it uses them a lot.

But what we want is actual numbers instead of these rationals,

numerator ,denominator.

We can convert them using this Transform Each function,

which loops across this list and processes each element

after putting it into a local variable r.

We want to process that by dividing

the denominator into the numerator, and then transform each builds a new list

of those results and puts it in our variable,

that being, like you see there.

Now these three numbers are the degrees, minutes, and seconds,

but we want to combine them all into a single value for JMP to use.

We have to add them together, scaling each component appropriately.

Do that there and we get

the value.

Now, if that's in the Northern hemisphere, we're fine.

But if it's in the Southern hemisphere, it needs to be negative.

If we look at that "GPSL atitudeR ef,"

it's either N or S, which tells us if it's S, we need to negate it.

We can do the same thing for longitude.

Its LongitudeR ef will be either E or W.

E is positive east values and W is negative west values.

Here we see we had to negate

the longitude because it was in the West.

If we dump those two numbers out, we can use JMP's built- in formats for

latitude and longitude,

and we can verify that they match

what the Finder can pull out of the file as those coordinates.

We can see that it's North and West, so it's in North America.

Now I'm going to skip to Demo 6 , where we put all this together.

I'm going to use that code to pull information

out of a whole bunch of photos.

In this folder, I have 16 of them,

and they're photos that I've taken at previous JMP Discovery Summits

in Europe on years back when we used to do them in person.

It'll put all the info it finds into a data table.

I'm going to run this and here's our data table,

and we can see that I've captured the names of the image files

and the timestamps.

You can see it goes from 2016 to 2018,

and the Lat and L ongs are North and East, so that's Europe.

I can even see the progression of various iPhone models I had

across those years and how their lenses improved over time.

I've set this up so that I can select a row

and click this Get Info table script, and it opens a window for me

that shows me the photograph and the metadata for it

that I've captured.

I even have a button here showing Google Maps,

so I can click that, and up pops a Google map of that location.

It's right there with the red marker.

I can see if I z oom in that this is the Hilton Amsterdam.

That's where we had the conference in 2016.

That all seems to work well,

I n this case,

I'm going to add the photos themselves to my data table as an expression column.

That's what this Add Photos table script does.

For larger collections, I would not want to do that

because it's actually making a copy of those photos into the data table.

But for 16, it's fine.

For thousands or even hundreds, you'd probably not want to do that.

But it's also set my new column to be a label column.

Now I can run this Explorer script which opens

a graph builder of those latitudes and longitudes.

I can see some points here. Here's Amsterdam.

There's a photo we just saw and here's some more.

That's definitely Amsterdam.

These are Brussels. Yes, that's Brussels.

Over here, we have Frankfurt.

That year, we got to go to the cool

supercar Museum.

That was pretty neat.

Over here we have Prague.

I'm going to use the magnifier to zoom in on Prague a couple of times.

A t this point our detailed Earth background is not really helping us much

so I'm going to switch to street map service.

We can see, yes, that's definitely Prague,

and here is where we rode the historic street cars up to Prague Castle

Here is some JMP attendees crossing the bridge to Prague Castle.

I can see John fall there in the distance.

Over here, we have a very nice reception that we held in the municipal hall.

Here's me checking the time

on my Apple watch against the orlloi to make sure that it's right.

That all seems to be doing what we want.

In conclusion, I want to touch on the things that we learned.

We learned about the JSL object BLOB,

which is a good tool to have on your tool belt

for manipulating arbitrary binary data.

We use that to build up a little application for importing files.

Along the way, we learned some things about namespaces, and JSL recursion,

specialized list handling, and some modern JSL things like Transform Each, For Each.

Ag ain those things are covered in much more detail in the paper.

But most importantly,

I think is that we saw a case study of how to take a difficult problem

like a complex file format

and break it into smaller subtasks that we could conquer.

That's a skill that we all have and need to make use of in our professional work.

But I think it's many times helpful to

observe someone else doing it and pick up tips and tools and techniques

that we can then use in our own work.

Now I want to turn it over to you

to take these tools and use them to import and write the code to create your own

binary files and solve your own problems.

Thank you very much.