Parallel Plots for Visual Configuration Management of Model Applicability
At GE Aerospace, we build predictive models to optimize utilization of aircraft engines against the cost of maintaining them. These models may be applied to many engine families within which exist multiple combinations of hardware configuration. To complicate things further, there are myriad other labels we apply to various groups or fleets of engines (e.g., airline, region, etc.).
Parallel plots provide a relatively simple means to visualize and manage these models’ applicability to specific engine serial numbers based on their specific configuration and grouping. As an additional model metric, this visual management helps prioritize updates and tuning of individual models based on inpact.
All right. Hello. Welcome to my talk. My name is Lee Becker. I'm a Principal Analytics Engineer with GE Aerospace. I work in the data science organization. This is going to be a talk on an interesting find within JMP that I discovered recently in terms of a useful graphical tool to help with data exploration.
You can read my abstract here. It's also published on the website. But at GE Aerospace, we build predictive models to optimize utilization of aircraft engines against the cost of maintaining them. These models may be applied to many engine families within which exist multiple combinations of hardware configuration. To complicate things further, there are a myriad of other labels we apply to various groups or fleets of engines, like airline region, et cetera.
Parallel plots provide a relatively simple means to visualize and manage these models' applicability to specific engine serial numbers or assets based on their specific configuration and grouping. As an additional model metric, this visual management helps prioritize updates and tuning of those models based on the impact that they have.
With that, I'll go ahead and get started. My first set of bullets here, my background, I've been with GE Aerospace for 23 years in various engineering roles. Most having to do with flight data, and especially with trending and dealing with the data that comes off of our assets. Like I said, managing time on wing and trying to balance that with the cost of maintaining.
Currently, my role is Principal Analytics Engineer in an organization called Data Science & Analytics. I am an engineer. I'm not in data science, so I always get that caveat on saying what my role is. I'm a visual learner. I'm very passionate about visualization of data. The last caveat that I'll put in here is when I deliver talks externally, I have to make stuff up. All the data has been masked to protect some of our customers, but it is based on some real case studies.
As I mentioned in the abstract, at GE Aerospace, we build a lot of predictive models. These models tend to be very nuanced to specific groups of assets based on their configuration. Configuration in this context can mean things like engine family. I either have a small regional aircraft or large international, intercontinental flights that need to be supported by much larger, more powerful engines.
Module means, I've got a fan, I've got a compressor, I've got a combustor, I've got a turbine, suck, squeeze, bang, blow. That's the general configuration of a turbine engine. But these components of the engine can have various vintages or configurations of the hardware within them. The models that we build need to know that.
Then airline customers behave differently in their flight practices, but also where they operate. There's various regional effects, route structures, and other dimensions to this that have to be considered. We're trying to predict time on wing or engine life and manage those groups of assets. The graphic that I've got, and another caveat that I'll throw out there, all of the graphics that I've built in here, I was able to build through various platforms in JMP.
Maybe talk to me after, if you're interested in trying to reproduce maybe some of these maps or even back to the simple graphic of… It's a heatmap of the cross-section of an engine, very cartoonish, but at least colorization and blocking out these map elements was all done in JMP.
Back to it. Long story short, it gets complicated when trying to manage all of these dimensions. When you're trying to build predictive models, you've got all of these considerations as potential axes to try to separate fleets and keep things together. But the nuance for us is that these assets will be upgraded. They will change configurations over time.
Model management becomes an element of understanding what hardware is in what asset at any given point in time. Trying to manage, "Hey, this particular fleet had 75 engines with this compressor configuration." last year, I had 100 of them with the same configuration where 25 of them have been upgraded or changed to another configuration. I'll get into some of those graphics later.
But we'll just go ahead and get into the specific example that I've built here for talking purposes today. Imagine we're tracking five airlines, each having two different engine types that they're operating. Each of those engine types has four components: fan, compressor, combustor, and turbine. Each of those components can have multiple vintages or configurations within the fleet that they're trying to manage.
It really does get to be very complex to try to understand which assets have which configurations at any point in time. When you're starting to explore, well, what do these distributions look like? I think the first inkling is, I'm probably just going to start plotting things. Let's understand what these distributions actually look like. If I'm starting out, and I'm looking for a very initial look at some histograms, very basic distributions.
First instinct is probably just go ahead and utilize the header graphs here at the top. If you didn't know that feature existed within the more recent versions of JMP, try that out. This will give you very simple histograms. All my data is categorical, so there's no challenge here with looking at the continuous distributions and seeing a bunch of outliers and that thing. But it's fairly easy and straightforward to look at categorical data this way and start to look at some of the relationships.
I think the next instinct is probably, go ahead and… Let's just go ahead and leverage the distributions platform, which is, again, fairly straightforward. You get a little bit more descriptive statistics for cases where you're dealing with the continuous data, so you get more information down here in the frequencies and the statistics. I've hidden them here just for simplicity's sake.
A third option, and this is maybe way out there and something that you might not have thought of, or didn't even want to consider. But you could go and build a contingency plot for each of these separate combinations of pairs of inputs or pairs of axes. I've done that here to try to illustrate that, yeah, that's hard to look at. I know some folks that do like to visualize the data this way, but when you're dealing with multiple dimensions, it does tend to get hard to digest what's actually going on here.
Another comment that I'll make here is that even within JMP, there's a bunch of different ways to explore. There's usually two or three different ways of doing just about everything that you're trying to do inside a JMP. The interactivity is a huge benefit, especially when you're exploring. If I just pick one of the airlines out of the five in my airline's distribution, I can start to see which engine type and which fan configuration, which compressor configuration.
I can see those distributions in the highlighting change as I select through the different bars within those distributions. I can see in the underlying data table, I can see records changing as a result of those selections as well. You can use these visual tools to help with selection within any one of the platforms that you've got open, including the data table. There's just a lot of great interactivity. But as I mentioned before, the higher the dimensionality, the harder this gets to wrap your brain around.
I've gotten pretty interested in trying to find new creative ways of visualizing this data. I stumbled upon this method, which is the parallel plot. If I take a step back, and describe what a parallel plot is and just try to give you a little bit of background and introduction onto how it can be used. That's going to be the goal of the next couple of minutes' worth of this talk here.
At a very high level, parallel plot is a visualization tool for displaying multivariate data, which is what we have here, where each axis represents a different variable, and the axes are placed parallel to each other instead of cross-plotting them. Data points are represented as lines that intersect each axis at the value correspondence to that variable, allowing for the visualization of patterns, correlations, and outliers across multiple dimensions, and particularly useful for analyzing high dimensional data, helping to identify those relationships, trends, and clusters within the data set.
If I go ahead and generate that a lot here, we can again start to see how some of that interactivity within JMP is very powerful in trying to understand where some of those relationships are. Now I'm not having to rely on multiple graphs and see how they relate to one another within, say, the distributions platform. I'm able to see those relationships all at once as they flow through in these ribbon lines as they flow through.
You'll get different views based on whether your data is continuous or categorical, and I'll show some examples later. But this is the general gist of what I've been able to find and find some very significant value in plotting the data this way and utilizing parallel plots for exploration. With that, quick tutorial on how to create a parallel plot the way that I prefer.
If you go into Graph Builder, so let's start back with our data table. If I go into Graph Builder, which is this icon here, or you can get to it by going to Graph and Graph Builder. Step 2, select all the columns that you want to plot all at once. Drag them all at once down to your X-axis. Then the magic button here of generate the parallel plot [inaudible 00:13:00]. Now we're not quite done yet. I want to… First off, here's maybe another technique that you may or may not know about within JMP.
I want things to be basically the same size and shape all the time. I can use this, right-click, copy frame settings, and paste over here, edit, paste, frame settings, and that gets everything to look same size, rectangle, in terms of the graph. Just a personal preference of mine, but you may not have known that existed.
Then the last comment that I had in my instruction here, let's colorize this somehow. In my prior example, I was colorizing on airline. Just drag your airline parameter over here into the airline drop zone or into the color drop zone, and you're done, but not quite. Another little pet peeve of mine is when people don't say they're done, but when they're done. There.
Always click the Done button when you're in Graph Builder, and there you go. That's how I generated this plot for the talk here today. It was… It took me, I don't know, two or three minutes to talk you through it, but it literally takes seconds to go and put together a very quick visual aid to help you explore your data in some very different, in probably a very different way than you might have known existed before. Which is just simple distributions and trying to rely on the interactivity of JMP to help you navigate your way through all that.
I wanted to talk also briefly on some advanced features in customization. I showed you the color feature already. That's pretty helpful in a lot of cases, but when I'm using the color just for airline and that airline is already being plotted on the graph, maybe it's not so helpful. I can trace some of these ribbons through all the various axes without having color benefit me that much.
But one thing you can try to do to maybe even simplify the view that you're looking at is to use the colorization on a parameter that you're not plotting in the actual graph. I've got an example of that here where I've taken away the engine type from the graph. There's one fewer X here at the bottom. But I've chosen to use that colorization. I've chosen to use that parameter as the colorization to help map my way through it.
The way to go do that, if I go and turn on my control panel again, I remove engine type from the list. I take color out of the color bucket, our drop zone, and I put engine type into [inaudible 00:16:12]. That's basically the same. I will probably stop walking you through every example like that. But just ways to highlight different ways viewing the data and looking at the parallel plot using the colorization to help with the exploration and the multiple selection of your data as you're exploring your way through it.
Let's get back to my example 2. Here's an example where I've got… I've only plotted three axes now, and I've colored on my fan configuration, either I have a 2D fan blade or a 3D fan blade, and I can see that all of my 2D fan blades are blue, and all of those 2D fan blades map to my legacy V1 turbine configuration, where I may or may not have been able to deduce that from just looking at the broader parallel plot understanding that, that really is a one-to-one relationship.
Again, just another technique you can try to use to maybe simplify the view. But using color as the helper here to help explore the dimensions within the data. The other thing that's neat here is that you can see these cases where I've got distinctly red or distinctly blue parts of the population beyond one of these axes where there's shades of purple to the left of that. Sometimes you can rely on if your data set is very binary, either tending towards one or the other, it's easy to see.
Again, I could take away one of these variables and take away that stratification of blue versus red and rely on the shade of purple to help me to maybe define how far to one of those extremes I am within the population. Just based on the visual clue of that shade of purple. Either I'm really blue or I'm really red, or I'm generally somewhere in between.
Another technique you might try. Local data filters are also quite helpful in being able to explore through some of the various permutations of the data set that you have, and the way to set up. Our local data filter is right here. You can set up a local data filter on any of the variables that you have in your data set, not just the ones you have plotted. Let's go ahead and do that. This view is basically the same as the view that I had before.
But the point of doing things this way and using the local data filter help you explore is that now I can use that to help me select through some of the various sub fleet worth or subgroups within the data set. If I only want to look at Air America and see how they're represented across the other axes, I can do that very easily here versus trying to only look at… If I select this and try to interpret what's on in the context of everything else. Or I can have the choice of only looking at that one particular fleet. Or I could do the same for one of my other axes and only look at certain configuration of fan and see how they show up.
It's just another way to help explore and down select within the data that you're trying to interrogate. Last local data filter example. This was… Okay, so this is an example that I put together to help with what I mentioned before, that time element of how this data is behaving. This 2024 view should look exactly like the data set that I've been talking to up until this point. Prior to that, I've had history with these fleets of assets or engines, and they've all had configuration permutations changes over time. That's what I'm trying to show or describe here and visualize those permutations or those changes over time.
I've got another dimension here that I haven't plotted up until now. But that dimension is the element of time. Just as the example, I'm interested to know how the sub fleet of CM5-5s has potentially diminished over time as retirements are happening. I can zoom all the way out, look at the full superset of data, select that sub fleet or that group of engines in my case.
Now I can iterate through the years and say, "All right, back in 2020, looks like every customer had at least a few of them. Then in 2021, a couple of these customers are retiring. Into 2022, looks like I only have four of them." Air America got rid of all theirs between 2021 and 2022. Now, only two customers have those asset types. In the very latest, you can see how the fleet is significantly depleted up until that point.
Another useful technique here that I'll try to illustrate with this method of trying to look at things over time. At some point, I had this compressor configuration of this super old configuration. Again, getting back to the superset of all the data to use as a selection tool, going back to the underlying data table and saying for those assets back in time, and the way to do that is to select your asset number and select, go to row selection and select Matching cells.
That picks the assets that have ever been that super old config and where they are currently as well. Then takes us through that example. Now I've got, and this isn't quite there yet, but I've got the visualization. I've got the selection of all assets that are not only currently, but have ever been this super old config.
Now I can walk them through time and look at where they were back in 2020. I can look at them back into 2021. How that's changing into 2022. How that's changing looks like a couple of them might have gotten upgraded yet. But you'll see the big shift here between where they were in this super old config into 2024.
They all moved up into… Quite a few of them got upgraded into this new config. The difference between this plot and this plot might not have been obvious at that superset level. But again, that's another technique you can use to explore data. Looking at a dimension or an X, a variable that you've not plotted on the X axis using color and using local data filter as a way to help you explore and select data.
Then I think the last thing I wanted to talk about here was the continuous data aspect of this. This is the Iris data set. This is what I stumbled upon as I was exploring and trying to build this presentation. But if you actually go to the graph, and you pick Parallel Plot, this is what you'll see in terms of that result. I tried looking at this. This technique is actually really good at plotting up the continuous data. But it's not so great in plotting up the categorical data.
If I go to Graph and Parallel Plot for my original data set and plot that up. It's not nearly as fancy or neat-looking. There's probably some other options that I can explore to go and make that look a little prettier, but different strokes for different folks. Again, it's a way of visualizing the data that's maybe a little bit different than what you've maybe historically been able to find on your own. But recognize that there are different applications and different ways of exploring the data.
If I go back to the Iris data set, and I do the parallel plot the way that I normally would have, and pick my list of potential axes and drop them into the X category and pick my parallel plot. I get something maybe a little bit more informative that way. But again, it's just a bunch of different ways to do the same thing. But using that multidimensional data and using this technique to help explore it is why I'm here to talk today.
That basically wraps up what I wanted to cover in this talk. Besides maybe some advice around how to use and how to visualize data. I've gotten a couple of these from an on-demand webinar that you can find at JMP's website, and talk to… You can watch his webinar, or you can maybe reach out to him directly. But he's got some great advice on just data visualization and keeping things simple.
I truly take that to heart because you will have to be very careful when you're in, what I call exploratory versus explanatory mode. If I'm exploring data, and I'm visualizing stuff for my own purpose, I know exactly why I'm clicking what I'm clicking, and I know exactly why I'm trying to do things the way that I'm doing them. When you're trying to use those same graphics to explain things, you may have a hard time explaining yourself.
Be careful and trying to keep things as simple as possible when using something as complicated as this looks for explanatory purposes. This is a great tool for exploratory visualization, but I think you might need to go back to some of the methods that I've mentioned here to try to maybe simplify the views and get to the punchline quicker when you're trying to explain things to an audience or to your leadership. Simple is always better.
Then this is a takeaway. It may seem like common sense, but never assume what is obvious to you is obvious to your audience. If you've put something on a chart on a PowerPoint, you've built a plot that is trying to explain something, talk it out and make sure that you've got some descriptive bullets that are explaining the walk of how you got to that particular conclusion, and be clear on the conclusion. What am I trying to deduce from this view? Never assume what is obvious to you is obvious to your audience.
Then the last thing I'll mention here, and I trip over this myself, honestly, is there's a delicate balance between visualization and art, and so don't make it pretty just to make it pretty. Make sure that the graphics that you're building are useful and are helping you tell the story and aren't getting in the way of telling the story.
Visualization for visualization's sake, art for art's sake. I've also included the links here to the JMP help around parallel plots. As I mentioned, that on-demand webinar from Mike Anderson. It can be very helpful, so I highly recommend that. But with that, I thank you for your attention, and that's it. Thanks a lot.