Georg Raming, Senior Manager, Siltronic AG
Siltronic AG is a global technology leader in the semiconductor wafer industry. This presentation will introduce the Siltronic AG approach to preparing batch process data for modeling with JMP Pro. It will demonstrate some interactive steps to clean and rearrange the dataset before modeling using an anonymized dataset containing both historical and experimental batch data. Once the best model algorithm is found, the boosted tree model will be tuned.
The Siltronic AG team found that a technically sound model may be physically worthless, meaning it had been overfitted. Therefore, the team started with a large set of factors, gradually reducing the factor list and testing the model's behavior to find the most effective factors (step backward strategy for a boosted tree in a small JSL routine). The last step provided the best insight into which levers are the strongest to optimize the process.
Hello,
everyone.
Thanks
for
joining
in.
In
this
talk,
I
want
to
talk
about
how
we
did
prepare
our
batch
process
data
for
modeling
with
JMP
Pro
and
gaining
valuable
insights
with
a
team
approach.
My
presentation
is
detailed
in
a
PowerPoint
part.
That
is
the
first
part
and
the
details
all
here
shown
will
follow
in
JMP,
like
how
the
data
set
looks,
which
platforms
I
have
used
like
missing
data,
multi
collinearity,
functional
data
explorer,
predictor
screening,
modeling
batch
data
with
Boosted Tree
and
Profiler.
Summarized
data
will
be
analyzed
by
Boosted Tree
as
well,
and
then
by
a
script with
Boosted Tree
backward
selection.
At
first,
my
company,
Siltronic,
has
world-
class
production
sites
all
over
the
world
like shown
here
and
about
4,000
employees.
Here
are
some
key
figures.
If
you
imagine
that
we
have
a
complex
process
flows
like
shown
here
with
silicon
mold
in
a
crucible.
Silicon
ingot
is
created
here.
That's
my
special
task.
To
make
processes
for
growing
silicon
ingots.
Then
the
ingot
is
ground
and
sliced.
Edge
rounding
is
done
for
the
wafers,
laser
marking,
lapping,
cleaning,
etching,
polishing,
and
maybe
epitaxy
for
the
final
wafer
to
be
created.
Our
portfolio
is
that
we
are
selling
300
millimeter,
200
millimeter
and
smaller
diameter
wafers
for
different
applications
like
shown
here,
silicon
wafers
with
several
specifications.
About
me,
my
education
is
I'm
an
electrical
engineer
and
I
did
some
Six
Sigma
education,
and
my
main
task
is
to
develop
processes
for
growing
silicon
crystals
like
shown
here,
and
I'm
as
well
responsible
for
around
500
users
at
Seltronic
JMP
users.
How
does
the
task
look
like?
What
we
see
here
is
the
final
table,
but
it
has
been
created
and
this
takes
a
lot
of
effort
as
well.
So
there
are
some
database
queries
behind
to
get
this
data
from
database.
We
fetched
the
results
into
JMP
data
tables
and
enlarged
the
data
set
with
archives
from
earlier
date
and
enriched
some
information
like
details
of
experiments
and
details
on
consumables
and
wrote
some
script
for
graphs
and
evaluations.
T
hen
we
have
done
the
modeling
tasks
and
of
course,
looked
for
missing
data
correlations
to
see
what
are
the
most
significant
effects
and
to
do
feature
engineering
to
see
which
features
are
important
to
generate
an
optimal
result.
At
this
point,
I
will
switch
into
JMP
then.
We
can
see
here
my
journal
I'm
working
with
and
the
JMP
main
window
and
the
abstract
is
seen
here.
We
will
start
with
technical
hints.
The
use
data
set
I
show
here
is
fully
anonymized
and
standardized,
and
all
identifiers
are
generic
for
better
understanding
what
are
the
features,
what
is
the
result,
and
so
on.
The
aim
of
this
presentation
is
to
show
all
the
steps
we
needed
for
getting
an
overview,
restructuring,
and
understanding
the
data
set,
and
how
to
build
the
models
to
get
some
insights
of
the
content
of
the
data
set.
I
will
show
some
results
that
we
have
discussed
in
a
team.
The
team
is
very
important
here
because
the
team
drove
a
lot
of
discussion
and
work
as
well,
how
to
analyze
and
what
features
may
be
interesting
and
what
should
not
be,
and
what
may
be
the
physics
behind.
I
will
start
with
the
data
set
here.
It's
also
a
part
of
the
contribution
in
the
community.
Here
it
is
opened
and
I
will
change
the
design
a
little
bit
to
see
how
it
looks
like.
We
have
around
80,000
rows
in
this
data
set
and
it's
a
batch
data
set,
so
we
have
a
batch
ID.
T
his
data
set
is
quite
challenging
because
it
has
a
mixture
of
historical
data
like
here,
POR
batches.
We
can
see
here
that
we
have…
Most
of
the
data
is
historical
data,
and
there
are
only
a
few
special
experiments
shown
here.
We
have
several
features
then,
like
one
categorical,
it's consumable.
We
have
the
batch
maturity,
it's
the
time,
also
standardized.
Then
we
have
several
features.
So
these
X
values
here,
we
have
one
result
column,
and
to
reduce
the
noise
a
little
bit,
we
have
calculated
a
new
moving
average
as
well.
Let's
have
a
look
at
how
the
data
set
looks
more
in
detail.
We
can
see
here,
if
we
do
a
summary
on
the
data
like
this,
you
can
do
this
from
the
table
s
menu
as
well.
Summary.
We
get
around
500
rows,
500
batches.
This
is
a
summary
by
batch,
and
we
see
that
there
is
no
variation
in
the
parameters
X
1
to
X
4,
meaning
that
they
are
constant
for
each
batch,
and
the
others
are
changing
at
different
rates.
To
have
a
look
how
the
data
looks
at
all,
we
can
see
here
the
result
parameter
like
yield,
a
long
time
for
all
the
rows
of
the
batch
data
set,
and
this
smoothing
is
done
by
JMP
Graph
Builder
platform.
We
implemented
this
as
a
formula
as
it
is
available
as
a
function
in
JMP.
We
can
have
a
look
here
at
some
special
batches
as
well.
If
we
use
the
local
data
filter
and
see
here
how
the
average
works
and
what
noise
is
in
the
single
data
points,
the
blue
ones
are
the
original
data
of
yield,
and
the
orange
one
are
the
moving
average.
I
will
close
this
then,
and
next
point
may
be
to
look
at
how
much
data
is
missing.
So
we
have
this
in
JMP
as
well.
We
can
mark
all
the
columns
and
do
the
missing
data
pattern
platform
like
this.
It
will
show
us
that
from
about
80,000
rows,
we
have
178
rows
with
some
missing
data
in
one
column.
This
can
also
be
shown
here
as
a
graph.
This
is
very
important,
at
least
for
the
data
creation steps,
it
was
important
to
see
where
is
some
data
missing
and
to
fix
this
missing
data
then
as
much
as
possible.
Another
step
to
look
at
the
data,
maybe
Columns
Viewer.
We
can
get
here,
I
put
all
the
columns
in,
and
here
we
can
see
again
like
we
had
before,
there
is
some
rows
missing
for
parameter
X2.
We
can
see
what
the
min,
max,
mean,
standard
deviations,
and
so
are
for
all
the
parameters
we
can
see
here.
Here
we
nicely
see
that
everything
is
standardized.
For
the
yield,
it's
between
zero
and
100.
We
can
as
well
start
from
here,
distribution
platform.
So
all
the
columns
are
marked
and
we
get
by
only
one
click
for
all
the
data,
the
distributions.
We
can
see
here
what
consumables
are
used
how
often
that
we
have
most
data
from
historic
processes
and
only
some
of
some
experiments
with
special
settings.
The
time,
of
course,
looks
nicely
distributed,
but
the
others
don't
look
that
nicely.
So
there
is
a
lot
of
room
between
some
settings,
and
it's
sparsely
distributed,
non-
normal
distributed
for
the
most
parameters,
and
that
makes
it
even
more
challenging
to
analyze
this
data.
We
go
to
the
next
steps.
I
will
close
these
reports.
Then
we
may
look
even
more
in
detail
on
some
things
like
how
the
parameters
are
correlated.
We
can
see
this
in
the
multivariate
platform.
It
needs
some
time
to
be
calculated.
You
will
find
it
here
under
the
analysis
menu,
multivariate.
It
takes
the
numeric
columns
and
generates
this
correlation
report,
and
you
will
see
that
the
parameters
like
X6
and
X5
are
highly
correlated.
This
makes
it
difficult,
like
X10
and
X9
as
well,
makes
it
difficult
to
do
feature
engineering.
What
we
want
to
know
from
the
analysis
is
which
parameter
causes
some
yield
drop.
If
two
parameters
are
correlated,
it's
not
so
easy
to
detect
or
to
find
out
which
one
is
the
responsible
one.
Here
in
the
scatter
plot
matrix,
you
can
see
as
well
which
parameters
change
with
time,
like
X1,
X2,
up
to
X4
is
constant
over
time,
and
the
others
are
changing
and
how
they
are
distributed,
and
you
can
nicely
mark
some
rows
like
here.
They
are
selected
in
the
data
table
then
and
see
how
the
curves
are
for
each
parameter
over
time
or
which
parameter
over
what
parameter
combination
looks
like.
Next,
I
want
to
use
the
functional data explorer.
The
functional data explorer
allows
us
to
fit
curves
for
each
batch
and
extracts
the
features
of
each
curve.
Then
we
can
have
a
look
at
which
batches
behave
similar
or
maybe
extreme
ones.
So
the
start
is
like
this.
We
can
have
a
look
at
how
I
started
the
analysis.
We
launch
analysis.
I
put
time
as
an
X
parameter,
Yield
as
the
output
parameter
Y,
and
the
ID
function
is
the
Batch
ID.
Then
we
have
here
some
informal
parts
like
Part
and
Group.
This
platform
is
available
in
JMP
Pro
only,
and
when
we
start
it,
we
can
do
some
data
processing
here.
But
in
this
case,
it's
not
necessary.
We
can
have
a
look
at
each
batch,
how
it
looks.
So
there
are
a
lot
of
graphs
here
like
this.
We
can
mark
the
rows.
We
can
see here
the
marked
rows
and
the
data
table
as
well.
To
go
on
with
this
platform,
we
need
to
make
some
models
like
P-splines
for
each
batch
and
JMP
does
this
and
defines
itself
which
splines
are
used
and
how
many
supporting
functions
are
needed,
like
the
knots
shown
here.
So
the
best
result
is
given
with
a
cubic
spline
with
20
knots.
You
can
see
how
each
batch
is
modeled
here
by
the
red
line
shown
here
and
how
it
looks.
We
have
here
the
shape
functions.
So
each
curve
is
added
together
by
a
combination
of
shape
functions,
and
we
get
for
each
batch
the
coefficient
for
each
shape
function.
If
we
are
looking
at
Shape
Function
1…
This
is
the
main
behavior
of
all
batches
with
a
drop
here
at
around
0.7.
We
can
see
that
here
we
have
Component
1.
This
is
a
coefficient
for
the
shape
function
one.
If
we
select
these
batches,
we
will
see
that
they
have
a
pronounced
shape
like
Shape
Function
1.
We
can
see
it
here.
We
can
as
well
use
the
Profiler.
So
this
is
mostly
for
understanding
the
data
but
we
have
not
used
it
for
further
analysis
because
we
did
not
really
need
the
information
how
the
back
batch
looks
as
a
shape
for
each
curve.
We
were
more
interested
in
average
yield
of
each
batch
because
we
cannot
define
only
to
use
the
first
part
of
the
batch
and
forget
about
the
second
part.
This
would
not
work
in
our
case.
As
well
to
see
again
how
this
works
together,
we
can
have
a
look
in
Graph
Builder.
The
graph
of
some
batches
we
have
seen
just
before.
Maybe
you
see
this
number
here.
We
have
seen
it
before.
Here
it
is
shown
again
together
with
the
moving
average
of
yield.
Next
step
would
be
to
start
modeling
of
the
batch
data.
When
doing
modeling,
it
may
be
interesting
to
see
or
have
an
idea
which
parameters
are
most
important
for
the
variability
of
the
output.
There
we
have
the
predictor
screening
platform.
You
can
as
well
start
it
from
here.
Analysis
and
predictor
screening.
I
wrote
it
here
as
a
script
simply
to
start
it
by
pressing
a
button.
When
doing
so,
we
will
see
some
Bootstrap
Forest
analysis
going
on,
and
it
shows
us
the
importance
of
the
features
we
have
in
our
data
set.
Time
is
the
most
important,
but
this
is
useless
at
the
end
for
us
because
we
need
to
use
the
full
batch.
T
hen
comes
X1,
then
comes
part
X8,
X5,
and
so
on.
So
here
we
could
as
well
select
a
few
rows,
copy
them
and
put
them
into
a
model.
I
will
stop
this
here,
and
to
see
which
model
works
best,
I
used
model
screening
platform.
I
will
not
run
it
here
because
it
takes
several
minutes.
But
we
have
seen
that
Boosted Tree platform
may
perform
well.
There
is
maybe
not
so
a
big
difference
between
the
next
ones,
but
that's
the
reason
why
I
used
Boosted Tree platform.
Then
we
will
run
Boosted Tree
like
this
on
the
batch
data,
and
it
works
quite
quick.
We
will
see
the
result,
and
a
nice
feature
of
the
Boosted Tree platform
as
well
is
that
you
have
column
contributions
so
that
you
can
nicely
do
some
feature
engineering.
We
can
see
here
that
we
have
71
%
R
square
for
training
and
66
for
validation,
may be
okay,
and
we
have
still
all
features
in.
But
we
are
interested
mostly
in
which
features
are
reliably
the
most
important
ones.
When
doing
this,
we
can
save
it
as
a
column,
save
prediction
formula
in
the
data
table.
We
see
in
the
data
table
we
have
a
formula
now
and
we
can
use
it.
We
can
maybe
have
a
look
at
how
the
model
performs
or
to
use
it
in
Graph
Builder
simply
to
see
how
the
model
data
looks
like
over
the
batch
maturity.
I
hope
the
Graph
Builder
to
show
the
graph
the
graph
soon,
and
here
it
comes.
Yes.
We
have
seen
that
this
modeling
works
quite
well,
so
we
have
a
formula
now
to
rebuild
the
data
and
we
can
maybe
work
with
it.
But
especially
for
the
batch
data
modeling,
we
have
a
problem
here
that
validation
will
not
work
because
we
may
have
here
for
some
batch,
these
rows
in
training
set
and
the
rows
next
to
it
in
validation
set.
So
they
are
not
well
separated
for
the
features
that
control
the
batch
then.
Additionally,
the
model
is
not
very
stable,
so
we
will
get a
different
result.
For our
different
runs
of
the
model,
this
is
known
from
tree-
based
methods
that
they
may
give
different
results
for
high
variability
data.
If
we
do
something
like
running
Boosted Tree
twice,
we
get
also
different
column
contributions
and
maybe
different
order
like
we
can
see
here
for
the
part
and
X
5
are
switched
here
for
these
two
runs,
and
I
will
show
it
again
also
here.
If
we
run
Boosted Tree
twice.
Don't
know.
Yes,
here
I
should
have
the
script.
It
comes
later.
So
at
this
point,
we
have
said
that
it
may
be
better
to
model
the
summarized
data
because
we
need
to
use
the
full
batch.
Here
I
have
a
script
now
to
summarize
the
data
in
a
form
that
we
have
only
one
row
for
each
batch.
There
is
a
nice
feature
statistics
column
name
format
that
we
get
the
same
columns
for
the
summarized
data
as
we
have
in
the
original
table
that
we
can
use
the
same
scripts
for
both.
Doing
so,
we
get
here
the
summary
data
table—
I
can
close
the
script—
with
around
500
batches.
It's
a
lot
of
easier
to
model,
and
here
I
have
summarized
the
data
for
0.6
to
0.8
time.
So
it's
where
the
yield
drop
was,
and
we
can
again
do
here
some
predictor
screening
like
this
and
see
that
I
still
have
time
in
that
data
set
to
see…
It's
more
like
to
see
what
level
noise
is,
and
it's
around
these
parameters
that
are
likely
also
noise
for
the
model
then.
Then
we
can,
of
course,
do
some
model
comparison.
So
I
selected
a
few
parameter
that
we
found
to
be
responsible,
most
probably,
and
I'm
doing
two
Boosted Tree
analysis
and
then
do
some
model
comparison
for
both.
It
looks
like
this.
We
can
see
here
we
get
a
Profiler
and
compare
the
result
for
different
settings,
maybe
like
this.
Here
we
have
still
the
problem.
We
see
some
features
like
this
here
for
X10
in
this
model
and
not
in
that
model.
So
it
likely
seems
to
be
noise.
At
the
beginning,
we
have
discussed
a
lot
about
these
differences.
We
have
seen
sometimes
and
sometimes
not,
and
asked
the
question,
what
is
true,
what
is
physical,
and
what
is
not.
That
brought
me
to
the
step
then
that
we
need
to
continue
with
feature
selection,
and
that's
why
we
created
this
script.
It
takes
this
summary
data
and
has
been
done
for
the
batch
data
as
well.
For
each
step,
it
builds
Boost
ed
Tree
model
for
the
full
parameter
set,
saves
the
model
into
the
formula
depot,
and
saves
the
model
performance
R
square
and
so
into
a
data
table,
and
shows
us
the
column
contribution.
Here
we
can
see
something
that
we
have
seen
often
that
with
the
higher
number
model,
it
is
the
model
with
less
parameters
like
we
can
see
here
with
the
column
contributions,
we
have
the
best
result.
It
looks
different
for
each
run,
but
the
tendency
we
see
in
most
times.
So
here
we
can
see
for
more
or
less
sure
that
part
and
X1
and
X5
are
the
most
important
parameters.
This
one
may
be
there
sometime
and
maybe
not,
so
we
will
focus
on
these
three
parameters.
As
well,
we
can
have
a
look
in
the
Formula
Depot.
There
we
can
start
model
comparison.
We
maybe
can
compare
the
first
model.
So
we
do
it
like
this,
model
comparison.
This
is
our
data
table.
Take
the
first
number.
The
numbers
here
are
shifted
by
one,
and
maybe
the
5th
should
be
that
one,
and
the
last
one.
This
will
not
work.
I
think
it's
number
three,
and
the
last
one.
The
ones
with
the
highest
validation
score
and
maybe
compare
them
here.
We
see
the
model
comparison
dialog.
We
see
that
the
last
model
is
between
the
best
models
we
could
fit
at
all,
and
can
see
here
the
Profiler,
for
example,
and
as
well,
may
use
extrapolation
control.
We
have
seen
that
we
have
sparse
data,
so
not
behind
every
point.
There
is
some
data.
Let's
look
where
it
is.
Here
it
is.
Extrapolation
control
warning
on.
So
it
shows
us
when
there
is
no
data
between
the
points.
Here
we
can
maybe
compare
the
models
and
we
see
here
that
there
is
no
variability
on
the
X
factors
that
haven't
been
used
here.
To
sum
up,
let's
close
some
tables
first
and
some
dialogs.
To
sum
up,
we
have
prepared
a
workflow
for
modeling
this
data
and
have
done
several
steps
and
additional
script
to
enhance
understanding
and
to
drive
the
discussion
about
what's
important
and
what's
not.
I
have
a
proposal
for
a
model
and
some
tasks
we
can
focus
on
to
improve
the
year
yield
of
our
process,
and
you
will
find
the
data
and
the
presentation
in
the
user
community.
If
you
have
other
ideas
how
to
explore
this
data
set
and
how
to
find
the
final
best
model,
you
can
contact
me
or
post
something
on
my
contribution
in
the
in
the
community
for
this
Discovery
Summit.
Thanks
for
your
attention
and
bye.
That's
it,
Martin.
... View more