Modern Approaches in Classification of Covalent Organic Frameworks by Textural P...

Michael Nazarkovsky, Chemistry Department, Pontifical Catholic University of Rio de Janeiro (DQ PUC Rio)
Felipe Lopes de Oliveira, Institute of Chemistry, Federal University of Rio de Janeiro (IQ UFRJ)

Covalent organic frameworks (COFs) are an emerging class of crystalline organic nanoporous materials, designed in a bottom-up approach by the covalent bonding of one or more building blocks. Due to the substantial quantity and diversity of building blocks, there is a massive number of different COFs that can be synthesized. In this sense, the development of machine learning techniques that can guide the synthesis of new materials is extremely important. In this work, after a set of 590 structures was selected, DFT and Grand-Canonical Monte Carlo methods were used to determine the enthalpy of adsorption and capture of CO2 or H2 by these structures. Also attempts to develop the classification models were made.

In the present study, the structures were classified by means of unsupervised (Hierarchical Clustering, PCA) and supervised (Multiple Logistic Regression, Naive Bayes, KNN, SVM) methods coded in JSL (JMP Pro 15). The COFs were separated in 2D and 3D. The largest clusters were selected to perform supervised machine learning stratifying the data by the structure (62.9% 2D and 37.1% 3D) to avoid disproportion in training/validation ratio (80%/20%). The 2D/3D classification by textural properties was successfully accomplished with the 100% accuracy (validation) for all models. Other metrics, such as Entropy R2, Generalized R2, Mean -log(p) were compared and discussed in order to select the optimal model.

Auto-generated transcript...

Speaker	Transcript
Josh Staunton	Michael.
Michael Nazarkovsky	hi josh just a moment.
	Okay.
Michael Nazarkovsky	Oh yeah no no it's working well.
Josh Staunton	Good I can see you, thank you.
Michael Nazarkovsky	While you.
	While we're waiting for phillippi he says he's the first, so I just talked to him 10 minutes ago, so you're recording right now yeah.
Josh Staunton	yeah it starts recording we'll just edit this part out.
	Once we kind of go through it and have a couple questions to kind of go through once Philippe joins and then I will pass it over to you shut off my camera and my my son and let you let you guys take it away, but thank you for for doing this we appreciate it.
Michael Nazarkovsky	Oh, thank you very much, really, it was a big surprise for me when they received the such a message about possibility opportunity to participate.
	In this event or really it's three to five or actually how I knew about that, last year I got the session in October from last year, and it was impressed oh my God so good topics are elucidated so I decided not to lose this opportunity this year.
Josh Staunton	yeah great awesome.
	Well, you know, unfortunately, you know this year will be in person, but you know, we still have some great content and.
	We, hopefully, in the future, we can get you to one of the in person discovery events which are always fun.
Michael Nazarkovsky	Oh yeah hopefully as.
	I started using jam for from beginning of last year, and just from the start of pajamas.
	yeah yeah so pretty quick very simply, I would say.
	yeah and do the pandemic or I succeed succeed a lot actually as or as a result of one of the results, you will see right now in this presentation.
	And even started publishing about that and actually at the moment, I guess, I am a co author of author also have maybe three articles in parallel, well aware, I involve JMP.
	Come on.
Josh Staunton	Everything that's pretty.
	Fast learning that's for you.
Michael Nazarkovsky	Just you know it helps a lot of course for a statistical thinking.
	yeah just yeah.
	yeah.
Josh Staunton	we're using before using just kind of open source to do your analyses in the past or.
Michael Nazarkovsky	Open Source you mean.
Josh Staunton	On or something.
	To reduce it to someone else's.
Michael Nazarkovsky	Ah, no, no, no, no, no, you know I started learning Python but you know what.
	I mean countering some difficulties in learning.
	So I'm chemist and physical chemist and actually using all this stuff for for Chemistry material science, as you will see right now yeah so I decided to switch from Python to math lab in parallel to pro six road to have make progress there because.
	machine learning interface for metalab in Jeff is more than enough for scientists.
Josh Staunton	Yes, yeah no that's that's great to hear I'm glad you're you're you've embraced it and are you know excelling with it that's awesome.
Michael Nazarkovsky	Well, let me, let me please text to phillippi.
	Okay whoa what's going what's going on, he has encountered some troubles with Internet issues like one hour before.
	Okay okay.
	he's trying to enter.
Josh Staunton	he's not in yet no.
	strange.
Josh Staunton	in real time.
	So you didn't have to put any password us clicked on the link and just came straight in correct.
Michael Nazarkovsky	Well, I sent him the link just doing but I don't remember if any password needs to be there.
Josh Staunton	You shouldn't have business yeah.
Michael Nazarkovsky	that's the thing so he's texting something to me right now in just a moment.
	Oh.
	Oh yeah so the thing is not the zoom is again encountered and troubles with the Internet right now.
	Oh.
Michael Nazarkovsky	we've got one hour yeah.
Josh Staunton	he's joining right now.
Michael Nazarkovsky	Oh good.
	Yes.
Felipe Lopes	Hello.
Michael Nazarkovsky	Hello finally you're a hero, so you you're launching the presentation so josh just as a way josh is awaiting us so we're in time actually.
Felipe Lopes	Yes, exactly anytime yeah.
Josh Staunton	Definitely, a time, I mean the the be.
	Nice to meet you my name is.
	josh I mean I don't.
	See the recorder so I'll just I'm just going to go through a few questions and then, once we kind of get through that I'm going to pass it over to you shut off my camera and you guys can take it away we're going to edit out all this part.
	Okay.
	Even though it's worth recording right now but we'll edit that part out later.
	So a couple things.
	Right, so I guess.
	So obviously we have Michael as our kosky from Pontifical Catholic University of Rio de Janeiro, and he Lopez de la vieja.
	federal University of Rio de Janeiro, and you guys are going to be going over.
	Water approaches and classification of kirkland or getting frameworks by textural properties using jail cell 2021 dash us desk 30 MP dash 874 does that sound correct.
Felipe Lopes	Yes.
Michael Nazarkovsky	Okay, the whole name the full name.
Josh Staunton	Yes, the full name okay now you guys.
	Just you understand this is being recorded for using JMP discovery Summit Conference, it will be available publicly and the JMP user community do you give permission for this recording and use.
Felipe Lopes	Yes, I do, yes, thank you.
Michael Nazarkovsky	Okay, good.
Josh Staunton	Michael you good.
	yeah, of course.
	yeah kind of analogy here.
	sounds like you guys have do you have.
	Your phones kind of turned off and like I am or anything like that that could.
Michael Nazarkovsky	Destroy yeah.
	Anything which can disturb I will just switch off older oldest sounds and skyway remember these requirements up and George please tell me during the session I mean after the record of the session, you will catch the relevant stuff or it will be like the whole.
Josh Staunton	Yes, so it will go straight through I don't think they're going to be doing any editing in during it if, for some reason, there was something.
	catastrophic that happened we can start from the beginning again, but I think they will just let you go straight through don't worry about you know hey we're all humans here right so.
	You made a mistake just try to just keep rolling with it nobody's it's not doesn't have to be perfect right stats isn't perfect so you don't need to either, but yeah.
	Go ahead.
Felipe Lopes	No.
Josh Staunton	I'm sorry.
	Is that okay Michael does that sound good.
Michael Nazarkovsky	yeah everything's fine everything's fine.
	So yeah.
Josh Staunton	last thing we just see.
	Okay.
	yep already good to that.
	All right, just taskbar.
	Now I'm going to.
	Do you guys have any other questions or you guys pretty much ready to to present and.
	Is.
Michael Nazarkovsky	It yes.
	yeah I guess.
Felipe Lopes	share my screen Michael you share your screen how.
Michael Nazarkovsky	Are you brother.
	But okay josh technically it's fine if we share our screens after each other yeah.
Josh Staunton	yeah ya know that that is fine yep just.
	Just make sure you're obviously you know forwarding the slides and is not getting too far behind the slides and.
Michael Nazarkovsky	yeah but.
Josh Staunton	yeah absolutely you guys know what.
Michael Nazarkovsky	We will do they have the entire the same presentation is all our parts are together so just we do it after each other that's fine Thank you.
Josh Staunton	Okay cool and you guys know how to share screens use zoom before.
	Yes.
Felipe Lopes	Just setting up the rushing here.
	put this.
	Yes.
Josh Staunton	Like your backgrounds.
Felipe Lopes	My sharing the right screen right now.
Michael Nazarkovsky	Great everything's fine.
Felipe Lopes	yeah.
	it's the presentation right.
	Because I'm heading to screen and sometimes the wrong screen and throws up so.
Michael Nazarkovsky	Perfect everything's perfect.
Josh Staunton	That looks great I'm going to turn off my.
	My camera now but I'll pass it over you gentlemen, thanks again and.
	I guess, we can talk at the end if they.
Michael Nazarkovsky	would have been nice to see you again josh.
Josh Staunton	All right, take care.
Felipe Lopes	So, hello everyone. My name is Felipe Lopes de Oliveira. I'm from Federal University of Rio de Janeiro and I'm presenting this work with Michael
	Nazarkovsky, from Pontifical Catholic University also in Rio de Janeiro. And today we present the work, Modern Approaches in Classification of Covalent Organic Frameworks by Textural and...by Textural Properties Using the JSL Software.
	So covalent organic frameworks are a new class of materials composed only by light elements such as carbon, nitrogen, oxygen, hydrogen, and boron.
	They are organic, crystalline, nanoporous, highly stable in reticular materials. This means that the chemical bonds that form these materials follow a well-defined geometric pattern dictated by the topology of the network.
	In this image, I'm showing an example of an HCB type lattice, which is composed of triangles and rods connected periodically,
	forming an extended network. These triangles and rod represent molecules which can be...which you call building blocks that are covalently bonded and form an extended structure
	with the geometric characteristics of this nature. There are many different types of
	geometries of molecules that can be used as building blocks and several types of chemical bonds that we use to connect to them.
	In this way, it's possible to form a structure with extremely specific characteristics, such as ??? that we can control in the order of ??? in the physical, chemical characteristics of the interior of the
	??? that can...that we have defined by the building blocks. So we can think of the building blocks as Lego pieces, where different pieces can be used to build several structures with unique characteristics.
	In this slide I show different possible topologies that can be used to build both two-dimensional and three-dimensional covalent organic frameworks,
	as well as the format that each building blocks must have to use to build these networks. On the right, I show an example of a crystal structure model of COFs
	with a ??? network, which will be used to calculate the properties of interest in the future. It's possible to see that the format forms a structure that is composed of a two-dimensional covalent layer
	and these layers are stacked in the equivalent of a geographies the third dimension.
	Due to their incredible modulation capacities, COFs can be used for several extremely interesting applications, such as heterogeneous catalysis, energy production, storage,
	organic semiconductors, chemical sensing, thermal insulation, and gas capture and storage, among others. However, due to the larger modularity that...
	this larger modularity can also be a problem, but the development of these materials is really complex.
	They use...they are usually developed through a combination of chemical intuition and available molecules.
	With several steps of synthesis and characterization that repeated with materials with high ??? is obtained. This material is developed...
	This material's development is very time-consuming, expensive and requires several...20 professionals to perform it. And sometimes, all these efforts do not lead to...
	do not lead to a material with a desired characteristics. So to try to solve this problem, we...
	to try to solve this problem, we propose a simple and ??? computational approach to study these materials and thus try to accelerate their development.
	We initially use a database of structures, now as curated COFs, which has compiled of several structures synthesized and presented in the literature.
	Then we use some building routines to develop values to validate those structures and make sure that everything's okay. Next we use a DFT-based
	electronic structure calculation software called CP2K to optimize the self(?) parameters in the atomic positions of the structures. Then we use a software called ZEO++,
	which performs geometrical analysis of the pores and percentage match between the formation of ??? networks inside them.
	And then determines a set of structural descriptors for the selected costs. This is a script by my colleague, Michael in JMP software to do several extremely interesting analyses.
	Well, here I'm showing the main structures that we use. First we have a structure ID, which is basically I think classification of the name of the structure.
	And following, we have the LIS, which stands for large included sphere, which gives us the information of the pore diameter of the material.
	We have the LFS, large free sphere, and LSFP, large sphere free path, with...which basically is the same thing as the large included sphere, but can encode the information...
	the difference in the pore diameter influence of several functionalizations within the pores. We also have the SSA and the SSAV, which is the specific area, both in square meters per gram in square meters per volume unit.
	AVF, which is the accessible volume fraction, which is how much of the material is formed by pores and Vp is the pore volume.
	And then, we have the N(chan), which is the number of accessible channels.
	Here in the figure I'm showing a structure with only one channel, but three-dimensional materials can have more than one accessible channel.
	We have ChanDim, which is the dimensionality of the channels that can be 1D, like in this structure, 2D and 3D. And the channel size, which is basically sometimes the pore diameter, but for ??? structures can be a little bit different.
	And now, Michael will talk about a little bit of the results.
Michael Nazarkovsky	Yeah, thanks a lot. Thank you very much, Felipe, for such a brilliant introduction, so now it will be pure data analysis and data science. So let me please share my screen in my turn okay. Okay, this is what I'm going to do right now.
	Perfect. Good stuff.
	Okay, so let's start from from the beginning of the second part. The time has come to analyze the data of those...of these...such a huge, I would say, amount of structures obtained by Felipe.
	Just a moment. Okay, good. After the first, let's say, from before any analysis, they have to retrieve the data and exclude some erroneous structures or some irrelevant
	and screen to detect missing values. No missing values have been detected during their analysis and screening. We also have discovered after the
	data visualization that the absolute majority of the structures available are assigned to to the structure more 85% and 15% all into a 3D structure. So from the beginning, we have unbalanced data set now to be predicted. Absolute majority of the
	channels is assigned to monodimensional
	channels and there are just more...more than 80% and almost 17% is attributed to 3D dimensional and just less than 2% is attributed to the channels that are by bidimensional.
	Moreover, all these channels are quite isotropic so it means that in all directions (X, Y and Z) they are equal, which is reflected on data three dimensional 3D diagram.
	If we distribute all the data between the structures, I mean 2 and 3D structures, we could see the monodimensional, bidimensional take the absolute majority. More than 90% for that 2D structure and the 3D are is taking.
	3D, I mean 3D, three dimensional channels. Almost more than 60% are taken by 3D structures, however, quite enough...quite a high amount of
	of three also is taken by 2D structures, because just they have a much more...a much higher value like 85%.
	To avoid that, to avoid such an unbalanced
	situation, we are undertaking some unsupervised
	approaches, together with supervised methods of the machine learning
	after the data preparation. So what we do in the first first ??? of these analysis, multivariate analysis to detect colinearity
	and correlations between the parameters. So, as we can see that first three (LIS, LFS, and LSFP) are highly correlated so one can be substituted by others. However, such a characteristic will be very important, while the analysis of PCA
	(principal component analysis). This is why, for hierarchical clustering, we selected just only one parameters of them.
	In case also X, Y and Z for a channel size, our isotropic means the same. However
	a decision was taken to keep all of them, because this is highly related to the native structure of the 2D...of the 2D compound...compound structures of key COFs.
	Well, due to their high amount, almost 500 structures, we decided to select to 13 clusters, which can be more significant and relevant optimal by ratio or the portion taken from the...all number...entire number of clusters and analyze them
	for the mean
	parameters, which are reflected on the table.
	As we can see that, after a number 13, there is a another huge
	reduction of the clustering... cluster distances, so this is why 13, it was assumed to be the optimal amount of the clusters. And from them we selected them most massive cluster, cluster #7, which has contained...comprises 128
	samples of 2D structure, so it will be used for classification afterward.
	In turn, 3D structures are, how do I would...I would dare to say that already are the same. We have still observed still a high correlation between LIS, LFS and LSFP, so this is why also
	the conditions for cluster extraction were the same like for 2D. However with lower amount of clusters, like we decided to take seven. However from these seven clusters, we extracted four, just to keep at least
	the ratio between 2D and 3D closer to equality. And it will be not equal,
	however, as you will see on the next slide, it will be much better, less unbalanced than 85 to 15.
	From comparing two tables about variability variation of the parameters, we can see that the last four in both cases are responsible for higher variability of the structures within each set. Volume fracture...fraction...sorry,
	total pore volume and both types of specific surface area, so they are on the same level and they are responsible for higher variability of by among the structures. So as I said, we got less...
	less difference between two structures in the end.
	Validation...trained validation played two sets were built and stratified on the basis of the obtained structures from...taken from the most massive clusters and well distributed.
	And let's get started with logistic regression, as the most simple parametric...parametric analyses. After taking out the most irrelevant
	variables, we left only two total pore volume and volume fraction.
	They are highly correlated also, like
	a very huge percentage of correlations are. However, it would take one about, as you can see, sorry, the same LogWorth, so it means that we cannot exclude,
	for example, volume fraction...fraction because I did it once to test how it will work and the lack of it became significant and
	misclassification rate in validation got 20%. And as compared to here, we got on the validation, zero misclassification for full right...correct attribution to each group
	and with low...pretty low misclassification in the training set. The rest of the metrics...measures of the model performance will be more discussed in the end, so when we will compare all the performance of all the models. Also their respective
	equations to predict the probability, predicted the structure was created and given, was obtained and given...presented on this slide. So let's go further. This is why, in the next
	modeling, we used only these two parameters...parameters to predict both structures (2D or 3D), so kicking nearest neighbors as a non parametric model has given them the best
	metrics...best performance and validation at 1K, so only one neighbor is enough just to recognize to assigned...to recognize the structure. As we can see from the diagram of...
	I mean, which a total pore volume and volume fraction, sorry, so we can see that
	for 2D we have more concentrated amount of this structure. It's
	very
	concentrated in one spot. However, in contrast to 3D, they are quite spread. So this is why it would compare the distances. Here in the right part of the slide, you can see that the highest distances between the neighbors are related to
	3D structures as in validation, as in training set. So this is why we can use also these non parametric model...a single non parametric model to apply in the attribution using just these two parameters.
	Also quadratic discriminant analysis gives zero misclassification in the validation and less than 5% of misclassification in the training set, just misclassifying eight samples. So it's even easy...is seen as on the confusion matrix and also on the parallel plot diagram.
	And also the rest of the...
	rest of the parameters will be described...will be discussed later because these model has some tricks...it will be...
	more interesting...will be...we will see in the end. So Naive Bayes is another parametric but very simple model, has given also zero misclassification. So the situation is becoming quite competitive...competitive.
	As we can see about here also both parameters are almost equal by their effects so
	the total effect is almost 0.88 for total pore volume and for volume fraction is almost eight...0.8, so this is why they are almost equal in contribution for their classification task.
	And simple also plots were built for prediction.
	Support vector machine has not been demonstrated, also brilliant performance. As we can see that of 64...sorry 46 ??? vectors were used to build such a model with zero misclassification in the end. And some
	misclassification in a training (6%) and mostly, the biggest percentage of misclassification was for for exactly 3D, they were they were misclassified.
	Okay, so let's go further. And finally, we're coming to a complex model which combines unsupervised and supervised approaches, unsupervised PCA. In this case, we took all all all them...
	quantitative continuous parameters, including all, even those which are highly correlated with each other, and we built
	diagrams and plots for PCA. And as we can see, two main components which has the values by
	more than one by Kaiser rule, so it means that at this point, we have another almost close to one component, third component, we will not involve it in the prediction model. We will involve only the first two
	where those Eigenvalues were more than one. And what is interesting, here we can see that in the first component, the principal component which can describe almost 66% of the cases, these comprises,
	let's say, LIS, LFS, LSFP and channels...channels parameters, so I mean size X, Y and Z, those directions. So this is like, let's say, ??? channels, ??? channel set.
	And the second, the second component is mostly good, could be described by both types of specific surface area and porocity. This is a pure textural parameters.
	So, as we can see that even here, we are observing such a certification difference and can make conclusions about the contribution of both parmeters into the prediction, in this description of the.
	structures. So after we build logistic regression, using only these two principal components, their formula will be given in the...in our publication, on our abstract afterward in the paper where, which we're going to submit to within the range...within the scope of the present summit.
	Here also, obviously, we got zero misclassification rate and very, very low, less than 1% misclassification in training...while training, which is not...which is...was not...which was not observed for the previous models. So now we are coming to the most
	interesting part, the conclusion about our
	summarized
	models. So as we can see, we summarize by misclassification the training set, obviously because we got some values there. We can see the logistic regression...the regression models
	like simple and PCA based, here on the top of the chart by misclassification and by validation discriminant analysis. However,
	what we can say here, things we have all all their misclassification arrays or accuracy as a opposite of parameter, we should compare other metrics.
	So what we can see that, okay good, we have also by other metrics, higher values so advantageous of discriminant analysis. But if we would take a closer look,
	we will see...everyone will see that there is a huge difference between these metrics, even between the misclassification rate, between validation and training. So we can make
	some thing, you know that we we we encountered some troubles with overfitting or underfitting. In this case, we mostly deal with underfitting since metrics invalidation are better than
	for training. Will normally occur opposite, yeah, most...most of the cases deal with the overfitting when the training is overfitting the data and we obtain metrics worse in validation. Here is completely opposite, so this is why
	these metrics, like entropy R square and generalize R square and mean log P, and also misclassification rate,
	were analyzed...differences were analyzed for these four metrics, and this is why we're selecting the best of the best. As mentioned before, discriminant analysis demonstrated the highest
	quality of the performance, however, it has one of the hugest differences between the metrics of validation in training, as compared to other ones. Remember that
	on validation, all of them gave zero, so it means we can select almost
	every of them, with the optimal balance of other metrics...differences for the metrics between validation and training. So what we can see here, for example, the lowest...
	lowest difference between misclassification rate and entropy R squared is attributed to logistic regression, combined with PCA.
	Other two metrics, like generalize difference between...difference in generalized R square and mean log P is observed
	in for support vector machine. For Naive Bayes, we cannot select this model, however, because at least in two
	parameters, it has a very huge difference. So this is why
	we can assume that for the best performance, for the most correct, logistic regression based on PCA is recommended, and also optionally recommended, support vector machine as another model to predict 2D and 3D structures. More details and more
	methods will be given in our paper, which we are going to submit within two weeks
	and publish on the website of JMP Community. So thank you very much, it was a big pleasure to give such a presentation. So I highly acknowledged also my colleague David Kirmayer from the Hebrew University of Jerusalem in Israel. Thank you for attention.
Josh Staunton	alright.
	Thank you, gentlemen.

Presented At Discovery Summit Americas 2021

Presenter

Michael Nazarkovsky

Modern Approaches in Classification of Covalent Organic Frameworks by Textural Properties Using JSL (2021-US-30MP-874)

Presenter

Files

Advanced Statistical Modeling

Automation and Scripting

Basic Data Analysis and Modeling

Consumer and Market Research

Content Organization

Design of Experiments

Predictive Modeling and Machine Learning

Quality and Process Engineering

Reliability Analysis