Run the following script to simulate your data set:
Names Default to Here( 1 ); // simulate data set with client responses dt = New Table( "Loans", Add Rows( 25 ), New Column( "Loan Description", "Character", "Unstructured Text" ), New Column( "Key Term", "Character", "Nominal" ), New Column( "Purpose of the Loan", "Character", "Nominal" ), New Column( "Modified Purpose of Loan", "Character", "Nominal" ) ); // make a list of some target terms target term = List( "credit card balance ", "home improvement ", "new car " ); For Each Row( // make unstructured text embedded term = target term[Random Integer( 1, N Items( target term ) )]; description = Repeat( "blah ", Random Integer( 2, 5 ) ) || embedded term || Repeat( "blah ", Random Integer( 2, 5 ) ); Column( dt, 1 ) = description; // make structured response purpose = If( Random Uniform() < 0.25, "Other", embedded term ); Column( dt, 3 ) = purpose; ); dt << Suppress Formula Eval;( 1 ); // now show one solution Column( dt, 2 ) << Set Formula( Regex( :Loan Description, " credit card balance | home improvement | new car " ) ); Column( dt, 4 ) << Set Formula( If( :Purpose of the Loan == "Other", :Key Term, :Purpose of the Loan ) ); dt << Suppress Formula Eval( 0 ) << Run Formulas;
Ignore the script and now focus on the data table example:
I assume that your real data set has a column like the first and the third data columns in my example above.
You can now use the formulas that I made in the second and third data columns to begin your solution.
Would you please describe the nature of the text fields and the model? For example, is the text unstructured data? Is it responses to survey questions?
Sure. Its unstructured data. What I'm trying to do is to build a model where it will look for specific phrases; Home Improvement, Debit Consolidation, Mortgage, etc. So, if the data in my field contained this:
Several years ago my wife started a graphic design company and because she was still in college she decided to use a credit card to help fund the startup. Now several years later she still has the balance from those expenses and the company has since dissolved (she chose to take a salaried position once we had our first of two kids). Because the balance is so high we send in a large payment monthly, which in turn, eats up our spending money for the month and the result, is having to use the same credit card. The balance has really not moved in more than 2 years and I do not see an end in sight. Please help. Having a fixed payment and term loan will give us the assurance that this will get better and we can start to invest in our kidÃ¢â‚¬â„¢s future. Thank you in advance.
I want to pull out words that are in bold above. Is this possible?
I think that it might be easier to create predictor variables (data columns) that act as indicators for a model. It seems as though most of the text is not informative for your purpose.
Your column could use something like:
Regex( :text, " balance | credit card " )
if you want a single character column to show the result from the same row in a column called text. You could use something like this:
Contains( :text, " balance " ) > 0
if you want a separate indicator column for each target string, such as balance above.
Is either one of these suggestions going in the correct direction?
I think so. I've got a list of key words I would want to develop the model from. Would it be easier to create a dummy column with the a formula to indicate where each of the key words were found in the row? I'm thinking that might be an easier way to identify what is in the text. Some of these responses are whole paragraghs - much like the example I posted.
Well, the two ways that I showed you produce either a single predictor with levels of the matched sub-string (or missing) or multiple predictors as indicators for the presence of a specific sub-string. The choice of one of these approaches won't make a difference in a regression model or a partition model but it might make a difference in a neural network model. What is the response variable?
I understand that the original text is unstructured data in a character data column. You want to extract specific terms from the text in new data columns (structured variables) to use as predictors in some model. You presumably have a known response for supervised learning or else you are going to use the new columns in an exploratory fashion, perhaps with other covariates in the data set?
I could use another column called Purpose of the Loan (dataset is loan data). It has defined reasons of why someone is requesting a loan. However, the bucket of "Other" is where I want to look through the Load_Description (this is the free form field with a reason) I want to pull those keywords out. So for example, someone could have choosen Other as a purpose, but in the description, they explained that the loan is for a home remodel. The ability to further define the rows with Other will help improve the overall prediction.
So, you might be able to create a new response (i.e., modified purpose) by substituting the 'other' level with a sub-string that you obtain from the unstructured text for those observations?