Subscribe Bookmark RSS Feed

Find the Mode in a Column of Matrices

aallman

Contributor

Joined:

Feb 28, 2017

Hello!

 

I need some help writing a script to find the most common value across several matrices. I have a data table with hundreds of rows. Each row has a sequence column (eg ABCDABCD) and I have created a column to show the positions of a certain letter within that sequence (eg The location of A is [1,5,0,0]).

 

(I am fairly new at writing scripts so right now my matrices have 0 listed when there are no more positions containing that letter. If there is an obvious, more efficient way to find multiple positions within a string, please let me know. Otherwise, is there a way to exclude 0 as one of the modes?)

 

I now would like to look at all the positions called out in all rows and find the most common positions.

 

What I have:

row     matrix column

1    [2, 3, 6, 8, 0, 0]

2    [2, 4, 5, 10, 0, 0]

3    [2, 7, 9, 10, 0, 0]

 

What I want:

A list showing the modes in order of most occurrences to least 

position 2 - 3 occurrences

position 10 - 2 occurrences

 

I attached the script I currently have that contains the position finding formula,

Thank you!

4 REPLIES
txnelson

Super User

Joined:

Jun 22, 2012

Here is a script that uses a different method to get you to a point where you have all of the Letter values parsed into a new column from which you can do your analytics

dt=current data table();

// Create a new table with just the Sequence column in it
dtbuild = dt << subset(columns("Sequence"), selected rows(0));

// Find the longest sequence
Maxstring = col max(length(:sequence));

// Add a new column called Row Num that will let you keep track of what row
// each letter came from
dtBuild << New column("Row Num", formula(row()));
// Convert the data into real values
dtBuild:Row Num << delete property("formula");

// Create a new column for each possible position in the sequence number, and
// strip off the ith character and place it into the column
For(i=1,i<=Maxstring,i++,
	dtBuild << new column("C" || char(i),character, formula(substr(:sequence,eval(i),1)));
	column(dtBuild,i) << delete property("formula");
);

// Create a list of the columns in the dtBuild data table
TheColumns = dtBuild << get column names();
// We are only going to stack the columns created that have bits of the
// sequence in them, so get rid of the  names of the first 2 column in the data table
TheColumns = remove(TheColumns,1,2);

// Stack all of the columns in to a data table that can be worked with
dtFinal = dtBuild << Stack(
	columns( TheColumns),
	Source Label Column( "Position" ),
	Stacked Data Column( "Letter" ),
	Name( "Non-stacked columns" )(Keep( :Row Num ))
);

// Get rid of the interium table
close(dtBuild, nosave);

// Delete rows that have no letter value
dtFinal << select where(:Letter == "");
dtFinal << delete rows;
Jim
aallman

Contributor

Joined:

Feb 28, 2017

I need to be able to locate the positions of different letters by row and then compare the rows. For example I need to find the locations of all "A" in row 1's sequence and all the locations of "C" in row 2's sequence. Then I need to be able to see the most frequently occurring location in the table. 

From the script you have given me (thank you by the way!) I still am unable to do this table wide comparison.

txnelson

Super User

Joined:

Jun 22, 2012

dt << select where(:(letter == "A" & :Row Num == 1)  | (:Letter == "C" & :Row Num == 2);

 

Doesn't that give you your first requirement?

 

And your second requiremnt should be able to be calculated using Tabulate, if I am correctly understanding what you want 

Jim
aallman

Contributor

Joined:

Feb 28, 2017

Thank you!!