Our World Statistics Day conversations have been a great reminder of how much statistics can inform our lives. Do you have an example of how statistics has made a difference in your life? Share your story with the Community!
Choose Language Hide Translation Bar
Highlighted
Level III

Search a list of strings for a partial string

Hello

I am trying to find a way to search through a list of strings for all items that contain a partial string. I know this can be done using a For loop, but the list that I will be running this script on is very large and For loops take a very long time. I am wondering if there is some function out there similar to Loc(list, string) that will find the items that have the partial string.

Just as an example:

``````list= {football, hockey, baseball, tennis};

Loc(list, "ball");``````

Would ideally return [1, 3].

Thanks!

Highlighted
Staff (Retired)

Re: Search a list of strings for a partial string

I think the loop may be a good choice; most of the work is in the contains function, not the loop overhead.

``````x={"34565467456745673456345634563456abca","34563456456745674534563456346accca","34456745674556345634563456accda"};
for(i=1,i<20,i+=1,
x=x||x;
);
nitems(x); // 1572864
result={};
start=tickseconds();
for(i=1,i<=nitems(x),i+=1,
if(contains(x[i],"ccc"),insertinto(result,i))
);
stop=tickseconds();
show(nitems(x)/nitems(result),stop-start);// 3:1, <1 second``````

One of the three strings contains the search pattern, the result list is 1/3 the size of the source. 1 second for 1.5 million items seems reasonable. What size list, typical item length, and what time requirement do you have?

Another approach, not as good. Avoids the explicit loop but copies the data into a table and uses row selection:

``````x={"34565467456745673456345634563456abca","34563456456745674534563456346accca","34456745674556345634563456accda"};
for(i=1,i<20,i+=1,
x=x||x;
);
nitems(x); // 1572864

start=tickseconds();
dt = New Table( "Untitled",
New Column( "Column 1", Character, "Nominal", Set Values( x ) )
);
stop=tickseconds();
show(stop-start); // 1.3 sec

start=tickseconds();
dt<<selectwhere(contains(column1,"ccc"));
list=dt<<getselectedrows;
stop=tickseconds();
show(stop-start); // 1 sec``````

JMP added better list support in JMP 13; if you are using an older version, read this post

https://community.jmp.com/t5/Uncharted/Fast-List/ba-p/28947

to make JMP < 13 faster.

Craige
Article Labels

There are no labels assigned to this post.