2,345 questions
Advice
0
votes
4
replies
87
views
How to create stable person identifiers when names vary across years
I am working with a university faculty salary dataset where the same person appears across many years, but their name strings are inconsistent. The dataset has about 8,000 unique people and years from ...
-1
votes
2
answers
167
views
Java regex - Optional Match Capturing Group [closed]
I'd like to process some input queries in 3 possible ways:
query: select * from People
query: select * from People exclude addresses
query: select * from People include department
I have two regex1 ...
0
votes
2
answers
87
views
Automatically map messy column names to a standard schema in pandas
I'm working with many tabular datasets (Excel, CSV) that contain inconsistent or messy column names due to typos, different naming conventions, spacing, punctuation, etc.
I have a standard schema (as ...
0
votes
1
answer
56
views
expect-5.45.4 shows unexpected spawn output, causing string match to fail; is it a bug?
In SLES15 SP6 on x86_64 I'm using a bash script and expect-5.45.4 to do automated program testing.
Basically I'm checking whether the program to test (./pwg.pl) outputs a specific string.
Starting to ...
-2
votes
1
answer
116
views
How to match German province names between 2 data sets in R?
I'm working with two datasets for German NUTS-3 level regions:
A shapefile from Eurostat via the giscoR package:
> library(giscoR)
> nuts3_germany <- gisco_get_nuts(country = "Germany&...
4
votes
4
answers
169
views
Match start of line in multiline string in lua?
Let's say I want to match any sequence of the hash sign # at the start of a string; so I'd want to match ## here:
local mystr = "##First line\nSecond line\nThird line"
... and ### here:
...
2
votes
3
answers
123
views
Pandas DataFrame column partial match and extract matching value
I have a column in Pandas DataFrame(Names) with a large collection of names. I have another DataFrame(Title) text column and in between text, the names in Name frame are there. What would be the ...
2
votes
0
answers
88
views
Find Substrings In A Dynamic Collection Of String
This question is a little complicated, so I try to describe it through an example.
First, we get a string foo, and put it into collection S.
Then we get a string sample, and put it into S too.
Next, ...
1
vote
1
answer
71
views
Match similar names [duplicate]
I have a database with three columns: name, occupation, and organization. In these columns, I have duplicates with slightly different names. For example, Anne Sue Frank and Anne S. Frank refer to the ...
0
votes
2
answers
86
views
How to match cross-referenced names from table without duplicates
savvy people,
I will have participants of an event sign up where they, aside from their personal details, also provide a duo partners name or leave that blank. So, I will have two columns, ...
1
vote
3
answers
94
views
Find str.contains in two large Pandas DataFrames
I have a large pandas DataFrames like below.
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
("1", "Dixon Street", "Auckland"),
("2&...
0
votes
1
answer
90
views
Full string matching in Pandas dataframes comparison
this seems like it should be an easy problem to solve, but I've been battling with it and cannot seem to find a solution.
I have two dataframes of different sizes and different column names. I am ...
1
vote
1
answer
79
views
How to match a function but exclude object methods without negative lookbehind
I'm trying to write a regex that matches every occurrence of some_function(...), but it should not match when it's part of an object method like my.some_function(...) or if it is a substring of ...
2
votes
2
answers
88
views
Do Kotlin's List/Array data structures have a findSublist method analogous to String.indexOf(CharSequence)?
Do Kotlin's List/Array data structures have a findSublist method analogous to String.indexOf(CharSequence), that takes a List/Array/Sequence to match against the list?
1
vote
0
answers
78
views
Trying to fix names in my database with fuzzywuzzy
What I'm trying to do is find and correct similar names in my database, like 'Patrick Maxwell' and 'Patrick Maxwel.' However, the issue I'm facing is that the best match for each name is often itself, ...
1
vote
1
answer
273
views
List specific information of all LEAF certificates from java store JKS
I wish to list only the signed certificates for our application and not the chain signing certificate from a java store i.e <jdk_home>/jre/lib/security/cacerts or any such JKS store.
The idea ...
1
vote
1
answer
2k
views
How can i check if a string contains another string in powershell?
I have a string that is returned from an api call , the string is something like
".\controllers\myaction c:\test\path"
I want to use Powershell to check if the string contains c:\
...
0
votes
1
answer
123
views
Python 3.12 Pandas Difflib Get_Close_Matches to compare two strings in a dataframe and return a % match
Working with irregular Excel tables, I am trying to match questions by looking at a string in a column in a dataframe and if it is a close match to my target string, score the % match.
The way I tried ...
0
votes
2
answers
80
views
dynamic approach to identify and standardize similar names automatically in pandas or data cleaning
I have a DataFrame with a column of publisher names that contains various minor variations of the same publisher. For example, entries such as "Harlequin Romance", "Harlequin Blaze"...
1
vote
2
answers
360
views
C++ function returns extremely slowly, far slower than functionally equivalent Python code
I have a function that is used in a script that I am writing to remove redundant blocking keywords from a list. Basically, with the input (in any order) of:
{"apple", "bapple", &...
0
votes
1
answer
62
views
Renaming dataframe column in Python with a string value in another dataframe by matching column/index names
Major edit:
Apparently it is difficult to understand my question, so I'll do my best to concretize it.
I got two dataframes, "df1" and "df2". These are quite larger, larger than in ...
0
votes
2
answers
136
views
Finding the longest Dictionary.Key match in a phrase
I have a SortedDictionary<string, string>, ordered by key length descending, of the form:
red fox - address1
weasel - address2
foxes - address3
fox - address3
etc.
and a list of phrases e.g.
&...
0
votes
2
answers
75
views
Is there a way to obtain a list separated by comma as the output of str_extract_all instead of the default output in R?
I have searched high and low and nobody seems to have asked that exact question, so I'm at loss.
I have a data frame with a couple columns. One of this column contains various sentences that don't ...
0
votes
1
answer
67
views
Identifying Correct String Order in Pandas
I have a dataframe as the following, showing the relationship of different entities in each row.
Child
Parent
Ult_Parent
Full_Family
A032
A001
A039
A001, A032, A039, A040, A041, A043, A043, A045, A046
...
2
votes
6
answers
143
views
Matching the start of a sequence in R
I have a series of string in a vector and need to remove the matching starting pattern from the string. However, I don't know the pattern or how long it is.
stringa <- c("apple_tart", &...
1
vote
1
answer
353
views
Given a String count the possible Permutations that satisfy a condition. How to Optimize from O(N*N!)
Hi I recently came across an interesting question and had a hard time trying to optimize it beyond O(N*N!).
Here is the question:
Given a string, return the number of possible combination that satisfy ...
2
votes
2
answers
341
views
How can I find all exact occurrences of a string, or close matches of it, in a longer string in Python?
Goal:
I'd like to find all exact occurrences of a string, or close matches of it, in a longer string in Python.
I'd also like to know the location of these occurrences in the longer string.
To define ...
1
vote
1
answer
95
views
Why doesn't fuzzywuzzy's process.extractBests give a 100% score when the tested string 100% contains the query string?
I'm testing fuzzywuzzy's process.extractBests() as follows:
from fuzzywuzzy import process
# Define the query string
query = "Apple"
# Define the list of choices
choices = ["Apple&...
0
votes
0
answers
85
views
How to efficiently compute similarity scores for prefixes of a string with another string in C?
I'm working on a problem involving string matching where I need to compute the similarity scores for each prefix of a string C against another string S. The similarity score for a prefix P of C and S ...
-2
votes
2
answers
101
views
Count If criteria partial text
How can I count number of cells in a column that contains partial text
I want the result to become 6 since text AB (A and B) can be found from all those rows except Row 4 that has only C in it.
COL
...
0
votes
1
answer
637
views
How to do fuzzy merge with 2 large pandas dataframes?
I have 2 pandas dataframes that both contain company names. I want to merge these 2 dataframes on company names using a fuzzy match. But the problem is 1 dataframe contains 5m rows and the other 1 ...
1
vote
0
answers
138
views
How to find best matching anchor texts from paragraph and list of titles?
I have a paragraph:
In today's world, keeping your personal information safe online is more important than ever. With cyber-attacks on the rise, having a strong cybersecurity strategy is essential.
...
2
votes
3
answers
171
views
How to Compare Hierarchy in 2 Pandas DataFrames? (New Sample Data Updated)
I have 2 dataframes that captured the hierarchy of the same dataset. Df1 is more complete compared to Df2, so I want to use Df1 as the standard to analyze if the hierarchy in Df2 is correct. However, ...
-1
votes
1
answer
77
views
Can i combine contain and startswith in order to match two columns from one dataframe to another's master column?
Master dataframe filled with a specific match's players and statistics.
34 columns and variable number of rows.
Column "Player" has full names
Player
Goals
Assists
Dominic Calvert-Lewin
1
1
...
-1
votes
1
answer
92
views
How do I find the first # after an even number of "?
Reading a text file with the format:
e2c=["(vsim-86)" ,'kkk', "pppp",
"bbbbbb", #"old", "uio",
" sds # sds", #"old2",
" sds #...
0
votes
1
answer
76
views
Asymmetric partial matching of text strings between two dataframes
I have two dataframes:
df1 is based on survey responses and includes a non-restricted field for users to add their location in the UK (or refuse to do so) formatted as so (not real data):
Name
...
0
votes
0
answers
65
views
String Matching Function Not Matching Strings Despite Threshold Set to 0
I have implemented a string matching function in Python utilizing n-grams and similarity ratios. The function signature is as follows:
# concise version of the function
def match_strings(...
-2
votes
1
answer
58
views
Incorporating Phone Number Matching into Existing String based Name Matching Function
I have a Python function, match_strings, which is designed to match names from two different data sources. Here is the function definition:
python
def match_strings(strings1, strings2, ngram_n=2, ...
1
vote
1
answer
62
views
Is there a way to recode a vector of strings based on two key words or phrases that appear in every value into new vector with those two values?
As my question indicates, I would like to convert a vector of strings into a new vector one of two values that appears in every string. Here is an example of a very simple data frame I have:
data <-...
0
votes
1
answer
217
views
Filtering Range based on Multiple Criteria
I am trying to filter a list of properties based on multiple keywords (e.g. "Cool Interior," "Terrace/Patio"). Here's a basic interpretation:
The range I want to filter is on a ...
0
votes
0
answers
704
views
Google Sheets - Count if two cells have the same text
I'm trying to create a code to see if my predictions for games and the actual result of the games are the same. I was going to create a point value, like March Madness has, but I can't actually get ...
3
votes
1
answer
179
views
Aho-Corasick algorithm with C language
I have programmed an Aho-Corasick algorithm with a transition table that searches for a set of words in a text and displays the number of occurrences by using malloc(), but I am encountering this ...
1
vote
1
answer
321
views
module 'thefuzz' has no attribute 'partial_ratio' and other odd errors
Been trying to use thefuzz to compare two different lists, and got the above error, which doesn't seem right. I've commented everything else out in my code except the below two test lines and still ...
0
votes
1
answer
478
views
searching for matching words in pdf using page.searc_for
I have a list of words which I am searching in a pdf document using fitz in python
The code generally works for most of the words except for a few like "efficiency"
My code is given below :
...
0
votes
0
answers
34
views
powershell ilike operator not returning true [duplicate]
PS C:\Users\Administrator> $string = "hello world"
PS C:\Users\Administrator> $string -ilike "hello"
False
the above is outputing false, and not true. not sure what I am ...
0
votes
0
answers
118
views
Why is Rabin-Karp algo seemingly less efficient than brute force algo for string matching
I am just looking at various algorithm's efficiency. Not just big O efficiency, but practical efficiency. Anyway i was testing a Rabin Karp algorithm i wrote against a brute force string comparison ...
0
votes
2
answers
103
views
Is there a way in R to join between two columns based on whether a string in column 1 is contained within the string in column 2?
I am trying to join several messy datasets together without using "fuzzy matching".
In the core dataset (example dataset1 below), I have simple names for companies. In the datasets I would ...
-1
votes
2
answers
74
views
Compare two columns (with merged phone numbers) if any phone number from first column exists in the second column
I need to compare two columns which are in resulting data frame and those two columns are coming from a separate sources.
Now, I would like to compare them and have a resulting (tag) column based on ...
1
vote
1
answer
577
views
Split full address to contain only street name
I have a table with address1, city, state, and postal code. However, some address1 will also contains city, state and postal code (separated by either comma or space or both). Example:
Address1: 9999 ...
-1
votes
3
answers
234
views
Having trouble with regex in Java 11
Trying to strip server name from: //some.server.name/path/to/a/dir (finishing with /path/to/a/dir)
I have tried 3 different regexes (hardcoded works), but the other two look like they should work but ...