Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

The Power BI Data Visualization World Championships is back! It's time to submit your entry. Live now!

Reply
Anonymous
Not applicable

regmatches woes in R/Query Editor

Hi all

 

I'm trying to do some pattern matching in the query editor. The intention being that I can then merge a reformatted postcode to an enormous table, which I do not want to upload in it's entirety to PBI online.

 

I have got validation working fine with grepl, but I need to ensure a common 8 character format. But I am getting an error in the column, the R scri[pt executes, but the column returns "error" on all rows.

 

This is where I am:

 

# 'dataset' holds the input data for this script
pattern1 <- "([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))"
pattern2 <- "[0-9][A-Za-z]{2}"
matches <- function(x) {gregexpr(pattern2 , as.character(x), perl=TRUE)}
part2 <- regmatches(dataset$post_code, matches(dataset$post_code))
PCDS <- within(dataset, {PCDS = part2})


I think the problem is that grepexpr returns a multidimensional array. I need the 1st index to go in the column "PCDS" but can't get the syntax right.

 

Please help! Thanks in advance!

1 ACCEPTED SOLUTION
Anonymous
Not applicable

Fixed it! The regex could be more comprehensive and it's very step by step, but this R script converts UK postcodes matching the regex pattern to an 8 character format in the query editor.

# 'dataset' holds the input data for this script
# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("\\s+|\\s+", "", x)
ws <- function(x){ 
    if(x>0){strrep(" ",x)}
    else {""}
}

pattern <- "^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?)))) ?[0-9][A-Za-z]{2}"
match <- function(x) {gregexpr(pattern , as.character(x), perl=TRUE)}
part <- function(x) {c(regmatches(x, match(x)))}

pCodeList <- part(dataset$post_code)
pCodeVector <- as.character(pCodeList)

for(i in 1:length(pCodeVector))
{
    pCodeVector[i] <- trim(pCodeVector[i])
    pCodeLength <- nchar(pCodeVector[i])
    pCodeRight <- toupper(substr(pCodeVector[i], pCodeLength - 2, pCodeLength))
    pCodeLeft <- toupper(substr(pCodeVector[i],1, pCodeLength -3))
    pCodeVector[i] <- paste(pCodeLeft, ws((8 - nchar(pCodeLeft)) - nchar(pCodeRight)), pCodeRight, sep='')
}

PCDS <- within(dataset, {PCDS = (pCodeVector)})

Development discussion warmly welcomed.

 

Beth

x

View solution in original post

4 REPLIES 4
v-yuta-msft
Community Support
Community Support

@Anonymous,

 

Don't know to much about R script and the error message you have encountered with,  however, you may modify the last statement with below:

dataset$PCDS <- within(dataset, {dataset$PCDS = part2})

 

Community Support Team _ Jimmy Tao
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly. 

Anonymous
Not applicable

Unfortunately putting "dataset$" in either of those positions returns an empty table which then causes an error in the query editor. 😞

Anonymous
Not applicable


I'm still stuck with:

# 'dataset' holds the input data for this script
pattern1 <- "([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))"
pattern2 <- "[0-9][A-Za-z]{2}"
match1 <- function(x) {gregexpr(pattern1 , as.character(x), perl=TRUE)}
part1 <- function(x) {c(regmatches(x, match1(x)))}

PCDS <- within(dataset, {PCDS = part1(post_code)})

After playing around in the R shell:

> post_code<-c("A12 3BC","A123BC","A12    3BC","a12 3bc","1234")

> pattern1 <- "([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))"

> match1 <- function(x) {gregexpr(pattern1 , as.character(x), perl=TRUE)}

> part1 <- function(x) {c(regmatches(x, match1(x)))}

> part1(post_code)
[[1]]
[1] "A12"

[[2]]
[1] "A12"

[[3]]
[1] "A12"

[[4]]
[1] "a12"

[[5]]
character(0)

> post_code
[1] "A12 3BC"    "A123BC"     "A12    3BC" "a12 3bc"    "1234"  

As you can see the data has a very different arrangement between post_code and part1(postcode). So I'm stilthinking down the lines that the vector just needs reorganising.

 

Problem is, I learned R last week. This didn't work!

output1<-vector(mode="character", length = 0)
for(val in part1){
    if(length(val)==0){
        output1<=c(part1,"Not UK Postcode")
    }
    else{
        output1<-val
    }
}

So, if you know anything about R vector manipulationm, please help!

xx

Anonymous
Not applicable

Fixed it! The regex could be more comprehensive and it's very step by step, but this R script converts UK postcodes matching the regex pattern to an 8 character format in the query editor.

# 'dataset' holds the input data for this script
# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("\\s+|\\s+", "", x)
ws <- function(x){ 
    if(x>0){strrep(" ",x)}
    else {""}
}

pattern <- "^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?)))) ?[0-9][A-Za-z]{2}"
match <- function(x) {gregexpr(pattern , as.character(x), perl=TRUE)}
part <- function(x) {c(regmatches(x, match(x)))}

pCodeList <- part(dataset$post_code)
pCodeVector <- as.character(pCodeList)

for(i in 1:length(pCodeVector))
{
    pCodeVector[i] <- trim(pCodeVector[i])
    pCodeLength <- nchar(pCodeVector[i])
    pCodeRight <- toupper(substr(pCodeVector[i], pCodeLength - 2, pCodeLength))
    pCodeLeft <- toupper(substr(pCodeVector[i],1, pCodeLength -3))
    pCodeVector[i] <- paste(pCodeLeft, ws((8 - nchar(pCodeLeft)) - nchar(pCodeRight)), pCodeRight, sep='')
}

PCDS <- within(dataset, {PCDS = (pCodeVector)})

Development discussion warmly welcomed.

 

Beth

x

Helpful resources

Announcements
FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.