Regular expressions in Google Search Console

This article is part of the Complete Guide on Google Search Console (GSC)

Google Search Console just started to support regular expressions (RegEx) in filters. Let’s learn how to leverage RegEx to analyse GSC data.

This article is to learn regular expressions that you can use in Google Search Console and follows the Re2 syntax.

To learn more regular expressions, read RegEx for SEO.

Join the Newsletter

    Getting Started with RegEx in Google Search Console


    Google Search Console uses Re2 syntax and does not support all the regular expressions syntaxes that you might know.

    Filtering by RegEx is available for Page and Query reports.

    To filter the Performance Report using regexes click on New and select either Query or Page.

    Add your regular expression and filter your report.


    Character Limits


    Google Search Console imposes a character limit of 4096 characters. It is usually enough.

    With Regular Expressions, you can make your pattern more condensed to save characters.

    For example, this:

    example.com/aa|example.com/bb
    

    equals:

    example.com/(aa|bb)
    

    Match all Pages / Queries that Contains a word


    To filter pages or queries that contains a word, just wrap the word around .*.

    This would match anything before and after your string. Here I match anything containing the word seo.

    .*seo.*
    
    • .* matches anything.

    Match Specific Pages


    To match specific pages, write your property along with a capture group () for the URIs. You can help yourself with this with JR Oakes’ RegEx Builder in Google Sheets.

    ^https://www.jcchouinard.com/(python-for-seo|google-search-console-api|reddit-api)/$
    
    • () capture group to group elements together
    • | means OR
    • ^ starts with
    • $ ends with

    Negative Filtering with RegEx


    The first reactions from the SEO community to the new regular expression filtering in Google Search Console was that negative lookahead was not supported in Re2.

    Google quickly reacted and came up with negative filtering by regex.

    You can now use the Doesn’t match regex with Custom (regex) filter.


    Match Query / URL Length with RegEx


    Match short patterns of less than 10 characters.

    ^[\w\W\s\S]{1,10}$
    

    Results shown

    Results filtered out

    long-tail queries with more than 10 characters
    
    • [] matches range of characters
    • ^ starts with
    • $ ends with
    • \w matches ASCII letter, digit or underscore. It is the same as [A-Za-z0-9_]\g;
    • \s matches whitespace;
    • \W matches anything that is not an ASCII letter, digit or underscores;
    • \S matches anything that is not whitespace.
    • {1,10} repetitions of patterns from 1 to 10 times.

    Find long-tail queries with Regular Expressions

    The RegEx below would match any query longer than X characters (70 in this case).

    ^[\w\W\s\S]{70,}$
    

    Another solution is to count the number of whitespaces to identify the number of words.

    (\w+\s){7,}\w+
    
    • ^ startswith
    • $ endswith
    • [\w\W\s\S] any character
    • {70,} 70 times or more
    • (\w+\s)Any number of words between 1 and unlimited times followed by a space
    • {7,} 7 times or more
    • \w+ ending with a word

    Find Very Long URLs

    Use this regular expression to filter page urls that are longer than 100 characters.

    ^[\w\W\s\S]{100,}$
    

    URL Containing Special Characters


    Matches any URL that contains special characters.

    [^\/\.\-:0-9A-Za-z_]
    
    • [^] exclude range of characters
    • \/\.\-\: Exclude non-word characters that are common in URLs (i.e. the :// in the protocol and the dashed - between words)
    • 0-9A-Za-z_ word characters to exclude from the regex.

    Show a Specific URL Path


    Sometimes, you just want to match a specific path.

    • /<category>/<sub-category>/<feature>
    .*/jobs/.*/melbourne$
    

    That could match

    • /jobs/sales/melbourne
    • /jobs/marketing/melbourne

    Ends With a Trailing Slash


    Show pages that does contain (or not) the trailing slash at the end.

    .*\/$
    

    Show HTTP/HTTPS/Subdomains Variations


    Although it is recommended to validate your site in Google Search Console both at a domain level and individual URL prefix, you might want a quick way to check your domain property for indexed subdomains or HTTP/HTTPs variations.

    https?\:\/\/.*example\.com\/?$
    

    This is a quick way to identify subdomains that you might not know are indexed.

    • https? matches http or https
    • \/?$ ends with trailing slash or not.

    Compare Regular Expressions


    You might want to compare pages or queries based on RegExes.

    You can use the compare filter with regular expressions too.


    Understand User Intent


    Show queries that defines different user intent.

    Informational

    who|what|when|how|why
    

    and more coming from Steve Toth:

    who|what|where|when|why|how|was|did|do|is|are|aren’t|won’t|does|if|can|could|should|would|won’t|were|weren’t|shouldn’t|couldn’t|cannot|can’t|didn’t|did not|does|doesn’t|wouldn’t
    

    Navigational

    See (Match Branded Terms).

    .*brand.*
    

    Commercial

    .*(best|top|vs|review*).*
    

    Transactional

    .*(buy|cheap|price|purchase|order).*
    

    Case Insensitive Queries


    Want to make queries case insensitive? Add (?i) at the start of the expression. Thanks Charly Wargnier for the tip (buy the man a coffee).

    (?i)^(who|what|where|when|why|how)[" "]
    

    Match Branded Terms


    Often people searches have spelling mistakes in them. You can properly evaluate brand searches with regular expressions.

    Let’s make an example with Linkedin possible misspellings.

    • lnkedin, linkeidn, linkden, linedin, linkein, likedin, linkin, linkedin, linkd, amazon linkedn

    You could be string with a long regex:

    .*lnked*in.*|linke*idn.*|linkd*en.*|lined*in.*|linke*in.*|liked*in.*|link*in.*|linked*in.*|.*linkedn.*|.*linkd.*
    

    You could be very large at the risk of adding errors into it:

    Or be more specific.

    .*l(i|n){1,2}(k|e).*n.*
    

    Multiple patterns can do.

    .*l.*(i|n).*(n|k).*(e|d|i).*n.*
    

    Compare Brand VS Non-Brand Traffic



    Check for Potential Content Injections


    Content injection is a hack that injects webpages into your site including specific keywords. Here is an idea how you can check for common ones on your site.

    Use this regex in the pages regular expression to check if it matches.

    .*viagra.*|.*cialis.*|.*levitra.*|.*drugs.*|.*porn.*|.*www.*www.*
    

    Could also check special characters or not-available-yet solution of looking for foreign characters in urls.


    Check WordPress Admin URLs


    Pretty straight forward, check WordPress admin pages that seemed indexed.

    .*wp-.*
    

    Edge Case Scenarios

    Show Postcodes in URLs

    I came across a case where I needed to list URLs that ended with a postcode.

    https://example.com/Sales-jobs-in-Wembley-HA0
    https://example.com/Sales-jobs-in-Corstorphine-EH12
    https://example.com/Marketing-jobs-in-Corstorphine-EH12
    https://example.com/Retail-jobs-in-Bourton-BS22
    

    In UK, postcodes consist of one or two letters, followed by one digit, two digits, or one digit and one letter.

    .*-[A-Za-z]{1,2}(\d{1,2}|\d[A-Za-z])$
    
    • [A-Za-z]{1,2}: Matches one or 2 letters
    • \d{1,2}: Matches one or 2 numbers
    • \d[A-Za-z]: Matches a number and a letter
    • ( pat1 | pat2 ): Grouping of OR patterns

    Should Work but Doesn’t

    Foreign Characters in URLs

    According to the Re2 documentation, unicode character classes like this should work:

    \p{Greek}
    

    This would be useful to identify URL patterns that have some of the foreign characters that are the most commonly used in content injection hacks.

    .*\p{Hiragana}.*|.*\p{Cyrillic}.*|.*\p{Hangul}.*|.*\p{Han}.*|.*\p{Thai}.*
    

    It works with Query RegExes, but it doesn’t work with URLs.

    From the Community

    There are a lot of other great ideas out there, share them with me and I’ll post them here.

    A/B Testing

    E-Commerce

    Here’s one for e-Commerce stores.

    Additional RegExes

    Some additional regular expressions for GSC from this post (thanks Mike Ciffone)

    # Matches URL slug
    ^[foo]+(?:-[bar]+)*$
    
    # All urls within /page 
    (http|https):\/\/www.example.com\/page\/.*
    
    # All urls between a certain slug and ending
    (http|https):\/\/www.example.com\/slug\/[^\/]+\/page
    
    # Matches all queries containing a specific term (all work)
    \b(\w*foo\w*)\b
    \b(\w*foo\sbar\w*)\b
    ^hello\sworld$
    
    # Matches all queries containing "blue shoe" or "blue shoes"
    (\W|^)blue\s{0,3}shoe(s){0,1}(\W|$) //works
    
    # Matches all queries that contain "Ciffone" or "Ciffone Digital"
    ^.*(ciffone|ciffone digital).*$
    
    # Match Word or Phrase in a List
    (?i)(\W|^)(foo|bar|foo\sbar)(\W|$)
    

    Other Articles in the Series on Google Search Console


    Other Resources

    Google’s examples of regular expressions for Google Search Console

    #performanceregex on twitter


    Conclusion

    This is it, you now know how to use regular expressions with Google Search Console

    4.6/5 - (20 votes)