Beginner Guide To Regex For SEO

Share this post

In this guide, I will show you how to use Regex for SEO, even if you have no programming knowledge.

RegEx are easy to learn and amazingly useful, so make sure that you go through this entire tutorial because it is going to be one of the best time vs results investment in your SEO career.

This post is part of the complete Guide on Python for SEO

What Are Regular Expressions (Regex)?

Regex, or regular expressions, are used to detect patterns in sequences of characters in strings.

With RegEx, you can easily match many results that have the same pattern.

Basic Regular Expressions

.Any character
.*0 or more characters
.+1 or more characters
?Optional character
^Beginning of a line
$End of a line

For example, one of the most common patterns that I use with Google Analytics is this one:

.*site1.*|.*site2.*

or the equivalent:

.*site(1|2).*

This way I can match any of those results:

#Match
site1.com
site1.fr
site2.ca
www.site2.com
site2.ca/url-path

#No Match
www.google.com

RegEx is not specific to any programming language. So, whether you are using Google Analytics, or programming in Python, JavaScript or Java, you’ll need at some point to use Regular Expressions.

Regular expressions have different flavours from one programming language to the other.

However, if you learn how to use general regular expressions, you’ll have no problem using them in any of the programming languages.

Get Started With RegEx

This guide will walk you through the basics of RegEx. If you want to go further, make sure that you look at my favorite tool, Regex101, and this RegEx Cheat Sheet.

You Might Also Like  Send Message With Slack API and Python

Why Learn RegEx for SEO?

SEOs start using Regex mostly because they use Google Analytics and data analysis.

Then, they’ll start using it for crawling and scraping purposes and as their career and knowledge progresses, they’ll start using it to make API calls, until they use them everywhere.

A good example is when you want to filter out a report in Google Analytics.

Surely, you’ll want someday to filter out all organic traffic coming from Google, including Google Search and Google For Jobs, but excluding Google CPC.

In this case, you would go to Acquisition > All Traffic > Source/Medium > Advanced and would use the .*google.*organic.* regular expression to filter out your results.

And then you’d get a report like this.

I know that this is fairly basic, but I just wanted to show why you’ll absolutely need regexes one day or the other in your SEO career.

Regular Expressions Basics

Let’s dive into the regular expression basics.

Flags

Flags will help you determine what kind of character to match. You might want to ignore case when matching or match only numbered words.

To do this, you’ll need to end your regex with a flag like this:

google\i

Matches google and Google.

The most useful flags are:

  • \i ignore case;
  • \g matches more than once (JavaScript);
  • \d matches one digits from 0 to 9;
  • \w matches ASCII letter, digit or underscore. It is the same as [A-Za-z0-9_]\g;
  • \s matches whitespace;
  • \D matches anything that is not a digit from 0 to 9;
  • \W matches anything that is not a ASCII letter, digit or underscore;
  • \S matches anything that is not a whitespace.
You Might Also Like  Google Search Console Data From a List of URLs

Match Characters

To match one or multiple characters you could use flags like we just saw. You can also use wildcards or other specific set of indications.

  • . matches anything. SE. will match SEO and SEM;
  • [aeiou] matches one of those vowels. b[aiu]g will match bag, big and bug. [aeiou]\g would match mutiple vowels;
  • [a-z] matches a rage of characters. This would match any lowercase character from the alphabet. To match any lower and uppercase characters you could use [a-z]\i or [a-zA-Z];
  • [0-9] matches a range of numbers from 0 to 9. You can combine the regEx to match numbers and letters like this: [2-5b-h];
  • ^ only match if it starts with the string. ^SEO.* matches SEO is great but not I love SEO.
  • $ only match if it ends with the string. .*regex$ matches I love working with regex, but does not match regex are awesome.
  • Colou?r says the previous character “u” is optional. It matches Color and Colour.

Logic

You’ll want to include one or more result or merge multiple conditions in your regular expressions using logical OR.

Using the | symbol, you’ll be able to match multiple conditions. For example dog | fish matches all results equalling dog or fish.

Here are other useful logic syntaxes.

Quantifiers

Quantifiers, or quantity specifiers, are useful to tell the number of times that you want to repeat a character. This represents the number of times the previous thing can match.

One or more times+
Twice{2}
Three to five times{3,5}
Zero or more times*
Once or none?

Negated Character Sets

When you want create a set of characters that you don’t want to match, you need to use negated character sets.

To create them, you can use carets character inside a character set ([^]).

  • [^] matches string that does not include. [^aieou] match a single character not present in the list [aeiou];

Positive And Negative Lookahead

Lookaheads are patterns that tell to lookahead in your string to check for the patterns you specify.. There are positive lookahead ((?=)) and negative lookahead ((?!)) .

se(?=o)
seo #match "se"
sem #no match 

se(?!o)
seo #no match
sem #match "se"

Geedy and Lazy Matching

In regular expressions, a greedy match finds the longest possible part of a string that satisfies the regex. A lazy match is the opposite. It finds the smallest possible part of the string that matches the regex.

  • .* is a greedy match since it matches anything. <.*> will match <h1>This is HTML</h1>
  • ? is a lazy match. <.*?> will match <h1>

Group Elements of a RegEx

You can group elements of a RegEx with parentheses () in an element called a capture group.

  • sam.*(hunt|jackson) would match sam hunt and samuel l. jackson, but not sammy davis jr.

Other Useful Regex

(?<=[\/])\d{2,} Matches any numbered ID preceded by a backslash.

You Might Also Like  Reddit API Without API Credentials

^\s+|\s+$ Select all white spaces at the begining and at the end of a string. This can be useful when doing data manipulation.

(?<=\.)(.*?)(?=\.) Lets you extract a domain name. This will match any string between two dots.

(?<=string)(.*) Matches anything after a string, excluding that string. Useful to clean-up URLs.

To learn more technical SEO, I deeply suggest that you start learning Python.