Skip to content

Cyber Sale: Save big on Moz Pro! Sign up by Dec. 6, 2024

Advanced seo

How I Found Internal Linking Opportunities With Vector Embeddings

Everett Sizemore

The author's views are entirely their own (excluding the unlikely event of hypnosis) and may not always reflect the views of Moz.

Table of Contents

Everett Sizemore

How I Found Internal Linking Opportunities With Vector Embeddings

The author's views are entirely their own (excluding the unlikely event of hypnosis) and may not always reflect the views of Moz.

I felt overwhelmed when I first read Mike King’s article on vector embeddings. The concepts seemed complex, and implementing them for SEO was intimidating. But with Screaming Frog’s new features and Gus Pelogia’s excellent guide, I saw the potential to improve internal link building using this method.

Based on the two resources above, I decided to create a detailed, step-by-step guide to make this process more approachable, even for people unfamiliar with Python or vector embeddings.

In this article, I’ll walk you through how I used vector embeddings to identify internal linking opportunities at scale so you can confidently apply these techniques to your SEO strategy.

Infographic with 10 points on finding internal linking with vector embeddings

What you’ll need to get started

To carry out this process, I used the following:

By the end, I had a comprehensive spreadsheet that included:

  • Every important URL from my site listed in column A (target URL)
  • The URL of every page that links to the target URL (excluding navigation)
  • URLs to the top 5 most closely related pages based on cosine similarity
  • Opportunities where one or more of those 5 URLs are not linking to the target URL

Example Spreadsheet

This is the example I used in the screenshots below, and it will look something like this:

Example spreadsheet showing internal link opportunities with URLs, related URLs, and missing links highlighted in pink cells.

Pink cells indicate where the related page doesn’t link to the target page.

Step 1: Get an OpenAI API key

I started by heading over to OpenAI’s website, clicked the button to create a new secret key and copied that API key to use in Screaming Frog.

Screenshot showing the OpenAI website with the option to create a new secret key for the API

Step 2: Set up Screaming Frog

Next, I opened Screaming Frog and followed these steps:

  • Navigated to Configuration > Custom > Custom JavaScript.
Screenshot of Screaming Frog’s configuration menu with the ‘Custom JavaScript’ option selected
  • Clicked “Add from Library” and selected “(ChatGPT) Extract embeddings from page content.” This allowed Screaming Frog to extract the data needed for the internal link audit.
Screenshot of the ‘Add from Library’ window in Screaming Frog with the ‘(ChatGPT) Extract embeddings from page content’ script selected
  • I edited the custom JavaScript code to include my OpenAI API Key. Then, I pasted the API Key I generated in Step 1 into the appropriate section of the code.
Screenshot showing the Custom JavaScript section in Screaming Frog with the Open JavaScript Snippet Editor button indicated.
Screenshot showing where to paste the Open AI API key in the Screaming Frog custom JavaScript editor.
  • I ran a quick test on a URL from my target site.

When I saw numbers populate in the “Custom Extraction” tab, I knew the setup was working correctly.

Screenshot of Screaming Frog displaying a successful test with numbers populating in the ‘Custom Extraction’ tab

Finish setting up Screaming Frog and turn on JavaScript rendering

To make the process easier, I adjusted several Screaming Frog settings to optimize the export of vector embeddings for internal link optimization.

To turn on JavaScript rendering:

  • Navigate to Configuration > Crawl Config > Spider > Rendering > JavaScript.
  • Turn on JavaScript rendering and leave the settings as-is. Click “OK.”

Optimize crawl settings

  • Go to Configuration > Crawl Config > Crawl.
  • I used settings that focused on crawling only textual page content and internal links. Depending on your site structure, you may choose to include subdomains.
Screenshot of Screaming Frog’s optimized crawl settings, showing options to focus on textual page content and internal links

Set data to extract

  • Head over to Configuration > Crawl Config > Spider > Extraction.
  • Only extract the data you need to keep the file size manageable. Refer to my screenshot to see the settings I used.
screenshot showing Screaming Frog’s settings for data extraction under the ‘Extraction’ tab

Check advanced settings

  • Navigate to Configuration > Crawl Config > Spider > Advanced.
  • Duplicate the settings as shown in the screenshot.
Screenshot displaying the ‘Advanced’ spider settings in Screaming Frog

Exclude header and footer links

Go to Configuration > Content > Content Area.
The goal is to avoid including header and footer navigation links in the opportunities we find. You can copy and paste this list into all three boxes:

  • ​​nav
  • Navigation
  • Head
  • Header
  • Footer
  • menu

Inspect your website's header and footer to identify tags, classes, and IDs to exclude. Right-click on the header/footer section, select “inspect,” and look for the top-most Class or ID.

Screenshot of New York State website's header with option to choose inspect tool

This should show something like:

Screenshot of the browser’s inspect tool, highlighting the header navigation area to identify the top-most class and ID for exclusion.”

For example, if I found a class like “nygov-unav” or ID “ny-universal-navigation,” I’d exclude these from the crawl.

Repeat the process for other navigation areas, including the footer. If unsure, start with my default list above or consult someone on your team for the navigation divs, classes, and IDs.

Adjust robots.txt settings

  • Navigate to Configuration > Robots.
  • Follow the settings shown in my screenshot.
Screenshot showing the ‘Robots’ configuration settings in Screaming Frog

Final settings check

At this point, you can click OK and be done with it. I didn’t connect to Google Analytics, Search Console, or any SEO API for this exercise since Screaming Frog, with JavaScript enabled and the XML sitemap, should find all the important pages.

Ready, set, crawl

Click “Start” and let the crawl run for about a minute.

Go to the “Custom JavaScript” tab and filter to show only “(ChatGPT) Extract embeddings from page content.” You should see numbers appearing beside most URLs.

Screenshot of the Screaming Frog ‘Custom JavaScript’ tab displaying the extracted embeddings data as numbers

If all looks good:

  • Click “Pause” and save the file.
  • Click “Resume” again. (This is an excellent habit to develop.)

Step 3: Export vector embeddings and all inlinks

Export “All Inlinks” from Screaming Frog

I started by exporting the “All Inlinks” data from Screaming Frog. This file contains every internal link on the site and can be quite large. For example, my file, all_inlinks.csv, was around 52 MB and represented 1,428 URLs.

Screenshot showing how to export the ‘All Inlinks’ data from Screaming Frog

Export vector embeddings from Screaming Frog

Next, I exported the vector embeddings as file.csv and saved them for later use.
At this point, I saved the Screaming Frog project one more time before closing the tool. Now, it’s time to move on to the next phase.

screenshot showing data imported into Google Sheets from the ‘all_inlinks.csv’ file

Step 4: Create spreadsheets

I used Google Sheets for this tutorial, but you can follow the same process in Excel. If needed, you can adjust the formulas using ChatGPT as a guide.

Import the two files you exported from Screaming Frog

  • Import the all_inlinks.csv file into one sheet and file.csv into another.
  • You can use the same workbook, but remember that CSV files only save a single tab of data when exporting.

Clean the data

This part is essential. I had to remove errors from the vector embeddings, simplify the internal link data to the bare essentials, and rename a few columns.

Clean the Custom JS (i.e., vector embeddings) and save over file.csv
 

  • Sort the “(ChatGPT) Extract embeddings from page content” column from Z to A
  • Delete any row where that column is not a string of numbers (e.g., cells labeled “timeout” or “error”)
  • Verify all URLs have a status code 200, then delete the “Status Code” and “Status” columns. Remove any rows that don’t meet this criterion.
  • Rename the remaining columns to “URL” and “Embeddings” (capitalization matters).
  • Export this tab and save it as “file.csv.”

Clean up all inlinks

This step was a bit more involved but well worth the effort.

  • Sort column A (“Type”) and delete any rows that aren’t “Hyperlink.” Once verified, delete this column. This should make “Source” the first column.
  • Sort column F (“Status Code”) and delete any rows that don’t have a 200 status. Then, delete the “Status Code” and “Status” columns
  • Delete the following columns:
    • Size (Bytes)
    • Follow
    • Target
    • Rel
    • Path Type
    • Link Path
    • Link Origin
  • Sort by Link Position
  • Delete any rows where the link is from navigation, header, or footer. This should leave you with “Content” and possibly “Aside.”
  • Sort by the “Source” column. Delete rows containing:
    • Home page URLs
    • Blog index page URLs
    • Category/tag index pages
    • Paginated URLs
    • Sitemap URLs
    • Any other non-unique content pages (e.g., internal search results, non-canonical URLs)
  • Sort by the “Destination” column and repeat the cleaning process you did for the “Source” column.
  • Sort by the “Alt Text” column (Z to A). Copy the alt text to the adjacent “Anchor” column and then delete the “Alt Text” column. 

Remove self-linking URLs

  • Create a new column called “links to self” between “destination” and “anchor,” making it column C.  I’ve included a screenshot for reference.
  • Copy and paste this formula into C2
    • =IF(A2=B2, "Match", "No Match")
  • Copy it down for all rows and sort column C from A-Z to bring up rows marked “Match.”
  • Delete these rows as they represent source URLs linking to themselves.
  • Finally, delete the “Match” column altogether.
Screenshot demonstrating how to clean up data in Google Sheets, showing the column sorting process

Column C helps you get rid of Source URLs that link to themselves.

After this cleanup, my original all_inlinks.csv file went from over 50 MB with 136,873 rows to a much leaner 2 MB file with 11,338 rows and four columns.

Step 5: Turn the vector embeddings into helpful information (i.e., related URLs)

Access Google Colab

To process the vector embeddings, I used Google Colab. Here’s what I did: Visit the Google Colab notebook created by Gus Pelogia and click on “File” > “Save a copy in Drive.” This notebook is essentially Python running in your browser, so you don’t need to install anything.
Next, I got a copy of Gus’s Python script, which uses Pandas, Numpy, and Scikit-learn to process the file.csv I generated with Screaming Frog and the OpenAI API.

Running the script

If you’ve cleaned your data properly and your CSV file is named and formatted correctly, you should be able to:

  • Press the Play button in the Colab notebook.
Screenshot of the Google Colab notebook interface, where the Python script for processing vector embeddings is run

Upload your file.csv file (the one with the “URL” and “Embeddings” columns).

Screenshot showing where to choose the CSV file after starting the script in Google Colab.

Wait for it to process without leaving the browser window.

Screenshot showing the Google Colab script at 50% done.

Troubleshooting errors

There’s usually some issue that pops up. But don’t worry, clicking “Explain Error” will typically guide you to the fix.

The explanation of the error helped me figure out that I needed to open the CSV file and look for irregularities in the Embeddings column. It turned out there was a blank cell. Other examples of what might cause errors during this phase are: Extra columns The wrong file name The wrong column names  Error data in the embedding column (which you’ll find if you sort the column and review it).

The explanation of the error helped me figure out that I needed to open the CSV file and look for irregularities in the Embeddings column. It turned out there was a blank cell.

Other examples of what might cause errors during this phase are:

  • Extra columns
  • The wrong file name
  • The wrong column names
  • Error data in the embedding column (which you’ll find if you sort the column and review it).
Screenshot showing an example error of a blank cell in the CSV output from Screaming Frog.

For example, I encountered a blank cell in the “Embeddings” column that caused an error. I simply deleted that row, exported the cleaned file as file.csv again, refreshed the Google Colab notebook, and retried.

Save and import results into Google Sheets

Once processing was complete, I downloaded the resulting file and imported it into my existing Google Sheets workbook, alongside the “all_inlinks” tab. I named this new tab “related pages.” 

It looked like this:

Screenshot of the ‘Related Pages’ tab in Google Sheets, showing target URLs and their top 5 most closely related pages

Split the “Related Pages” column
 

  • The target URL is in column A, and the top 5 most closely related URLs are in column B. 
  • To separate these, I used this formula in cell C2.
=SPLIT(SUBSTITUTE(B2,"[",""),", ")
Screenshot of the ‘SPLIT’ formula being used in Google Sheets to separate related URLs into individual columns
  • I copied this formula down the column, which split the URLs into columns C through G. Next, I renamed columns C to G as “Related URL 1” through “Related URL 5.”

Copy and paste as values

Copy columns C to G and paste them back as “values only.”

Screenshot displaying the formula used in Google Sheets to pull the inlink data from the ‘all_inlinks’ tab

Do two find-and-replace operations:
Remove all single quotes (')
Remove all right brackets (])

Screenshot showing a Find and Replace operation in Google Sheets.

This image shows a left bracket, but you’ll be looking for right brackets ].

I deleted the original “Related URLs” column (column B), leaving me with six columns: URL and Related URLs 1-5.
Here’s what it looks like:

Spreadsheet screenshot showing each related URL in its own column.

Now we’re ready to put this information to practical use.

Step 6: Pull inlink data from the “all_inlinks” tab

Setting up columns and pulling inlinks data
Insert a new column between “URL” and “Related URL 1” and name it “Links to Target URL.” It should be in column B. Next, use this formula in cell B2 to pull inlink data:

=TEXTJOIN(", ", TRUE, FILTER(all_inlinks!A:A, all_inlinks!B:B = A2))

This formula gathers all URLs from the “all_inlinks” tab that link to the target URL in column A. Here’s what the result looks like:

Screenshot shows target URLs in column A with their respective internal links listed in column B

Check results

#N/A means no links were found for that specific URL. Tip: Turn off text wrapping in column B, as the cells can get quite long if multiple links exist.

Copy column B and paste as values

Copy column B and paste it as “values only.” This step is crucial for ensuring the data doesn’t change as you make further edits.

Adding additional columns

Insert a new column after every “Related URL” column and call these “Related URL # Links to A?” as shown in the screenshot below.

Screenshot showing spreadsheet with each related URL separated into labelled columns.

Here’s how the columns should look:

  1. URL
    1. The target URL
  2. Links to the target URL
    1. A comma separated list of all URLs that link to the URL in column A. 
  3. Related URL 1
  4. URL 1 links to A?
  5. Related URL 2
  6. URL 2 links to A?
  7. Related URL 3
  8. URL 3 links to A?
  9. Related URL 5
  10. URL 4 links to A?
  11. Related URL 5
  12. URL 5 links to A?

Step 7: Find unlinked related pages

Identify missing links
It’s time to check if the related pages are linking to my target page.
I used this formula in cell D2 and copied it down:

=IF(ISNUMBER(SEARCH(C2, B2)), "Exists", "Not Found")

It should look like this with either “Not Found” or “Exists” in each cell in column D (URL 1 links to A?):
 

Screenshot showing how to identify unlinked related pages using a formula in Google Sheets

Do the same thing for each subsequent “URL # links to A?” rows.
The reference to column B “Links to Target URL” isn’t going to change, but the reference to the related URL column will. For example:
In F2 (“URL 2 links to A?”) you will be looking for the E2 URL within the list of URLs in B2:

=IF(ISNUMBER(SEARCH(E2, B2)), "Exists", "Not Found")

Copy this formula down column F. In H2 you will be looking for the G2 URL within the list of URLs in B2:

=IF(ISNUMBER(SEARCH(G2, B2)), "Exists", "Not Found")

Copy this formula down column H. Repeat this process for each of the “URL # links to A?” columns.

Highlight missing links for easy review

  • I selected columns D:L and went to Format -> Conditional Formatting in Google Sheets (or Excel).
  • I set a rule to format cells containing “Not Found” in pink for easy identification.

This made it easy to spot where the internal links were missing.

Validate the data

I double-checked a few entries manually to ensure everything was accurate. Now, I have a complete list that shows each target URL in column A, the top 5 related URLs, and whether those URLs are linking back to the target URL.
My final spreadsheet looked like this, with “Exists” or “Not Found” indicating whether each related URL was linking back to the target URL:

Step 8: Build internal links

Now comes the final and most actionable part — building those internal links.
Identify the opportunities: I used the pink cells as indicators of where internal links were missing. Each pink cell represented a related page that wasn’t linking to the target URL, even though it should.
Add the links: I went to each related page (from the pink cells) and edited the content to include a relevant internal link to the target URL. I made sure to use a descriptive anchor text that aligns with the content on the target page.
Prioritize: I started with the highest-priority pages first, such as those with the most traffic.

Concluding thoughts: Create a cohesive internal linking structure with vector embeddings

Take the time to build, analyze, and refine your internal link structure. This step-by-step guide transformed my internal linking process into a data-driven strategy with the power of vector embeddings. The effort will pay off in improved rankings, better user experience, and ultimately, more organic traffic. It also improves SEO performance by ensuring your most valuable pages are connected in a way that search engines and your users understand. 
After running this process on a client’s site, I was surprised. I thought we’d done a great job at internal linking, but there were hundreds of opportunities we’d missed. And I don’t just mean that the keyword we want to link from appears on the page. I mean opportunities to link pages that search engines would see as highly relevant to each other. In doing so, I was able to disambiguate closely related concepts and fix a few unnoticed keyword cannibalization issues as well.

Back to Top
Everett Sizemore
Everett is an SEO and content strategy consultant with over 16 years of experience optimizing websites for search. For the last 6 years, he has focused on helping startups build the teams and processes that form the foundation of highly effective search engine optimized content marketing machines.

Make smarter decisions with Moz API

Start building and scale easily with affordable plans

Read Next

Accelerate Your SEO Knowledge: New Webinar From Moz

Accelerate Your SEO Knowledge: New Webinar From Moz

Nov 11, 2024
Build a Search Intent Dashboard to Unlock Better Opportunities

Build a Search Intent Dashboard to Unlock Better Opportunities

Aug 29, 2024
How to Optimize for Google's Featured Snippets [Updated for 2024]

How to Optimize for Google's Featured Snippets [Updated for 2024]

Aug 20, 2024