![Chima Mmeje](https://moz.rankious.com/_moz/images/user/photo/10.jpg?w=160&h=160&auto=compress%2Cformat&fit=crop&dm=1695126153&s=9da8592a332927996bd1e563649d869e)
Ziff Davis's Study Reveals That LLMs Favor High DA Websites
For years, SEOs have relied on Domain Authority (DA) as a benchmark for assessing a website’s authority. While Moz has consistently stated that DA is not a Google ranking factor, the metric has remained a key point of discussion in the industry.
New research from Ziff Davis sheds more light on how Domain Authority correlates with LLM content preferences, suggesting that the future might not be so different from the present.
Why did Ziff Davis conduct this study?
Ziff Davis, a major publisher with brands like PCMag, Mashable, IGN, and Moz, faces the same challenges as other media companies. They suspect that Large Language Models (LLMs) are training on their content without licensing agreements. Hence, it’s difficult to determine which content is being favored.
The study set out to address this issue. Researchers analyzed datasets like Common Crawl, C4, OpenWebText, and OpenWebText2 to understand how LLMs are trained, what types of content they prefer, and how these choices influence AI behavior and output.
You can read the full study report here.
Key takeaways from the Ziff Davis LLM Study
If you want to skip the rest of the article, I’ve summarized the key findings below:
- LLMs place a high weighting on heavily-curated, high-quality datasets above other raw web data
- Authoritative publishers dominate these curated datasets
- OpenWebText and OpenWebText2 feature a much higher proportion of high-DA content compared to uncurated datasets
- LLM developers prioritize commercial publisher content, reflecting a preference for quality and credibility
Which datasets were analyzed?
The Ziff Davis study examined four key datasets that are crucial in training large language models:
- Common Crawl: An uncurated repository of web text scraped from the entire internet with minimal quality control.
- C4: A cleaned version of Common Crawl that focuses on English pages and excludes duplicates and low-quality text. It offers a more refined dataset without strict curation.
- OpenWebText: A proxy for OpenAI’s WebText, emphasizing high-quality content linked from Reddit with a minimum upvote threshold.
- OpenWebText2: A follow-up to OpenWebText featuring an expanded and updated dataset while maintaining the same quality-focused approach.
It’s important to note that these datasets aren’t created equal. More curated datasets, like OpenWebText and OpenWebText2, contain a higher proportion of authoritative content, while unfiltered sources like Common Crawl pull from a much wider but lower-quality pool of web pages. The difference in dataset impacts how LLMs learn and generate responses.
![](https://moz.rankious.com/_moz/images/assets/features/Ziff-Davis-study-reveals-that-LLMs-favor-high-DA-websites/1-datasets_analyzed_in_the_ziff_davis_study__1.png?w=1920&h=1080&auto=compress%2Cformat&fit=crop&dm=1738596796&s=2b6d2ef2ad5a98363bc862b8c2e52c19)
How were publishers chosen for the study?
The study used Comscore’s web traffic to determine which publishers to analyze. Researchers focused on the top 15 portfolio publishers in the Media category as of August 2020, representing the most widely visited news and media organizations.
![](https://moz.rankious.com/_moz/images/assets/features/Ziff-Davis-study-reveals-that-LLMs-favor-high-DA-websites/2-19-featured-publishers.png?w=1920&h=800&auto=compress%2Cformat&fit=crop&dm=1738596806&s=0e3d606e5cb6e0f966bc9a48f66d9378)
The selection process excluded single-property publishers, non-media tech firms, and user-generated content platforms in favor of more established commercial publishers.
Which metric was used?
The study used Moz’s Domain Authority (DA) to measure the influence and quality of web content in LLM training datasets. While DA is not a search ranking factor, it’s a recognized metric that predicts a website’s likelihood to rank in SERPs based on factors like backlinks, domain history, and site size.
To analyze LLM content preferences, the study compiled Moz DA scores for all URLs found in Common Crawl, OpenWebText, OpenWebText2, and C4. The findings revealed a strong correlation between dataset curation and DA distribution. Meanwhile, uncurated datasets contained mostly low-DA sites, while curated datasets were heavily weighted toward high-DA publishers.
Access the digital health of any website
With Moz DA/PA metrics
![](https://moz.rankious.com/_moz/images/assets/CTAs/Illustrations/Domain-Authority.png?w=290&h=200&auto=compress%2Cformat&fit=crop&dm=1733855251&s=7488fa10d9e78b094da20e2921634e5e)
What did we learn from the Ziff Davis Study?
Most datasets are curated to improve the quality of AI output
The Ziff Davis study makes it clear that while these models may scrape everything indiscriminately, they place a higher weighting on curated datasets to prioritize quality.
Curation shapes how LLMs process and generate content. Raw datasets like Common Crawl pull from the open web with a mix of high and low-quality sources. In contrast, curated datasets like OpenWebText and OpenWebText2 filter out low-quality content to create a higher concentration of reliable information.
This intentional, selective process improves model accuracy, response quality, and content relevance. It also explains why high-authority websites dominate AI outputs.
LLMs prefer high-quality content from commercial publishers with high Domain Authority
LLMs don’t treat all web content equally. The Ziff Davis study confirms that high-DA commercial publishers dominate curated datasets.
![](https://moz.rankious.com/_moz/images/assets/features/Ziff-Davis-study-reveals-that-LLMs-favor-high-DA-websites/3-Domain-Authority-Data-for-Featured-Publishers-Descending-DA-Value.png?w=1920&h=2200&auto=compress%2Cformat&fit=crop&dm=1738596813&s=ceea0ee11b867287e28d90bf9c0774fe)
We used a combination of Moz API and Google Collab to run a bulk DA analysis for all URLs featured in the study.
You can view the custom script here.
84.2% of analyzed publishers had an average DA of 60 or higher, showing a clear preference toward established media brands. As datasets become more curated, the proportion of high-DA content increases, with publishers like The New York Times and News Corp appearing more frequently.
An emerging trend of AI companies partnering with major publishers
Nothing is free in life, and AI companies know it. The backlash from publishers over copyrighted content has forced AI companies to broker exclusive licensing deals with a select group of publishers like News Corp and Axel Springer. Many of these publishers have seemingly used robots.txt rules as leverage in these negotiations.
Click here to download the graphic as a PDF and explore the source links.
Does this mean that publishers with licensing agreements feature more?
No. While publishers with AI partnerships appear more frequently in OpenWebText2 than in the WebText top 1000, the correlation isn’t absolute.
![](https://moz.rankious.com/_moz/images/assets/features/Ziff-Davis-study-reveals-that-LLMs-favor-high-DA-websites/5-NEW-Dataset-Representation-by-Publisher.png?w=1920&h=2200&auto=compress%2Cformat&fit=crop&dm=1738596828&s=1431fee031959f997a350d93dc60a35d)
Three of the top five publishers in OpenWebText 2 (NYT, Advance, and Gannett) do not have licensing agreements with OpenAI. Also, the WebText top 1000 contains a higher percentage of these publishers than OpenWebText2 (13.47% vs. 12.04%). Suffice it to say that AI partnerships do not guarantee higher dataset representation. It’s also worth noting that the NYTimes blanket blocks almost all AI crawlers in its robots.txt, so its presence in this dataset is an indication that the makers of these datasets wanted to use NYTimes content, but not that they were able to do so.
What does the Ziff Davis study mean for SEO?
Content is still king
Every major publisher thrives on high-quality content—from breaking news and investigative journalism to data-led reports and expert analysis. Looking at the top publishers featured in the Ziff Davis study, we see household names like:
- The New York Times (nytimes.com)
- Buzzfeed, Inc. (buzzfeed.com, huffpost.com)
- Condé Nast (wired.com, newyorker.com, vogue.com)
- News Corp (wsj.com, thesun.co.uk, nypost.com)
These publishers dominate search, earn backlinks naturally, and are frequently used in LLM training datasets, reinforcing their credibility.
Despite volatile SERPs and the rise of AI-generated answers, content remains the foundation of a website’s authority.
Moz's DA metric is directionally accurate for gauging a website's authority
![](https://moz.rankious.com/_moz/images/assets/features/Ziff-Davis-study-reveals-that-LLMs-favor-high-DA-websites/6-NEW-Average-DA-Quote-Image.png?w=1920&h=1000&auto=compress%2Cformat&fit=crop&dm=1738596851&s=b6027adec49e78bb67254d6f9bca182d)
While Moz’s Domain Authority (DA) isn’t a ranking factor, the Ziff Davis study confirms it’s a strong directional indicator of site authority, which aligns with the high-quality sources favored in LLM training.
In a Moz roundup on the Google Leaks, Rand Fishkin pointed out, “Google has been misleading marketers for years when saying they don’t use any form of website authority.” Supporting this statement, a study by Tom Pool on Google's Helpful Content Update (HCU) found that websites with higher DA scores were more likely to be HCU winners.
While building authority is a combination of different elements, the central tenets remain the same:
- Helpful content from thought leaders that demonstrates a personal experience with the problem
- Topically relevant backlinks from authoritative websites
- Strong UX and engagement signals that show content is helpful to users
- Positive off-page signals that reinforce brand trust and authority
AI models face the same challenges with identifying authoritative sources as Google and may well solve them in the same way.
Low DA websites are unlikely to win in SERPs.
Become the stronger competitor with Moz domain insights
![](https://moz.rankious.com/_moz/images/assets/CTAs/Illustrations/Domain-Authority.png?w=290&h=200&auto=compress%2Cformat&fit=crop&dm=1733855251&s=7488fa10d9e78b094da20e2921634e5e)
Building backlinks from authoritative sources strengthens site authority
If LLMs favor high-authority websites, then backlinks from these sites carry weight—not just in Google search rankings but potentially in generative AI visibility.
But the reality is that link building is getting harder. Spammy outreach and low-value links don’t move the needle. Instead, focus on creating content that naturally attracts media attention and citations.
High-value assets include:
- Industry reports with exclusive research and data
- Original surveys and case studies that provide unique insights
- Thought leadership content from recognized experts in your niche
- Interactive tools that offer a ton of value for users
While not mentioned, most of these publishers have a higher Brand Authority than most
![](https://moz.rankious.com/_moz/images/assets/features/Ziff-Davis-study-reveals-that-LLMs-favor-high-DA-websites/7-NEW-Average-BA-Quote-Image.png?w=1920&h=1000&auto=compress%2Cformat&fit=crop&dm=1738596859&s=87a56caa784281861e02404240622e7d)
Brand Authority is shaping up to be just as important as Domain Authority. The numbers don’t lie—57.9% of the publishers in the Ziff Davis study had a Brand Authority score of 40 or higher. Moz’s Jonathan Berthold used a combination of Moz API and a custom Google Collab script to do a bulk URL analysis for Brand Authority score.
The numbers align with Tom Capper’s study findings, which showed that sites with strong brand signals were more likely to benefit from Google’s algorithm changes, while weaker brands struggled to compete.
![](https://moz.rankious.com/_moz/images/assets/features/Ziff-Davis-study-reveals-that-LLMs-favor-high-DA-websites/8-Brand-Authority-Data-for-Featured-Publishers-Desending-BA-Value.png?w=1920&h=2200&auto=compress%2Cformat&fit=crop&dm=1738596867&s=f6d8672873b383980f92fbf62f40ab15)
According to Amanda Milligan, a few tactics that work for Brand Authority include:
- Creating newsworthy reports and studies
- Leveraging in-house experts to create content
- Highlight proof of expertise on your website and content
- Co-marketing with vertical authoritative brands
- Give value worth its weight in gold
Conclusion: High-quality content and Domain Authority are crucial elements to optimize for generative search
I’m not sure anyone is surprised about the outcome of the Ziff Davis study, as it confirms what we’ve long suspected. However, it’s important to note that these websites and publishers didn’t become giants overnight. They spent years investing in high-quality content, earning backlinks, and building credible brands. To optimize for generative AI search, SEOs should follow the same playbook: publish unique content that naturally attracts relevant backlinks and establishes topical authority.