How AI "Data Brokers" Take 100% of the Content — And Why Publishers Get Nothing

Digiday: Dozens of new third-party scrapers emerging as the new middlemen of content

A $1 billion market has formed — yet content creators receive zero

Inside the four-step pipeline: scrape → process → sell → reuse

"Not a tax — a hostile takeover"… "Napster is back, but iTunes is nowhere in sight"s

A new kind of middleman has emerged in the AI era. The familiar "ad tech tax" of the digital advertising age took a slice of revenue. The new "AI data broker" takes the entire pie — 100% of the content, with nothing paid in return. Worse: in some cases the same content is recycled into competing products that push the original publishers out of the market entirely. How exactly does this mechanism work? Drawing on Digiday's May 4 report on the alarm spreading through the U.S. publishing industry, this explainer walks through the pipeline step by step.

◆ What Is an "AI Data Broker"? — The Industry That Feeds AI

The term "data broker" is not new. Traditionally it referred to firms that collected personal and consumer data and sold it to marketers and advertisers. But the "AI data brokers" Digiday's reporting describes are a different breed. These firms automatically scrape news articles, blog posts, images and videos from across the open web, process them into structured training datasets, and sell them — or expose them via APIs — to AI companies such as OpenAI, Anthropic and Google.

Why does such an intermediary exist? Training a large language model (LLM) like ChatGPT requires billions to trillions of words of text. AI agents that browse, book and buy on behalf of users must continuously read fresh web content to function. Doing all of this collection in-house is technically and legally costly for AI firms. Outsourcing it to a specialist — that is the role AI data brokers fill.

AI ‘데이터 브로커’는 어떻게 콘텐츠 100%를 가져가는가

미국 디지데이 보도에 따르면 AI 데이터 브로커(스크래퍼) 21~40개 업체가 퍼블리셔 콘텐츠를 무단 수집·가공·판매하는 10억 달러 규모 ‘스크래퍼 경제’를 형성. 콘텐츠 제작자에게 돌아가는 수익은 0원이며, 이는 ”세금이 아닌 IP 기반 적대적 인수”라는 업계 비판이 제기

K-EnterTech HubJung Han

An anonymous publishing executive cited by Digiday compared them to the demand-side platforms (DSPs) of the ad tech world. If a DSP automates ad-inventory buying on behalf of advertisers, an AI data broker automates content acquisition on behalf of AI companies. The executive counted 30 to 50 such startup DSPs in the content space — "all of them effectively charging a 100% fee."

[Glossary ① Data Broker vs. Scraper] Strictly speaking, a "scraper" refers to the technology or firm that automatically collects content, while a "data broker" sells the collected and processed data downstream. In today's AI market, however, the two functions are usually vertically integrated within the same company, and the terms are used interchangeably. Digiday's reporting also treats "third-party web scraper" and "AI data broker" as effectively synonymous.

◆ How It Works — A Four-Step Pipeline: Scrape → Process → Sell → Reuse

The mechanism by which AI data brokers end up taking 100% of a publisher's content unfolds in four distinct steps. Tracing the flow makes clear why none of the resulting value flows back to the people who created the content.

[STEP 1] Crawling / Scraping — Automated programs (crawlers, bots) operated by the data broker continuously visit websites and download text, images and video wholesale. Publishers' consent is not part of the process. Even when robots.txt explicitly states "no-crawl," a growing number of operators ignore or actively circumvent the directive.

[STEP 2] Processing / Structuring — The raw content is cleaned and reformatted for AI training: ads, navigation and other peripheral elements are stripped out; body text, headlines, bylines and publication dates are normalized; images are captioned and videos transcribed. At the end of this step, the original article has been transformed into a structured training dataset.

[STEP 3] Selling / API — The dataset is then sold to AI companies, or exposed through real-time APIs that supply content on demand. Customers range from large AI labs (OpenAI, Anthropic, Google) to AI search and agent startups. According to Mordor Intelligence, this market has already reached $1 billion in size.

[STEP 4] Reuse / Competing Products — AI firms train chatbots, search engines and summarization services on the data. Users now get article-derived answers directly inside the AI interface — without ever visiting the publisher's site. The original creators lose traffic, advertising revenue and subscriber acquisition.

In this four-step pipeline, the original creator appears in only one place — Step 1, and only as a target of extraction. The processing margin in Step 2 is captured by the data broker. The transaction revenue in Step 3 is captured by the data broker. The downstream value in Step 4 is captured by the AI company. The publisher's settlement share across the entire chain is, in practice, zero. That is why Digiday's framing — that AI data brokers "take 100% of the content and pay 0%" — describes the structure literally, not metaphorically.

◆ "Not a Tax — A Hostile Takeover": Candr Media's Diagnosis

Chris Dicker, CEO of Candr Media, frames the issue more bluntly. He told Digiday that with the old ad tech middlemen, publishers at least got something back. With scrapers, he said, the value extraction is total.

Dicker added that the scrapers take 100% of the content, pay 0%, and in some cases use that content to build competing products that erase the publisher altogether. He characterizes the model not as a tax but as "a hostile takeover funded by our own IP."

The phrase is deliberate. In M&A, a hostile takeover describes a buyer using capital to seize control of a target without management consent. By Dicker's analogy, the AI industry is doing something structurally similar — using publishers' own assets, their IP and content, without permission, to build the financial and technical foundation that may one day displace those publishers entirely.

The dynamic is easy to picture. An AI chatbot trains on tens of thousands of newspaper articles → a user asks "what happened with U.S. interest rates this week?" → the chatbot synthesizes and summarizes the newspapers' reporting → the user has no reason to visit the newspaper's website → the newspaper loses traffic, advertising and subscriber acquisition. A product built on the publisher's own content is structurally cannibalizing the publisher's revenue base.

◆ "No Means No" — The Collapse of an Internet Gentleman's Agreement

What Dicker considers more troubling than the extraction itself is the bad-faith conduct layered on top. He told Digiday that some firms deploy stealth, undeclared crawlers to bypass websites' no-crawl directives, while others publicly announce they will not comply at all.

Website operators have long used a file called robots.txt to tell crawlers "this site is off-limits." It is, in effect, a gentleman's agreement that has held up since 1994 — the dawn of the modern web. Search engines and most other operators have respected it, and that mutual trust became one of the foundational social contracts of the open internet.

Some AI data brokers are now breaking that contract in two distinct ways. The first is the stealth crawler: a bot disguised as an ordinary user, accessing content while hiding its identity. The second is open defiance: declaring publicly that the firm will not comply with robots.txt, and proceeding to scrape regardless. The first amounts to deception; the second, provocation.

Dicker, who also serves on the board of the Independent Media Alliance, called the conduct "active deception and abuse of scale designed to defeat the few defensive tools publishers have left." His message is uncompromising: when a website signals "no-crawl," no must mean no. The phrasing deliberately echoes the "no means no" framing from sexual-consent debates — borrowed here to underline the ethical wrong of scraping without permission.

[Glossary ② robots.txt and no-crawl] robots.txt is a text file placed at a website's root that tells web crawlers which paths are allowed or disallowed for collection. It carries little legal weight, but because most operators — including search engines — have honored it, it functions as a de facto norm of the open web. Recent additions allow publishers to specifically block AI training bots (e.g., User-agent: GPTBot, ClaudeBot).

◆ A $1 Billion Scraper Economy — 21 to 40 Vendors Identified

A recent report by media analyst Matthew Scott Goldstein offers the first credible map of this market. Citing Mordor Intelligence data, the report values the scraper economy at $1 billion. It is also, Goldstein points out, an industry from which publishers earn essentially nothing.

Goldstein's report identifies 21 vendors operating in the space, including Firecrawl, Exa, Tavily, Brave, You.com, Perplexity Sonar and Bright Data. TollBit, which maintains a separate index of third-party scrapers, lists roughly 40. Either count points to a category that is rapidly diversifying.

Some of these names will be familiar to consumers. Perplexity is widely known as an AI answer engine. Bright Data is one of the largest global players in data-collection infrastructure. What looks to the user like a search or answer service is, on the back end, also a large-scale scraping operation.

◆ "Agentic Infrastructure" Rebrand — Legitimizing Extraction in Plain Sight

What worries Goldstein more than the scale is the industry's repositioning. In a blog post on April 29 and a follow-up LinkedIn post, he argued that third-party web scrapers are now rebranding as "agent infrastructure" so they can continue extracting value in plain sight. He named Parallel Web Systems as a representative case.

"Agentic infrastructure" literally means "the infrastructure required for AI agents to operate." The argument runs: as AI agents proliferate and act on behalf of users, they require a data supply chain — and that is what these firms provide. If that framing takes hold, what publishers see as unauthorized harvesting gets redefined as "legitimate infrastructure for the AI era." It is the equivalent of relabeling pirates as "explorers" or smuggling as "international trade."

Goldstein wrote that the technology has become more sophisticated and the enterprise sales pitch more polished, but the underlying economics have not changed. AI agents, he argued, will consume the web at a scale that dwarfs human behavior.

Until a real marketplace layer exists to price and govern that consumption, he concluded, the category is fundamentally a race to extract value the fastest. The question of who gets paid remains unresolved. In plainer terms: until something equivalent to iTunes is built for the content market, the extractive structure continues — no matter how the activity is rebranded.

[Glossary ③ AI Agent] An AI agent is an AI system that autonomously browses the web and performs tasks on behalf of a user. Example: instructed to "book a flight to Tokyo for next week's business trip," the agent visits airline websites, compares options and completes the booking. A single user request expands into the consumption of dozens or hundreds of pages. That is what Goldstein meant when he said "agents will consume the web at a scale that dwarfs human behavior."

◆ "Hosts Being Eaten Alive" — Licensing Deals as Legal Defense

Publishers have argued consistently that they are the hosts of the content ecosystem and are now being consumed by it. Without their content, they say, the next generation of LLMs simply could not exist. AI requires human-generated content to function — and yet the people generating that content are excluded from the value chain it powers.

Yet Digiday reports that recent licensing deals look less like recognition of that value and more like defensive moves by platforms seeking to limit legal exposure. For publishers, the argument is starting to feel like shouting into an empty void.

Major AI firms — OpenAI, Anthropic, Google and others — have signed licensing deals with select major publishers in recent months. But many in the industry read these deals as risk-management responses to ongoing copyright litigation, including The New York Times v. OpenAI, rather than as voluntary recognition of underlying value. They are closer to legal hedging than to a settled new business model.

The result is a polarization: a handful of major publishers receive compensation through bilateral deals, while the bulk of small and mid-sized publishers — without the resources to litigate or the standing to negotiate — remain only as targets of unauthorized scraping.

◆ "Napster Is Back. iTunes Is Nowhere in Sight."

The analogy publishers reach for most often is Napster. The peer-to-peer music-sharing service, launched in 1999, stripped value from the music industry at scale. The market eventually rebalanced when iTunes (2003) and Spotify (2008) established legitimate digital distribution layers.

The same anonymous executive told Digiday the world now has more and more Napsters, but neither an iTunes nor a Spotify yet. Publishers, he said, are running a race only with the pirates — and the pirates, as always, are faster.

The decisive difference between the music market and today's content-and-LLM situation, Goldstein argues, is the absence of a true marketplace layer that prices and governs consumption. Even at the height of Napster, music labels could begin recovering value only after iTunes provided a legitimate distribution path. The content market has not yet reached that stage.

There are early candidates. In the U.S., firms such as TollBit and Cloudflare are piloting "pay-per-crawl" infrastructure that automatically meters and charges AI bots when they consume content. Whether they evolve into the iTunes of the content market is one of the most consequential open questions in the industry.

◆ The Syndication Trap — Blocked at the Source, Scraped at the Portal

Publishers that syndicate their content to third-party sites face an additional structural problem. Even when they block AI crawlers from their own domains, the same articles often resurface on portals and customer sites that carry their feeds — and become accessible there.

One publishing executive described this dynamic to Digiday as a game of whack-a-mole. Block one channel, and the content reappears in another.

When publishers challenge AI firms about this third-party scraping path, the standard response, Digiday reports, is that the issue lies with the portal's settings, not with the AI company's own crawling. Responsibility is pushed downstream, and the actual responsible party becomes increasingly difficult to pin down.

The same dynamic applies to South Korea's portal–CP (Contents Provider) ecosystem. Even if a Korean newspaper blocks AI crawlers on its own site, identical articles supplied to portals like Naver and Daum may remain accessible through indirect paths. For Korean publishers, this is not a foreign problem.

■ Five Questions for Korea's Media Industry

While the immediate context is American, the structural lessons travel directly to Korea. Five questions present themselves immediately.

First, the scraper economy created by AI training and agent-infrastructure demand is forming as a single global market. Korean content is not exempt. Domestic news outlets and broadcasters should design their response on the assumption that their content is already being collected without consent.

Second, the "agentic infrastructure" rebrand is a textbook move to legitimize a category before regulatory and contractual frameworks catch up. Policy responses must look past the change of label and focus on the underlying question: how is the content being consumed?

Third, the same whack-a-mole dynamic is built into Korea's syndication structure, anchored on portal–CP relationships. Standard contract terms and policies that establish the legal force of no-crawl directives, explicit prohibitions on AI-training use within syndication contracts, and clear allocation of responsibility are urgently needed.

Fourth, whoever first designs and occupies the "AI content marketplace layer" — the iTunes/Spotify equivalent — will set the terms of negotiating leverage. TollBit and Cloudflare are already attempting versions of this infrastructure in the U.S. In Korea, organizations such as the Korea Press Foundation and the Korea Communications Commission (KCC) could lead a parallel effort, potentially building a unified marketplace at the K-content level.

Fifth, current Korean policy debates on the proposed Audiovisual Media Services Act and related frameworks should treat AI-training and agent-driven content consumption as a distinct policy track, with explicit definitions, compensation methodologies and accountability provisions. There is growing concern that existing copyright law alone cannot adequately address a new behavioral category — one in which agents consume the web on behalf of humans.

The pirates are always faster. But as the music market eventually demonstrated, markets do find their legitimate seat in time. The real question is who builds that seat first — and who writes the rules from it. Whether publishers, as the original hosts of the content ecosystem, and whether Korean media in particular can claim a place at that rule-making table — that is the question now in front of the industry.

Source: https://digiday.com/media/from-ad-tech-tax-to-ai-data-brokers-the-new-middlemen-keep-100-publishers-say/