Who owns the words AI learns from?
When I first began publishing my own content, I thought of it mainly as a way to share ideas and test perspectives. But as I watched the rise of AI systems that ingest vast amounts of online writing, I found myself asking unsettling questions. What happens when my work, or yours, becomes part of a training dataset without permission or compensation? At first, I tried to reason by analogy. After all, humans learn by reading as well. A student studies a textbook, then applies formulas to solve new problems. A writer absorbs countless prior works and then creates something original. Why should AI be treated differently?
The more I dug in, the clearer the distinction became. Humans read, internalize, and selectively apply ideas within a social and legal framework that has long recognized the balance between inspiration and infringement. AI systems, on the other hand, do not merely learn from content in the abstract. They copy, compress, and reconstitute enormous volumes of data at industrial scale, often reproducing patterns or passages directly. That difference matters, because it shifts the economics of creativity. It made me realize that creators deserve a more serious conversation about rights and compensation in the AI era.
The core tension can be framed as a triangle: access versus control, credit versus value capture, and scale versus manageability. Creators and publishers demand more authority over how their work is used. AI labs want as much clean data as possible. Platforms and licensing intermediaries must thread the needle of fairness and efficiency. The market is now emerging with multiple proposals. Whether any survive depends heavily on how the AI labs choose to engage.
How the proposals could work in practice
As I explored this space, I was struck by the sheer number of competing frameworks, each with its own vision of fairness and enforceability.
RSL standard
Imagine a blogger publishing an article tagged under RSL. The tag might specify: “Allowed for noncommercial research only, must attribute, no pay-per-inference unless negotiated.” When an AI lab queries the crawling index, it encounters that metadata and is expected to observe the terms. Reddit, Medium, Quora, and Yahoo have expressed support, giving RSL growing legitimacy. But enforcement depends entirely on whether the labs choose to respect those tags. If they do not, the standard lacks teeth.
Cloudflare pay-per-crawl/AI crawl control
Consider The Atlantic allowing AI crawlers to index its pages only if they pay $0.01 per page. Cloudflare acts as the gatekeeper, logging crawler behavior and enforcing payment automatically. Publishers gain real monetization and control over who crawls what. Some are already experimenting with this. The limitation is that it only works if the publisher uses Cloudflare’s infrastructure, and it does not solve attribution or downstream usage rights.
Microsoft’s Publisher Content Marketplace (PCM)
Here, a smaller publisher joins PCM. When its content is used by Copilot or related AI systems, it receives payments tied to actual usage. The publisher does not need to manage individual licensing deals; the marketplace handles that. Microsoft is piloting this with select news outlets. The promise is scale within Microsoft’s ecosystem. The downside is that publishers outside or unwilling to join remain excluded.
Adobe content credentials
A visual artist embeds credentials into images. Later, when an AI system uses those images, the metadata helps track provenance and attribution. This enhances recognition and trust, especially in media-heavy domains. But it does not guarantee payment, and adoption remains limited. AI labs have yet to commit to interpreting or honoring this metadata in licensing decisions.
Collective licensing models
In the U.K., the CLA and ALCS plan frameworks where many authors pool rights. AI systems seeking broad access can negotiate a single license that covers multiple works. A news publisher’s back catalog might be licensed this way, reducing friction. The trade-off is that authors lose granular control, and proceeds are distributed according to opaque formulas.
Intermediary licensing startups
New startups such as Calliope or Created by Humans allow bloggers and creators to package licensing offers with defined access, payment, and attribution terms. AI developers who use the content pay via the intermediary system. These startups are still small, but they offer tools that could scale, if AI labs decide to participate.
What struck me as I dug into these mechanisms was how fragmented the landscape is. Each proposal reflects a different balance of control, compensation, and efficiency. The fragmentation reminded me of the early days of music licensing, when Napster, iTunes, and record labels were all pulling in different directions. The eventual outcome was not just a technical standard but an economic compromise.
What the labs are doing (or not doing) right now
The success or failure of all these mechanisms depends on how labs like OpenAI, Anthropic, and Google (Gemini) decide to engage.
Looking at their different approaches, I found myself reflecting on the asymmetry of power. Individual creators must navigate opaque contracts or hope a collective represents them fairly. Meanwhile, labs can choose whether to embrace, ignore, or selectively comply with licensing systems. The imbalance raises an unsettling question: are we watching the birth of a sustainable market, or simply the formalization of gatekeeping by a handful of tech giants?
OpenAI. OpenAI has already inked multiple content licensing deals. For example, a multi-year deal with News Corp is reported to exceed $250 million. OpenAI also struck partnerships with publishers like Dotdash Meredith, Condé Nast, Axel Springer, the Financial Times, Vox Media, and Future (owner of Tom’s Guide, PC Gamer, TechRadar, etc.). Many of these deals enable OpenAI to include content in ChatGPT or other products with attribution or summaries, and to use publisher archives for training. Because of these deals, portions of OpenAI’s data pipeline already incorporate licensed content rather than purely scraped content.
That said, OpenAI’s deals are still a fraction of the total web. The company continues to rely heavily on large-scale crawling and ingestion of open web pages. Its licensing deals hint that it sees some risk in pure scraping and wants to reduce litigation exposure by converting some portion of its content base into licensed relationships.
Anthropic. Anthropic finds itself under fire in litigation. In 2025 it proposed a $1.5 billion settlement to resolve claims that it used pirated books, thus indirectly acknowledging that unlicensed content acquisition carries material financial risk. The settlement would pay authors roughly $3,000 per book.
At the same time, Reddit has sued Anthropic, accusing it of scraping Reddit data without permission while ignoring Reddit’s licensing deals with other parties like OpenAI and Google. Reducing legal exposure may prompt Anthropic to become more active in licensing, but thus far its posture suggests a mix of defensive settlement and selective partnerships rather than full embrace of systematic frameworks.
Gemini/Google. Google has large-scale strengths in web crawling, indexing, and search infrastructure. Its advantage is that it already controls much of the pipeline for acquiring data. If licensing frameworks become robust, Gemini or Google-backed models may treat them as optional extras or use them selectively. Publicly, Google has not announced large-scale content licensing deals as aggressively as OpenAI, likely because it can leverage its existing indexing infrastructure. Its strategy may be to wait and see whether ecosystem norms shift before committing extensively to licensing.
Lessons from first principles
Evaluating content licensing from first principles requires considering legal clarity, economic incentives, and incentive alignment within the digital ecosystem.
As I worked through these principles, I realized they offered a way to discipline my own gut reactions. Sympathy for creators is important, but principles like enforceability, efficiency, and ecosystem health matter just as much. Any solution that ignores one of these dimensions is unlikely to hold.
Legal clarity. Copyright and intellectual property law define who owns a work and how it can be used. Clear, enforceable rights reduce disputes for both creators and AI developers. Standards such as RSL or collective licensing attempt to operationalize these rights in a machine-readable format. If adoption is limited or jurisdictional differences create ambiguity, gaps emerge that leave creators uncompensated.
Economic incentives. Creators produce content only if the benefits outweigh the costs. Without compensation mechanisms, high-quality work is often undervalued, leading to underinvestment. Transactional systems such as Cloudflare Pay-Per-Crawl or Microsoft PCM align usage with financial reward. Attribution-focused systems like Adobe Content Credentials enhance recognition but do not ensure monetary benefit. Licensing must link usage to measurable benefits to sustain creative output.
Incentive alignment. The broader ecosystem requires balanced incentives. AI companies need access to diverse datasets, while creators require recognition and fair compensation. Misalignment, such as free access without attribution or payment, discourages participation, reduces dataset quality, and introduces bias toward unprotected works. Collective licensing or intermediary platforms attempt to reconcile these needs, but trade-offs exist. Systems should optimize creator benefit, AI performance, and enforceability, because over-prioritizing one dimension undermines the others.
Why this matters
The labs face a strategic calculus. Having to navigate dozens or hundreds of licensing schemes, each with different terms, metadata formats, and enforcement rules, can be operationally burdensome. If every site uses a different standard, the lab might need logic to interpret RSL tags, obey pay-per-crawl terms, honor Cloudflare payments, respect attribution metadata, parse collective licensing deals, negotiate via intermediaries, and more. That complexity is nontrivial. Labs may respond with hierarchical strategies:
- First, prioritize content licensed from large publishers.
- Next, use content from platforms that adopt RSL tags.
- Then, access sites with pay-per-crawl systems.
- Finally, rely on open crawling, subject to legal risk.
Because of the Anthropic settlement and lawsuits such as Reddit v. Anthropic, labs are actively weighing whether ignoring licensing frameworks will prove costlier than compliance.
For creators, the implications are profound. If labs broadly respect licensing, creators regain leverage. Licensing becomes a currency: authors can define terms, capture value, and control downstream use. If labs ignore standards and only strike selective deals, licensing systems may devolve into optional metadata. In that world, creators fall back on litigation or technical barriers, and collaboration gives way to conflict.
When I mapped out this complexity, I could not help but feel the irony. The very technology that promises efficiency at planetary scale now finds itself tangled in a thicket of micro-contracts and metadata tags. It underscores that “AI eats the world” only works if the world consents to be eaten.
A future being written
We are in the middle of a transition. OpenAI’s deals signal a hybrid path: licensing where practical, scraping where tolerated. Anthropic’s settlement underscores the liability of ignoring provenance. Google’s strategy may be to wait and selectively adopt what proves enforceable.
After months of exploring this issue, I find myself both hopeful and uneasy. I believe in the value of open knowledge and the potential of AI to accelerate progress. But I can no longer accept the idea that AI ingestion is the same as human learning. It is not. If creators are to keep producing, they must see themselves as part of the future, not as collateral damage.
The outcome depends less on abstract principles and more on choices the labs make in the months ahead. The world is watching to see whether licensing becomes the default infrastructure of AI, or whether it remains a contested patchwork of deals and lawsuits.