On May 5, Hachette Book Group, Macmillan, McGraw Hill, Elsevier, and Cengage Learning — joined by bestselling author Scott Turow — filed a putative class action against Meta and CEO Mark Zuckerberg, alleging willful copyright infringement to train the Llama family of models. The complaint claims Meta executives, including Zuckerberg personally, authorized torrenting more than 267 TB of pirated text from LibGen and Anna's Archive. Plaintiffs are seeking statutory damages, a permanent injunction against further use of their works, and an order to destroy the infringing training data.

The legal strategy is sharper than earlier author suits. By naming Zuckerberg individually and pointing to internal authorization of torrenting, the plaintiffs are setting up an argument for willful infringement — which carries materially higher statutory damages per work and weakens any fair-use defense. The framing also echoes the distinction the Bartz v. Anthropic court drew last year: training on copyrighted material may be fair use, but storing pirated copies is not.

That distinction is the live edge of AI copyright law. Anthropic settled Bartz for $1.5 billion in 2025, with a final approval hearing scheduled for May 14. If a similar liability theory sticks against Meta, the entire industry's reliance on shadow-library corpora — long an open secret — becomes a balance-sheet item. Llama's open-weights release strategy was built on top of training data that may now require expensive retroactive licensing.

A note for learners: if you train or fine-tune models on bulk text, document your data provenance. Even hobby projects matter — the distinction the courts are drawing is between 'how you used the data' (potentially fair use) and 'how you obtained it' (potentially willful piracy). A clean data trail is a cheap insurance policy and increasingly a hiring filter for ML roles.