Big news: Apple just quietly published (1 Jan) a webpage describing the training data used for their gen AI
This legal document is for compliance with the California Generative Artificial Intelligence Training Data Transparency Act.
🍿 Fascinating insights. Highlights:
"Apple trains generative AI models using a mixture of data that includes publicly available data, including publicly available information crawled by Apple’s web crawler Applebot, data licensed or purchased from third parties, open-sourced data, data obtained through user studies, and synthetic data."
"Applebot does not crawl data from websites that require login credentials or that are protected by a paywall. Applebot respects standard robots.txt directives"
"Data sets for model training include both data from the public domain and data subject to intellectual property rights. For example, data used to train generative AI models includes data that has been directly licensed to Apple and data made available pursuant to licenses, such as common open-source licenses, that permit use of the data in the development of generative AI systems."
"Apple does not use our users’ private personal data or user interactions when training our foundation models. Additionally, for content publicly available on the internet that has been crawled by Applebot, Apple takes steps to apply filters to remove certain categories of personally identifiable information, such as social security and credit card numbers, from training data."
"Apple filters web-crawled data and publicly available datasets both at the time the data is crawled or imported and also as a part of post-acquisition processing prior to training. The data is managed both to limit the use of low-quality data and to remove content that is undesirable or unsafe. For example, Apple performs quality filtering and plain-text extraction on data crawled by Applebot, including safety, profanity, inappropriate content, spam, financial data, and quality filtering using heuristics and model-based classifiers, global fuzzy de-duplication using locality-sensitive n-gram hashing, decontamination against common pre-training benchmarks, and filtering against benchmark datasets. Different techniques are used to filter datasets, including manual and algorithmic ranking of content, use of heuristics, and use of machine learning models."
"Apple has been collecting textual data for training since 2018 and image data for training since 2020. Data collection remains ongoing."
"Apple uses generated text, images, audio, and other content to supplement datasets containing real-world data. This category of data is used to enhance the other corpora, including synthetic image caption data, question-answer pairs, and language data. Apple also uses synthetic data generation for post-training, including supervised fine-tuning."
I wonder what happens with respect to the Google-trained Apple LLM model.
This document was discovered by Epi Internet Intelligence (
@epiapp).