What I've been working on for the past year! blog.openai.com/p/7fa97c36-6…
Inspired by CoVE, ELMo, and ULMFiT we show that a single transformer language model can be finetuned to a wide variety of NLP tasks and performs very well with little tuning/tweaking.
We are starting a new, nonprofit alignment organization, ⊢ Sequent Research, bringing together researchers previously on UK AISI’s Alignment Team, Timaeus, and elsewhere to research how to align superintelligence. We are hiring! 🧵
New work with @AlecRad and @DavidDuvenaud:
Have you ever dreamed of talking to someone from the past? Introducing talkie, a 13B model trained only on pre-1931 text.
Vintage models should help us to understand how LMs generalize (e.g., can we teach talkie to code?). Thread:
Announcing Talkie: a new, open-weight historical LLM! We trained and finetuned a 13B model on a newly-curated dataset of only pre-1930 data. Try it below!
with @AlecRad and @status_effects 🧵
We trained diffusion models on a billion LLM activations, and we want you to use them!
New preprint: Learning a Generative Meta-Model of LLM Activations
Joint work with @feng_jiahai, @trevordarrell, @AlecRad, @JacobSteinhardt.
More in thread 🧵
New paper, w/@AlecRad
Models acquire a lot of capabilities during pretraining.
We show that we can precisely shape what they learn simply by filtering their training data at the token level.
Extremely excited to share work I've been doing at OpenAI the past few months: MuseNet, a neural net music generator. It's been a huge team effort pulling this all together!
Introducing MuseNet, a neural network which discovered how to generate music using many different instruments and styles.
Listen & interact: openai.com/blog/musenet/
MuseNet will play an experimental concert today from 12–3pmPT on livestream: twitch.tv/openai
Releasing some work today with @scottgray76@AlecRad and @ilyasut. Contains some simple adaptations for Transformers that extend them to long sequences.
Releasing the Sparse Transformer, a network which sets records at predicting what comes next in a sequence — whether text, images, or sound. Improvements to neural 'attention' let it extract patterns from sequences 30x longer than possible previously: openai.com/blog/sparse-trans…
One commonly cited argument about the difficulty of learning common-sense reasoning is that "no-one writes down common sense". A counter-argument is "well, the web is big": instructables.com/id/How-To-…
First, reproducibility is not about rerunning code to get the same results. Science must be more robust, as naive copying has many flaws. Second, reproducibility should never be above public safety. We must publish responsibility, with hope and kindness in our minds.
Don't the benefits of increased reproducibility and rigor on the part of the authors greatly outweigh any potential misuses of their work, at least for the vast majority of ICML/ICLR papers? I think the current shift towards empirical work puts a greater need on releasing code.
I'd like to weigh in on the #GPT2 discussion. The decision not to release the trained model was carefully considered and important for norm-forming. Serving the public good requires us to draw lines on release somewhere: better long before catastrophe than after.
By the way - I think a valid (if extreme) take on GPT-2 is "lol you need 10,000x the data, 1 billion parameters, and a supercomputer to get current DL models to generalize to Penn Treebank."
It's interesting we're having this discussion upon releasing text models that _might_ have potential for misuse yet we never engaged as fully as a community when many of the technologies powering visual Deep Fakes were being released, including hard to make pretrained models.
Shoutout to @katyanna_q who fed the system a curveball, which I always like to see. As you might expect by now after seeing AlphaStar, OpenAI 5 etc. etc., if you drag the system away from its training data and into weirder territory, it begins to wobble. theregister.co.uk/2019/02/14…
The DL CV community is having a "oh wait, bags of local features are a really strong baseline for classification" moment with the BagNet paper.
This has always been clear for text classification due to n-gram baselines. It took an embarrassingly long time for nets to beat them.
So nets are stubbornly, begrudgingly, moving in the right direction and we're throwing ever larger amounts of compute and data at them and praying it's enough for them to figure out how to do things "the right way".
Will that work?
Don't know. Probably still worth checking?
Nice discussion of the progress in NLU that's happening with BERT, OpenAI GPT, ULMFiT, ELMo, and more covered by @CadeMetz in the @nytimes I'm super excited to see how far this line of research will be able to get in the next few years!
nytimes.com/2018/11/18/techn…