Surya here from the core Gemma team -- we can think of a distillation loss as learning to model the entire distribution of tokens that are likely to follow the prefix thus far, instead of only the token in the training example. If you do some back of the envelope calculations, we can see that learning to model a larger distribution yields many more bits of information to learn from.
This is really remarkable! How hard do you think it will be to support new models, i.e. does the tooling you’ve built generalize to you being able to serve other large scale models easily?
Hi everyone! One of the creators of DFL here. In an attempt to more deeply understand fundamental concepts in machine learning, we designed Depth First Learning. It's a pedagogy for diving deep into machine learning by carefully tailoring a curriculum around a particular paper or concept and leading small, focused discussion groups. So far, we’ve created guides for InfoGAN, TRPO, AlphaGoZero, and DeepStack.
Since our launch, we’ve received very positive feedback from students and researchers. Now, we want to run new, online classes around the world.
We intimately understand that the process of curating a meaningful curriculum with reading materials, practice problems, and instructive discussion points can be very rewarding, but also time-consuming and difficult. We wanted to make sure that the people compiling the content understood that their efforts were well worth their time and consequently decided to launch a fellowship program.
Thanks to the generosity of Jane Street, we will provide 4 fellows with a $4000 grant each to build a 6 week curriculum and run weekly on-line discussions.
If you’d like to lead a class about an important paper in machine learning, please visit http://fellowship.depthfirstlearning.com to apply. We look forward to hearing from you, and I'm happy to answer any questions about it!
Many machine learning and reinforcement learning models are susceptible to adversarial attacks; it's not unique to deep learning. However, because so many systems that are currently deployed in applications use deep learning, it's under particular scrutiny.
Are there any difficulties in generating a program in standard languages in Python? Did you choose a DSL because neural network is sensitive to the output programming language?
It turns out the full grammar of Python (and almost all real programming languages) is quite large; this is very early and new work in neural program synthesis, and so we chose a pretty limited DSL to make sure that we could at least solve this one before moving on to more general ones that contain state, conditionals, for-loops, etc. In theory however, we can apply the exact architecture to Python programs and see what happens. We haven't tried yet. :)
Why does the final example in figure 14 fail completely? The outputs are correct as far as they go, but they're all incomplete.
Is it because the scoring metric has a point where enough of a good start outscores an alternative in the beam search that could lead to a more complete solution? In non-trivial real-world examples, would the be a major problem?
Theorem solving is very closely related to program induction (we just change the grammar). Just as with Python, the underlying search space would be incredibly large, and while in theory, we could simply change the DSL and it should work, it'll probably involve a few more iterations of the model or other insights to see this to fruition (but it's definitely not impossible).
It looks like this is a more comprehensive version of FlashFill (can do more tasks), and it is based on deep learning instead of previous rule-based techniques in FlashFill.