2 pointsby amadeuswoo11 days ago3 comments

linolevan11 days ago
For tiny models, the SFT data mixture is unbelievably critical to usability. They are unable to generalize in almost any way. If you don't have multi-turn conversations, they will not be able to do multi-turn conversations. If you have multi-turn conversations which are just chatting, and then single turn conversations for math, it will be unable to do math in a multi-turn setting. This is much less true for bigger models.
dlcarrier10 days ago
Neural network development platforms are even more bloated and broken than the record set by FPGA development platforms and even mobile phone development platforms.
baranmelik10 days ago
That it's really easy to overfit a model