Advertisement
Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.🗓️ Mark your calendar from September 20, 2024 to September 20, 2024 at Subterranean. Pile is a celebration of passion, innovation, and community. Embrace the opportunity to explore the world of in a dynamic setting in Illinois, POP / ROCK. Don't miss out on this remarkable event 🌈. http://maps.seatics.com/GeneralAdmissionEvent_2018-10-03_1713_SVGC_tn.gif
Advertisement
Event Venue & Nearby Stays
Subterranean, 2011 West North Ave, Subterranean, Chicago, Illinois, United States of America, United States
Tickets