Reinforcement Studying with Verifiable Rewards (RLVR) has been efficiently utilized to considerably enhance the capabilities of pretrained giant language fashions, particularly within the math and logic drawback domains. Nonetheless, present analysis and obtainable coaching datasets stay English-centric. Whereas multilingual coaching information and benchmarks have been created previously, they weren’t created with RLVR and present mannequin functionality in thoughts, and their degree of problem is commonly too low to supply acceptable coaching indicators for present fashions. To deal with this hole, we offer mAceReason-Math, a dataset of high-quality translations of difficult math issues sourced from a corpus particularly curated for RLVR (AceReason-Math). We additional take particular care to scrub and enhance our translations, leading to a protection of 14 languages with greater than 10,000 samples per language. We launch the dataset to facilitate multilingual RLVR analysis and benchmarking within the analysis neighborhood.
- †Hasso Plattner Institute & ELLIS Unit Potsdam
- ** Work performed whereas at Apple
- ‡ Equal contribution







