How is packing implemented in your code? Have you tried using a 4D attention mask to avoid the overlap between samples that you mentioned?
xiaoqijian
mx1024
·
AI & ML interests
None yet
Recent Activity
commented on
an
article
about 9 hours ago
Open R1: Update #3
upvoted
an
article
about 9 hours ago
Open R1: Update #3
liked
a model
17 days ago
qihoo360/TinyR1-32B-Preview
Organizations
mx1024's activity

commented on
Open R1: Update #3
about 9 hours ago

upvoted
an
article
about 9 hours ago
Article
Open R1: Update #3
By
and 9 others
•
•
191