This Github has a very wide collection of High-quality datasets, tools, and concepts for LLM fine-tuning.
All the datasets listed here should be under permissive licensing (Apache 2.0, MIT, cc-by-4.0, etc.).
Categorized into segments like Math & Logic, Code, Conversation & Role-Play, Agent & Function calling etc.
github. com/mlabonne/llm-datasets