Datasets
Dataset is a collection of organized data. Datasets can be represented in various formats, such as tables (e.g. CSV files), databases, or specialized data formats.
Data format
I see that the most common formats used for storing data are Text Plain, Markdown, JSONL and Parquet.
JSONL is a text file, it is the simplest.
Parquet is a binary data storage.
Judging by the number of datasets, the Parquet format is in the lead for text data.
For example, below are two dataset structures in JSONL format:
{"messages":[{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages":[{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages":[{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
Special tokens
- Beginning Of Sentence (bos) - is a special token that marks the beginning of a sentence. For Llama realm models is
<s>
. - End Of Sentence (eos) - is a special token that marks the end of a sentence. For Llama realm models is
</s>
. - Padding (pad) - special token for unifying the length of a sequence in a batch.
- Unknown (unk) - a special token that represents an unknown or rarely encountered word. For Llama family is
<unk>
.
<s>Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</s>
<s>Ut enim ad minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex ea commodo consequat.</s>
<s>Duis aute irure dolor in reprehenderit in voluptate velit
esse cillum dolore eu fugiat nulla pariatur.</s>
<s>Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum.</s>
Dataset types
Instructional Dataset (instruct)
It is a type of dataset that contains commands or instructions for a model to perform a specific actions.
<s>[INST]
Who is Aleksey Nemiro?
[/INST]
This is a Russian developer, author of articles and projects in the field of information technology.
</s>
<s>[INST]
Show an example of a "Hello, world!" program in Python
[/INST]
This is an example of implementing the "Hello, world!" program in Python, which prints the phrase of the same name to the console:
'''python
print('Hello, world!')
'''
You can save this code to a file and execute it with the command 'python helloworld.py'.
</s>
Dialogue Dataset
It is a type of dataset that contains conversations between two or more participants.
{
"messages": [
{
"role": "system",
"content": "You are..."
},
{
"role": "user",
"content": "..."
},
{
"role": "assistant",
"content": "..."
}
]
}
<s>System: You are the supreme being, you have no equal in the Universe, you have incredible power.
User: Hello! Who are you?
Assistant: I am the supreme being, I have no equal.
User: Wow!
Assistant: I'm in shock myself.</s>
Parallel Corpora
It is a type of dataset that contains records in different languages. It is used to train machine translation models.
Source: Привет, мир!
Target: Hello, world!
{
"source": "Привет, мир!",
"target": "Hello, world!"
}
Code Dataset
Is is a type of dataset that contains code examples and comments about them.
<s>Language: Python
User: Show an example of searching for a string in an array
Assistant: a = ["C#", "TypeScript", "Python"]
if 'Python' in a:
print("String found!")
else:
print("String not found!")</s>