# Train a BERT classifier with synthetic data Do we still need specifically trained classifiers in the age of zero-shot and LLMs? The answer to this question is - it depends. As of the time of writing (09/2024) even small LLMs with around 2bn parameters run quite slow (if at all) in a limited hardware setup without fast GPUs. Therefore, using a specialized classifier could be due to scalability and cost of operation. Another big factor is the complexity of the problem. Having a large number of classes, that are close to each other and that where there is no additional data available our experience is that even base-level BERT models after fine-tuning perform better than SOTA LLMs in a zero shot variant. Currently, there are very few benchmarks available that compare "classic" language models like BERT with LLMs. One such experiment was published by Huggingface () and it shows, that the RoBERTA based model outperformed the LLM variants by a large margin, by being faster and cheaper. {% hint style="info" %} To have a BERT-like classifier performing well it is necessary that the used Base model was pre-trained on data that contains "knowledge" and "semantics" that are necessary to understand the text. The Base models released by Google, Meta and Microsoft are very good in general English. However, in our experiments with a complex set of labels those did always perform mediocer. There are fully pre-trained German models though like deepset/gbert-base but also those are not really trained on specialized domain knowledge - depending on the use case of course. We achieved best results for very domain specific classifications by using domain adoption techniques like continued pre-training on texts that contain the domain knowledge. Be aware, that this requires a lot of data and ressources to do. {% endhint %} ### Coverage In this tutorial we will cover how to structure a training dataset and further augment the dataset with synthetic data to finally train a BERT-based classification model. For that, we will use the Huggingface library and ecosystem to fine-tune the model. * Fine-tune a BERT-based classifier model * Structure a training dataset * Create synthetic samples to augment the training data Eventually, we will evaluate the fine-tuned model and explain the metrics used in doing so. ### Requirements For this tutorial, you'll need access to a GPU. The amount of VRAM available is a less important factor than for LLMs since BERT-based models are usually relatively small. However, more VRAM allows for larger batch sizes. On the software side, make sure to have all dependencies listed in the `requirements.txt` file installed before proceeding. ``` // Some code ``` ### Data The structure of the dataset is relatively simple since we expect the model to have a text as an input and a label as an output, that is exactly how the structure of the training dataset should look like. Therefore we will have a CSV file with just two columns: text, label We will use question answer pairs as input and the the used model for creating the answer as label. Therefore, our classifier should be able to detect which model answerd the question. For simplicity reasons, we use gpt-4o, gpt-4o-mini and gpt-3.5

text	label
<div class="tzej">We repair your phone within 3 work days</div>	gpt-4o
<div class="footer"><a href="/contact">Contact us</a></div>	gpt-4o-mini
hwejrh wuerzweuir	gpt-3.5

We can easily generate the data by passing a question to the corresponding model API. ```