Train LLMs with Custom Dataset on Laptop

Problem Statement

I want to train a Large Language Model(LLM)1 with some private documents and query various details.

Journey

There are open-source available LLMs like Vicuna, LLaMa, etc which can be trained on custom data. However, training these models on custom data is not a trivial task.

After trying out various methods, I ended up using privateGPT2 which is quite easy to train on custom documents. There is no need to format or clean up the data as privateGPT can directly consume documents in many formats like txt, html, epub, pdf, etc.

Training

First, let's clone the repo, install requirements.txt and download the default model.

$ git clone https://github.com/imartinez/privateGPT
$ cd privateGPT
$ pip3 install -r requirements.txt
$ wget https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin

$ cp example.env .env
$ cat .env
MODEL_TYPE=GPT4All
MODEL_PATH=ggml-gpt4all-j-v1.3-groovy.bin

I have sourced all documents and kept them in a folder called docs. Let's ingest(train) the data.

$ cp ~/docs/* source_documents

$ python ingest.py

This will take a while depending on the number of documents we have. Once the ingestion is done, we can start querying the model.

$ python privateGPT.py
Enter a query: Summarise about Gaaliveedu

The default GPT4All-J v1.3-groovy3 model doesn't provide good results. We can easily swap it with LlamaCpp4. Lets download the model and convert it.

$ git clone https://huggingface.co/openlm-research/open_llama_13b

$ git clone https://github.com/ggerganov/llama.cpp.git
$ cd llama.cpp
$ python convert.py ../open_llama_13b
Wrote ../open_llama_13b/ggml-model-f16.bin

We can now update the .env file to use the new model and start querying again.

$ cat .env
MODEL_TYPE=LlamaCpp
MODEL_PATH=/path/to/ggml-model-f16.bin

$ python privateGPT.py
Enter a query: Summarise about Gaaliveedu

Conclusion

This makes it easy to build domain-specific LLMs and use them for various tasks. I have used this to build a chatbot for my internal docs and it is working well.


Need further help with this? Feel free to send a message.