Train LLMs with Custom Dataset on Laptop

Problem Statement

I want to train a Large Language Model(LLM)1 with some private documents and query various details.


There are open-source available LLMs like Vicuna, LLaMa, etc which can be trained on custom data. However, training these models on custom data is not a trivial task.

After trying out various methods, I ended up using privateGPT2 which is quite easy to train on custom documents. There is no need to format or clean up the data as privateGPT can directly consume documents in many formats like txt, html, epub, pdf, etc.


First, let's clone the repo, install requirements.txt and download the default model.

$ git clone
$ cd privateGPT
$ pip3 install -r requirements.txt
$ wget

$ cp example.env .env
$ cat .env

I have sourced all documents and kept them in a folder called docs. Let's ingest(train) the data.

$ cp ~/docs/* source_documents

$ python

This will take a while depending on the number of documents we have. Once the ingestion is done, we can start querying the model.

$ python
Enter a query: Summarise about Gaaliveedu

The default GPT4All-J v1.3-groovy3 model doesn't provide good results. We can easily swap it with LlamaCpp4. Lets download the model and convert it.

$ git clone

$ git clone
$ cd llama.cpp
$ python ../open_llama_13b
Wrote ../open_llama_13b/ggml-model-f16.bin

We can now update the .env file to use the new model and start querying again.

$ cat .env

$ python
Enter a query: Summarise about Gaaliveedu


This makes it easy to build domain-specific LLMs and use them for various tasks. I have used this to build a chatbot for my internal docs and it is working well.