Using Chat Mode

Chat mode lets you interact with the LLM in a simple, single-turn format. You can use it two ways: with a knowledge base for retrieval-augmented generation (RAG), or without one for a direct conversation with the model.

Try It: Direct Chat (No Knowledge Base)

Follow these steps to chat directly with the LLM without retrieval:

In the sidebar, set Mode to Chat.
Make sure no knowledge base is selected — leave the Knowledge Base dropdown unset.
Type a question in the input box and press Enter. For example:
```
"What is retrieval-augmented generation?"
```
Expected result — the model responds using only its built-in training data. The answer will be general-purpose and will not reference any uploaded documents. No source citations appear because there is no retrieval step.

Try It: Chat With a Knowledge Base (RAG)

Follow these steps to chat with RAG retrieval:

In the sidebar, set Mode to Chat.
Select a knowledge base from the Knowledge Base dropdown (e.g., one of the pre-loaded FantaCo collections).
Type a question that relates to the documents in that collection and press Enter. For example:
```
"What is the vacation policy for full-time employees?"
```
Expected result — the model returns an answer grounded in the retrieved document chunks. Source citations appear alongside the response so you can trace the answer back to the originating document. The response should closely mirror the language in the source documents.

Since the knowledge base is retrieved, you may need to increase Max Tokens in the sidebar. Retrieved chunks consume a significant portion of the token budget, and a low cap can cause the response to be cut short.

How RAG Retrieval Works

When a knowledge base is selected the system augments every query with relevant document context:

Your question is converted into a vector embedding using the loaded embedding model.
The top-k most relevant document chunks are retrieved from PGVector based on semantic similarity.
Those chunks are inserted into the LLM prompt as context.
The LLM (e.g., Llama-3.2-3B-Instruct) generates a response grounded in the retrieved context.
Responses include source references so you can trace the answer back to the originating document.

When to Use Each Approach

Direct Chat (No KB)	Chat With Knowledge Base
General-purpose questions that don’t need proprietary context	Questions about specific documents you’ve uploaded
Testing the model’s baseline capabilities before layering in RAG	Policy, procedure, or reference lookups
Creative or open-ended prompts	Any question where you need grounded, citable answers

Direct Chat (No KB)

Chat With Knowledge Base

General-purpose questions that don’t need proprietary context

Questions about specific documents you’ve uploaded

Testing the model’s baseline capabilities before layering in RAG

Policy, procedure, or reference lookups

Creative or open-ended prompts

Any question where you need grounded, citable answers

Tuning the Response

Use the sidebar sampling controls to influence output:

Parameter	Effect
Temperature	Higher values (e.g., 0.9) increase creativity; lower values (e.g., 0.1) make responses more deterministic.
Top-p	Controls nucleus sampling — lower values restrict the token pool.
Top-k	Restricts sampling to the k most likely tokens — lower values reduce randomness.
Max Tokens	Caps the length of the generated response. When RAG retrieval is active, set this as high as possible — retrieved chunks consume a significant portion of the available token budget and a low cap can cause the response to be cut short.
System Prompt	Override the assistant’s persona or add extra constraints (e.g., "Answer only in bullet points.").

Parameter

Effect

Temperature

Higher values (e.g., 0.9) increase creativity; lower values (e.g., 0.1) make responses more deterministic.

Top-p

Controls nucleus sampling — lower values restrict the token pool.

Top-k

Restricts sampling to the k most likely tokens — lower values reduce randomness.

Max Tokens

Caps the length of the generated response. When RAG retrieval is active, set this as high as possible — retrieved chunks consume a significant portion of the available token budget and a low cap can cause the response to be cut short.

System Prompt

Override the assistant’s persona or add extra constraints (e.g., "Answer only in bullet points.").