AI Programmes used in the chatbot
Niels Raunkjær Holm, Michael Alexander Harborg og Andreas Holmer Bigom have used the following AI programmes
Speech-to-text:
- Two different models depending on the language. If the spoken question is in Danish: chcaa/xls-r-300m-danish-nst-cv9. English: OpenAI Whisper Base.
Language models:
- For offline generation (i.e., done in advance for use in the database) of synthetic summaries and relevant questions for each passage (these are generated to increase the accuracy of retrieval) in our collection of documents about Anne Marie Carl-Nielsen, we used GPT-4-Turbo.
- For our relevancy evaluator, which assesses whether questions are relevant to the conversation (i.e., for a conversation with AMCN), we used GPT-4o.
- For our text generation to answer questions, we used GPT-4o.
Retrieval Augmented Generation:
- Passage retrieval: We automatically divided each document in our document corpus into small passages of about 3-5 sentences using an LLM. Then we generated a simplified summary for each passage, as well as 5 relevant questions that the passage answers using GPT-4-Turbo.
- We then generated embeddings of each summary and each individual question using OpenAI Ada-002 (these are generated to hopefully optimise retrieval by increasing the similarity between the user’s question and relevant passages).
- When we run retrieval online (online here means it happens when a user interacts with the system), we find the 3 most relevant passages based on the L2 norm between the embeddings of the synthetic questions and the posed question (thus a total of 6 passages).
- Relevancy Evaluation: We use GPT-4o to assess whether a question is relevant. A system prompt is provided that describes the role, the 6 relevant passages, the previous 2-6 conversation messages, the posed question, and a message asking if the question is relevant. If the question is not relevant, we provide a predefined message to the user.
- Prompts: Our prompts are primarily based on a qualitative analysis of the output. If we encounter scenarios where the output is unsatisfactory, we use ChatGPT-4 to optimize our prompt by describing the desired output and our current prompt. Additionally, 5 prompts for our relevance evaluator were tested on 20 relevant and 10 non-relevant questions, after which we used the prompt with the highest accuracy compared to the desired output.
- Text-to-speech technology: Google “da-DK-Wavenet-E” for Danish outputs and Google “en-US-Journey-F” for English outputs. Both are through Google’s API.
- Voice conversion technology: Here we use Free VC 24kHz. The underlying data is voice actor Lotte Andersen, who has recorded 24 minutes of Danish speech for Danish VC and 33 minutes of English speech for English VC.