This page discusses some methods you can use to process multimodal and embedding models.
If you want to answer questions based on diagrams, LLMs with the text-in-text-out architecture will be of no help. While GPT-4o and GPT-4o mini are able to take image inputs, there are other open-source options available for your consideration.
In this setup, you can use the initial text extraction just to have something to (semantic-)search for, but then later run the multimodal model on top of the raw source page (image).
If you are working in English, you can try MSMARCO models from the sentence-transformers docs ↗.
MS MARCO ↗ is a collection of large scale information retrieval datasets that were created based on real user search queries using the Bing search engine. The provided models can be used for semantic search, in that given keywords, a search phrase, or a question, the model will find passages that are relevant for the search query.
This means these models were specifically trained to put queries and relevant passages close together in embedding space.
By this definition, embedding models may be a better fit for semantic search workflows that start from a user query than general-purpose OpenAI Ada. When you use Ada to embed a query directly and compare that to chunk embeddings, you are not comparing the same concept and may instead use asymmetric embedding models to bridge that gap. Alternatively, you can attempt using an LLM to get generate a hypothetical chunk first.
Ada, in turn would make more sense when your starting point is a chunk, and you are searching for similar chunks. Note that most non-ada embedding models only support 512 tokens, so you need to adapt your chunking strategy accordingly.
If you are working in German, for example, GPT is currently the only LLM that performs decently for the language. With a German document corpus, try ada.