ago
0 like 0 dislike
0 like 0 dislike
I have lot of notes about research papers in a particular directory and the number of files has started to become larger than what I can remember off the top of my head. It will continue to keep growing and I have begun to wonder the most efficient way to retrieve the information. I could use ripgrep to find the notes efficiently, but I imagine that if the database is very huge and I don't have the correct regular expression in use, then I might not retrieve the correct files.

Inspired by chatGPT, I was impressed at how it retrieves info from the internet and speeds up my time for finding information even when I do not know the correct keywords. I figured a NLP model primarily trained on my database would be an easier task and I was wondering if someone had already created something like this as open source or how would they go about it?
ago
0 like 0 dislike
0 like 0 dislike
So-- a few things

ChatGPT doesn't currently have access to the internet, although it's obviously working with data it scraped in the recent past, and I expect searching wikipedia from 2021 is sufficient to answer a wide array of queries, which is why it *feels* like it has internet access when you ask it questions.

ChatGPT is effective because it's been trained on an unimaginably large set of data, and had an unknown large number of human hours gone into supervised/interactive/online/reinforcement/(whatever) learning where an army of contractors has trained it how to deal well with arbitrary human prompts.   You don't really want an AI trained just on your data set by itself.

But ChatGPT (or just plain GPT3) is great for summarizing bodies of text as it is right now.   I expect you should be able to google how to nicely ask GPT3 to summarize your notes or answer questions with respect to them.
ago
0 like 0 dislike
0 like 0 dislike
ChatGPT does NOT retrieve any data at all from the internet. It merely remembers statistical patterns of words coming one after another in the typical texts. It has no knowledge of facts, and no means to get them whatsoever. It was also trained with data up to 2021, so there is no training data after that whatsoever. There was an older attempt with WebGPT, but it did not get anywhere AFAIK.

What you need is a semantic search model, which summarizes semantic information from texts as vectors and then performs vector search based on your query. You can use transformer-based model for text vectorization, of course, which may work reasonably well. For specific searches, however, I am pretty sure that in your use case regexes will be just fine.

If you are sure that you need semantic search, use domain-specific model like SciBERT for best results, or fine-tune some pretrained model from Huggingface.
ago
0 like 0 dislike
0 like 0 dislike
The internet isn't accessed live by most of these models, as others have said.

You can finetune language models, but you don't add knowledge as such to them; you bias them to output more words in similar order to your sample data; it won't add facts as such if you do this fine tuning.

One approach you can do though is semantic search through your notes for a given topic/search query. You basically collect the relevant notes with meanings similar to your topic/search query. Then you can populate a prompt with that text. The answer will use that information and any facts, if the model is big enough and RLHF tuned (like ChatGPT/Instruct/text-00x models from OpenAI).

An open source module for this is GPTIndex, I also work on a commercial solution which encompasses videos etc too and has some optimisations. It is possible you can add data/facts from the internet to the prompt(context) at time of generation too; you can use an approach like WebGPT.
ago
0 like 0 dislike
0 like 0 dislike
If you want to go the Semantic Search route, make sure to check out the deepset.ai haystack framework in conjunction with a sentence-transformer. They make semantic document retrieval very easy to set up and there's many, high-performing pre-trained models for semantic search on hugging face
ago

No related questions found

33.4k questions

135k answers

0 comments

33.7k users

OhhAskMe is a math solving hub where high school and university students ask and answer loads of math questions, discuss the latest in math, and share their knowledge. It’s 100% free!