The most efficient way to search millions of pages of OCR output

Hi!

We're looking to implement an OCR system into our platform in order to allow users to find the right document by searching key words in the content. As of now we are leaning to a simple search in the body of the text given the costs associated with the more advanced OCR functions in AWS Textract.

However I am worried about the viability of scaling a simple search bar to parse through millions of pages in order to return the right answers efficiently.

What are some good options to setup a quick (for the user) text search engine that can handle this type of task without having a minutes long loading time?

Preferably keeping it within the AWS ecosystem.

Thanks!

submitted by /u/majorshimo
[link] [comments]

from Software Development – methodologies, techniques, and tools. Covering Agile, RUP, Waterfall + more! https://ift.tt/0dXqVYz

Share this:

Related

Leave a comment Cancel reply