Searching document text at scale using Azure Cognitive Search
This blog post is accompanied by another post entitled Creating custom skills in Azure Cognitive Search using Azure ML.
This is the second of the two posts. In the first post we explore the creation of a custom skill for use in enriching the document index we’ll be creating here.
Azure cognitive search is a Lucene-based search PaaS service available from Microsoft Azure.
In this post we’ll go through the process of:
- Creating an Azure Storage Account
- Uploading documents to our Azure Storage Account
- Creating an Azure Cognitive Search instance
- Connecting the Azure Cognitive Search instance to our data source
- Connecting the Azure Cognitive Search instance to Cognitive Services
- Defining a skillset including our custom skill
- Indexing our documents
- Querying Azure Search
- Formatting search highlights
The workflow for Azure Cognitive Search looks like the following diagram:
The tools we’ll be using in this post are the Azure CLI and Python to make REST requests to our Cognitive Search instance.
Quick disclaimer: At the time of writing, I am currently a Microsoft Employee
Example Use Case
The example use case to be used here is that we’ll be uploading PDF files, having Azure use the OCR service from Azure Cognitive Services to insert any non-machine readable text, and making the resulting text searchable using Azure Cognitive Search.
The PDF files to be used in this case are a set of 10 PDF files from the year 1980 from the Open Access Journal Nucleic Acids Research (NAR). They can be found here.
Although only 10 PDF files are used here, this can be done at a much larger scale and Azure Cognitive Search supports a range of other file formats including: Microsoft Office (DOCX/DOC, XSLX/XLS, PPTX/PPT, MSG), HTML, XML, ZIP, and plain text files (including JSON).
For reference, the author was previously a Biochemist, with 3 papers published in NAR, though not quite in 1980 – Nucleic Acids Research tends to focus on the field of molecular biology, particularly in relation to DNA and RNA, and proteins and other molecules that interact with these nucleic acids.
An example of what one of our files looks like can be seen here: