Creating custom skills for Azure Cognitive Search using Azure ML

This blog post is accompanied by another post entitled Searching document text at scale using Azure Cognitive Search. This is the first of the two blog posts and details the deployment of a custom skill for use with Azure Cognitive Search using Azure Machine Learning.

Azure cognitive search is a Lucene-based search PaaS service available from Microsoft Azure.

Custom skills are required for custom enrichment our document index in Azure Cognitive Search with additional data.

Quick disclaimer: At the time of writing, I am currently a Microsoft Employee

Example Use Case

The use case that is explored in this and the accompanying post is for searching through articles from Nucleic Acids Research (NAR).

As such we’ll be creating an Azure Cognitive Search Custom Skill to extract genetic codes from the journal articles.

There are 4 nucleotides that comprise DNA – adenine (A), cytosine (C), guanine (G) and thymine (T). In RNA thymine is replaced with uracil (U).

We’ll be using regular expressions to extract out genetic codes.

Setting Up

Using the Azure CLI, create a resource group (if not already created), I’ve named mine azure-search-nar-demo:

az group create --name azure-search-nar-demo --location westeurope

We’ll be using the following two external python packages in this post, so I’d recommend pip installing them into a python virtual environment:

pip install azureml-sdk
pip install requests

Creating Azure Custom Skill

Azure Custom Skills are just a REST API endpoint call, in which text or image data is passed to an API and insights from this text are extracted and returned from the API.

We’ll be deploying an API using Azure Machine Learning and, while not training any machine learning models in this post, this process will be directly transferable to another workflow using custom trained models.

Create Azure ML Workspace

We’ll need an Azure ML Workspace to deploy our API endpoint, run the cell below to create this.

If you want to use Azure CLI for authentication, you’ll need to install the associated package to your virtual environment with pip install azure-cli-core.

In [1]:

from azureml.core import Workspace
from azureml.core.authentication import AzureCliAuthentication

cli_auth = AzureCliAuthentication()

resource_group = 'azure-search-nar-demo'
subscription_id = '<subscription_id>'

ws = Workspace.create(
    name='myworkspace',
    subscription_id=subscription_id, 
    resource_group=resource_group,
    location='westeurope',
    auth=cli_auth
)

Deploying StorageAccount with name myworkspstorage04b3f5525.
Deploying AppInsights with name myworkspinsights572ddf2f.
Deployed AppInsights with name myworkspinsights572ddf2f. Took 18.83 seconds.
Deploying KeyVault with name myworkspkeyvault18249613.
Deployed KeyVault with name myworkspkeyvault18249613. Took 31.83 seconds.
Deployed StorageAccount with name myworkspstorage04b3f5525. Took 34.33 seconds.
Deploying Workspace with name myworkspace.
Deployed Workspace with name myworkspace. Took 47.96 seconds.

Scoring Script

Our scoring script will take input from Azure Cognitive Search and return an output that is parseable by Azure Cognitive Search.

The input is a JSON object that has a single key values, whose own values are an array of objects. Each object in this array has two keys recordId, which is a unique identifier for the document, and data, which contains data from the document.

In this example we’ll only take one input from the document text and will return one output genetic_codes, which will be an array of genetic codes extracted from the document text.

Errors and warnings can also be returned in the output per document.

An example input and output can be seen below:

Input

{
  "values": [
    {
      "recordId": "id1",
      "data": {
        "text": "... A 36 base pair AT-rich recognition site, typically TGTTGACAATTT is ..."
      }
    },
    {
      "recordId": "id2",
      "data": {
        "text": "...The DNA sequence AATAAA will, of course match with UUAUUU in RNA..."
      }
    },
    {
      "recordId": "id3",
      "data": {
        "text": null
      }
    }
  ]
}

Output

{
  "values": [
    {
      "recordId": "id1",
      "data": {
        "genetic_codes": [
          "TGTTGACAATTT"
        ]
      }
    },
    {
      "recordId": "id2",
      "data": {
        "genetic_codes": [
          "AATAAA",
          "UUAUUU"
        ]
      }
    },
    {
      "recordId": "id3",
      "data": {},
      "warnings": [
        {
          "message": "No genetic codes found in text field"
        }
      ]
    }
  ]
}

Our scoring script will need an init function and a run function, the init function is run once and the run function is run each time the API is called.

We’ll be using a regex to extract the genetic code, we’ll just use a very simple regex to do this:

[^0-9a-zA-Z] – a non-alphanumeric character, such as a space or a hyphen
([CAGTU]{3,}) – a capture group of 3 or more consecutive characters that are C, A, G, T, or U (A single codon is 3 nucleotides in length)
[^0-9a-zA-Z] – another non-alphanumeric character

We’ll then remove common journal article words cut, tag, act and acta (a word found in some journal names in the references).

In [2]:

%%writefile score.py

import json
import re

def init():
    pass

def run(raw_data):
    try:
        raw_data = json.loads(raw_data)
        output = {"values": []}
        genetic_code_regex = r"[^0-9a-zA-Z]([CAGTU]{3,})[^0-9a-zA-Z]"
        for doc in raw_data["values"]:
            record_id = doc.get('recordId')
            doc_data = doc.get("data")
            record_output = {
                'recordId': record_id,
                'data': {},
            }
            if doc_data is None:
                record_output['errors'] = [{ "message": "data not found"}]
                output['values'].append(record_output)
                continue
            doc_text = doc_data.get("text", "not found")
            if doc_text is "not found":
                record_output['errors'] = [{ "message": "text field not found"}]
            elif type(doc_text) is not str:
                record_output['warnings'] = [{ "message": "No genetic codes found in text field"}]
            else:
                words_to_exclude = ('cut', 'tag', 'act', 'acta')
                codes = re.findall(genetic_code_regex, doc_text, flags=re.I)
                codes = [code for code in codes if code.lower() not in words_to_exclude]
                record_output['data']['genetic_codes'] = codes
            output['values'].append(record_output)
        return output
    except Exception as e:
        result = str(e)
        return {"error": result}

Writing score.py

Create Container Image

As the scoring script only imports python builtin modules, json and re, we won’t need any additional config files to instruct the container image to install additional pip packages.

As we’re just using a regex, we won’t need any models in our example.

In [3]:

from azureml.core.image import ContainerImage

image_config = ContainerImage.image_configuration(
    execution_script = "score.py",
    runtime = "python"
)

image = ContainerImage.create(
    name = "cognitive-search-skill-image",
    image_config = image_config,
    models = [],
    workspace = ws
)

image.wait_for_creation(show_output = True)

Creating image
Running....................
Succeeded
Image creation operation finished for image cognitive-search-skill-image:1, operation "Succeeded"

Create Kubernetes Cluster

For Azure Cognitive Search, we’ll need an HTTPS endpoint, so we’ll need an SSL certificate.

For ease of use, instead of Azure Container Service, we’ll use Azure Kubernetes Service, in which we can enable SSL from the provisioning configuration.

In the cell below we’ll create our Kubernetes cluster.

In [5]:

from azureml.core.compute import AksCompute, ComputeTarget

# Config used to create a new AKS cluster and enable SSL
provisioning_config = AksCompute.provisioning_configuration(
    cluster_purpose=AksCompute.ClusterPurpose.DEV_TEST
)
provisioning_config.enable_ssl(leaf_domain_label = "nardemo")

aks_name = 'democluster'

aks_target = ComputeTarget.create(
    workspace = ws,
    name = aks_name,
    provisioning_configuration = provisioning_config
)

aks_target.wait_for_completion(show_output = True)

Creating..........................................................................................................................................................................
SucceededProvisioning operation finished, operation "Succeeded"

Deploy Web Service

Now that our Kubernetes cluster is provisioned, we can deploy our web service. Make sure to enable authentication on your API.

In [7]:

from azureml.core.model import Model
from azureml.core.webservice import AksWebservice

deployment_config = AksWebservice.deploy_configuration(
    cpu_cores=1,
    memory_gb=8,
    auth_enabled=True
)

service = AksWebservice.deploy_from_image(
    deployment_config=deployment_config,
    deployment_target=aks_target,
    image=image,
    name='cognitive-search-skill',
    workspace=ws
)

service.wait_for_deployment(show_output = True)
print(service.state)

Running........
Succeeded
AKS service creation operation finished, operation "Succeeded"
Healthy

In [8]:

print("Scoring API served at: {}".format(service.scoring_uri))

Scoring API served at: https://nardemo2dwwod3.westeurope.cloudapp.azure.com:443/api/v1/service/cognitive-search-skill/score

Test Azure Cognitive Search Skill

Now that our Azure ML Web Service is deployed, we can test the API.

We’ll use the input data in the example above and ensure we get the expected output required for Azure Cognitive Search.

In [9]:

primary_key, secondary_key = service.get_keys()

In [10]:

import json
import requests

input_data = {
  "values": [
    {
      "recordId": "id1",
      "data": {
        "text": "... A 36 base pair AT-rich recognition site, typically TGTTGACAATTT is ..."
      }
    },
    {
      "recordId": "id2",
      "data": {
        "text": "...The DNA sequence AATAAA will, of course match with UUAUUU in RNA..."
      }
    },
    {
      "recordId": "id3",
      "data": {
        "text": None
      }
    }
  ]
}

headers = {
    'Content-Type':'application/json',
    'Authorization': 'Bearer {}'.format(primary_key)
}

resp = requests.post(service.scoring_uri, json.dumps(input_data), headers=headers)

resp.json()

Out[10]:

{'values': [{'recordId': 'id1', 'data': {'genetic_codes': ['TGTTGACAATTT']}},
  {'recordId': 'id2', 'data': {'genetic_codes': ['AATAAA', 'UUAUUU']}},
  {'recordId': 'id3',
   'data': {},
   'warnings': [{'message': 'No genetic codes found in text field'}]}]}

Now that our Azure Cognitive Search Skill is deployed, we can start to use it in our Azure Cognitive Search Skillset and have the extracted genetic codes in our search index.

You’re ready to move on to the associated blog post at Searching document text at scale using Azure Cognitive Search.