Creating custom skills for Azure Cognitive Search using Azure ML
This blog post is accompanied by another post entitled Searching document text at scale using Azure Cognitive Search. This is the first of the two blog posts and details the deployment of a custom skill for use with Azure Cognitive Search using Azure Machine Learning.
Azure cognitive search is a Lucene-based search PaaS service available from Microsoft Azure.
Custom skills are required for custom enrichment our document index in Azure Cognitive Search with additional data.
Quick disclaimer: At the time of writing, I am currently a Microsoft Employee
Example Use Case
The use case that is explored in this and the accompanying post is for searching through articles from Nucleic Acids Research (NAR).
As such we’ll be creating an Azure Cognitive Search Custom Skill to extract genetic codes from the journal articles.
There are 4 nucleotides that comprise DNA – adenine (A), cytosine (C), guanine (G) and thymine (T). In RNA thymine is replaced with uracil (U).
We’ll be using regular expressions to extract out genetic codes.
Setting Up
Using the Azure CLI, create a resource group (if not already created), I’ve named mine azure-search-nar-demo
:
az group create --name azure-search-nar-demo --location westeurope
We’ll be using the following two external python packages in this post, so I’d recommend pip installing them into a python virtual environment:
pip install azureml-sdk
pip install requests
Creating Azure Custom Skill
Azure Custom Skills are just a REST API endpoint call, in which text or image data is passed to an API and insights from this text are extracted and returned from the API.
We’ll be deploying an API using Azure Machine Learning and, while not training any machine learning models in this post, this process will be directly transferable to another workflow using custom trained models.
Create Azure ML Workspace
We’ll need an Azure ML Workspace to deploy our API endpoint, run the cell below to create this.
If you want to use Azure CLI for authentication, you’ll need to install the associated package to your virtual environment with pip install azure-cli-core
.
from azureml.core import Workspace
from azureml.core.authentication import AzureCliAuthentication
cli_auth = AzureCliAuthentication()
resource_group = 'azure-search-nar-demo'
subscription_id = '<subscription_id>'
ws = Workspace.create(
name='myworkspace',
subscription_id=subscription_id,
resource_group=resource_group,
location='westeurope',
auth=cli_auth
)
Scoring Script
Our scoring script will take input from Azure Cognitive Search and return an output that is parseable by Azure Cognitive Search.
The input is a JSON object that has a single key values
, whose own values are an array of objects. Each object in this array has two keys recordId
, which is a unique identifier for the document, and data
, which contains data from the document.
In this example we’ll only take one input from the document text
and will return one output genetic_codes
, which will be an array of genetic codes extracted from the document text.
Errors and warnings can also be returned in the output per document.
An example input and output can be seen below:
Input
{
"values": [
{
"recordId": "id1",
"data": {
"text": "... A 36 base pair AT-rich recognition site, typically TGTTGACAATTT is ..."
}
},
{
"recordId": "id2",
"data": {
"text": "...The DNA sequence AATAAA will, of course match with UUAUUU in RNA..."
}
},
{
"recordId": "id3",
"data": {
"text": null
}
}
]
}
Output
{
"values": [
{
"recordId": "id1",
"data": {
"genetic_codes": [
"TGTTGACAATTT"
]
}
},
{
"recordId": "id2",
"data": {
"genetic_codes": [
"AATAAA",
"UUAUUU"
]
}
},
{
"recordId": "id3",
"data": {},
"warnings": [
{
"message": "No genetic codes found in text field"
}
]
}
]
}
Our scoring script will need an init function and a run function, the init function is run once and the run function is run each time the API is called.
We’ll be using a regex to extract the genetic code, we’ll just use a very simple regex to do this:
[^0-9a-zA-Z]
– a non-alphanumeric character, such as a space or a hyphen([CAGTU]{3,})
– a capture group of 3 or more consecutive characters that are C, A, G, T, or U (A single codon is 3 nucleotides in length)[^0-9a-zA-Z]
– another non-alphanumeric character
We’ll then remove common journal article words cut, tag, act and acta (a word found in some journal names in the references).
%%writefile score.py
import json
import re
def init():
pass
def run(raw_data):
try:
raw_data = json.loads(raw_data)
output = {"values": []}
genetic_code_regex = r"[^0-9a-zA-Z]([CAGTU]{3,})[^0-9a-zA-Z]"
for doc in raw_data["values"]:
record_id = doc.get('recordId')
doc_data = doc.get("data")
record_output = {
'recordId': record_id,
'data': {},
}
if doc_data is None:
record_output['errors'] = [{ "message": "data not found"}]
output['values'].append(record_output)
continue
doc_text = doc_data.get("text", "not found")
if doc_text is "not found":
record_output['errors'] = [{ "message": "text field not found"}]
elif type(doc_text) is not str:
record_output['warnings'] = [{ "message": "No genetic codes found in text field"}]
else:
words_to_exclude = ('cut', 'tag', 'act', 'acta')
codes = re.findall(genetic_code_regex, doc_text, flags=re.I)
codes = [code for code in codes if code.lower() not in words_to_exclude]
record_output['data']['genetic_codes'] = codes
output['values'].append(record_output)
return output
except Exception as e:
result = str(e)
return {"error": result}
Create Container Image
As the scoring script only imports python builtin modules, json
and re
, we won’t need any additional config files to instruct the container image to install additional pip packages.
As we’re just using a regex, we won’t need any models in our example.
from azureml.core.image import ContainerImage
image_config = ContainerImage.image_configuration(
execution_script = "score.py",
runtime = "python"
)
image = ContainerImage.create(
name = "cognitive-search-skill-image",
image_config = image_config,
models = [],
workspace = ws
)
image.wait_for_creation(show_output = True)
Create Kubernetes Cluster
For Azure Cognitive Search, we’ll need an HTTPS endpoint, so we’ll need an SSL certificate.
For ease of use, instead of Azure Container Service, we’ll use Azure Kubernetes Service, in which we can enable SSL from the provisioning configuration.
In the cell below we’ll create our Kubernetes cluster.
from azureml.core.compute import AksCompute, ComputeTarget
# Config used to create a new AKS cluster and enable SSL
provisioning_config = AksCompute.provisioning_configuration(
cluster_purpose=AksCompute.ClusterPurpose.DEV_TEST
)
provisioning_config.enable_ssl(leaf_domain_label = "nardemo")
aks_name = 'democluster'
aks_target = ComputeTarget.create(
workspace = ws,
name = aks_name,
provisioning_configuration = provisioning_config
)
aks_target.wait_for_completion(show_output = True)
Deploy Web Service
Now that our Kubernetes cluster is provisioned, we can deploy our web service. Make sure to enable authentication on your API.
from azureml.core.model import Model
from azureml.core.webservice import AksWebservice
deployment_config = AksWebservice.deploy_configuration(
cpu_cores=1,
memory_gb=8,
auth_enabled=True
)
service = AksWebservice.deploy_from_image(
deployment_config=deployment_config,
deployment_target=aks_target,
image=image,
name='cognitive-search-skill',
workspace=ws
)
service.wait_for_deployment(show_output = True)
print(service.state)
print("Scoring API served at: {}".format(service.scoring_uri))
Test Azure Cognitive Search Skill
Now that our Azure ML Web Service is deployed, we can test the API.
We’ll use the input data in the example above and ensure we get the expected output required for Azure Cognitive Search.
primary_key, secondary_key = service.get_keys()
import json
import requests
input_data = {
"values": [
{
"recordId": "id1",
"data": {
"text": "... A 36 base pair AT-rich recognition site, typically TGTTGACAATTT is ..."
}
},
{
"recordId": "id2",
"data": {
"text": "...The DNA sequence AATAAA will, of course match with UUAUUU in RNA..."
}
},
{
"recordId": "id3",
"data": {
"text": None
}
}
]
}
headers = {
'Content-Type':'application/json',
'Authorization': 'Bearer {}'.format(primary_key)
}
resp = requests.post(service.scoring_uri, json.dumps(input_data), headers=headers)
resp.json()
Now that our Azure Cognitive Search Skill is deployed, we can start to use it in our Azure Cognitive Search Skillset and have the extracted genetic codes in our search index.
You’re ready to move on to the associated blog post at Searching document text at scale using Azure Cognitive Search.