Skip to content

Latest commit

 

History

History
689 lines (607 loc) · 25.9 KB

File metadata and controls

689 lines (607 loc) · 25.9 KB

Quickstart: Train a Form Recognizer model and extract form data by using the REST API with Python

In this quickstart, you'll use the Azure Form Recognizer REST API with Python to train and score forms to extract key-value pairs and tables.

If you don't have an Azure subscription, create a free account before you begin.

Prerequisites

To complete this quickstart, you must have:

  • Python installed (if you want to run the sample locally).
  • A set of at least five forms of the same type. You will use this data to train the model. Your forms can be of different file types but must be the same type of document. You can use a sample data set (download and extract sample_data.zip) for this quickstart. Upload the training files to the root of a blob storage container in an standard-performance-tier Azure Storage account.

Note

This quickstart uses remote documents accessed by URL. To use local files instead, see the reference documentation.

Create a Form Recognizer resource

Go to the Azure portal and create a new Form Recognizer resource . In the Create pane, provide the following information:

Name A descriptive name for your resource. We recommend using a descriptive name, for example MyNameFormRecognizer.
Subscription Select the Azure subscription which has been granted access.
Location The location of your cognitive service instance. Different locations may introduce latency, but have no impact on the runtime availability of your resource.
Pricing tier The cost of your resource depends on the pricing tier you choose and your usage. For more information, see the API pricing details.
Resource group The Azure resource group that will contain your resource. You can create a new group or add it to a pre-existing group.

Note

Normally when you create a Cognitive Service resource in the Azure portal, you have the option to create a multi-service subscription key (used across multiple cognitive services) or a single-service subscription key (used only with a specific cognitive service). However currently Form Recognizer is not included in the multi-service subscription.

When your Form Recognizer resource finishes deploying, find and select it from the All resources list in the portal. Your key and endpoint will be located on the resource's key and endpoint page, under resource management. Save both of these to a temporary location before going forward.

Train a Form Recognizer model

First, you'll need a set of training data in an Azure Storage blob container. You should have a minimum of five filled-in forms (PDF documents and/or images) of the same type/structure as your main input data. Or, you can use a single empty form with two filled-in forms. The empty form's file name needs to include the word "empty."

Note

See Custom model input requirements to make sure your training data set follows the input requirements for Form Recognizer.

Note

You can use the labeled data feature to manually label some or all of your training data beforehand. This is a more complex process but results in a better trained model. See the Train with labels section of the overview to learn more.

To train a Form Recognizer model with the documents in your Azure blob container, call the Train Custom Model API by running the following python code. Before you run the code, make these changes:

  1. Replace <SAS URL> with the Azure Blob storage container's shared access signature (SAS) URL. To retrieve the SAS URL for your custom model training data, go to your storage resource in the Azure portal and select the Storage Explorer tab. Navigate to your container, right-click, and select Get shared access signature. It's important to get the SAS for your container, not for the storage account itself. Make sure the Read and List permissions are checked, and click Create. Then copy the value in the URL section to a temporary location. It should have the form: https://<storage account>.blob.core.windows.net/<container name>?<SAS value>.

  2. Replace <subscription key> with the subscription key you copied from the previous step.

  3. Replace <endpoint> with the endpoint URL for your Form Recognizer resource.

  4. Replace <Blob folder name> with the path to the folder in blob storage where your forms are located. If your forms are at the root of your container, leave this string empty.

  5. Optionally replace <your model name> with the friendly name you'd like to give to your model.

    ########### Python Form Recognizer Labeled Async Train #############
    import json
    import time
    from requests import get, post
    
    # Endpoint URL
    endpoint = r"<endpoint>"
    post_url = endpoint + "/formrecognizer/v2.1/custom/models"
    source = r"<SAS URL>"
    prefix = "<Blob folder name>"
    includeSubFolders = False
    useLabelFile = False
    
    headers = {
        # Request headers
        'Content-Type': 'application/json',
        'Ocp-Apim-Subscription-Key': '<subscription key>',
    }
    
    body =     {
        "source": source,
        "sourceFilter": {
            "prefix": prefix,
            "includeSubFolders": includeSubFolders
        },
        "useLabelFile": useLabelFile
    }
    
    try:
        resp = post(url = post_url, json = body, headers = headers)
        if resp.status_code != 201:
            print("POST model failed (%s):\n%s" % (resp.status_code, json.dumps(resp.json())))
            quit()
        print("POST model succeeded:\n%s" % resp.headers)
        get_url = resp.headers["location"]
    except Exception as e:
        print("POST model failed:\n%s" % str(e))
        quit()

To train a Form Recognizer model with the documents in your Azure blob container, call the Train Custom Model API by running the following python code. Before you run the code, make these changes:

  1. Replace <SAS URL> with the Azure Blob storage container's shared access signature (SAS) URL. To retrieve the SAS URL for your custom model training data, go to your storage resource in the Azure portal and select the Storage Explorer tab. Navigate to your container, right-click, and select Get shared access signature. It's important to get the SAS for your container, not for the storage account itself. Make sure the Read and List permissions are checked, and click Create. Then copy the value in the URL section to a temporary location. It should have the form: https://<storage account>.blob.core.windows.net/<container name>?<SAS value>.
  2. Replace <subscription key> with the subscription key you copied from the previous step.
  3. Replace <endpoint> with the endpoint URL for your Form Recognizer resource.
  4. Replace <Blob folder name> with the path to the folder in blob storage where your forms are located. If your forms are at the root of your container, leave this string empty.
  5. Optionally replace <your model name> with the friendly name you'd like to give to your model.
########### Python Form Recognizer Labeled Async Train #############
import json
import time
from requests import get, post

# Endpoint URL
endpoint = r"<endpoint>"
post_url = endpoint + "/formrecognizer/v2.0/custom/models"
source = r"<SAS URL>"
prefix = "<Blob folder name>"
includeSubFolders = False
useLabelFile = False

headers = {
    # Request headers
    'Content-Type': 'application/json',
    'Ocp-Apim-Subscription-Key': '<subsription key>',
}

body =     {
    "source": source,
    "sourceFilter": {
        "prefix": prefix,
        "includeSubFolders": includeSubFolders
    },
    "modelName":"<your model name>",
    "useLabelFile": useLabelFile
}

try:
    resp = post(url = post_url, json = body, headers = headers)
    if resp.status_code != 201:
        print("POST model failed (%s):\n%s" % (resp.status_code, json.dumps(resp.json())))
        quit()
    print("POST model succeeded:\n%s" % resp.headers)
    get_url = resp.headers["location"]
except Exception as e:
    print("POST model failed:\n%s" % str(e))
    quit()

  1. Save the code in a file with a .py extension. For example, form-recognizer-train.py.
  2. Open a command prompt window.
  3. At the prompt, use the python command to run the sample. For example, python form-recognizer-train.py.

Get training results

After you've started the train operation, you use the returned ID to get the status of the operation. Add the following code to the bottom of your Python script. This uses the ID value from the training call in a new API call. The training operation is asynchronous, so this script calls the API at regular intervals until the training status is completed. We recommend an interval of one second or more.

n_tries = 15
n_try = 0
wait_sec = 5
max_wait_sec = 60
while n_try < n_tries:
    try:
        resp = get(url = get_url, headers = headers)
        resp_json = resp.json()
        if resp.status_code != 200:
            print("GET model failed (%s):\n%s" % (resp.status_code, json.dumps(resp_json)))
            quit()
        model_status = resp_json["modelInfo"]["status"]
        if model_status == "ready":
            print("Training succeeded:\n%s" % json.dumps(resp_json))
            quit()
        if model_status == "invalid":
            print("Training failed. Model is invalid:\n%s" % json.dumps(resp_json))
            quit()
        # Training still running. Wait and retry.
        time.sleep(wait_sec)
        n_try += 1
        wait_sec = min(2*wait_sec, max_wait_sec)
    except Exception as e:
        msg = "GET model failed:\n%s" % str(e)
        print(msg)
        quit()
print("Train operation did not complete within the allocated time.")

When the training process is completed, you'll receive a 201 (Success) response with JSON content like the following:

{
  "modelInfo":{
    "status":"ready",
    "createdDateTime":"2019-10-08T10:20:31.957784",
    "lastUpdatedDateTime":"2019-10-08T14:20:41+00:00",
    "modelId":"1cfb372bab404ba3aa59481ab2c63da5"
  },
  "trainResult":{
    "trainingDocuments":[
      {
        "documentName":"invoices\\Invoice_1.pdf",
        "pages":1,
        "errors":[

        ],
        "status":"succeeded"
      },
      {
        "documentName":"invoices\\Invoice_2.pdf",
        "pages":1,
        "errors":[

        ],
        "status":"succeeded"
      },
      {
        "documentName":"invoices\\Invoice_3.pdf",
        "pages":1,
        "errors":[

        ],
        "status":"succeeded"
      },
      {
        "documentName":"invoices\\Invoice_4.pdf",
        "pages":1,
        "errors":[

        ],
        "status":"succeeded"
      },
      {
        "documentName":"invoices\\Invoice_5.pdf",
        "pages":1,
        "errors":[

        ],
        "status":"succeeded"
      }
    ],
    "errors":[

    ]
  },
  "keys":{
    "0":[
      "Address:",
      "Invoice For:",
      "Microsoft",
      "Page"
    ]
  }
}

Copy the "modelId" value for use in the following steps.

Analyze forms for key-value pairs and tables

Next, you'll use your newly trained model to analyze a document and extract key-value pairs and tables.

Call the Analyze Form API by running the following code in a new Python script. Before you run the script, make these changes:

  1. Replace <file path> with the file path of your form (for example, C:\temp\file.pdf). This can also be the URL of a remote file. For this quickstart, you can use the files under the Test folder of the sample data set (download and extract sample_data.zip).

  2. Replace <model_id> with the model ID you received in the previous section.

  3. Replace <endpoint> with the endpoint that you obtained with your Form Recognizer subscription key. You can find it on your Form Recognizer resource Overview tab.

  4. Replace <file type> with the file type. Supported types: application/pdf, image/jpeg, image/png, image/tiff.

  5. Replace <subscription key> with your subscription key.

    ########### Python Form Recognizer Async Analyze #############
    import json
    import time
    from requests import get, post
    
    # Endpoint URL
    endpoint = r"<endpoint>"
    apim_key = "<subsription key>"
    model_id = "<model_id>"
    post_url = endpoint + "/formrecognizer/v2.1/custom/models/{modelId}/analyze" % model_id
    source = r"<file path>"
    params = {
        "includeTextDetails": True
    }
    
    headers = {
        # Request headers
        'Content-Type': '<file type>',
        'Ocp-Apim-Subscription-Key': apim_key,
    }
    with open(source, "rb") as f:
        data_bytes = f.read()
    
    try:
        resp = post(url = post_url, data = data_bytes, headers = headers, params = params)
        if resp.status_code != 202:
            print("POST analyze failed:\n%s" % json.dumps(resp.json()))
            quit()
        print("POST analyze succeeded:\n%s" % resp.headers)
        get_url = resp.headers["operation-location"]
    except Exception as e:
        print("POST analyze failed:\n%s" % str(e))
        quit()

Call the Analyze Form API by running the following code in a new Python script. Before you run the script, make these changes:

  1. Replace <file path> with the file path of your form (for example, C:\temp\file.pdf). This can also be the URL of a remote file. For this quickstart, you can use the files under the Test folder of the sample data set (download and extract sample_data.zip).

  2. Replace <model_id> with the model ID you received in the previous section.

  3. Replace <endpoint> with the endpoint that you obtained with your Form Recognizer subscription key. You can find it on your Form Recognizer resource Overview tab.

  4. Replace <file type> with the file type. Supported types: application/pdf, image/jpeg, image/png, image/tiff.

  5. Replace <subscription key> with your subscription key.

    ########### Python Form Recognizer Async Analyze #############
    import json
    import time
    from requests import get, post
    
    # Endpoint URL
    endpoint = r"<endpoint>"
    apim_key = "<subsription key>"
    model_id = "<model_id>"
    post_url = endpoint + "/formrecognizer/v2.1/custom/models/{modelId}/analyze"
    source = r"<file path>"
    params = {
        "includeTextDetails": True
    }
    
    headers = {
        # Request headers
        'Content-Type': '<file type>',
        'Ocp-Apim-Subscription-Key': apim_key,
    }
    with open(source, "rb") as f:
        data_bytes = f.read()
    
    try:
        resp = post(url = post_url, data = data_bytes, headers = headers, params = params)
        if resp.status_code != 202:
            print("POST analyze failed:\n%s" % json.dumps(resp.json()))
            quit()
        print("POST analyze succeeded:\n%s" % resp.headers)
        get_url = resp.headers["operation-location"]
    except Exception as e:
        print("POST analyze failed:\n%s" % str(e))
        quit()

  1. Save the code in a file with a .py extension. For example, form-recognizer-analyze.py.
  2. Open a command prompt window.
  3. At the prompt, use the python command to run the sample. For example, python form-recognizer-analyze.py.

When you call the Analyze Form API, you'll receive a 201 (Success) response with an Operation-Location header. The value of this header is an ID you'll use to track the results of the Analyze operation. The script above prints the value of this header to the console.

Get the Analyze results

Add the following code to the bottom of your Python script. This uses the ID value from the previous call in a new API call to retrieve the analysis results. The Analyze Form operation is asynchronous, so this script calls the API at regular intervals until the results are available. We recommend an interval of one second or more.

n_tries = 15
n_try = 0
wait_sec = 5
max_wait_sec = 60
while n_try < n_tries:
    try:
        resp = get(url = get_url, headers = {"Ocp-Apim-Subscription-Key": apim_key})
        resp_json = resp.json()
        if resp.status_code != 200:
            print("GET analyze results failed:\n%s" % json.dumps(resp_json))
            quit()
        status = resp_json["status"]
        if status == "succeeded":
            print("Analysis succeeded:\n%s" % json.dumps(resp_json))
            quit()
        if status == "failed":
            print("Analysis failed:\n%s" % json.dumps(resp_json))
            quit()
        # Analysis still running. Wait and retry.
        time.sleep(wait_sec)
        n_try += 1
        wait_sec = min(2*wait_sec, max_wait_sec)
    except Exception as e:
        msg = "GET analyze results failed:\n%s" % str(e)
        print(msg)
        quit()
print("Analyze operation did not complete within the allocated time.")

When the process is completed, you'll receive a 200 (Success) response with JSON content in the following format. The response has been shortened for simplicity. The main key/value pair associations and tables are in the "pageResults" node. If you also specified plain text extraction through the includeTextDetails URL parameter, then the "readResults" node will show the content and positions of all the text in the document.

This sample JSON output has been shortened for simplicity.

{
  "status": "succeeded",
  "createdDateTime": "2020-08-21T01:13:28Z",
  "lastUpdatedDateTime": "2020-08-21T01:13:42Z",
  "analyzeResult": {
    "version": "2.1.0",
    "readResults": [
      {
        "page": 1,
        "angle": 0,
        "width": 8.5,
        "height": 11,
        "unit": "inch",
        "lines": [
          {
            "text": "Project Statement",
            "boundingBox": [
              5.0444,
              0.3613,
              8.0917,
              0.3613,
              8.0917,
              0.6718,
              5.0444,
              0.6718
            ],
            "words": [
              {
                "text": "Project",
                "boundingBox": [
                  5.0444,
                  0.3587,
                  6.2264,
                  0.3587,
                  6.2264,
                  0.708,
                  5.0444,
                  0.708
                ]
              },
              {
                "text": "Statement",
                "boundingBox": [
                  6.3361,
                  0.3635,
                  8.0917,
                  0.3635,
                  8.0917,
                  0.6396,
                  6.3361,
                  0.6396
                ]
              }
            ]
          },
          ...
        ]
      }
    ],
    "pageResults": [
      {
        "page": 1,
        "keyValuePairs": [
          {
            "key": {
              "text": "Date:",
              "boundingBox": [
                6.9833,
                1.0615,
                7.3333,
                1.0615,
                7.3333,
                1.1649,
                6.9833,
                1.1649
              ],
              "elements": [
                "#/readResults/0/lines/2/words/0"
              ]
            },
            "value": {
              "text": "9/10/2020",
              "boundingBox": [
                7.3833,
                1.0802,
                7.925,
                1.0802,
                7.925,
                1.174,
                7.3833,
                1.174
              ],
              "elements": [
                "#/readResults/0/lines/3/words/0"
              ]
            },
            "confidence": 1
          },
          ...
        ],
        "tables": [
          {
            "rows": 5,
            "columns": 5,
            "cells": [
              {
                "text": "Training Date",
                "rowIndex": 0,
                "columnIndex": 0,
                "boundingBox": [
                  0.6944,
                  4.2779,
                  1.5625,
                  4.2779,
                  1.5625,
                  4.4005,
                  0.6944,
                  4.4005
                ],
                "confidence": 1,
                "rowSpan": 1,
                "columnSpan": 1,
                "elements": [
                  "#/readResults/0/lines/15/words/0",
                  "#/readResults/0/lines/15/words/1"
                ],
                "isHeader": true,
                "isFooter": false
              },
              ...
            ]
          }
        ],
        "clusterId": 0
      }
    ],
    "documentResults": [],
    "errors": []
  }
}
{
  "status": "succeeded",
  "createdDateTime": "2020-08-21T00:46:25Z",
  "lastUpdatedDateTime": "2020-08-21T00:46:32Z",
  "analyzeResult": {
    "version": "2.0.0",
    "readResults": [
      {
        "page": 1,
        "angle": 0,
        "width": 8.5,
        "height": 11,
        "unit": "inch",
        "lines": [
          {
            "text": "Project Statement",
            "boundingBox": [
              5.0153,
              0.275,
              8.0944,
              0.275,
              8.0944,
              0.7125,
              5.0153,
              0.7125
            ],
            "words": [
              {
                "text": "Project",
                "boundingBox": [
                  5.0153,
                  0.275,
                  6.2278,
                  0.275,
                  6.2278,
                  0.7125,
                  5.0153,
                  0.7125
                ]
              },
              {
                "text": "Statement",
                "boundingBox": [
                  6.3292,
                  0.275,
                  8.0944,
                  0.275,
                  8.0944,
                  0.7125,
                  6.3292,
                  0.7125
                ]
              }
            ]
          },
        ...
        ]
      }
    ],
    "pageResults": [
      {
        "page": 1,
        "keyValuePairs": [
          {
            "key": {
              "text": "Date:",
              "boundingBox": [
                6.9722,
                1.0264,
                7.3417,
                1.0264,
                7.3417,
                1.1931,
                6.9722,
                1.1931
              ],
              "elements": [
                "#/readResults/0/lines/2/words/0"
              ]
            },
            "confidence": 1
          },
         ...
        ],
        "tables": [
          {
            "rows": 4,
            "columns": 5,
            "cells": [
              {
                "text": "Training Date",
                "rowIndex": 0,
                "columnIndex": 0,
                "boundingBox": [
                  0.6931,
                  4.2444,
                  1.5681,
                  4.2444,
                  1.5681,
                  4.4125,
                  0.6931,
                  4.4125
                ],
                "confidence": 1,
                "rowSpan": 1,
                "columnSpan": 1,
                "elements": [
                  "#/readResults/0/lines/15/words/0",
                  "#/readResults/0/lines/15/words/1"
                ],
                "isHeader": true,
                "isFooter": false
              },
              ...
            ]
          }
        ],
        "clusterId": 0
      }
    ],
    "documentResults": [],
    "errors": []
  }
}

Improve results

Examine the "confidence" values for each key/value result under the "pageResults" node. You should also look at the confidence scores in the "readResults" node, which correspond to the text read operation. The confidence of the read results does not affect the confidence of the key/value extraction results, so you should check both.

  • If the confidence scores for the read operation are low, try to improve the quality of your input documents (see Input requirements).
  • If the confidence scores for the key/value extraction operation are low, ensure that the documents being analyzed are of the same type as documents used in the training set. If the documents in the training set have variations in appearance, consider splitting them into different folders and training one model for each variation.

The confidence scores you target will depend on your use case, but generally it's a good practice to target a score of 80% or above. For more sensitive cases, like reading medical records or billing statements, a score of 100% is recommended.

Next steps

In this quickstart, you used the Form Recognizer REST API with Python to train a model and run it in a sample scenario. Next, see the reference documentation to explore the Form Recognizer API in more depth.