Technology

Image Understanding with OpenAI and C#

calendar_today
Feb 12, 2024
schedule
5 Minutes
John Ackerman

Introduction

Optical Character Recognition, or OCR, is a class of technology that can recognize text in digital images. For example, reading the name and address off of a drivers license, or extracting data from a scanned W-2 fall under the umbrella of OCR.  By combining information about the text in an image, along with the placement of that text, we can start to gain a basic level of understanding of the image.

OCR implementation is one of those problems that surfaces from time to time with our clients and something we have solved numerous times in the past decade. For this reason we've had lots of opportunity to watch the technology grow and accelerate. Today's Large Language Models have started to show remarkable abilities in visual understanding of images, and this post is intended to help C# developers leverage those abilities in their code.

Limitations of Past OCR

There are many OCR products available in the developer market. These tools include python tools like PyTesseract and hosted services like Azure Document Intelligence.  While libraries work well for simple use cases like simply extracting text from an image, you need more powerful models to extract true understanding from an image.

To work with a service like Document Intelligence, you need to provide a number of sample images and tag out the kinds of data you want to extract. You quickly run into limitations if you need to extract the same kind of information from documents in many different formats. Resume's are a great example of this. While the data that we put into a resume is fairly standard, many people take creative license with the formatting of their resume so actually finding the "Education" section can be tricky.

OpenAI Vision Service

The OpenAI Vision Service provides a powerful API for developers to extract information from images.  It can function essentially as an AI-enabled OCR  platform, without the need for any time spent training a model.  Further, the image understanding model is able to recognize data themes in objects rather than having a reliance on format and layout. This creates a one-stop-shop for many OCR needs.

The main limitation of the current implementation of the Vision Service is that you cannot use tooling (i.e. function calling) nor specify that you want the API to return a JSON object. This can make it difficult to extract structured data from an image that can be used in other parts of an application.

The img-to-json Library

The OpenAI GPT function calling model allows you to send the API function definitions which consist of a function name and a description of the parameters to those functions. This description essentially follows the standards of [JSON Schema](https://json-schema.org/).  As web-focused C# developers, we are well versed in serializing/deserializing C# classes using JSON.

We started by creating a method that could take a C# class and output a basic JSON Schema definition for it. We quickly realized that the OpenAI API needed a bit more information to make the schema truly useful. Things like constrained values needed to be communicated and there were several cases where the property name alone was not sufficient to tell OpenAI how to populate our models. To solve this we introduced a few custom attributes that allow us to tell the model exactly how to treat our data in the result.

Once we were happy with the way this method worked for function calling, we tried applying it to the Vision Service as well by simply including the result object definition in our prompt to the model . It worked! While there were some formatting challenges, on the whole we have been able to get OpenAI to respond to our image queries with reliable JSON that we can then pass on to other parts of our application.

Example

Here is a simple example of the img-to-json library in use. Imagine you have a pile of resume images that you need to sort through for an "Administrative Assistant" job.

An example resume image
An Example Resume

We create a C# class representing a resume and the data we want to extract from it. Note our instructions to the OpenAI API specifically around handling the relevance and highest level of education.

Example code for a resume object
Simple C# Class Definition for a Resume

Running that image through the img-to-json library, we extract the following information:

Example output from resume parsing
Resume Parsing Result

Try It Yourself

If you're a C# developer interested in learning more about this method, the `img-to-json` library is available freely. You can check it out in our GitHub repository.

If you're a medium to large business, especially in the hospitality, healthcare or government sectors and you're looking to add visual understanding capabilities to your tools and platforms be sure to get in touch with us and book a free consultation today!

Are You Ready to make your dreams reality?

Answer 30 questions and find out
Complete 30 carefully crafted questions and we'll send a personalized report for FREE!
Find Out
You'll get
verified
Instant insights into your businesses strengths and areas for improvement
verified
Spotlights on gaps in strategy, technology and team dynamics before they impact your success
verified
Tailored advice on enhancing your product management and execution
verified
Expert-backed feedback on your team's readiness to get started
verified
Actionable next steps that you can take today

Let's talk.
The consultation is free!

Book a Call Today
Own Your Market
100+ Happy clients and counting
The Process
check_circle
Introductory Call: We evaluate fit and offer actionable insights
check_circle
Strategic discussion: We'll outline potential approaches tailored to your needs
check_circle
Partnership Kickoff: Once aligned, we commence with clear objectives