Optical Character Recognition using OpenCV and Tesserract
According to wikipedia, Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example from a television broadcast).
This is the first in a series of articles I will be writing on OCR. The intent is to compare the performance of 4 different platforms for OCR on both typed and handwritten documents. We will be comparing the performance of the following platforms:
- Open Source tools-OpenCV and Tesserract
- Azure Cognitive Service Vision API
- Amazon Textract
- Google Cloud Vision API
We will be exploring Option 1 today. OpenCV and Tesserract.
Let us start by importing some libraries:
import io
import cv2
import numpy as np
from IPython.display import clear_output, Image, display
import PIL.Image
import matplotlib.pyplot as plt
import imutils
import pytesseract
from pytesseract import Output
import json
Next we read in the data:
IMAGE_FILE_LOCATION = “C:/Users/aakogun/Desktop/OCR/sample.jpg”
input_img = cv2.imread(IMAGE_FILE_LOCATION) # image read
#Plotting the image to see the output
import matplotlib.pyplot as plt
plotting = plt.imshow(input_img)
plt.show()
Next we write some functions to pre-process the image using OpenCV:
# Convert to gray scale
def get_grayscale(image):
return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Gaussian Blur to smoothing an image
def remove_noise(image):
return cv2.medianBlur(image,5)
#thresholding. Thresholding can help us to remove lighter or darker regions # and contours of images
def thresholding(image):
return cv2.threshold(gray, 225, 255, cv2.THRESH_BINARY_INV)[1]
#Erosions and dilations are typically used to reduce noise in binary images (a #side effect of thresholding).
#dilation
#dilation
def dilate(image):
kernel = np.ones((5,5),np.uint8)
return cv2.dilate(image, kernel, iterations = 5)
#erosion
def erode(image):
kernel = np.ones((5,5),np.uint8)
return cv2.erode(image, kernel, iterations = 5)
#opening — erosion followed by dilation
def opening(image):
kernel = np.ones((5,5),np.uint8)
return cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)
#Edge detection using Canny Algorithm
#canny edge detection
def canny(image):
return cv2.Canny(image, 50, 200)
gray = get_grayscale(input_img)
plotting = plt.imshow(gray)
plt.show()
Boundary Boxes:
h, w, c = input_img.shape
boxes = pytesseract.image_to_boxes(input_img)
for b in boxes.splitlines():
b = b.split(‘ ‘)
img = cv2.rectangle(input_img, (int(b[1]), h — int(b[2])), (int(b[3]), h — int(b[4])), (0, 255, 0), 2)
Next we write a function that performs some pre-processing on the input image and convert it to text
### Convert image to text
def image_text(image_path,config):
image=cv2.imread(image_path)
### Re-size image
#image = cv2.resize(image, (1350, 1150))
### Change to gray scale image
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
#image=cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
### Remove Noise
#image=cv2.medianBlur(image,5)
#image=cv2.medianBlur(image,3)
kernel_vertical = cv2.getStructuringElement(cv2.MORPH_RECT, (1,50))
temp1 = 255 — cv2.morphologyEx(image, cv2.MORPH_CLOSE, kernel_vertical)
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (50,1))
temp2 = 255 — cv2.morphologyEx(image, cv2.MORPH_CLOSE, horizontal_kernel)
temp3 = cv2.add(temp1, temp2)
result = cv2.add(temp3, image)
return pytesseract.image_to_string(result, lang=’eng’,config=config)
Function for JSON Output
## Image to JSON
def images_json(image_path,config):
image=cv2.imread(image_path)
### Re-size image
#image = cv2.resize(image, (1350, 1150))
### Change to gray scale image
image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
### Remove Noise
#image=cv2.medianBlur(image,5)
kernel_vertical = cv2.getStructuringElement(cv2.MORPH_RECT, (1,50))
temp1 = 255 — cv2.morphologyEx(image, cv2.MORPH_CLOSE, kernel_vertical)
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (50,1))
temp2 = 255 — cv2.morphologyEx(image, cv2.MORPH_CLOSE, horizontal_kernel)
temp3 = cv2.add(temp1, temp2)
result = cv2.add(temp3, image)
d = pytesseract.image_to_data(result,config=config, output_type=Output.DICT)
n_boxes = len(d[‘text’])
dict_json=[]
dict_keys=[]
for i in range(n_boxes):
dict_json.append([d[‘left’][i], d[‘top’][i], d[‘width’][i], d[‘height’][i],d[‘conf’][i],d[‘text’][i]])
#dict_json.append(dict_list)
dict_keys.append([‘left’,’top’,’width’,’height’,’conf’,’text’])
res[i]=dict(zip(dict_keys[i], dict_json[i]))
return json.dumps(res)
Sample Text Results:
Sample JSON Results:
Handwriting OCR-Sample Result
Pytesserract and OpenCV did a decent job extracting text from typed documents after some pre-processing was done on the input image. However not so good with hand writing.
In my next article, I will write On OCR Using Microsoft Azure Cognitive Vision APIs. We will compare the performance with that of Tesserract.
In this artilce, we did text extraction using OpenCV and Tesserract. next we will explore Cloud Platforms for OCR. After that, we will dive into Post OCR. Processing the output of the text using techniques such as Named Entity Recongition(NER) and Regular Expressions.