MLUI Mobile: Autify OCR vs. Google OCR

Autify OCR vs Google OCR comparison visualization

Autify for Mobile has multiple ML-based features that are built on top of our ML-based UI detection engine we call MLUI-Mobile. We have integrated an OCR system with this engine that we have developed in-house. While currently, it only powers Mobile features, we are actively working on integrating it with some of the Autify for Web Features. Since our use-case is relatively narrow (i.e. rendered text on screenshots), our in-house OCR system has been able to outperform alternatives available. In this blog post, we want to compare our OCR with open source alternatives like EasyOCR as well as closed source solutions like Google Cloud OCR.

Executive Summary

Autify OCR

91%

Google OCR

79%

EasyOCR

29%

Methodology

Given the specificity of our use-case, we only tested on a chunk of our internal dataset. So the results may differ on other datasets. Also, we gave cropped images of text to all the OCR engines and used our own production detector, so there is a possibility of results being skewed if we were to use each engine's own detector.

Dataset Scope

Internal dataset specific to mobile screenshots with rendered text

Testing Approach

Cropped images provided to all engines using production detector

Evaluation Method

Agreement-based accuracy with manual annotation validation

Dataset

Initial Collection

Few thousand annotated screenshots with text bounding boxes

~100k text samples

→

Preprocessing

Shuffled dataset with CLIP-based deduplication

~10k final samples

→

Ground Truth

Agreement-based labeling with manual validation

70% agreement rate

We have a few thousand annotated screenshots that contain bounding boxes of texts but this dataset does not contain text annotation. We crop all the text boxes from these screenshots which is ~100k samples. We shuffle this dataset and use OpenAI CLIP to deduplicate samples and take only ~10k. Since, we do not have labels for these boxes, we compare the output of Autify OCR & Google OCR, if the output is similar, we consider it as correct. There is a small probability that both engines are incorrect; however; that is very insignificant.

Results

Initially, we planned to use the agreement of all three OCR engines for ground truth generation which came out to be ~29% of the samples. However; after reviewing the results we realized Easy OCR is making a lot mistakes. As, for the rest of the samples, Easy OCR was agreeing with Autify OCR for ~3% of the samples and with Google OCR for 5% of the samples, but Google OCR and Autify OCR were agreeing for ~45% of the samples.

Thus, we removed Easy OCR's agreement for ground truth estimation and only relied on Google OCR and Autify OCR. This time we got ~70% of the samples in total agreement for both engines. Next we took ~150 random samples from remaining 30% and annotated in house. And on those samples Autify OCR was 70% of the times correct whereas Google OCR was 30%.

Agreement Analysis

All Three Engines

29%

Initial baseline

Autify + Google

70%

Final ground truth

Manual Validation

150

Samples annotated

Final Accuracy Estimates

Autify OCR

91%

Winner

Google OCR

79%

Runner-up

Analysis ✨

We did some analysis of these results to investigate and rationale some reasons behind these numbers. We found out that one of the core strengths of Autify OCR is handling mixed Japanese & English content. Google OCR was handling English-only and Japanese-only cases very well, some times better than Autify OCR. Another reason Autify OCR performs really well is data domain being limited to only screenshots & rendered texts. We used very strong augmentation scheme while training the model on synthetic dataset which generalizes very well to this use-case.

Autify OCR Strengths

Excellent mixed Japanese & English content handling
Optimized for screenshot & rendered text domain
Strong augmentation scheme with synthetic data

Google OCR Strengths

Excellent English-only content recognition
Strong Japanese-only content performance
General-purpose OCR capabilities

Potential Limitations

One reason Google OCR & Easy OCR might not be performing well could be due to the crops of text boxes. We gave pre-annotated crops to each engine instead of using their own detectors. Another reason specifically for Easy OCR is that their pretrained model is optimized for real-world text and lack of rendered text in the training data or breadth of data domain could be the cause of low accuracy on a specific domain.

Samples

Result:

Autify OCR ✓

Google OCR ✓

EasyOCR ✗

Autify OCR & Google OCR predicted this text correctly, however, Easy OCR did not recognize the 'Yen' symbol properly.

Result:

Autify OCR ✓

Google OCR ✓

EasyOCR ✗

Easy OCR miss classifies 'of' into 'ot'. It could possibly be due to the very tight crop.

OCR Sample 3 - Version number punctuation

Result:

Autify OCR ✓

Google OCR ✓

EasyOCR ✗

In this example, Easy OCR missed a dot "." for the version number.

OCR Sample 4 - Digit recognition challenge

Result:

Autify OCR ✓

Google OCR ✗

EasyOCR ✗

Here Google OCR also made a mistake. And Easy OCR did not recognize digits properly.

Result:

Autify OCR ✗

Google OCR ✓

EasyOCR ✓

A mistake made by Autify OCR, confusing . with :, This is a systematic error due to data generation that we have identified recently and working on fixing it.

Result:

Autify OCR ✓

Google OCR ✗

EasyOCR ✓

Here Google OCR capitalized the last 's'.

Result:

Autify OCR ✓

Google OCR ✓

EasyOCR ✗

An example of Korean text, since we only provided 'en' and 'ja' as the languages to Easy OCR, it tried to match the text with Japanese characters.

Conclusion

For our initial version of MLUI, we did use Easy OCR for a while. However, given the amount of errors, we decided to move to another solution. We also explored Google OCR but the challenge was, we could not improve it when it made mistakes. That is how, we decided to build an extremely optimized in-house solution which is an encoder-decoder transformer trained on massive synthetic dataset and architecture optimized for speed and accuracy trade-off.

Key Takeaways

Domain-Specific Training

Specialized training on screenshot and rendered text yields superior results for specific use cases

Multi-Language Support

Mixed Japanese & English content handling is a critical differentiator in mobile applications

Customization Advantage

In-house solutions allow for continuous improvement and error pattern correction

Performance Metrics

91% accuracy achieved through synthetic data augmentation and transformer architecture

Topics

OCR Machine Learning Mobile Testing Computer Vision Autify Benchmarking Text Recognition