OCR in the Cloud

This project was given to me as an assignment as part of a course during my computer science bechelor dgree. In this project I used AWS services, such as: EC2, S3 and SQS, to work with large amount of data.

In this assignment I coded a real-world application to distributively apply OCR algorithms on images. The result of this program displays each image with its recognized text on a webpage.

The application is composed of a local application and instances running on the Amazon cloud. The application will get as an input a text file containing a list of URLs of images. Then, instances will be launched in AWS (workers). Each worker will download image files, use some OCR library to identify text in those images (if any) and display the image with the text in a webpage.

The OCR tool we used is Tesseract (installed only on the workers).

The Application Flow

Local Application uploads the file with the list of images to S3
Local Application sends a message (queue) stating of the location of the images list on S3
Local Application does one of the two:
- Starts the manager
- Checks if a manager is active and if not, starts it
Manager downloads list of images
Manager creates an SQS message for each URL in the list of images
Manager bootstraps nodes to process messages
Worker gets an image message from an SQS queue
Worker downloads the image indicated in the message
Worker applies OCR on image.
Worker puts a message in an SQS queue indicating the original URL of the image and the text.
Manager reads all the Workers' messages from SQS and creates one summary file
Manager uploads summary file to S3
Manager posts an SQS message about summary file
Local Application reads SQS message
Local Application downloads summary file from S3
Local Application creates html output files

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
out/artifacts		out/artifacts
src/main/java		src/main/java
target/classes		target/classes
Application_Flow.png		Application_Flow.png
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR in the Cloud

The Application Flow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OCR in the Cloud

The Application Flow

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages