This project was given to me as an assignment as part of a course during my computer science bechelor dgree. In this project I used AWS services, such as: EC2, S3 and SQS, to work with large amount of data.
In this assignment I coded a real-world application to distributively apply OCR algorithms on images. The result of this program displays each image with its recognized text on a webpage.
The application is composed of a local application and instances running on the Amazon cloud. The application will get as an input a text file containing a list of URLs of images. Then, instances will be launched in AWS (workers). Each worker will download image files, use some OCR library to identify text in those images (if any) and display the image with the text in a webpage.
The OCR tool we used is Tesseract (installed only on the workers).
- Local Application uploads the file with the list of images to S3
- Local Application sends a message (queue) stating of the location of the images list on S3
- Local Application does one of the two:
- Starts the manager
- Checks if a manager is active and if not, starts it
- Starts the manager
- Manager downloads list of images
- Manager creates an SQS message for each URL in the list of images
- Manager bootstraps nodes to process messages
- Worker gets an image message from an SQS queue
- Worker downloads the image indicated in the message
- Worker applies OCR on image.
- Worker puts a message in an SQS queue indicating the original URL of the image and the text.
- Manager reads all the Workers' messages from SQS and creates one summary file
- Manager uploads summary file to S3
- Manager posts an SQS message about summary file
- Local Application reads SQS message
- Local Application downloads summary file from S3
- Local Application creates html output files
