In this assignment we will code a real-world application to distributively apply OCR algorithms on images. Then we will display each image with its recognized text on a webpage.
- Create AWS role with the following:
- EC2FullAccess
- S3FullAccess
- SQSFullAccess
- AdministratorAccess
- You need to create a file
localApp/src/main/resources/secure_info.properties - The content of the file should be
ami=<image to run for manager and workers> arn=arn:aws:iam::<your AWS account number>:instance-profile/<The name of the profile> keyName=<name of the keyPair> securityGroupIds=<single security group id>
- Create
managerApp.jarandworkerApp.jarandscpthem to the instance to/home/ubuntu. - Create new Amazon AMI based on the instance.
- Install JDK 15 or above
- Install Tesseract 4.00, and make sure that its data is in
/usr/share/tesseract-ocr/4.00/tessdata - After the AMI is available, copy its ID to
localApp/src/main/resources/secure_info.properties - run local app with 4 args:
args[0]: input image file pathargs[1]: output file pathargs[2]: number of workers neededargs[3]: (optional) must be equals to "terminate"
Yes, on one hand, the Manager doesn't hold local app inforation on the ram, only the amount of "connected" local apps.
- The personal data such as number of urls remaining and return bucket name are held in temp bucket, its name is the personal return SQS queue of the local app.
- The Manager app reads the links file that the local app uploaded for him line after line dynamically such that it won't be resource consuming to read the hole file.
- The local app reads the final <link, OCR output> entries from the bucket entry dynamically when creating the file.
- Worker App won't save the image on disk, and reads it to memory dynamically (we assume that the image size won't pass 800 MB)
- We assume that image OCR description won't pass 200 KB
- We're catching all possible exeptions that might rais from the operation of our code, we do not protect against sudden termination from Amazon itself.
Only the Manager uses 2 thread:
- thread
main: responsible for receiving messages (up to 10 at a time) from local apps and sending tasks to the workers - thread
WorkerToManager: responsible for receiving messages (up to 10 at a time) and upload their cotent to the proper S3 bucket of the relevant local app
- We used Ubuntu 20.04 based AMI, and instance type of T2-micro
- when used with 10 Workers and 24 links it takes about 3 minutes (including manager and workers startup time)