OCR In Cloud

In this assignment we will code a real-world application to distributively apply OCR algorithms on images. Then we will display each image with its recognized text on a webpage.

How to run our project

Create AWS role with the following:
1. EC2FullAccess
2. S3FullAccess
3. SQSFullAccess
4. AdministratorAccess
You need to create a file localApp/src/main/resources/secure_info.properties

The content of the file should be

ami=<image to run for manager and workers>
arn=arn:aws:iam::<your AWS account number>:instance-profile/<The name of the profile>
keyName=<name of the keyPair>
securityGroupIds=<single security group id>

Create managerApp.jar and workerApp.jar and scp them to the instance to /home/ubuntu.
Create new Amazon AMI based on the instance.
Install JDK 15 or above
Install Tesseract 4.00, and make sure that its data is in /usr/share/tesseract-ocr/4.00/tessdata
After the AMI is available, copy its ID to localApp/src/main/resources/secure_info.properties
run local app with 4 args:
1. args[0]: input image file path
2. args[1]: output file path
3. args[2]: number of workers needed
4. args[3]: (optional) must be equals to "terminate"

Considerations

Did you think about scalability?

Yes, on one hand, the Manager doesn't hold local app inforation on the ram, only the amount of "connected" local apps.

The personal data such as number of urls remaining and return bucket name are held in temp bucket, its name is the personal return SQS queue of the local app.
The Manager app reads the links file that the local app uploaded for him line after line dynamically such that it won't be resource consuming to read the hole file.
The local app reads the final <link, OCR output> entries from the bucket entry dynamically when creating the file.
Worker App won't save the image on disk, and reads it to memory dynamically (we assume that the image size won't pass 800 MB)
We assume that image OCR description won't pass 200 KB

What about persistence?

We're catching all possible exeptions that might rais from the operation of our code, we do not protect against sudden termination from Amazon itself.

Threads in our application

Only the Manager uses 2 thread:

thread main: responsible for receiving messages (up to 10 at a time) from local apps and sending tasks to the workers
thread WorkerToManager: responsible for receiving messages (up to 10 at a time) and upload their cotent to the proper S3 bucket of the relevant local app

Technical stuff

We used Ubuntu 20.04 based AMI, and instance type of T2-micro
when used with 10 Workers and 24 links it takes about 3 minutes (including manager and workers startup time)

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github		.github
localApp		localApp
managerApp		managerApp
workerApp		workerApp
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR In Cloud

How to run our project

Considerations

Did you think about scalability?

What about persistence?

Threads in our application

Technical stuff

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OCR In Cloud

How to run our project

Considerations

Did you think about scalability?

What about persistence?

Threads in our application

Technical stuff

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages