Skip to content

Commit b17b68c

Browse files
authored
Using Tesseract.js to OCR every image on a page
1 parent 54b85db commit b17b68c

File tree

1 file changed

+27
-0
lines changed

1 file changed

+27
-0
lines changed
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Using Tesseract.js to OCR every image on a page
2+
3+
Pasting this code into a DevTools console should load [Tesseract.js](https://github.com/naptha/tesseract.js) from a CDN, loop through every image loaded by that page (every PNG, GIF, JPG or JPEG), run OCR on them and output the result to the DevTools console.
4+
5+
There's one major catch: the images need to be served in a context that allows JavaScript to read their content - either from the same domain, or from a separate domain with a permissive CORS policy.
6+
7+
Very few sites do this! It worked on www.google.com for me, where it successfully OCRs the Google logo as containing the text "Google".
8+
9+
```javascript
10+
var s = document.createElement("script")
11+
s.src = "https://unpkg.com/[email protected]/dist/tesseract.min.js";
12+
document.head.appendChild(s);
13+
s.onload = (async () => {
14+
const imageUrls = performance.getEntries().map(f => f.name).filter(
15+
n => n.includes('.jpg') || n.includes('.gif') || n.includes('.png') || n.includes('.jpeg')
16+
);
17+
const worker = Tesseract.createWorker();
18+
await worker.load();
19+
await worker.loadLanguage('eng');
20+
await worker.initialize('eng');
21+
for (const url of imageUrls) {
22+
console.log(url);
23+
var { data: { text } } = await worker.recognize(url);
24+
console.log(text);
25+
}
26+
});
27+
```

0 commit comments

Comments
 (0)