Okay, so today I’m gonna ramble about something I was messing with – ripping a chunk out of a book. Not literally ripping pages, chill out! I mean extracting text data, you know, for… reasons. Mostly curiosity, let’s be real.
First thing, I grabbed the book. It was a PDF, thankfully. Figured that’d be easier than trying to OCR an actual physical copy. I mean, who has time for that?
Then I hunted around for a decent PDF library in Python. Ended up settling on PyPDF2. Seemed simple enough. Installed it with pip install PyPDF2, the usual drill.
Next up, the code. I started by just trying to open the PDF and see if I could even read it. Something like this:
import PyPDF2
pdf_file = open('my_*', 'rb')
pdf_reader = *(pdf_file)
print(len(pdf_*)) # Check number of pages
pdf_*()
That worked! I got the number of pages. Felt like a small victory. But getting the text out was the real challenge.
I looped through the pages, extracting text from each one. The basic idea was:
import PyPDF2
pdf_file = open('my_*', 'rb')
pdf_reader = *(pdf_file)
text = ""
for page_num in range(len(pdf_*)):
page = pdf_*[page_num]
text += *_text()
pdf_*()
print(text)
Alright, that printed a ton of text to the console. Success, right? Not quite. The formatting was all messed up. Line breaks in weird places, words getting split up… Ugh.
Spent a while trying to clean up the text. Tried replacing newline characters, joining lines based on certain patterns… It was a mess. Honestly, I didn’t get it perfect. PDF formatting is a beast.
Then I decided to focus on extracting a specific chapter. I knew which pages it spanned. So, I modified the loop to only grab text from those pages:
import PyPDF2
pdf_file = open('my_*', 'rb')
pdf_reader = *(pdf_file)
start_page = 10
end_page = 20
text = ""
for page_num in range(start_page, end_page + 1):
page = pdf_*[page_num]
text += *_text()
pdf_*()
print(text)
That gave me a more manageable chunk of text. Still needed cleaning, but at least it was focused.
Finally, I saved the extracted text to a file:
import PyPDF2
pdf_file = open('my_*', 'rb')
pdf_reader = *(pdf_file)
start_page = 10
end_page = 20
text = ""
for page_num in range(start_page, end_page + 1):
page = pdf_*[page_num]
text += *_text()
pdf_*()
with open('chapter_*', 'w', encoding='utf-8') as output_file:
output_*(text)
And that was it! I had a text file containing an excerpt from the book. The formatting wasn’t perfect, but it was good enough for what I needed. Mainly just wanted to play around and see how it worked.
Learned a few things: PyPDF2 is okay for basic extraction, but PDF formatting is a pain. Might try a different library next time. Maybe something that handles more complex layouts better. Who knows, maybe I’ll even try OCR on a physical book someday. But probably not. Too much effort!