Whats an Excerpt from a Book? Simple Guide for Beginners

0
21

Okay, so today I’m gonna ramble about something I was messing with – ripping a chunk out of a book. Not literally ripping pages, chill out! I mean extracting text data, you know, for… reasons. Mostly curiosity, let’s be real.

Whats an Excerpt from a Book? Simple Guide for Beginners

First thing, I grabbed the book. It was a PDF, thankfully. Figured that’d be easier than trying to OCR an actual physical copy. I mean, who has time for that?

Then I hunted around for a decent PDF library in Python. Ended up settling on PyPDF2. Seemed simple enough. Installed it with pip install PyPDF2, the usual drill.

Next up, the code. I started by just trying to open the PDF and see if I could even read it. Something like this:


import PyPDF2

pdf_file = open('my_*', 'rb')

Whats an Excerpt from a Book? Simple Guide for Beginners

pdf_reader = *(pdf_file)

print(len(pdf_*)) # Check number of pages

pdf_*()

That worked! I got the number of pages. Felt like a small victory. But getting the text out was the real challenge.

I looped through the pages, extracting text from each one. The basic idea was:

Whats an Excerpt from a Book? Simple Guide for Beginners

import PyPDF2

pdf_file = open('my_*', 'rb')

pdf_reader = *(pdf_file)

text = ""

for page_num in range(len(pdf_*)):

Whats an Excerpt from a Book? Simple Guide for Beginners

page = pdf_*[page_num]

text += *_text()

pdf_*()

print(text)

Alright, that printed a ton of text to the console. Success, right? Not quite. The formatting was all messed up. Line breaks in weird places, words getting split up… Ugh.

Whats an Excerpt from a Book? Simple Guide for Beginners

Spent a while trying to clean up the text. Tried replacing newline characters, joining lines based on certain patterns… It was a mess. Honestly, I didn’t get it perfect. PDF formatting is a beast.

Then I decided to focus on extracting a specific chapter. I knew which pages it spanned. So, I modified the loop to only grab text from those pages:


import PyPDF2

pdf_file = open('my_*', 'rb')

pdf_reader = *(pdf_file)

Whats an Excerpt from a Book? Simple Guide for Beginners

start_page = 10

end_page = 20

text = ""

for page_num in range(start_page, end_page + 1):

page = pdf_*[page_num]

Whats an Excerpt from a Book? Simple Guide for Beginners

text += *_text()

pdf_*()

print(text)

That gave me a more manageable chunk of text. Still needed cleaning, but at least it was focused.

Finally, I saved the extracted text to a file:

Whats an Excerpt from a Book? Simple Guide for Beginners

import PyPDF2

pdf_file = open('my_*', 'rb')

pdf_reader = *(pdf_file)

start_page = 10

end_page = 20

Whats an Excerpt from a Book? Simple Guide for Beginners

text = ""

for page_num in range(start_page, end_page + 1):

page = pdf_*[page_num]

text += *_text()

pdf_*()

Whats an Excerpt from a Book? Simple Guide for Beginners

with open('chapter_*', 'w', encoding='utf-8') as output_file:

output_*(text)

And that was it! I had a text file containing an excerpt from the book. The formatting wasn’t perfect, but it was good enough for what I needed. Mainly just wanted to play around and see how it worked.

Learned a few things: PyPDF2 is okay for basic extraction, but PDF formatting is a pain. Might try a different library next time. Maybe something that handles more complex layouts better. Who knows, maybe I’ll even try OCR on a physical book someday. But probably not. Too much effort!

LEAVE A REPLY

Please enter your comment!
Please enter your name here