Travel Stories

Whats an Excerpt from a Book? Simple Guide for Beginners

April 4, 2025

Okay, so today I’m gonna ramble about something I was messing with – ripping a chunk out of a book. Not literally ripping pages, chill out! I mean extracting text data, you know, for… reasons. Mostly curiosity, let’s be real.

Whats an Excerpt from a Book? Simple Guide for Beginners

First thing, I grabbed the book. It was a PDF, thankfully. Figured that’d be easier than trying to OCR an actual physical copy. I mean, who has time for that?

Then I hunted around for a decent PDF library in Python. Ended up settling on PyPDF2. Seemed simple enough. Installed it with pip install PyPDF2, the usual drill.

Next up, the code. I started by just trying to open the PDF and see if I could even read it. Something like this:


import PyPDF2
pdf_file = open('my_*', 'rb')

pdf_reader = *(pdf_file)
print(len(pdf_*)) # Check number of pages
pdf_*()

That worked! I got the number of pages. Felt like a small victory. But getting the text out was the real challenge.

I looped through the pages, extracting text from each one. The basic idea was:


import PyPDF2
pdf_file = open('my_*', 'rb')
pdf_reader = *(pdf_file)
text = ""
for page_num in range(len(pdf_*)):

page = pdf_*[page_num]
text += *_text()
pdf_*()
print(text)

Alright, that printed a ton of text to the console. Success, right? Not quite. The formatting was all messed up. Line breaks in weird places, words getting split up… Ugh.

Spent a while trying to clean up the text. Tried replacing newline characters, joining lines based on certain patterns… It was a mess. Honestly, I didn’t get it perfect. PDF formatting is a beast.

Then I decided to focus on extracting a specific chapter. I knew which pages it spanned. So, I modified the loop to only grab text from those pages:


import PyPDF2
pdf_file = open('my_*', 'rb')
pdf_reader = *(pdf_file)

start_page = 10
end_page = 20
text = ""
for page_num in range(start_page, end_page + 1):
page = pdf_*[page_num]

text += *_text()
pdf_*()
print(text)

That gave me a more manageable chunk of text. Still needed cleaning, but at least it was focused.

Finally, I saved the extracted text to a file:


import PyPDF2
pdf_file = open('my_*', 'rb')
pdf_reader = *(pdf_file)
start_page = 10
end_page = 20

text = ""
for page_num in range(start_page, end_page + 1):
page = pdf_*[page_num]
text += *_text()
pdf_*()

with open('chapter_*', 'w', encoding='utf-8') as output_file:
output_*(text)

And that was it! I had a text file containing an excerpt from the book. The formatting wasn’t perfect, but it was good enough for what I needed. Mainly just wanted to play around and see how it worked.

Learned a few things: PyPDF2 is okay for basic extraction, but PDF formatting is a pain. Might try a different library next time. Maybe something that handles more complex layouts better. Who knows, maybe I’ll even try OCR on a physical book someday. But probably not. Too much effort!

Whats an Excerpt from a Book? Simple Guide for Beginners

LEAVE A REPLY Cancel reply

EDITOR PICKS

Thinking of a small san judas tadeo tattoo? Explore simple and...

France Duck Recipes: Discover the Best Duck Dishes Now.

Why do we find certain events literally unbelievable? Exploring the psychology...

POPULAR POSTS

Meet Your Lesbian Neighbors: Real-Life Stories and Experiences.

Things to see in East Coast USA: Top Spots You Gotta...

Discover Madeleine Vall Beijner: Find Out What Makes Her So Special...

POPULAR CATEGORY