File Formats

File Palaces supports a wide range of file types. Text is extracted automatically during mining — you don't need to convert files before adding them to a Wing.

Supported formats

Format	Extensions	Library	Notes
PDF	`.pdf`	pypdf	Text-layer PDFs only. Scanned PDFs require OCR (not yet built-in).
Word	`.docx`	python-docx	Full text + tables extracted. Headers and footers included.
Excel	`.xlsx`	openpyxl	Each sheet is extracted separately as a Room.
Excel (legacy)	`.xls`	xlrd	Older binary format. Read-only extraction.
CSV	`.csv`	stdlib	Each row treated as a document unit.
Plain text	`.txt`	stdlib	UTF-8 and common encodings auto-detected.
Markdown	`.md`, `.mdx`	stdlib	Markdown syntax preserved in extracted text.
Audio	`.mp3`, `.wav`, `.ogg`, `.flac`, `.m4a`	OpenAI Whisper	Transcribed to text locally using the Whisper model.
Video	`.mp4`, `.mov`, `.mkv`	OpenAI Whisper	Audio track extracted, then transcribed.
Email	`.eml`, `.msg`	stdlib / extract-msg	Subject, sender, date, and body indexed.
ZIP archive	`.zip`	stdlib	Contents extracted and each file processed individually.
Web URL	URL input	httpx + BeautifulSoup	Page text scraped and indexed. JavaScript-heavy pages may be incomplete.

WHISPER TRANSCRIPTION

Audio and video transcription uses OpenAI Whisper running locally — no API key required and no audio leaves your machine. The default model is base (~74 MB). For better accuracy on noisy recordings, switch to small or medium in Settings → Transcription.

Transcription is significantly slower than text extraction: expect ~1× real-time for base on a modern CPU.

Planned formats

The following formats are on the roadmap but not yet supported:

Format	Status
PowerPoint (`.pptx`)	Planned — v0.6
EPUB	Planned
HTML (local files)	Planned
RTF	Planned
Scanned PDF (OCR)	Planned — requires Tesseract or a cloud OCR option
Notion export	Planned

Unsupported files

Files with unsupported extensions are skipped silently during mining. They appear in the mining error log (if enabled) but do not block other files from being indexed.

Binary files (images, executables, compiled code) are always skipped.

Encoding detection

For plain-text files, File Palaces uses chardet to detect encoding before decoding. Files that cannot be decoded as any known encoding are skipped.

Large files

There is no hard file size limit, but very large files (>50 MB of text) can slow down mining significantly. Consider splitting large documents if mining speed is a concern.

The chunker caps individual chunks at 512 tokens, so even very long files are ingested correctly — they just produce more Drawers.