File Formats
File Palaces supports a wide range of file types. Text is extracted automatically during mining — you don't need to convert files before adding them to a Wing.
Supported formats
| Format | Extensions | Library | Notes |
|---|---|---|---|
.pdf | pypdf | Text-layer PDFs only. Scanned PDFs require OCR (not yet built-in). | |
| Word | .docx | python-docx | Full text + tables extracted. Headers and footers included. |
| Excel | .xlsx | openpyxl | Each sheet is extracted separately as a Room. |
| Excel (legacy) | .xls | xlrd | Older binary format. Read-only extraction. |
| CSV | .csv | stdlib | Each row treated as a document unit. |
| Plain text | .txt | stdlib | UTF-8 and common encodings auto-detected. |
| Markdown | .md, .mdx | stdlib | Markdown syntax preserved in extracted text. |
| Audio | .mp3, .wav, .ogg, .flac, .m4a | OpenAI Whisper | Transcribed to text locally using the Whisper model. |
| Video | .mp4, .mov, .mkv | OpenAI Whisper | Audio track extracted, then transcribed. |
.eml, .msg | stdlib / extract-msg | Subject, sender, date, and body indexed. | |
| ZIP archive | .zip | stdlib | Contents extracted and each file processed individually. |
| Web URL | URL input | httpx + BeautifulSoup | Page text scraped and indexed. JavaScript-heavy pages may be incomplete. |
Audio and video transcription uses OpenAI Whisper running locally — no API key required and no audio leaves your machine. The default model is base (~74 MB). For better accuracy on noisy recordings, switch to small or medium in Settings → Transcription.
Transcription is significantly slower than text extraction: expect ~1× real-time for base on a modern CPU.
Planned formats
The following formats are on the roadmap but not yet supported:
| Format | Status |
|---|---|
PowerPoint (.pptx) | Planned — v0.6 |
| EPUB | Planned |
| HTML (local files) | Planned |
| RTF | Planned |
| Scanned PDF (OCR) | Planned — requires Tesseract or a cloud OCR option |
| Notion export | Planned |
Unsupported files
Files with unsupported extensions are skipped silently during mining. They appear in the mining error log (if enabled) but do not block other files from being indexed.
Binary files (images, executables, compiled code) are always skipped.
Encoding detection
For plain-text files, File Palaces uses chardet to detect encoding before decoding. Files that cannot be decoded as any known encoding are skipped.
Large files
There is no hard file size limit, but very large files (>50 MB of text) can slow down mining significantly. Consider splitting large documents if mining speed is a concern.
The chunker caps individual chunks at 512 tokens, so even very long files are ingested correctly — they just produce more Drawers.