Update README.md

This commit is contained in:
Nick Sweeting 2017-05-29 18:53:08 -05:00 committed by GitHub
parent 8e52c52fef
commit b2312a088d

View file

@ -1,6 +1,6 @@
# Pocket/Pinboard/Browser Bookmark Website Archiver <img src="https://getpocket.com/favicon.ico" height="22px"/> <img src="https://pinboard.in/favicon.ico" height="22px"/> [![Twitter URL](https://img.shields.io/twitter/url/http/shields.io.svg?style=social)](https://twitter.com/thesquashSH)
(Your own personal Way-Back Machine)
(Your own personal Way-Back Machine) [DEMO: sweeting.me/pocket](https://home.sweeting.me/pocket)
Save an archived copy of all websites you star using Pocket, Pinboard, or Browser bookmarks.
Outputs browsable html archives of each site, a PDF, a screenshot, and a link to a copy on archive.org, all indexed in a nice html file.
@ -12,41 +12,35 @@ NEW: Also submits each link to save on archive.org!
## Quickstart
`archive.py` is a script that takes a [Pocket](https://getpocket.com/export) export, and turns it into a browsable html archive that you can store locally or host online.
```bash
./archive.py pocket_export.html pocket # See below for how to install dependencies
```
**Runtime:** I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB.
Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV.
`archive.py` is a script that takes a [Pocket](https://getpocket.com/export), [Pinboard](https://pinboard.in/export/), or [Browser Bookmark](https://support.google.com/chrome/answer/96816?hl=en) html export file, and turns it into a browsable archive that you can store locally or host online.
**Dependencies:** `google-chrome >= 59`,` wget >= 1.16`, `python3 >= 3.5` ([chromium](https://www.chromium.org/getting-involved/download-chromium) >= v59 also works well, yay open source!)
**1. Install dependencies:** `google-chrome >= 59`,` wget >= 1.16`, `python3 >= 3.5` ([chromium](https://www.chromium.org/getting-involved/download-chromium) >= v59 also works well, yay open source!)
```bash
# On Mac:
brew install Caskroom/versions/google-chrome-canary wget python3
echo -e '#!/bin/bash\n/Applications/Google\ Chrome\ Canary.app/Contents/MacOS/Google\ Chrome\ Canary "$@"' > /usr/local/bin/google-chrome
chmod +x /usr/local/bin/google-chrome
# On Linux:
wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
sudo sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
apt update; apt install google-chrome-beta python3 wget
# Check:
google-chrome --version && which wget && which python3 && echo "[√] All dependencies installed."
```
On some Linux distributions the python3 package might not be recent enough.
If this is the case for you, resort to installing a recent enough version manually.
```bash
add-apt-repository ppa:fkrull/deadsnakes && apt update && apt install python3.6
```
If you still need help, [the official Python docs](https://docs.python.org/3.6/using/unix.html) are a good place to start.
To swtich from Google Chrome to chromium, change the `CHROME_BINARY` variable at the top of `archive.py`.
If you're missing `wget` or `curl`, simply install them using `apt` or your package manager of choice.
**2. Run the archive script:**
**Archiving:**
1. Download your pocket export file `ril_export.html` from https://getpocket.com/export
2. Download this repo `git clone https://github.com/pirate/pocket-archive-stream`
1. Download your export file e.g. `ril_export.html` from https://getpocket.com/export
2. Clone the repo `git clone https://github.com/pirate/pocket-archive-stream`
3. `cd pocket-archive-stream/`
4. `./archive.py ~/Downloads/ril_export.html [pinboard|pocket]`
4. `./archive.py ~/Downloads/ril_export.html [pocket|pinboard|bookmarks]`
It produces a folder `pocket/` containing an `index.html`, and archived copies of all the sites,
organized by timestamp. For each sites it saves:
@ -60,7 +54,22 @@ You can tweak parameters like screenshot size, file paths, timeouts, etc. in `ar
You can also tweak the outputted html index in `index_template.html`. It just uses python
format strings (not a proper templating engine like jinja2), which is why the CSS is double-bracketed `{{...}}`.
**Live Updating:** (coming soon)
**Estimated Runtime:** I've found it takes about an hour to download 1000 articles, and they'll take up roughly 1GB.
Those numbers are from running it single-threaded on my i5 machine with 50mbps down. YMMV.
**Troubleshooting:**
On some Linux distributions the python3 package might not be recent enough.
If this is the case for you, resort to installing a recent enough version manually.
```bash
add-apt-repository ppa:fkrull/deadsnakes && apt update && apt install python3.6
```
If you still need help, [the official Python docs](https://docs.python.org/3.6/using/unix.html) are a good place to start.
To switch from Google Chrome to chromium, change the `CHROME_BINARY` variable at the top of `archive.py`.
If you're missing `wget` or `curl`, simply install them using `apt` or your package manager of choice.
**Live Updating:** (coming soon... maybe...)
It's possible to pull links via the pocket API or public pocket RSS feeds instead of downloading an html export.
Once I write a script to do that, we can stick this in `cron` and have it auto-update on it's own.
@ -101,15 +110,18 @@ Now I can rest soundly knowing important articles and resources I like wont diss
My published archive as an example: [sweeting.me/pocket](https://home.sweeting.me/pocket).
## Security WARNING
## Security WARNING & Content Disclaimer
Hosting other people's site content has security implications for your domain, make sure you understand
the dangers of hosting other people's CSS & JS files [on your domain](https://developer.mozilla.org/en-US/docs/Web/Security/Same-origin_policy). It's best to put this on a domain
of its own to slightly mitigate [CSRF attacks](https://en.wikipedia.org/wiki/Cross-site_request_forgery).
It might also be prudent to blacklist your archive in your `robots.txt` so that search engines dont index
You may also want to blacklist your archive in your `/robots.txt` so that search engines dont index
the content on your domain.
Be aware that some sites you archive may not allow you to rehost their content publicly for copyright reasons,
it's up to you to host responsibly and respond to takedown requests appropriately.
## TODO
- body text extraction using [fathom](https://hacks.mozilla.org/2017/04/fathom-a-framework-for-understanding-web-pages/)
@ -120,11 +132,13 @@ the content on your domain.
- automatic text summaries of article with summarization library
- feature image extraction
- http support (from my https-only domain)
- try getting dead links from archive.org (https://github.com/hartator/wayback-machine-downloader)
- try wgetting dead sites from archive.org (https://github.com/hartator/wayback-machine-downloader)
## Links
- [Hacker News Discussion](https://news.ycombinator.com/item?id=14272133)
- [Reddit r/selfhosted Discussion](https://www.reddit.com/r/selfhosted/comments/69eoi3/pocket_stream_archive_your_own_personal_wayback/)
- [Reddit r/datahoarder Discussion](https://www.reddit.com/r/DataHoarder/comments/69e6i9/archive_a_browseable_copy_of_your_saved_pocket/)
- https://wallabag.org + https://github.com/wallabag/wallabag
- https://webrecorder.io/
- https://github.com/ikreymer/webarchiveplayer#auto-load-warcs