diff --git a/README.md b/README.md
index e8492472..b8892b06 100644
--- a/README.md
+++ b/README.md
@@ -13,7 +13,7 @@
@@ -72,10 +72,9 @@ The goal is to sleep soundly knowing the part of the internet you care about wil
Expand for quick copy-pastable install commands... ⤵️
-mkdir ~/archivebox; cd ~/archivebox # create a dir somewhere for your archivebox data
-
-# Option A: Get ArchiveBox with Docker Compose (recommended):
-curl -sSL 'https://docker-compose.archivebox.io' > docker-compose.yml # edit options in this file as-needed
+# Option A: Get ArchiveBox with Docker Compose (recommended):
+mkdir -p ~/archivebox/data && cd ~/archivebox
+curl -fsSL 'https://docker-compose.archivebox.io' > docker-compose.yml # edit options in this file as-needed
docker compose run archivebox init --setup
# docker compose run archivebox add 'https://example.com'
# docker compose run archivebox help
@@ -83,6 +82,7 @@ docker compose run archivebox init --setup
# Option B: Or use it as a plain Docker container:
+mkdir -p ~/archivebox/data && cd ~/archivebox/data
docker run -it -v $PWD:/data archivebox/archivebox init --setup
# docker run -it -v $PWD:/data archivebox/archivebox add 'https://example.com'
# docker run -it -v $PWD:/data archivebox/archivebox help
@@ -91,6 +91,7 @@ docker run -it -v $PWD:/data archivebox/archivebox init --setup
# Option C: Or install it with your preferred pkg manager (see Quickstart below for apt, brew, and more)
pip install archivebox
+mkdir -p ~/archivebox/data && cd ~/archivebox/data
archivebox init --setup
# archviebox add 'https://example.com'
# archivebox help
@@ -98,7 +99,7 @@ archivebox init --setup
# Option D: Or use the optional auto setup script to install it
-curl -sSL 'https://get.archivebox.io' | sh
+curl -fsSL 'https://get.archivebox.io' | sh
Open http://localhost:8000
to see your server's Web UI ➡️
@@ -182,9 +183,9 @@ ArchiveBox is free for everyone to self-host, but we also provide support, secur
docker-compose.yml
file into a new empty directory (can be anywhere).
-mkdir ~/archivebox && cd ~/archivebox
+
mkdir -p ~/archivebox/data && cd ~/archivebox
# Read and edit docker-compose.yml options as-needed after downloading
-curl -sSL 'https://docker-compose.archivebox.io' > docker-compose.yml
+curl -fsSL 'https://docker-compose.archivebox.io' > docker-compose.yml
docker compose run archivebox init --setup
@@ -208,7 +209,7 @@ See below for more usage examples using the C
See below for more usage examples using the CLI, Web UI, or filesystem/SQL/Python to manage your archive.mkdir ~/archivebox && cd ~/archivebox
+
mkdir -p ~/archivebox/data && cd ~/archivebox/data
docker run -v $PWD:/data -it archivebox/archivebox init --setup
@@ -256,12 +257,16 @@ See "Against curl | sh as a
+curl -sSL 'https://get.archivebox.io' | sh
curl -fsSL 'https://get.archivebox.io' | sh
pip3
(or pipx
).
+See the Install: Bare Metal Wiki for full install instructions for each OS...
pip3 install archivebox
+archivebox version
+# install any missing extras shown using apt/brew/pkg/etc.
+# python@3.10 node curl wget git ripgrep ...
mkdir ~/archivebox && cd ~/archivebox
-archivebox init --setup
-# install any missing extras like wget/git/ripgrep/etc. manually as needed
+
mkdir -p ~/archivebox/data && cd ~/archivebox/data # for example
+archivebox init --setup # instantialize a new collection
+# (--setup auto-installs and link JS dependencies: singlefile, readability, etc.)
-See the pip-archivebox
repo for more details about this distribution.
+
+See the pip-archivebox
repo for more details about this distribution.
mkdir ~/archivebox && cd ~/archivebox
+mkdir -p ~/archivebox/data && cd ~/archivebox/data
archivebox init --setup # if any problems, install with pip instead
-Note: If you encounter issues with NPM/NodeJS, install a more recent version.
+Note: If you encounter issues or want more granular instructions, see the Install: Bare Metal Wiki.
archivebox server 0.0.0.0:8000
@@ -323,9 +329,10 @@ See the debian-a
brew tap archivebox/archivebox
brew install archivebox
+See the Install: Bare Metal Wiki for more granular instructions for macOS... ➡️
mkdir ~/archivebox && cd ~/archivebox
+mkdir -p ~/archivebox/data && cd ~/archivebox/data
archivebox init --setup # if any problems, install with pip instead
homebr
- Arch:
yay -S archivebox
(contributed by @imlonghao
)
-- FreeBSD:
curl -sSL 'https://get.archivebox.io' | sh
(uses pkg
+ pip3
under-the-hood)
+- FreeBSD:
curl -fsSL 'https://get.archivebox.io' | sh
(uses pkg
+ pip3
under-the-hood)
- Nix:
nix-env --install archivebox
(contributed by @siraben
)
- Guix:
guix install archivebox
(contributed by @rakino
)
- More: contribute another distribution...!
@@ -461,13 +468,14 @@ mkdir -p ~/archivebox/data # create a new data dir anywhere
cd ~/archivebox/data # IMPORTANT: cd into the directory
# archivebox [subcommand] [--help]
+archivebox version
archivebox help
# equivalent: docker compose run archivebox [subcommand] [--help]
docker compose run archivebox help
# equivalent: docker run -it -v $PWD:/data archivebox/archivebox [subcommand] [--help]
- docker run -it -v $PWD:/data archivebox/archivebox help
+docker run -it -v $PWD:/data archivebox/archivebox help
```
#### ArchiveBox Subcommands
@@ -677,7 +685,7 @@ It uses all available methods out-of-the-box, but you can disable extractors and
Expand to see the full list of ways it saves each page...
-./archive/{Snapshot.id}/
+data/archive/{Snapshot.id}/
- Index:
index.html
& index.json
HTML and JSON index files containing metadata and details
- Title, Favicon, Headers Response headers, site favicon, and parsed site title
@@ -808,18 +816,18 @@ All of ArchiveBox's state (SQLite DB, content, config, logs, etc.) is stored in
Expand to learn more about the layout of Archivebox's data on-disk...
-Data folders can be created anywhere (`~/archivebox` or `$PWD/data` as seen in our examples), and you can create as many data folders as you want to hold different collections.
+Data folders can be created anywhere (`~/archivebox/data` or `$PWD/data` as seen in our examples), and you can create as many data folders as you want to hold different collections.
All archivebox
CLI commands are designed to be run from inside an ArchiveBox data folder, starting with archivebox init
to initialize a new collection inside an empty directory.
-mkdir ~/archivebox && cd ~/archivebox # just an example, can be anywhere
+mkdir -p ~/archivebox/data && cd ~/archivebox/data # just an example, can be anywhere
archivebox init
-The on-disk layout is optimized to be easy to browse by hand and durable long-term. The main index is a standard index.sqlite3
database in the root of the data folder (it can also be exported as static JSON/HTML), and the archive snapshots are organized by date-added timestamp in the ./archive/
subfolder.
+The on-disk layout is optimized to be easy to browse by hand and durable long-term. The main index is a standard index.sqlite3
database in the root of the data folder (it can also be exported as static JSON/HTML), and the archive snapshots are organized by date-added timestamp in the data/archive/
subfolder.
-/data/
+data/
index.sqlite3
ArchiveBox.conf
archive/
@@ -834,7 +842,7 @@ The on-disk layout is optimized to be easy to browse by hand and durable long-te
...
-Each snapshot subfolder ./archive/TIMESTAMP/
includes a static index.json
and index.html
describing its contents, and the snapshot extractor outputs are plain files within the folder.
+Each snapshot subfolder data/archive/TIMESTAMP/
includes a static index.json
and index.html
describing its contents, and the snapshot extractor outputs are plain files within the folder.
Learn More
@@ -1048,9 +1056,9 @@ Because ArchiveBox is designed to ingest a large volume of URLs with multiple co
Disk usage can be reduced by using a compressed/deduplicated filesystem like ZFS/BTRFS, or by turning off extractors methods you don't need. You can also deduplicate content with a tool like [fdupes](https://github.com/adrianlopezroche/fdupes) or [rdfind](https://github.com/pauldreik/rdfind).
-**Don't store large collections on older filesystems like EXT3/FAT** as they may not be able to handle more than 50k directory entries in the `archive/` folder.
+**Don't store large collections on older filesystems like EXT3/FAT** as they may not be able to handle more than 50k directory entries in the `data/archive/` folder.
-**Try to keep the `index.sqlite3` file on local drive (not a network mount)** or SSD for maximum performance, however the `archive/` folder can be on a network mount or slower HDD.
+**Try to keep the `data/index.sqlite3` file on local drive (not a network mount)** or SSD for maximum performance, however the `data/archive/` folder can be on a network mount or slower HDD.
If using Docker or NFS/SMB/FUSE for the `data/archive/` folder, you may need to set [`PUID` & `PGID`](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#puid--pgid) and [disable `root_squash`](https://github.com/ArchiveBox/ArchiveBox/issues/1304) on your fileshare server.
@@ -1441,7 +1449,7 @@ https://stackoverflow.com/questions/1074212/how-can-i-see-the-raw-sql-queries-dj
ArchiveBox [`extractors`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/extractors/media.py) are external binaries or Python/Node scripts that ArchiveBox runs to archive content on a page.
-Extractors take the URL of a page to archive, write their output to the filesystem `archive/TIMESTAMP/EXTRACTOR/...`, and return an [`ArchiveResult`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/core/models.py#:~:text=return%20qs-,class%20ArchiveResult,-(models.Model)%3A) entry which is saved to the database (visible on the `Log` page in the UI).
+Extractors take the URL of a page to archive, write their output to the filesystem `data/archive/TIMESTAMP/EXTRACTOR/...`, and return an [`ArchiveResult`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/core/models.py#:~:text=return%20qs-,class%20ArchiveResult,-(models.Model)%3A) entry which is saved to the database (visible on the `Log` page in the UI).
*Check out how we added **[`archivebox/extractors/singlefile.py`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/extractors/singlefile.py)** as an example of the process: [Issue #399](https://github.com/ArchiveBox/ArchiveBox/issues/399) + [PR #403](https://github.com/ArchiveBox/ArchiveBox/pull/403).*