Commit graph

32 commits

Author SHA1 Message Date
Nick Sweeting
457c42bf84
load EXTRACTORS dynamically using importlib.import_module 2024-05-11 22:28:59 -07:00
Nick Sweeting
c1fd2cfa42 tag URLs immediately once added instead of waiting until archival completes 2024-01-03 20:31:46 -08:00
Nick Sweeting
78d942ac22 show more detail in readabiliity error messages 2024-01-03 20:09:31 -08:00
Nick Sweeting
5b07a1126c add comment about why DOM is preferred over singlefile for readability parsing 2024-01-03 19:09:24 -08:00
Nick Sweeting
2c54e55697 prefer dom dump to singlefile for generating readability output 2024-01-02 19:50:56 -08:00
Nick Sweeting
82d8662c74 add more readability error output 2023-10-20 04:14:28 -07:00
prnake
011bd104cb
remove unused import 2022-02-09 10:48:51 +08:00
papersnake
de8e22efb7 improve title extractor 2022-02-08 23:17:52 +08:00
Nick Sweeting
eb4d3bca9d
Update readability.py 2021-05-13 00:13:32 -04:00
Nick Sweeting
a9986f1f05 add timezone support, tons of CSS and layout improvements, more detailed snapshot admin form info, ability to sort by recently updated, better grid view styling, better table layouts, better dark mode support 2021-04-10 04:21:36 -04:00
Nick Sweeting
bd6d9c165b enforce utf8 on literally all file operations because windows sucks 2021-03-27 01:16:29 -04:00
Nick Sweeting
acb932ba12 improve readability and mercury error handling and fix output path to be relative 2021-02-16 15:53:11 -05:00
Nick Sweeting
d0f8a5e710 change mercury atomic_write output order 2021-02-16 06:19:16 -05:00
Dan Arnfield
5420903102 Refactor should_save_extractor methods to accept overwrite parameter 2021-01-21 15:56:32 -06:00
JDC
b1f70b2197 Initial implementation 2020-12-06 01:12:45 +02:00
Nick Sweeting
a645f36b87
add comment about fake cmd 2020-09-01 19:42:22 -04:00
Cristian
66037535fd feat: Add curl command on readability as default command to debug 2020-09-01 10:16:24 -05:00
Cristian
bf3ea42141 fix: Add a default cmd value to handle case where the html cannot be retrieved 2020-08-27 09:51:33 -05:00
Nick Sweeting
a2c158e43e catch OSErrors due to missing path 2020-08-18 19:09:45 -04:00
Nick Sweeting
7144e0bdce search for node dependencies in output dir first 2020-08-18 18:40:19 -04:00
Nick Sweeting
92de20af15 better detect missing dependencies on startup 2020-08-18 04:38:13 -04:00
Cristian
05c71fc302 fix: Organize readability extractor so a timeout does not break the whole process 2020-08-17 08:34:40 -05:00
Nick Sweeting
03b73bfe77
Update archivebox/extractors/readability.py 2020-08-14 12:55:22 -04:00
Cristian
5dc7e63792 feat: Update dockerfile to support readability 2020-08-11 11:52:43 -05:00
Cristian
2a68af1b94 tests: Add readability tests 2020-08-11 11:15:15 -05:00
Cristian
8aa7b34de7 tests: Add readability to ignored methods in tests 2020-08-11 08:58:49 -05:00
Cristian
dc87d8b68c tests: Update failing tests 2020-08-11 08:48:13 -05:00
Cristian
0ec747f64e feat: Look in wget, singlefile or dom outputs before attempting to download the information again 2020-08-11 08:37:12 -05:00
Cristian
a14762640e feat: Avoid running readability when the target is a file 2020-08-11 08:37:12 -05:00
Cristian
61e08a7c43 docs: Update docs link 2020-08-11 08:37:12 -05:00
Cristian
b33c66a9f7 feat: Split output of readability into multiple files 2020-08-11 08:37:12 -05:00
Cristian
7e2b249388 feat: Initial version of readability extractor 2020-08-11 08:37:12 -05:00