Commit graph

15 commits

Author SHA1 Message Date
Nick Sweeting
a680724367
Merge branch 'dev' into search_index_extract_html_text 2023-10-27 23:09:28 -07:00
Ross Williams
310b4d1242 Add htmltotext extractor
Saves HTML text nodes and selected element attributes in
`htmltotext.txt` for each Snapshot. Primarily intended to be used
for search indexing.
2023-10-23 21:42:32 -04:00
Ross Williams
b44f7e68b1 Add URL-specific method allow/deny lists
Allows enabling only allow-listed extractors or disabling specific
deny-listed extractors for a regular expression matched against an added
site's URL.
2023-08-02 09:36:40 -04:00
Nick Sweeting
bd6d9c165b enforce utf8 on literally all file operations because windows sucks 2021-03-27 01:16:29 -04:00
Cristian
62ed11a5ca fix: Improve headers handling 2020-09-24 12:55:51 -05:00
Angel Rey
ee6caca3ca Added more asserts 2020-09-23 11:07:00 -05:00
Angel Rey
1cce786d6d Added test headers extractor 2020-09-23 11:07:00 -05:00
ttimasdf
e3329be291 tests: add test for mercury-parser 2020-09-22 18:44:12 -05:00
Cristian
cc0fa747ce feat: Add options to ease management of node related extractors 2020-08-18 10:34:28 -05:00
Cristian
2a68af1b94 tests: Add readability tests 2020-08-11 11:15:15 -05:00
Cristian
5429096c30 tests: Add mechanism to avoid using extractors that we are not testing 2020-08-04 08:42:30 -05:00
Nick Sweeting
5b6eb5e4ad make filenames consistent with program name 2020-08-03 13:23:05 -05:00
Cristian
37df00a08b tests: Add basic singlefile test 2020-08-03 13:22:36 -05:00
Cristian
e6c571beb2 fix: Remove title from extractors for oneshot 2020-07-31 10:24:58 -05:00
Cristian
23e6803f02 fix: Add change to calculate wget folder when there is a port present 2020-07-17 16:55:56 -05:00