All notable changes to this project are documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning (as of version 1.5.0).
- Add utility function to compute ZIM Tags #164, including deduplication #156
- Metadata does not automatically drops control characters #159
- New
indexing.IndexData
class to hold title, content and keywords to pass to libzim to index an item - Automatically index PDF documents content #167
- Automatically set proper title on PDF documents #168
- Expose new
optimization.get_optimization_method
to get the proper optimization method to call for a given image format - Add
optimization.get_optimization_method
to get the proper optimization method to call for a given image format - New
creator.Creator.convert_and_check_metadata
to convert metadata to bytes or str for known use cases and check proper type is passed to libzim - Add svg2png image conversion function #113
- Add
conversion.convert_svg2png
image conversion function + support for SVG inprobing.format_for
#113 - Add
i18n.Lang
class used as typed result of i18n operations #151
- BREAKING Renamed
zimscraperlib.image.convertion
tozimscraperlib.image.conversion
to fix typo - BREAKING Many changes in type hints to match the real underlying code
- BREAKING Force all boolean arguments (and some other non-obvious parameters) to be keyword-only in function calls for clarity / disambiguation (see ruff rule FBT002)
- Prefer to use
IO[bytes]
toio.BytesIO
when possible since it is more generic - BREAKING
i18n.NotFound
renamedi18n.NotFoundError
- BREAKING
types.get_mime_for_name
now returnsstr | None
- BREAKING
creator.Creator.add_metadata
andcreator.Creator.validate_metadata
now only acceptsbytes | str
as value (it must have been converted before call) - BREAKING second argument of
creator.Creator.add_metadata
has been renamed tovalue
instead ofcontent
to align with other methods - When a type issue arises in metadata checks, wrong value type is displayed in exception
- BREAKING
i18n.get_language_details()
,i18n.get_iso_lang_data()
,i18n.find_language_names()
andi18n.update_with_macro
now process / return a new typedLang
class #151 - BREAKING Rename
i18.NotFound
toi18n.NotFoundError
- BREAKING Remove translation features in
i18n
:Locale
class +_
andsetlocale
functions #134
- Metadata length validation is buggy for unicode strings #158
- Pillow 10.4.0 reveals improper type hints for image probing functions #177
- Enhance error when locale fails to setup #157
zim.creator.Creator._log_metadata()
to log (DEBUG) all metadata set on_metadata
(prior to start()) #155- New utility function to confirm ZIM can be created at given location / name #163
- Migrate the VideoWebmLow and VideoWebmHigh presets to VP9 for smaller file size #79
- New preset versions are v3 and v2 respectively
- Simplify type annotations by replacing Union and Optional with pipe character ("|") for improved readability and clarity #150
- Calling
Creator._log_metadata()
onCreator.start()
if running in DEBUG #155
- Add back the
--runinstalled
flag for test execution to allow smooth testing on other build chains #139
- Add support for
disable_metadata_checks
andignore_duplicates
arguments inmake_zim_file
function ("zimwritefs-mode")
- Relaxed constraints on Python dependencies
- Upgraded optional dependencies used for test and QA
- Set a user-agent for
handle_user_provided_file
#103
- Migrate to generic syntax in all std collections #140
- Do not modify the ffmpeg_args in reencode function #144
- New
disable_metadata_checks
parameter inzimscraperlib.zim.creator.Creator
initializer, allowing to disable metadata check at startup (assuming the user will validate them on its own) #119
- Rework the VideoWebmLow preset for faster encoding and smaller file size #122
- preset has been bumped to version 2
- when using an S3 cache, all videos using this preset will be reencoded and uploaded to cache again (it will replace the same file encoded with preset version 1)
- When reencoding a video, ffmpeg now uses only 1 CPU thread by default (new arg to
reencode
allows to override this default value) - Using openZIM Python bootstrap conventions (including hatch-openzim plugin) #120
- Add support for Python 3.12, drop Python 3.7 support #118
- Replace "iso-369" by "iso639-lang" library
- Replace "file-magic" by "python-magic" library for Alpine Linux support and better maintenance
- Fixed type hints of
zimscraperlib.zim.Item
and subclasses, andzimscraperlib.image.optimization:convert_image
- Add utility function to compute/check ZIM descriptions #110
- Using pylibzim
3.4.0
- Support for Python 3.7 (EOL)
- Fixed declared (hint) return type of
download.stream_file
#104 - Fixed declared (hint) type of
content
param forCreator.add_item_for
#107
- Using pylibzim
3.1.0
- ZIM metadata check now allows multiple values (comma-separated) for
Language
- Using
yt_dlp
instead ofyoutube_dl
- Dropped support for Python 3.6
zim.creator.Creator
and zim.filesystem.make_zim_file
zim.creator.Creator.config_metadata
method (returning Self) exposing all mandatory Metdata, all standard ones and allowing extra text metdadata.zim.creator.Creator.config_dev_metadata
method setting stub metdata for all mandatory ones (allowing overrides)zim.metadata
module with a list of per-metadata validation functionszim.creator.Creator.validate_metadata
(called onstart
) to verify metadata respects the spec (and its recommendations)zim.filesystem.make_zim_file
accepts a new optionallong_description
param.i18n.is_valid_iso_639_3
to check ISO-639-3 codesimage.probing.is_valid_image
to check Image format and size
zim.creator.Creator
main_path
argument now mandatoryzim.creator.Creator.start
now fails on missing required or invalid metadatazim.creator.Creator.add_metadata
nows enforces validation checkszim.filesystem.make_zim_file
renamed itsfavicon_path
param toillustration_path
zim.creator.Creator.config_indexing
language
argument now optionnal whenindexing=False
zim.creator.Creator.config_indexing
now validateslanguage
is ISO- 639-3 whenindexing=True
zim.creator.Creator.update_metadata
. See.config_metadata()
insteadzim.creator.Creator
language
argument. See.config_metadata()
insteadzim.creator.Creator
keyword arguments. See.config_metadata()
insteadzim.creator.Creator.add_default_illustration
. See.config_metadata()
insteadzim.archibe.Archive.media_counter
(deprecated in2.0.0
)
zim.creator.Creator(language=)
can be specified asList[str]
.["eng", "fra"]
,["eng"]
,"eng,fra"
, "eng" are all valid values.
- Fixed
zim.providers.URLProvider
returning incomplete streams under certain circumstances (from openzim/kolibri#40) - Fixed
zim.creator.Creator
not supporting multiple values in for Language metadata, as required by the spec
- Using pylibzim v2.1.0 (using libzim 8.1.0)
- [libzim]
Entry.get_redirect_entry()
- [libzim]
Item.get_indexdata()
to implement custom IndexData per entry (writer) - [libzim]
Archive.media_count
- [libzim]
Archive.article_count
updated to match scraperlib's version Archive.article_counter
now deprecated. Now returnsArchive.article_count
Archive.media_counter
now deprecated. Now returnsArchive.media_count
- [libzim]
lzma
compression algorithm
download.get_session()
to build a new requests Session
download.stream_file()
accepts asession
param to use instead of creating one
zim.Creator
now supportsignore_duplicates: bool
parameter to prevent duplicates from raising exceptionszim.Creator.add_item
,zim.Creator.add_redirect
andzim.Creator.add_item_for
now supports aduplicate_ok: bool
parameter to prevent an exception should this item/redirect be a duplicate
download.stream_file()
supports passingheaders
(scrapers were already using it)
- Fixed
filesystem.get_content_mimetype()
crashing on non-guessable byte stream
- Wider range of accepted lxml dependency version as 4.9.1 fixes a security issue
Archive.get_metadata_item()
to retrieve full item instead of just value
- Using pylibzim v1.1.0 (using libzim 7.2.1)
- Adding duplicate entries now raises RuntimeError
- filesize is fixed for larger ZIMs
zim.Archive.tags
andzim.Archive.get_tags()
to retrieve parsed Tags with optionnallibkiwix
param to include libkiwix's hints- [tests] Counter tests now also uses a libzim6 file.
zim.Archive.article_counter
follows libkiwix's new bahavior of returning libzim'sarticle_count
for libzim 7+ ZIMs and returning previously returned (parsed) value for older ZIMs.
- Unreachable code removed in
imaging
module. - [tests] “Sanskrit” removed from tests as output not predicatble depending on plaftform.
zim.Archive.counters
wont fail on missingCounter
metadata
- Fixed leak in
zim.Archive
's.counters
- New
.get_text_metadata()
method onzim.Archive
to save UTF-8 decoding
- New
Counter
metadata based properties for Archive:.counters
: parsed dict of the Counter metadata.article_counter
: libkiwix's calculation for nb or article.media_counter
: libkiwix's calculation for nb or media
- Fixed
i18n.find_language_names()
failing on some languages - Added
uri
module withrebuild_uri()
- Using new python-libzim based on libzim v7
- New Creator API
- Removed all namespace references
- Renamed
url
mentions topath
- Removed all links rewriting
- Removed Article/CSS/Binary seggreation
- Kept zimwriterfs mode (except it doesn't rewrite for namespaces)
- New
html
module for HTML document manipulations - New callback system on
add_item_for()
andadd_item()
- New Archive API with easier search/suggestions and content access
- Changed download log level to DEBUG (was INFO)
filesystem.get_file_mimetype
now passes bytes to libmagic instead of filename due to release issue in libmagic- safer
inputs.handle_user_provided_file
regarding input as str instead of Path image.presets
andvideo.presets
now all includesext
andmimetype
properties- Video convert log now DEBUG instead of INFO
- Fixed
image.save_image()
saving to disk even when using a bytes stream - Fixed
image.transformation.resize_image()
when resizing a byte stream without a dst
Intermediate release using unreleased libzim to support development of libzim7. Don't use it.
- requesting newer libzim version (not released ATM)
- New ZIM API for non-namespace libzim (v7)
- updated all requirements
- Fixed download test inconsistency
- fix_ogvjs mostly useless: only allows webm types
- exposing retry_adapter for refactoring
- Changed download log level to DEBUG (was INFO)
- guess more-defined mime from filename if magic says it's text
- get_file_mimetype now passes bytes to libmagic
- safer regarding input as str instead of Path
- fixed static item for empty content
- ext and mimetype properties for all presets
- Video convert log now DEBUG instead of INFO
- Added delete_fpath to add_item_for() and fixed StaticItem's auto remove
- Updated badges for new repo name
- add
stream_file()
to stream content from a URL into a file or aBytesIO
object - deprecated
save_file()
- fixed
add_binary
when used without an fpath (#69) - deprecated
make_grayscale
option in image optimization - Added support for in-memory optimization for PNG, JPEG, and WebP images
- allows enabling debug logs via ZIMSCRAPERLIB_DEBUG environ
- added
wait
option inYoutubeDownloader
to allow parallelism while using context manager - do not use extension for finding format in
ensure_matches()
inimage.optimization
module - added
VideoWebmHigh
andVideoMp4High
presets for high quality WebM and Mp4 convertion respectively - updated presets
WebpHigh
,JpegMedium
,JpegLow
andPngMedium
inimage.presets
save_image
moved fromimage
toimage.utils
- added
convert_image
optimize_image
resize_image
functions toimage
module
- added
YoutubeDownloader
todownload
to download YT videos using a capped nb of threads
- fixed rewriting of links with empty target
- added support for image optimization using
zimscraperlib.image.optimization
for webp, gif, jpeg and png formats - added
format_for()
inzimscraperlib.image.probing
to get PIL image format from the suffix
- replaced BeautifoulSoup parser in rewriting (
html.parser
–>lxml
)
- detect mimetypes from filenames for all text files
- fixed non-filename based StaticArticle
- enable rewriting of links in poster attribute of audio element
- added find_language_in() and find_language_in_file() to get language from HTML content and HTML file respectively
- add a mime mapping to deal with inconsistencies in mimetypes detected by magic on different platforms
- convert_image signature changed:
target_format
positional argument removed. Replaced with optionnalfmt
key of keyword arguments.colorspace
optionnal positional argument removed. Replaced with optionnalcolorspace
key of keyword arguments.
- prevent rewriting of links with special schemes
mailto
, 'tel', etc. in HTML links rewriting - replaced
imaging
module with explodedimage
module (convertion
,probing
,transformation
) - changed
create_favicon()
param names (source_image
->src
,dest_ico
->dst
) - changed
save_image()
param names (image
->src
) - changed
get_colors()
param names (image_path
->src
) - changed
resize_image()
param names (fpath
->src
)
- fixed URL rewriting when running from /
- added support for link rewriting in
<object>
element - prevent from raising error if element doesn't have the attribute with url
- use non greedy match for CSS URL links (shortest string matching
url()
format) - fix namespace of target only if link doesn't have a netloc
- added UTF8 to constants
- added mime_type discovery via magic (filesystem)
- Added types: mime types guessing from file names
- Revamped zim API
- Removed ZimInfo which role was tu hold metadata for zimwriterfs call
- Removed calling zimwriterfs binary but kept function name
- Added zim.filesystem: zimwriterfs-like creation from a build folder
- Added zim.creator: create files by manually adding each article
- Added zim.rewriting: tools to rewrite links/urls in HTML/CSS
- add timeout and retries to save_file() and make it return headers
- fixed
convert_image()
which tried to use a closed file
- exposed reencode, Config and get_media_info in zimscraperlib.video
- added save_image() and convert_image() in zimscraperlib.imaging
- added support for upscaling in resize_image() via allow_upscaling
- resize_image() now supports params given by user and preservs image colorspace
- fixed tests for zimscraperlib.imaging
- added video module with reencode, presets, config builder and video file probing
make_zim_file()
accepts extra kwargs for zimwriterfs
- added translation support to i18n
- added s3transfer to verbose dependencies list
- changed default log format to include module name
- verbose dependencies (urllib3, boto3) now logged at WARNING level by default
- ability to set verbose dependencies log level and add modules to the list
- zimscraperlib's logging level now aligned with scraper's requested one
- fix_ogvjs_dist script more generic (#1)
- updated zim to support other zimwriterfs params (#10)
- more flexible requirements for requests dependency
- fixed return value of
get_language_details
on non-existent language - fixed crash on
resize_image
with methodheight
- fixed root logger level (now DEBUG)
- removed useless
console=True
getLogger
param - completed tests (100% coverage)
- added
./test
script for quick local testing - improved tox.ini
- added
create_favicon
to generate a squared favicon - added
handle_user_provided_file
to handle user file/URL from param
- fixed fix_ogvjs_dist
- initial version providing
- download: save_file, save_large_file
- fix_ogvjs_dist
- i18n: setlocale, get_language_details
- imaging: get_colors, resize_image, is_hex_color
- zim: ZimInfo, make_zim_file