All notable changes to this project will be documented in this file.
This project adheres to Semantic Versioning.
pw.xpacks.llm.prompts.RAGPromptTemplate
, set of prompt utilities that enable verifying templates and creating UDFs from prompt strings or callables.pw.xpacks.llm.question_answering.BaseContextProcessor
streamlines development and tuning of representing retrieved context documents to the LLM.pw.io.kafka.read
now supportswith_metadata
flag, which makes it possible to attach the metadata of the Kafka messages to the table entries.pw.io.deltalake.read
can now stream the tables with deletions, if no deletion vectors were used.
pw.io.sharepoint.read
now explicitly terminates with an error if it fails to read the data the specified number of times per row (the default is8
).pw.xpacks.llm.prompts.prompt_qa
, and other prompts expect 'context' and 'query' fields instead of 'docs'.- Removed support for
short_prompt_template
andlong_prompt_template
inpw.xpacks.llm.question_answering.BaseRAGQuestionAnswerer
. These prompt variants are no longer accepted during construction or in requests. pw.xpacks.llm.question_answering.BaseRAGQuestionAnswerer
allows setting user created prompts. Templates are verified to include 'context' and 'query' placeholders.pw.xpacks.llm.question_answering.BaseRAGQuestionAnswerer
can take aBaseContextProcessor
that represents context documents to the LLM. Defaults topw.xpacks.llm.question_answering.SimpleContextProcessor
which filters metadata fields and joins the documents with new lines.
- The input of
pw.io.fs.read
andpw.io.s3.read
is now correctly persisted in case deletions or modifications of already processed objects take place.
pw.io.s3.read
now monitors object deletions and modifications in the S3 source, when ran in streaming mode. When an object is deleted in S3, it is also removed from the engine. Similarly, if an object is modified in S3, the engine updates its state to reflect those changes.pw.io.s3.read
now supportswith_metadata
flag, which makes it possible to attach the metadata of the source object to the table entries.
pw.xpacks.llm.document_store.DocumentStore
no longer requires_metadata
column in the input table.
pw.xpacks.llm.document_store.SlidesDocumentStore
, which is a subclass ofpw.xpacks.llm.document_store.DocumentStore
customized for retrieving slides from presentations.pw.temporal.inactivity_detection
andpw.temporal.utc_now
functions allowing for alerting and other time dependent usecases
pw.Table.concat
,pw.Table.with_id
,pw.Table.with_id_from
no longer perform checks if ids are unique. It improves memory usage.- table operations that store values (like
pw.Table.join
,pw.Table.update_cells
) no longer store columns that are not used downstream. append_only
column property is now propagated better (there are more places where we can infer it).- BREAKING: Parsers and parser utilities including
OpenParse
,ParseUnstructured
,ParseUtf8
,parse_images
are now async. Parser interface in theVectorStore
andDocumentStore
remains unchanged. - BREAKING: Unused arguments from the constructor
pw.xpacks.llm.question_answering.DeckRetriever
are no longer accepted.
query_as_of_now
ofpw.stdlib.indexing.DataIndex
andpw.stdlib.indexing.HybridIndex
now work in constant memory for infinite query stream (no query-related data is kept after query is answered).
pw.io.kafka.read
now supports reading entries starting from a specified timestamp.pw.io.nats.read
andpw.io.nats.write
methods for reading from and writing Pathway tables to NATS.
pw.Table.diff
now supports settinginstance
parameter that allows computing differences for multiple groups.pw.io.postgres.write_snapshot
now keeps the Postgres table fully in sync with the current state of the table in Pathway. This means that if an entry is deleted in Pathway, the same entry will also be deleted from the Postgres table managed by the output connector.
pw.PyObjectWrapper
is now picklable.query_as_of_now
ofpw.stdlib.indexing.DataIndex
andpw.stdlib.indexing.HybridIndex
now work in constant memory for infinite query stream (no query-related data is kept after query is answered).
pw.io.mongodb.write
connector for writing Pathway tables in MongoDB.pw.io.s3.read
now supports downloading objects from an S3 bucket in parallel.
pw.io.fs.read
performance has been improved for directories containing a large number of files.
pw.io.deltalake.read
now supports custom S3 Delta Lakes with HTTP endpoints.pw.io.deltalake.read
now supports specifying both a custom endpoint and a custom region for Delta Lakes viapw.io.s3.AwsS3Settings
.
- Indices in
pathway.stdlib.indexing.nearest_neighbors
can now work also on numpy arrays. Previously they only acceptedlist[float]
. Working with numpy arrays improves memory efficiency. pw.io.s3.read
has been optimized to minimize new object requests whenever possible.- It is now possible to set the size limit of cache in
pw.udfs.DiskCache
. - State persistence now uses a single backend for both metadata and stream storage. The
pw.persistence.Config.simple_config
method is therefore deprecated. Now you can use thepw.persistence.Config
constructor with the same parameters that were previously used insimple_config
.
pw.io.bigquery.write
connector now correctly handlespw.Json
columns.
pw.temporal.session
andpw.temporal.asof_join
now correctly works with multiple entries with the same time.- Fixed an issue in
pw.stdlib.indexing
where filters would cause runtime errors while usingHybridIndexFactory
.
- Experimental A
pw.xpacks.llm.document_store.DocumentStore
to process and index documents. pw.xpacks.llm.servers.DocumentStoreServer
used to expose REST server for retrieving documents frompw.xpacks.llm.document_store.DocumentStore
.pw.xpacks.stdlib.indexing.HybridIndex
used for querying multiple indices and combining their results.pw.io.airbyte.read
now also supports streams that only operate infull_refresh
mode.
- Running servers for answering queries is extracted from
pw.xpacks.llm.question_answering.BaseRAGQuestionAnswerer
intopw.xpacks.llm.servers.QARestServer
andpw.xpacks.llm.servers.QASummaryRestServer
. - BREAKING:
query
andquery_as_of_now
ofpathway.stdlib.indexing.data_index.DataIndex
now produce an empty list instead ofNone
if no match is found.
pw.io.deltalake.read
andpw.io.deltalake.write
now correctly work with lakes hosted in S3 over min.io, Wasabi and Digital Ocean.
- The Pathway CLI command
spawn
can now execute code directly from a specified GitHub repository. - A new CLI command,
spawn-from-env
, has been added. This command runs the Pathway CLIspawn
command using arguments provided in thePATHWAY_SPAWN_ARGS
environment variable.
- Switched
pw.xpacks.llm.embedders.GeminiEmbedder
to be sync to resolve compatibility issues with the Google Colab runs. - Pinned
surya-ocr
module version for stability.
pw.xpacks.llm.embedders.GeminiEmbedder
which is a wrapper for Google Gemini Embedding services.
pw.debug.table_to_pandas
now exportsint | None
columns correctly.
pw.io.airbyte.read
can now be used with Airbyte connectors implemented in Python without requiring Docker.- BREAKING: UDFs now verify the type of returned values at runtime. If it is possible to cast a returned value to a proper type, the values is cast. If the value does not match the expected type and can't be cast, an error is raised.
- BREAKING:
pw.reducers.ndarray
reducer requires input column to either have typefloat
,int
orArray
. pw.xpacks.llm.parsers.OpenParse
can now extract and parse images & diagrams from PDFs. This can be enabled by setting theparse_images
.processing_pipeline
can be also set to customize the post processing of doc elements.
pw.io.deltalake.read
now supports S3 data sources.pw.xpacks.llm.parsers.ImageParser
which allows parsing images with the vision LMs.pw.xpacks.llm.parsers.SlideParser
that enables parsing PDF and PPTX slides with the vision LMs.pw.xpacks.llm.parsers.question_answering.RAGClient
, Python client for Pathway hosted RAG apps.pw.xpacks.llm.parsers.question_answeringDeckRetriever
, a RAG app that enables searching through slide decks with visual-heavy elements.
pw.xpacks.llm.vector_store.VectorStoreServer
now uses new indexes.
pw.xpacks.llm.parsers.OpenParse
now supports any vision Language model including local and proprietary models via LiteLLM.
pw.io.kafka.read
now accepts an autogenerate_key flag. This flag determines the primary key generation policy to apply when reading raw data from the source. You can either use the key from the Kafka message or have Pathway autogenerate one.pw.io.deltalake.read
input connector that fetches changes from DeltaLake into a Pathway table.pw.xpacks.llm.parsers.OpenParse
which allows parsing tables and images in PDFs.
- All S3 input connectors (including S3, Min.io, Digital Ocean, and Wasabi) now automatically retry network operations if a failure occurs.
- The issue where the connection to the S3 source fails after partially ingesting an object has been resolved by downloading the object in full first.
pw.io.deltalake.write
now supports S3 destinations.
pw.debug.compute_and_print
now allows passing more than one table.- BREAKING:
path
parameter inpw.io.deltalake.write
renamed touri
.
- A bug in
pw.Table.deduplicate
. Ifpersistent_id
is not set, it is no longer generated inpw.PersistenceMode.SELECTIVE_PERSISTING
mode.
pw.PyObjectWrapper
that enables passing python objects of any type to the engine.cache_strategy
option added forpw.io.http.rest_connector
. It enables cache configuration, which is useful for duplicated requests.allow_misses
argument toTable.ix
andTable.ix_ref
methods which allows for filling rows with missing keys with None values.pw.io.deltalake.write
output connector that streams the changes of a given table into a DeltaLake storage.pw.io.airbyte.read
now supports data extraction with Google Cloud Runs.
- BREAKING: Removed
Table.having
method. - BREAKING: Removed
pw.DATE_TIME_UTC
,pw.DATE_TIME_NAIVE
andpw.DURATION
as dtype markers. Instead,pw.DateTimeUtc
,pw.DateTimeNaive
andpw.Duration
should be used, which are wrappers for corresponding pandas types. - BREAKING: Removed class transformers from public API:
pw.ClassArg
,pw.attribute
,pw.input_attribute
,pw.input_method
,pw.method
,pw.output_attribute
andpw.transformer
. - BREAKING: Removed several methods from
pw.indexing
module:binsearch_oracle
,filter_cmp_helper
,filter_smallest_k
andprefix_sum_oracle
.
pathway.assert_table_has_schema
andpathway.table_transformer
now acceptallow_subtype
argument, which, if True, allows column types in the Table be subtypes of types in the Schema.next
method topw.io.python.ConnectorSubject
(python connector) that enables passing values of any type to the engine, not only values that are json-serializable. Thenext
method should be the preferred way of passing values from the python connector.
- The
format
argument ofpw.io.python.read
is deprecated. A data format is inferred from the method used (next_json
,next_str
,next_bytes
) and the provided schema.
- Removed
pw.numba_apply
andnumba
dependency.
- Fixed
pw.this
desugaring bug, where__getitem__
in.ix
context was not working properly. pw.io.sqlite.read
now checks if the data matches the passed schema.
query
andquery_as_of_now
ofpathway.stdlib.indexing.data_index.DataIndex
now accept inmetadata_column
parameter a column with data of typestr | None
.pathway.xpacks.connectors.sharepoint
module, available with Pathway Scale License.
- Embedders in the LLM xpack now have method
get_embedding_dimension
that returns number of dimension used by the chosen embedder. pathway.stdlib.indexing.nearest_neighbors
, with implementations ofpathway.stdlib.indexing.data_index.InnerIndex
based on k-NN via LSH (implemented in Pathway), and k-NN provided by USearch library.pathway.stdlib.indexing.vector_document_index
, with a few predefined instances ofpathway.stdlib.indexing.data_index.DataIndex
.pathway.stdlib.indexing.bm25
, with implementations ofpathway.stdlib.indexing.data_index.InnerIndex
based on BM25 index provided by Tantivy.pathway.stdlib.indexing.full_text_document_index
, with a predefined instance ofpathway.stdlib.indexing.data_index.DataIndex
.- Introduced the
reranker
module underllm.xpacks
. Includes few re-ranking strategies and utility functions for RAG applications.
- BREAKING:
windowby
generates IDs of produced rows differently than in the previous version. - BREAKING:
pw.io.csv.write
prints printable non-ascii characters as regular text, not\u{xxxx}
. - BREAKING: Connector methods
pw.io.elasticsearch.read
,pw.io.debezium.read
,pw.io.fs.read
,pw.io.jsonlines.read
,pw.io.kafka.read
,pw.io.python.read
,pw.io.redpanda.read
,pw.io.s3.read
now check the type of the input data. Previously it was not checked if the provided format was"json"
/"jsonlines"
. If the data is inconsistent with the provided schema, the row is skipped and the error message is emitted. - BREAKING:
query
andquery_as_of_now
methods ofpathway.stdlib.indexing.data_index.DataIndex
now returnpathway.JoinResult
, to allow resolving column name conflicts (between columns in the table with queries and table with index data). - BREAKING: DataIndex methods
query
andquery_as_of_now
now return score in a column named_pw_index_reply_score
(defined as_SCORE
variable inpathway.stdlib.indexing.colnames.py
).
- BREAKING:
pathway.stdlib.indexing.data_index.VectorDocumentIndex
class, some predefined instances are now meant to be obtained via methods provided inpathway.stdlib.indexing.vector_document_index
. - BREAKING:
with_distances
parameter ofquery
andquery_as_of_now
methods inpathway.stdlib.indexing.data_index.DataIndex
. Instead of 'distance', we now operate with a more general term 'score' (higher = better). For distance based indices score is usually defined as negative distance. Score is now always included in the answer, as long as underlying index returns something that indicates quality of a match.
query
method to VectorStoreServer to enable compatible API withDataIndex
.AdaptiveRAGQuestionAnswerer
to xpacks.question_answering. End-to-end pipeline and accompanying code forPrivate RAG
showcase.
- Pathway now warns when unintentionally creating Table with empty universe.
pw.io.kafka.write
inraw
andplaintext
formats now supports output for tables with multiple columns. For such tables, it requires the specification of the column that must be used as a value of the produced Kafka messages and gives a possibility to provide column which must be used as a key.pw.io.kafka.write
can now output values from the table using Kafka message headers in 'raw' and 'plaintext' output format.
instance
arguments togroupby
,join
,with_id_from
now determine how entries are distributed between machines.flatten
results remain on the same machine as their source entries.join
sends each record between machines at most once.- BREAKING:
flatten
,join
,groupby
(if used withinstance
),with_id_from
(if used withinstance
) generate IDs of the produced rows differently than in the previous versions. pathway spawn
with multiple workers prints only output from the first worker.
pw.reducers.latest
andpw.reducers.earliest
that return the value with respectively maximal and minimal processing time assigned.pw.io.kafka.write
can now produce messages containing raw bytes in case the table consists of a single binary column andraw
mode is specified. Similarly, this method will provide plaintext messages ifplaintext
mode is chosen and the table consists of a single string-typed column.pw.io.pubsub.write
connector for publishing Pathway tables into Google PubSub.- Argument
strict_prompt
toanswer_with_geometric_rag_strategy
andanswer_with_geometric_rag_strategy_from_index
that allows optimizing prompts for smaller open-source LLM models. - Temporarily switch LiteLLMChat's generation method to sync version due to a bug while using
json
mode with Ollama.
- BREAKING:
pw.io.kafka.read
will not parse the messages from UTF-8 in caseraw
mode was specified. To preserve this behavior you can use theplaintext
mode. - BREAKING:
Table.flatten
now flattens one column and spreads every other column of the table, instead of taking other columns from the argument list.
pw.io.bigquery.write
connector for writing Pathway tables into Google BigQuery.- parameter
filepath_globpattern
toquery
method inVectorStoreClient
for specifying which files should be considered in the query. - Improved compatibility of
pw.Json
with standard methods such aslen()
,int()
,float()
,bool()
,iter()
,reversed()
when feasible.
pw.io.postgres.write
can now parallelize writes to several threads if several workers are configured.- Pathway now checks types of pointers rigorously. Indexing table with mismatched number/types of columns vs what was used to create index will now result in a TypeError.
pw.Json.as_float()
method now supports integer JSON values.
- New function
answer_with_geometric_rag_strategy_from_index
, which allows to useanswer_with_geometric_rag_strategy
without the need to first retrieve documents from index. - Added support for custom state serialization to
udf_reducer
. - Introduced
instance
parameter inAsyncTransformer
. All calls with a given(instance, processing_time)
pair are returned at the same processing time. Ordering is preserved within a single instance. - Added
successful
,failed
,finished
properties toAsyncTransformer
. They return tables with successful calls, failed calls and all finished calls, respectively.
- Property
result
ofAsyncTransformer
is deprecated. Propertysuccessful
should be used instead. pw.io.csv.read
,pw.io.jsonlines.read
,pw.io.fs.read
,pw.io.plaintext.read
now handlepath
as a glob pattern and read all matched files and directories recursively.
- Pathway will only require
LiteLLM
package, if you use one of the wrappers forLiteLLM
. - Retries are implemented in
pw.io.airbyte.read
. - State processing protocol is updated in
pw.io.airbyte.read
.
- New parameters of
pw.UDF
class andpw.udf
decorator:return_type
,deterministic
,propagate_none
,executor
,cache_strategy
. - The LLM Xpack now provides integrations with LlamaIndex and LangChain for running the Pathway VectorStore server.
- Subclassing
UDFSync
andUDFAsync
is deprecated.UDF
should be subclassed to create a new UDF. - Passing keyword arguments to
pw.apply
,pw.apply_with_type
,pw.apply_async
is deprecated. In the future, they'll be used for configuration, not passing data to the function.
- Fixed a minor bug with
Table.groupby()
method which sometimes prevented of accessing certain columns in the followingreduce()
. - Fixed warnings from using OpenAI Async embedding model in the VectorStore in Colab.
%:z
timezone format code tostrptime
.- Support for Airbyte connectors
pw.io.airbyte
.
- Introduced the
send_alerts
function in thepw.io.slack
namespace, enabling users to send messages from a specified column directly to a Slack channel. - Enhanced the
pw.io.http.rest_connector
by introducing an additional argument calledrequest_validator
. This feature empowers users to validate payloads and raise anHTTP 400
error if necessary.
- Addressed an issue in
pw.io.xpacks.llm.VectorStoreServer
where the computation of the last modification timestamp for an indexed document was incorrect.
- Improved the behavior of
pw.io.kafka.write
. It now includes retries when sending data to the output topic encounters failures.
pw.io.http.rest_connector
now supports multiple HTTP request types.pw.io.http.PathwayWebserver
now allows Cross-Origin Resource Sharing (CORS) to be enabled on newly added endpoints- Wrappers for LiteLLM and HuggingFace chat services and SentenceTransformers embedding service are now added to Pathway xpack for LLMs.
pw.run
now includes an additional parameterruntime_typechecking
that enables strict type checking at runtime.- Embedders in pathway.xpacks.llm.embedders now correctly process empty strings as queries.
- BREAKING:
pw.run
andpw.run_all
now only accept keyword arguments.
pw.Duration
can now be returned from User-Defined Functions (UDFs) or used as a constant value without resulting in errors.pw.io.debezium.read
now correctly handles tables that do not have a primary key.
pw.io.http.rest_connector
can now generate Open API 3.0.3 schema that will be returned by the route/_schema
.- Wrappers for OpenAI Chat and Embedding services are now added to Pathway xpack for LLMs.
- A vector indexing pipeline that allows querying for the most similar documents. It is available as class
VectorStore
as part of Pathway xpack for LLMs.
pw.debug.table_from_markdown
now uses schema parameter (when set) to properly assign simple types (int, bool, float, str, bytes
) and optional simple types to columns.
pw.io.http.rest_connector
now also accepts port as a string for backwards compatibility.pw.stdlib.ml.index.KNNIndex
now sorts by distance by default.
- Support for comparisons of tuples has been added.
- Standalone versions of methods such as
pw.groupby
,pw.join
,pw.join_inner
,pw.join_left
,pw.join_right
, andpw.join_outer
are now available. - The
abs
function from Python can now be used on Pathway expressions. - The
asof_join
method now has configurable temporal behavior. Thebehavior
parameter can be used to pass the configuration. - The state of the
deduplicate
operator can now be persisted.
interval_join
can now work with intervals of zero length.- The
pw.io.http.rest_connector
can now open multiple endpoints on the same port using a newpw.io.http.PathwayWebserver
class. - The
pw.xpacks.connectors.sharepoint.read
andpw.io.gdrive.read
methods now support the size limit for a single object. If set, it will exclude too large files and won't read them.
- pathway.xpacks.llm.splitter.TokenCountSplitter.
- Introducing new methods for strict conversion of
pw.Json
to desired types within a UDF body:as_int()
as_float()
as_str()
as_bool()
as_list()
as_dict()
- Added
table.col.dt.utc_from_timestamp
method: CreatesDateTimeUtc
from timestamps represented asint
s orfloat
s. - Enhanced the
table.col.dt.timestamp
method with a newunit
argument to specify the unit of the returned timestamp.
- Introduced an experimental xpack with a Microsoft SharePoint input connector.
- Index operator (
[]
) can now be directly applied topw.Json
within UDFs to access elements of JSON objects, arrays, and strings.
- Enhanced the
table.col.dt.from_timestamp
method to createDateTimeNaive
from timestamps represented asint
s orfloat
s. - Deprecated not specifying the
unit
argument of thetable.col.dt.timestamp
method.
KNNIndex
now supports returning computed distances.- Added support for cosine similarity in
KNNIndex
.
- The
offset
argument ofpw.stdlib.temporal.sliding
andpw.stdlib.temporal.tumbling
is deprecated. Useorigin
instead, as it represents a point in time, not a duration.
- Sliding window now works correctly with UTC Datetimes.
- Temporal column in
asof_join
no longer has to be namedt
. asof_join
includes rows with equal times for all values of thedirection
parameter.
- Fixed an issue with
pw.io.gdrive.read
: Shared folders support is now working seamlessly.
- Added Table.split() method for splitting table based on an expression into two tables.
- Columns with datatype duration can now be multiplied and divided by floats.
- Columns with datatype duration now support both true and floor division (
/
and//
) by integers.
- Pathway is better at typing if_else expressions when optional types are involved.
table.flatten()
operator now supports Json array.- Buffers (used to delay outputs, configured via delay in
common_behavior
) now flush the data when the computation is finished. The effect of this change can be seen when run in bounded (batch / multi-revision) mode. pw.io.subscribe()
takes additional argumenton_time_end
- the callback function to be called on each closed time of computation.pw.io.subscribe()
is now a single-worker operator, guaranteeing thaton_end
is triggered at most once.KNNIndex
supports now metadata filtering. Each query can specify it's own filter in the JMESPath format.
- Resolved an optimization bug causing
pw.iterate
to malfunction when handling columns effectively pointing to the same data.
- Pathway now keeps track of
array
columntype better - it is able to keep track of Array dtype and number of dimensions, wherever applicable.
- Fixed issues with standalone panel+Bokeh dashboards to ensure optimal functionality and performance.
- A method
weekday
has been added to thedt
namespace, that can be called on column expressions containing datetime data. This method returns an integer that represents the day of the week. - EXPERIMENTAL: Methods
show
andplot
on Tables, providing visualizations of data using HoloViz Panel. - Added support for
instance
parameter togroupby
,join
,windowby
and temporal join methods. pw.PersistenceMode.UDF_CACHING
persistence mode enabling automatic caching ofAsyncTransformer
invocations.
- Methods
round
andfloor
on columns with datetimes now accept duration argument to be a string. pw.debug.compute_and_print
andpw.debug.compute_and_print_update_stream
have a new argumentn_rows
that limits the number of rows printed.pw.debug.table_to_pandas
has a new argumentinclude_id
(by defaultTrue
). If set toFalse
, creates a new index for the Pandas DataFrame, rather than using the keys of the Pathway Table.windowby
functionshard
argument is now deprecated andinstance
should be used.- Special column name
_pw_shard
is now deprecated, and_pw_instance
should be used. pw.ReplayMode
now can be accessed aspw.PersistenceMode
, while theSPEEDRUN
andREALTIME
variants are now accessible asSPEEDRUN_REPLAY
andREALTIME_REPLAY
.- EXPERIMENTAL:
pw.io.gdrive.read
has a new argumentwith_metadata
(by defaultFalse
). If set toTrue
, adds a_metadata
column containing file metadata to the resulting table. - Methods
get_nearest_items
andget_nearest_items_asof_now
ofKNNIndex
allow to specifyk
(number of returned elements) separately in each query.
- Added ability of creating custom reducers using
pw.reducers.udf_reducer
decorator. Usepw.BaseCustomAccumulator
as a base class for creating accumulators. Decorating accumulator returns reducer following custom logic. - A function
pw.debug.compute_and_print_update_stream
that computes and prints the update stream of the table. - SQLite input connector (
pw.io.sqlite
).
pw.debug.parse_to_table
is now deprecated,pw.debug.table_from_markdown
should be used instead.pw.schema_from_csv
now hasquote
anddouble_quote_escapes
arguments.
- Schema returned from
pw.schema_from_csv
will have quotes removed from column names, so it will now work properly withpw.io.csv.read
.
- Experimental Google Drive input connector.
- Stateful deduplication function (
pw.stateful.deduplicate
) allowing alerting on significant changes. - The ability to split data into batches in
pw.debug.table_from_markdown
andpw.debug.table_from_pandas
.
- class
Behavior
, a superclass of all behavior classes. - class
ExactlyOnceBehavior
indicating we want to create aCommonBehavior
that results in each window producing exactly one output (shifted in time by an optionalshift
parameter). - function
exactly_once_behavior
creating an instance ofExactlyOnceBehavior
.
- BREAKING:
WindowBehavior
is now calledCommonBehavior
, as it can be also used with interval joins. - BREAKING:
window_behavior
is now calledcommon_behavior
, as it can be also used with interval joins. - Deprecating parameter
keep_queries
inpw.io.http.rest_connector
. Nowdelete_completed_queries
with an opposite meaning should be used instead. The default is stilldelete_completed_queries=True
(equivalent tokeep_queries=False
) but it will soon be required to be set explicitly.
- A flag
with_metadata
for the filesystem-based connectors to attach the source file metadata to the table entries. - Methods
pw.debug.table_from_list_of_batches
andpw.debug.table_from_list_of_batches_by_workers
for creating tables with defined data being inserted over time.
- BREAKING:
pw.debug.table_from_pandas
andpw.debug.table_from_markdown
now will create tables in the streaming mode, instead of static, if given table definition contains_time
column. - BREAKING: Renamed the parameter
keep_queries
inpw.io.http.rest_connector
todelete_queries
with the opposite meaning. It changes the default behavior - it waskeep_queries=False
, now it isdelete_queries=False
.
- A method
get_nearest_items_asof_now
inKNNIndex
that allows to get nearest neighbors without updating old queries in the future. - A method
asof_now_join
inTable
to join rows from left side of the join with right side of the join at their processing time. Past rows from left side are not used when new data appears on the right side.
interval_join
now supports forgetting old entries. The configuration can be passed usingbehavior
parameter ofinterval_join
method.- Decorator
@table_transformer
for marking that functions take Tables as arguments. - Namespace for all columns
Table.C.*
. - Output connectors now provide logs about the number of entries written and time taken.
- Filesystem connectors now support reading whole files as rows.
- Command line option for
pathway spawn
to record data andpathway replay
command to replay data.
select
operates only on consistent states.
Schema
methodtypehints
that returns dict of mypy-compatible typehints.- Support for JSON parsing from CSV sources.
restrict
method inTable
to restrict table universe to the universe of the other table.- Better support for postgresql types in the output connector.
- BREAKING: renamed
Table
methoddtypes
totypehints
. It now returns adict
of mypy-compatible typehints. - BREAKING:
Schema.__getitem__
returns a data classColumnSchema
containing all related information on particular column. - BREAKING:
tuple
reducer used after intervals_over window now sorts values by time. - BREAKING: expressions used in
select
,filter
,flatten
,with_columns
,with_id
,with_id_from
have to have the same universe as the table. Earlier it was possible to use an expression from a superset of a table universe. To use expressions from wider universes, one can userestrict
on the expression source table. - BREAKING:
pw.universes.promise_are_equal(t1, t2)
no longer allows to use references fromt1
andt2
in a single expression. To change the universe of a table, usewith_universe_of
. - BREAKING:
ix
andix_ref
are temporarily broken inside joins (both temporal and ordinary). select
,filter
,concat
keep columns as a single stream. The work for other operators is ongoing.
- Optional types other than string correctly output to PostgreSQL.
- Support for messages compressed with zstd in the Kafka connector.
- Support for JSON data format, including
pw.Json
type. - Methods
as_int()
,as_float()
,as_str()
,as_bool()
to convert values fromJson
. - New argument
skip_nones
fortuple
andsorted_tuple
reducers. - New argument
is_outer
forintervals_over
window. pw.schema_from_dict
andpw.schema_from_csv
for generating schema based, respectively, on provided definition as a dictionary and CSV file with sample data.generate_class
method inSchema
class for generating schema class code.
- Method
get()
and[]
to support accessing elements in Jsons. - Function
pw.assert_table_has_schema
for writing asserts checking, whether given table has the same schema as the one that is given as an argument. - BREAKING:
ix
andix_ref
operations are now standalone transformations ofpw.Table
intopw.Table
. Most of the usages remain the same, but sometimes user needs to provide a context (when e.g. using them insidejoin
orgroupby
operations).ix
andix_ref
are temporarily broken inside temporal joins.
- Fixed a bug where new-style optional types (e.g.
int | None
) were translated toAny
dtype.
- Incompatible
beartype
version is now excluded from dependencies.
- Module
pathway.dt
to construct and manipulate DTypes. - New argument
keep_queries
inpw.io.http.rest_connector
.
- Internal representation of DTypes. Inputting types is compatible backwards.
- Temporal functions now accept arguments of mixed types (ints and floats). For example,
pw.temporal.interval
can use ints while columns it interacts with are floats. - Single-element arrays are now treated as arrays, not as scalars.
to_string()
method on datetimes always prints 9 fractional digits.%f
format code instrptime()
parses fractional part of a second correctly regardless of the number of digits.
Table.cast_to_types()
function that can performpathway.cast
on multiple columns.intervals_over
window, which allows to get temporally close data to given times.demo.replay_csv_with_time
function that can replay a CSV file following the timestamps of a given column.
- Static data is now copied to ensure immutability.
- Improved error tracing mechanism to work with any type of error.
tuple
reducer, that returns a tuple with values.ndarray
reducer, that returns an array with values.
numpy
arrays ofint32
,uint32
andfloat32
are now converted to their 64-bit variants instead of tuples.- KNNIndex interface to take columns as inputs.
- Reducers now check types of their arguments.
- Fixed delayed reporting of output connector errors.
- Python objects are now freed more often, reducing peak memory usage.
@
(matrix multiplication) operator.
- Python version 3.10 or later is now required.
- Type checking is now more strict.
- Immediately forget queries in REST connector.
- Make type annotations mandatory in
Schema
.
- Fixed IDs coming from CSV source.
- Fixed indices of dataframes from pandas transformer.