You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactor class and duckdb connection relationships
Why these changes are being introduced:
This refactoring work was a long time coming, inspired by a recent need to gracefully
handle a read request for embeddings against a dataset without embeddings parquet
files. If we can normalize how and when tables are created, and the handling of
duckdb connections, we can normalize handling read requests for data that may not be
available (yet). As such, this refactoring work will help normalize read edge cases
now and going forward.
This library was built in stages. First was TIMDEXDataset, which read parquet files
directly. Then TIMDEXDatasetMetadata, which more formally introduced DuckDB. It
handled the connection creation and configuration. This connection was shared with
TIMDEXDataset as we leaned into DuckDB reading. Lastly, TIMDEXEmbeddings was added
as our first new "source" of data. This class shared the connection from TIMDEXDataset.
Both TIMDEXDatasetMetadata and TIMDEXEmbeddings were doing their own SQLAlchemy
table reflections. TIMDEXDatasetMetadata could be instantiated on its own, while
TIMDEXEmbeddings was assumed to take an instance of TIMDEXDataset.
At this point, while things worked, it was clear that a refactor would be beneficial.
We needed clearer responsibility of what created and configured the DuckDB connection,
solidify that TIMDEXDatasetMetadata and TIMDEXEmbeddings are components on TIMDEXDataset,
and how and when SQLAlchemy reflection was performed. Aligning all these things will
make responding to these read and write edge cases much easier.
How this addresses that need:
- A new factory class is created DuckDBConnectionFactory that is responsible for
creating and configuring any DuckDB connections used.
- Both TIMDEXDatasetMetadata and TIMDEXEmbeddings require a TIMDEXDataset
instance, and then themselves become components on TIMDEXDataset. We can more
accurately call them "components" then of the primary TIMDEXDataset.
- TIMDEXDataset handles the creation of a DuckDB connection via the new factory,
and this connection is then accesible to its components TIMDEXDatasetMetadata and
TIMDEXEmbeddings (maybe more in the future)
- TIMDEXDataset is also responsible for all SQLAlchemy reflection, saving to
self.sa_tables. In this way, any component that may want a SQLAlchemy instance,
e.g. for reading, it can get it from `self.timdex_dataset.get_sa_table(<schema>,
<table)`.
- Refreshing of classes is greatly simplifed: TIMDEXDataset is the true orchestrator
now, so a full re-init satisfies this. Components no longer have their own
`.refresh()` methods.
- Where possible, update all tests to use components like TIMDEXEmbeddings as part
of a TIMDEXDataset intsance, not a long class instance.
Side effects of this change:
* It is not recommended to use TIMDEXDatasetMetadata or TIMDEXEmbeddings
by themselves; they are meant as components on a TIMDEXDataset.
Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-306
0 commit comments