Skip to content

Commit 2fd2218

Browse files
committed
Begin rebuild of TIMDEXDatasetMetadata
Why these changes are being introduced: The current overarching work is to support the creation and reading of a static metadata database file and append deltas. To get there, very little of the original TIMDEXDatasetMetadata class is needed or wanted. This commit begins the process of rebuilding TIMDEXDatasetMetadata, oriented around managing a static metadata database file, and providing a readonly projection over that and append delta paqruet files. How this addresses that need: TIMDEXDatasetMetadata is almost completely rebuilt, with the first functionality being the creation of the static metadata file by scanning the ETL records. Then, the ability to remotely attach in readonly mode to this metadata database file for reading. Note: these changes are breaking. TIMDEXDataset cannot provide "current" records and many unit tests are broken. This will be addressed in future commits as we build this class back up with new functionality. Side effects of this change: * TIMDEXDataset cannot provide current records * Unit tests are either temporarily skipped or failing Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-530
1 parent b907a15 commit 2fd2218

3 files changed

Lines changed: 178 additions & 313 deletions

File tree

tests/test_metadata.py

Lines changed: 10 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -39,51 +39,14 @@ def test_tdm_get_duckdb_connection(timdex_dataset_metadata):
3939
assert isinstance(conn, duckdb.DuckDBPyConnection)
4040

4141

42-
def test_tdm_set_threads(timdex_dataset_metadata):
43-
# set to 64
44-
timdex_dataset_metadata.set_database_thread_usage(64)
45-
sixty_four_thread_count = timdex_dataset_metadata.conn.query(
46-
"""SELECT current_setting('threads');"""
47-
).fetchone()[0]
48-
assert sixty_four_thread_count == 64
49-
50-
# set to 12
51-
timdex_dataset_metadata.set_database_thread_usage(12)
52-
sixty_four_thread_count = timdex_dataset_metadata.conn.query(
53-
"""SELECT current_setting('threads');"""
54-
).fetchone()[0]
55-
assert sixty_four_thread_count == 12
56-
57-
58-
def test_tdm_init_sets_up_database(timdex_dataset_metadata):
59-
df = timdex_dataset_metadata.conn.query("show tables;").to_df()
60-
assert set(df.name) == {"current_records", "records"}
61-
62-
63-
def test_tdm_get_current_parquet_files(timdex_dataset_metadata):
64-
parquet_files = timdex_dataset_metadata.get_current_parquet_files()
65-
# assert 5 total parquet files in dataset
66-
# but only 3 contain current records
67-
assert len(timdex_dataset_metadata.timdex_dataset.dataset.files) == 5
68-
assert len(parquet_files) == 3
69-
70-
71-
def test_tdm_get_record_to_run_mapping(timdex_dataset_metadata):
72-
record_map = timdex_dataset_metadata.get_current_record_to_run_map()
73-
74-
assert len(record_map) == 75
75-
assert record_map["alma:0"] == "run-5"
76-
assert record_map["alma:5"] == "run-4"
77-
assert record_map["alma:19"] == "run-4"
78-
assert "run-3" not in record_map.values()
79-
assert record_map["alma:20"] == "run-2"
80-
81-
82-
def test_tdm_current_records_subset_of_all_records(timdex_dataset_metadata):
83-
records_df = timdex_dataset_metadata.conn.query("select * from records;").to_df()
84-
current_records_df = timdex_dataset_metadata.conn.query(
85-
"select * from current_records;"
42+
def test_tdm_connection_has_static_database_attached(timdex_dataset_metadata):
43+
assert set(
44+
timdex_dataset_metadata.conn.query("""show databases;""").to_df().database_name
45+
) == {"memory", "static_db"}
46+
47+
48+
def test_tdm_connection_static_database_records_table_exists(timdex_dataset_metadata):
49+
records_df = timdex_dataset_metadata.conn.query(
50+
"""select * from static_db.records;"""
8651
).to_df()
87-
assert set(current_records_df.timdex_record_id).issubset(
88-
set(records_df.timdex_record_id)
89-
)
52+
assert len(records_df) > 0

0 commit comments

Comments
 (0)