Commit 7e0e795
committed
TIMDEXDataset capable of yielding current records only
Why these changes are being introduced:
With TIMDEXDataset capable of limiting to only parquet files
associated with current runs, the next logical step is providing
the ability to yield only the current version of a record.
This would support a "full refresh" of a TIMDEX source where an
application like TIM could yield only current records for a given
source and index those to Opensearch.
How this addresses that need:
When TIMDEXDataset is loaded with current_records=True, the private
attribute TIMDEXDataset._dedupe_on_read is set to True, informing
any read methods to dedupe during yielding. Because all read
methods TIMDEXDataset.read_batches_iter() at the lowest level,
the deduping logic is required only there.
Because the ordering of the parquet files is already handled by
the load method, the read methods can be confident they are always
seeing the most recent version of a record first, and thus can
just maintain a "seen" list as they are encountered. This keeps
the deduplication effectively instant and memory safe; no large
in-memory reordering or deduplication is required.
Side effects of this change:
* Applications like TIM now have the option of yielding only current
records for a source, or all sources, supporting new functionality
like fully reindexing a source in Opensearch from parquet dataset
data alone.
Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-4941 parent 50fff12 commit 7e0e795
3 files changed
Lines changed: 110 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
138 | 138 | | |
139 | 139 | | |
140 | 140 | | |
| 141 | + | |
141 | 142 | | |
142 | 143 | | |
143 | 144 | | |
| |||
147 | 148 | | |
148 | 149 | | |
149 | 150 | | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
339 | 339 | | |
340 | 340 | | |
341 | 341 | | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
120 | 120 | | |
121 | 121 | | |
122 | 122 | | |
| 123 | + | |
123 | 124 | | |
124 | 125 | | |
125 | 126 | | |
| |||
162 | 163 | | |
163 | 164 | | |
164 | 165 | | |
| 166 | + | |
165 | 167 | | |
166 | 168 | | |
167 | 169 | | |
| |||
467 | 469 | | |
468 | 470 | | |
469 | 471 | | |
470 | | - | |
| 472 | + | |
| 473 | + | |
471 | 474 | | |
472 | 475 | | |
473 | 476 | | |
474 | 477 | | |
475 | | - | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
476 | 500 | | |
477 | | - | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
478 | 521 | | |
479 | 522 | | |
480 | 523 | | |
| |||
536 | 579 | | |
537 | 580 | | |
538 | 581 | | |
539 | | - | |
540 | | - | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
541 | 585 | | |
542 | 586 | | |
543 | 587 | | |
544 | 588 | | |
545 | | - | |
| 589 | + | |
546 | 590 | | |
547 | 591 | | |
548 | 592 | | |
| |||
0 commit comments