Skip to content

Smurf-IV/Excel_PRIME

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

107 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Excel_PRIME 🌟

  • Excel_Performant Reader via Interfaces for Memory Efficiency.
  • Without using any external libraries.
  • Optimised for Range extraction.
  • Excel.png

What does that mean?

  • Yet another Excel reader ?,
    • Starting with .Net 8 as the performant Runtime (See Benchmarks)
    • .Net9 gives an extra 5% boost,
    • .Net10 Another 5% over .Net9 ;-)

Lets take each of the above elements and explain:

Excel 📈

  • Open Large 2007 (Onwards) XLSX file formats and XLSB (BIFF12) in V3.##
  • Zip Deflate format Only

Performant 🚀

  • Try to be as fast as possible, i.e.
    • Forward only Lazy loading
    • Only "Quick" decipher / convert of the cell(s) types to ease GC pressure
    • No attempt at "creating / using" datatables with headers etc.
    • Use IEnumerables with initial offset starts (Row / Column)
    • Allow CancellationTokens to be used to allow page transitioning cancellation (More on this later)
  • Now the fastest in Real world usage 2025-11-19 onwards

Q & A's

  • Q: There are others that are faster
  • A: Agreed, but then
    • They do not have range extraction.
    • Or optionally allow the use of the OS's TempFile System to store massive sheets
    • Or re-use of already extracted (massive) sheets
    • Or allow multiple sheets to be read at the same time
      • because others use global memory to represent the current row
      • Or have a single access into the Zip Excel file

Reader 📋

  • Read only
    • Therefore no calculations or updates to formula calls

Interfaces 🏗️

  • Will use the DotNet core functionality by default
  • But, if your target deployment allows for the use of native performant binaries, then via the use of interfaces these will be pluggable
    • i.e. Using Zlib.Net for getting the data streams out of the compressed Excel file faster. (Or SharpZipLib / PowerPlayZipper)
    • A faster / slimmer implementation for xml stream reading (i.e. TurboXml)
  • Allow the implementation of different source files (i.e. XLSB)

Q & A's

  • Q: Why?
  • A: As mentioned above, this is to allow a developer to replace with external nugets that might perform better XML speed etc.

Memory 🌐

  • The reason for this project, is to handle very large XSLX files (i.e. > 500K rows with > 180 columns per sheet, with multiple sheets of this size)
  • For ETL validation scenarios, i.e. make sure that the user modified data that has been transferred has interaction rules applied, before moving onto the T and L stages
  • Try not to hit / store in the LOH
  • No internal .Net memory of previously loaded sheets / rows.

Q & A's

  • Q: It appears that this uses more memory than other implementations
  • A: Currently yes, but it is being optimised for Range Extraction,
    • AND for allowing multiple rows (With cell data) to be stored in memory at the same time, (i.e. via ToList() call);
    • AND to allow multiple sheets to be read at the same time (Unlike some to of the others that use a single global memory to represent a row)
    • And it appears that the current benchmarks do not extract unless a ToString and a check on the result is used (Otherwise the Jit removes the unassigned dead code)
    • And, the memory used will actually be used in the ETL pipeline anyway, so it's just being truthful

Efficiency 📦

  • As hinted by the above statements, this is to be targetted at memory restricted environments (i.e. ASP Net VM's)
  • Use the OS's "Temp File" caching, so if the memory is tight then the Owner app will not have to worry about OOM exceptions, or having to use Swap Disk speeds.
  • Only unzip the sheet(s) when they are asked for
  • Only load the shared strings upto the current request number

Q & A's

  • Q: Sometimes the Async await s add too much overhead
  • A: true, that is why there are also the equivalent base interfaces that perform the same functionality without the need for the async await overheads.

Etc. 🔧

CancellationTokens

  • This is to allow the Large files to be Aborted
  • Make "Most" of the "Net Cores'" API's Asynchronous Tasks

IDisposable

  • Got to tidy up those Temp Files, and release the FileStream's

Challenges:

  • CellValue instances are returned to users
  • They must be thread-safe (multiple readers possible)
  • Each "Cell Type" / "Cell Instance" / " Row Instance"(string, numeric, boolean, datetime, error) have different lifecycle requirements

Caveats ⛔:

It will not be: Same sheet thread instance safe 📊

  • It will Not be same sheet Instance thread safe, because the xml reader will be locked (Forward only) to the sheet in use.
    • but you can Open the sheet more than once, and have different threads running over it,
    • And you can have Parallel threads access the Excel file
    • Just remember to set Options{ AccessExcelFileInForwardOnlyMode = false}

It will not do: Dynamic Ranges ⚠️

  • i.e. Ones that contain formulas:
    • <definedName name="Prices">OFFSET(Sheet1!$A$1,0,0,COUNTA(Sheet1!$A:$A),1)</definedName>

It will not do: Poco 🤖

  • A POCO / Type populator (Extensions can be written for that later)

It will not be: Writer / Modifier 📚

  • Totally beyond the scope of this project remit

Badge 🔄 Area
.NET Release build and tests

Flag Counter


Targets 🎯

Phase 0

  • ✅ Setup this github
  • ✅ Create the main project
  • ✅ Add Unit Test project
  • ✅ Add simple Test Data

Phase Alpha

  • ✅ Use Net Core Interface(s)
    • ✅ Use ZipArchive
    • ✅ Use XDocument
  • ✅ Implement Open / Dispose (Async)
    • ✅ Sheet Names
    • ✅ Shared Strings
  • ✅ Implement Sheet loading (unzip and be ready for use)
    • ✅ Use XDocument as POC only
  • ✅ Implement Row extraction
    • ✅ Skip
    • ✅ Delayed read - until a cell is actually needed
    • ✅ Deal with Null / Empty cells (Utilise sparse array?)
    • ✅ Keep last used offset (i.e. no need to reload sheet if the next range API startRow call is later)

Phase Beta - Benchmarks ⏱️

  • ✅ Benchmarks
    • ✅ Add Other "Excel readers" to the Benchmark project(s)
    • 🎉 Now With Sylvan.Data.Excel
    • 🎉 Now With XlsxHelper
  • ✅ More UnitTests

Phase 1 - MVP 🔍

  • ✅ Add IEnumerables and benchmark
  • ✅ Implement XmlReader.Create for
  • ✅ More Benchmarks
    • Now With FastExcel
    • ✅ Some Profiling Enahancements
  • ✅ Better Storage of the SharedStrings
  • ✅ Cell object type 📅
  • ✅ Use internal ZipEntry rented buffer
  • ✅ Investigation into the smallest function 💪
  • ✅ Optimise for CellConversion.None 💪
  • ✅ Parallel Sheet threads Access
  • ✅ Nuget
    • ✅ Beta etc.
    • 🎊 Released as Nuget V1.yyMM.dd -> 1.2511.14

Phase 2 - RC

  • ✅ Add IEnumerables All the way down ⤵️
  • ✅ Nuget
    • ✅ Manual workflow deploy Release
    • ✅ Manual workflow deploy Beta
  • ✅ Read "definedName"s (Ranges / Cell / Value / Dynamic) 📇
  • ✅ Deal with blank rows in a sheet 🗋
  • ✅ Deal with Empty cells in a row 🗅
  • ✅ Implement Sheet scoping of "definedName"s
  • ✅ Implement Row extraction 📟
  • ✅ Implement RangeExtraction 📲
  • ✅ Add Benchmarks for "Excel readers" That perform Range Extraction
    • ClosedXML Version="0.105.0"
    • EPPlus_LPGL Version="4.5.3.13"
    • ⚠️ FastExcel Version="3.0.13" -> Fails on Range Extraction
    • FreeSpire.XLS Version="14.2.0"
    • Aspose.Cells Version="25.11.0"
    • ⚠️ Extend benchmarks to cover the other large file types
  • ✅ Investigate memory usage(s) 🧑‍💻
  • ✅ Release as Nuget V2.2512-10 💨

V2 Changes ➡️ 2025-12-14


Phase V3 - XLSB 💾 (BIFF12)

  • ⛓️‍💥 Breaking Change(s)
    • FileType has been removed, and Open via the Public class type
    • IXmlReaderHelpers has become IOpenXmlReaderHelpers, with slightly different methods
    • IXmlWorkBookReader has become IOpenXmlWorkBookReader
    • IXmlSheetReader has become IOpenXmlSheetReader
    • Removal of the Conversion options Number###
    • Changed GetAllCells to return IReadOnlyList<ICell?>?
      • Watch out for those null rows !
  • ✅ Branch and beta yml
    • ✅ Convert test data in xlsb format
  • ✅ Implement Open / Dispose (Async)
    • ✅ Sheet Names
    • ✅ Shared Strings
  • ✅ Implement Sheet loading
  • ✅ Implement Row extraction
    • ✅ Skip
    • ✅ Delayed read - until a cell is actually needed
    • ✅ Deal with Null / Empty cells
  • ✅ Cell object type 📅
  • ✅ Benchmarks 🖲️
  • ✅ Read "definedName"s (Ranges / Cell / Value / Dynamic) 📇
    • ✅ Read from global
  • ✅ Strongly-typed accessors (AsInt32, AsDateTime, etc)
    • Slightly slower, but less memory pressure for xslb
    • 2026-01-02
  • ✅ Parallel Sheet threads Access
    • ✅ Multiple times (with locking)
  • ✅ Release as Nuget V3.yyMM.dd
    • 🎊 Released RC1 as Nuget V3.2601.04-RC1
  • ✅ Investigate Performance and edge cases, then Release as Stable
  • 🎊 Released V3 as Nuget V3.2601.11

V3 Changes ➡️ 2026-01-16

  • Remove some AggressiveOptimization and allow i-cache to do its job
  • Implement "Hot-Paths" for cell type access
  • Reduce some memory allocations for ReadOnly CellCollections

Phase V4 - Specific Cell value type(s) #️⃣

  • ⛓️‍💥 Breaking Change(s)
    • Removal of GetSheetFileName(int offsetSheetId);
    • Removal of GetDefinedRange via int sheetId
    • Removal of Index property from ISheet
    • Internal Creation of WorkBooks
    • Internal implementation of IOpenXmlWorkBookReader::GetSheetNames now returns the relative path to the "Sheet Name"
    • CellValue is now a class, therefore no need to use .Value
    • ICell.CellValue is now nullable
  • ✅ Cell object type 📅
    • ✅ "Best Effort" Operator based conversion
    • ✅ TryGetType will return out type, if stored as that type.
    • ✅ Unit Tests
  • ✅ Performance
    • ✅ Use ValueTask and reduce memory allocations in some hot paths
    • 🚀 Fix fallout from making CellValue is now a class
    • ArrayPool support has been added to ThreadStringBuilderPool using ArrayPool.
    • ✅ Release-specific optimizations added
      • ✅ EnableTrimAnalyzer: true
      • ✅ TieredCompilation: true
      • ✅ TieredCompilationQuickJit: true
      • ✅ TieredCompilationQuickJitForLoops: true
  • ✅ Implement System.DBNull return option, for empty cells
    • ✅ Implement INullRow return option, for empty rows
    • ✅ Update tests to use INullRow detection
  • ✅ Implement GetCell###(string columnLetters, ...) #8
  • 🚀 2026-05-05

Phase 5 - User Cell Value type formatting 💽 & Performance Optimizations 🏃‍➡️

  • ⛓️‍💥 Breaking Change(s)
    • None yet.
  • Cell object type 📅
    • Store cell style type (see Options enum)
    • Use of user defined column schema
    • Formatter applied -> CellConversion.ForceStyles
    • Unit Tests
    • Deal with DateOnly / TimeOnly fields -> CellConversion.NumberAndDates 💹 Code-Level Optimizations
    • Implement ISpanFormattable in CellValue
    • Use CollectionsMarshal for zero-copy operations
    • Add System.Runtime.Intrinsics for SIMD Advanced Scenarios
    • Enable PublishTrimmed=true with trim warnings resolved
    • Native AOT compilation testing
    • IAsyncEnumerable stream processing Monitoring
    • Add performance regression tests
    • Implement ETW profiling in CI/CD
    • Track JIT compilation metrics

Phase 6 - Third Party Nugets 📦

  • ⛓️‍💥 Breaking Change(s)
    • None yet.
  • Excercise the Implementation of Interfaces for other Libs (Xml / Zip)
    • Separate Nuget(s) ?
  • Benchmarks
    • e.g. search isages of Class PoolingArrayBufferWriter<T>
    • [ ]

Phase 7 - ideas 💡

  • Investigate a different way of storing the Shared strings to the Filesystem, when they are in the MB's

    • e.g. Search for Class FileBufferingWriter
  • Investigate possibility of using "Pipelining" to get data for Next row / cell population after yield?

    • Locking
    • How to deal with rows that are completely blank
    • fibres ?
  • Indicate that things may be Hidden 🖺

    • Sheet
    • Row
    • Column
    • Cell ?
  • Indicate that things may be Readonly

    • Sheet
    • Row
    • Column
    • Cell ?
  • More ideas to be added later, Please suggest... ;-)

About

Excel P.erformant R.eader via I.nterfaces for M.emory E.fficiency. Without using any external libraries. Optimised for Range extraction

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages