Skip to content

Operations Catalog — MakeIndex Port to Python

Purpose: Categorize all logical operations found in the original MakeIndex function (and its dependency tree) into defined groups, correlate each group to recommended Python modules/libraries, and flag improvements over the original implementation.
Audience: AI-first, Human-second
Scope: Covers MakeIndex, its 8 external pslib dependencies (Base64DecodeString, Date2UnixTime, DirectoryId, FileId, MetaFileRead, TempOpen, TempClose, Vbs), 2 external binaries (exiftool, jq), and all internal sub-functions.
Schema Reference: MakeIndex_OutputSchema.json — the Encoding key is intentionally excluded from the ported software per project requirements.


How to Read This Document

Each operation category below contains:

  • What It Does: A summary of the original behavior across MakeIndex and its dependencies.
  • Where It Lives (Original): The specific functions/sub-functions in the PowerShell source that implement the operation.
  • Python Modules: The recommended Python standard-library or third-party modules for the ported implementation.
  • Improvement Notes: Where the original logic was inefficient, fragile, or platform-locked, and how the Python port should diverge.

Categories are ordered by architectural importance (foundational operations first, output/presentation operations last).


Category Index

# Category Primary Python Modules
1 Filesystem Traversal & Discovery pathlib, os
2 Path Resolution & Manipulation pathlib
3 Hashing & Identity Generation hashlib
4 Symlink Detection pathlib
5 Filesystem Timestamps & Date Conversion datetime
6 EXIF / Embedded Metadata Extraction subprocess + exiftool, json
7 Sidecar Metadata File Handling json, re, pathlib
8 Output Object Construction & Schema dataclasses, json
9 JSON Serialization & Output Routing json (or orjson), pathlib, sys
10 File Rename & In-Place Write Operations pathlib, shutil
11 Configuration Management dataclasses, tomllib (or pyyaml)
12 Logging & Verbosity logging
13 Progress Reporting tqdm (or rich)
14 CLI Argument Parsing & Entry Point click (or argparse)
15 Temporary File Management tempfile

1. Filesystem Traversal & Discovery

What It Does

Enumerates files and directories within a target path. Supports recursive and non-recursive modes. Filters out system artifacts ($RECYCLE.BIN, System Volume Information). Separates child items into file and directory lists. Counts items for progress reporting. Handles the three input scenarios: single file, single directory (flat), and directory tree (recursive).

Where It Lives (Original)

Function Role
MakeDirectoryIndexRecursiveLogic Recursive tree walk via Get-ChildItem -Force -Recurse combined with self-recursive calls per child directory
MakeDirectoryIndexLogic Non-recursive single-directory enumeration via Get-ChildItem -Force
MakeFileIndex Single-file path; skips traversal entirely, calls Get-Item on one file
Where-Object filters Excludes $RECYCLE.BIN and System Volume Information by name match
[System.Collections.ArrayList] Combines file and directory child arrays into a single ordered collection

Python Modules

Module Usage
pathlib.Path.iterdir() Non-recursive directory listing
pathlib.Path.rglob('*') Recursive traversal (replaces Get-ChildItem -Recurse)
os.scandir() High-performance directory iteration when pathlib overhead matters (large trees)
pathlib.Path.is_file() / .is_dir() Item type classification (replaces Where-Object separation)

Improvement Notes

The original uses two entirely separate code paths for recursive vs. non-recursive traversal (MakeDirectoryIndexRecursiveLogic which calls itself, vs. MakeDirectoryIndexLogic). This created near-complete code duplication. The Python port should use a single traversal function parameterized by a recursive: bool flag. When recursive=True, use Path.rglob('*') or os.walk(). When recursive=False, use Path.iterdir(). Both paths feed into the same object-construction pipeline.

The original manually assembles an ArrayList from two separate Get-ChildItem calls (one for files, one for directories). Python's os.scandir() returns DirEntry objects that already expose .is_file() and .is_dir() without additional stat calls, making the separation trivially cheap in a single pass.

The hardcoded exclusion of $RECYCLE.BIN and System Volume Information is Windows-specific. The port should externalize the exclusion list into configuration (see Category 11) and default to a cross-platform set that also covers .DS_Store, .Spotlight-V100, .Trashes, and similar platform artifacts.


2. Path Resolution & Manipulation

What It Does

Resolves relative, hypothetical, or symbolic paths into absolute canonical forms. Extracts parent directories, basenames, extensions, and filenames from path strings. Constructs new paths for renamed files and sidecar metadata outputs. Validates extension strings against a regex pattern to reject malformed or suspiciously long extensions.

Where It Lives (Original)

Function Role
ResolvePath Resolves real paths via Resolve-Path; falls back to [System.IO.Path]::GetFullPath() for hypothetical (non-existent) paths
GetParentPath Extracts parent directory via Split-Path
MakeObject Extracts extension, basename, filename via [System.IO.Path] methods; validates extension with regex ^(([a-z0-9]){1,2}\|([a-z0-9]){1}([a-z0-9\-]){1,12}([a-z0-9]){1})$
FileId-ResolvePath / DirectoryId-ResolvePath Redundant copies of the same resolve logic inside FileId and DirectoryId
$Sep (global) [System.IO.Path]::DirectorySeparatorChar used for manual string concatenation of paths

Python Modules

Module Usage
pathlib.Path.resolve() Canonical absolute path resolution (replaces Resolve-Path + GetFullPath fallback)
pathlib.Path.parent Parent directory extraction (replaces Split-Path and GetParentPath)
pathlib.Path.name / .stem / .suffix Filename, basename, and extension extraction (replaces all [System.IO.Path] calls)
pathlib.PurePosixPath / PureWindowsPath Cross-platform path construction without touching the filesystem

Improvement Notes

The original has three independent copies of the "resolve path" logic: ResolvePath in MakeIndex, FileId-ResolvePath in FileId, and DirectoryId-ResolvePath in DirectoryId. All three do the same thing. The Python port should have exactly one path resolution utility.

The original constructs paths by string-concatenating components with $Sep as a manual separator character. This is brittle and unnecessary. pathlib operator overloading (parent / filename) handles path construction correctly across platforms with no separator management.

The extension validation regex in the original rejects extensions longer than 14 characters or those containing non-alphanumeric characters (beyond hyphens). This is reasonable but should be made configurable rather than hardcoded, as some legitimate extensions (e.g., .numbers, .download) may be affected.


3. Hashing & Identity Generation

What It Does

Computes cryptographic hashes of file contents and name strings to produce deterministic unique identifiers. Files get content-based IDs (prefixed y); directories get name-based IDs computed from a two-layer scheme: hash( hash(dirName) + hash(parentDirName) ) (prefixed x). Handles null/empty inputs with known null-hash constants. Handles symlinks by falling back to name hashing instead of content hashing.

The original only computes MD5 and SHA256 at runtime, despite the output schema defining fields for SHA1 and SHA512 as well.

Where It Lives (Original)

Function Role
FileId + 8 nested hash sub-functions File content hashing (MD5, SHA1, SHA256, SHA512) via [System.Security.Cryptography.*]::Create() + [System.IO.File]::OpenRead() stream hashing; name string hashing via same algorithms on UTF-8 byte encoding
DirectoryId + 8 nested hash sub-functions Directory name hashing using the two-layer hash(hash(name) + hash(parentName)) scheme
ReadMetaFile-GetNameHashMD5 / ReadMetaFile-GetNameHashSHA256 Yet more copies of string hashing logic, duplicated inside ReadMetaFile
MetaFileRead-Sha256-File / MetaFileRead-Sha256-String Even more copies inside MetaFileRead itself
Null-hash constants Hardcoded per-algorithm empty-string hash values for edge cases

Python Modules

Module Usage
hashlib All hash computation: hashlib.md5(), hashlib.sha1(), hashlib.sha256(), hashlib.sha512() — supports both file stream hashing (via .update() in chunks) and string hashing (via .update(s.encode('utf-8')))
hashlib.file_digest() Python 3.11+ convenience function for hashing file contents directly from a file object; ideal for the file content hashing path

Improvement Notes

Critical code duplication problem. The original has no fewer than four separate locations where hashing logic is independently implemented: FileId, DirectoryId, ReadMetaFile sub-functions, and MetaFileRead sub-functions. Each reimplements the same Create() → ComputeHash() → ToString() → replace('-','') pattern. The Python port should provide exactly one hashing utility module exposing functions like hash_file(path, algorithm) -> str and hash_string(value, algorithm) -> str, called from everywhere.

Expand runtime hash coverage. The original only computes MD5 and SHA256 at runtime, but the output schema defines SHA1 and SHA512 fields. Since hashlib can compute all four algorithms in a single file read pass (feeding the same byte chunks to four hash objects simultaneously), the Python port should compute all four by default with near-zero marginal cost. This fills the previously-empty schema fields and enables downstream consumers to select their preferred algorithm.

Chunked file reading. The original opens the entire file stream and calls ComputeHash() on it in one pass. For very large files this is fine at the .NET level (it streams internally), but the Python port should be explicit about chunked reads (e.g., 8 KB or 64 KB chunks fed to hashlib.update()) to keep memory usage bounded and to enable multi-algorithm hashing in a single pass.

The x/y prefix convention for directory vs. file IDs is a design choice carried forward from the original. It should be preserved for backward compatibility with the output schema.

Null-hash constants should not be hardcoded. Instead, the Python port can compute them once at module load time: hashlib.md5(b'').hexdigest().upper(), etc. This is self-documenting and eliminates the risk of copy-paste errors in long hex strings.


What It Does

Determines whether a file or directory is a symbolic link (reparse point). When a file is a symlink, the identity system falls back to hashing the file's name string rather than its content (because the link target may not be accessible). The IsLink boolean is included in the output schema.

Where It Lives (Original)

Function Role
FileId Checks (Get-Item).Attributes -band [System.IO.FileAttributes]::ReparsePoint
DirectoryId Same reparse point check for directories
ValidateIsLink Listed as a dependency but never directly called; FileId and DirectoryId perform the check inline
MakeObject Reads .IsLink from the FileId/DirectoryId return object; skips encoding detection and exiftool for symlinks

Python Modules

Module Usage
pathlib.Path.is_symlink() Single cross-platform call; returns True for both file and directory symlinks
os.path.islink() Alternative for string-path interfaces

Improvement Notes

The original checks for the ReparsePoint attribute, which is a Windows-specific concept that covers symlinks but also covers junction points and other reparse point types. pathlib.Path.is_symlink() is the correct cross-platform equivalent. On Windows it still detects reparse points; on Linux/macOS it detects POSIX symlinks. This is a strict improvement in portability.

The ValidateIsLink function listed in the original docstring but never called is dead code. It should not be carried forward.


5. Filesystem Timestamps & Date Conversion

What It Does

Reads filesystem timestamps (created, modified, accessed) from file/directory stat data. Formats them as ISO 8601 strings with timezone offset (format: yyyy-MM-ddTHH:mm:ss.fffffffzzz). Converts those formatted strings to Unix timestamps (milliseconds since epoch) via the external Date2UnixTime function.

Where It Lives (Original)

Function Role
MakeObject Reads .LastAccessTime, .CreationTime, .LastWriteTime from Get-Item; formats via .ToString($DateFormat)
Date2UnixTime (external pslib) Parses formatted date strings back into [DateTimeOffset] objects and calls .ToUnixTimeMilliseconds()

Python Modules

Module Usage
datetime.datetime Timestamp formatting via .isoformat() or .strftime()
datetime.datetime.timestamp() Direct conversion to Unix epoch seconds (multiply by 1000 for milliseconds)
os.stat_result / pathlib.Path.stat() Reading st_mtime, st_atime, st_ctime (or st_birthtime on macOS)

Improvement Notes

The original performs an unnecessary round-trip. It formats a datetime to a string, then passes that string to Date2UnixTime which parses it back into a datetime object just to call .ToUnixTimeMilliseconds(). The Python port should extract the Unix timestamp directly from the stat result's float value: int(stat_result.st_mtime * 1000). The formatted ISO string can be produced separately from the same source datetime. No round-trip parsing needed.

Creation time portability. Windows provides a true creation time (CreationTime). Linux typically does not expose birth time in os.stat() unless the filesystem and kernel support st_birthtime (available on some systems via os.stat_result.st_birthtime). The port should attempt st_birthtime and fall back to st_ctime (metadata change time on Linux) with a documented caveat. This is a platform reality, not a bug.

The date format string yyyy-MM-ddTHH:mm:ss.fffffffzzz uses .NET formatting tokens. The Python equivalent is %Y-%m-%dT%H:%M:%S.%f%z, noting that Python's %f gives microseconds (6 digits) rather than .NET's 7-digit fractional seconds. For backward compatibility, the port should zero-pad to 7 digits if exact schema match is required, or accept the 6-digit microsecond precision as a minor, acceptable deviation.


6. EXIF / Embedded Metadata Extraction

What It Does

Invokes the exiftool binary against individual files to extract embedded EXIF/XMP/IPTC metadata. The exiftool arguments are Base64-encoded in the source and decoded at runtime, written to a temporary argument file, and passed to exiftool via its -@ (argfile) switch. The raw JSON output from exiftool is piped through jq to compact it and strip unwanted system keys (ExifToolVersion, FileName, FilePath, Directory, FilePermissions, etc.). Certain file types (.csv, .htm, .html, .json, .tsv, .xml) are excluded because exiftool tends to dump their entire content into the metadata output.

Where It Lives (Original)

Function Role
GetFileExif Orchestrator: decodes args, manages temp file lifecycle, calls runner
GetFileExifArgsWrite Decodes Base64 argument strings via Base64DecodeString, writes them to a temp file via Add-Content
GetFileExifRun Executes exiftool -@ $ArgsFile and pipes output through jq -c '.[] \| .' then a second jq pass to delete unwanted keys
Base64DecodeString (external pslib) URL-decodes and Base64-decodes the argument strings with UTF-8 encoding support
TempOpen / TempClose (external pslib) Creates and deletes the temporary argument file

Python Modules

Module Usage
subprocess.run() Invoking exiftool with arguments passed directly (no temp file needed)
json Parsing exiftool's -json output directly (replaces jq entirely)
PyExifTool (third-party, optional) A Python wrapper around exiftool that manages a persistent exiftool process for batch operations; significantly faster for large directory trees

Improvement Notes

Eliminate jq dependency entirely. The original pipes exiftool output through jq for two purposes: JSON compaction and key deletion. Python's json.loads() handles the parsing natively, and unwanted keys can be removed with a simple dict comprehension: {k: v for k, v in data.items() if k not in EXCLUDED_KEYS}. This eliminates a binary dependency with zero functionality loss.

Eliminate the Base64 argument encoding scheme. The original stores exiftool arguments as Base64-encoded strings and decodes them at runtime via Base64DecodeString (which itself has a complex OpsCode-based branching pattern and calls certutil on Windows). This appears to have been a mechanism for safely embedding complex argument strings in the PowerShell source. In Python, we can simply define the arguments as a list of strings and pass them directly to subprocess.run(). This eliminates the entire Base64DecodeStringTempOpen → write-args → TempClose pipeline.

Eliminate the temporary argument file. The original writes decoded arguments to a temp .txt file and passes it to exiftool via -@. The Python port should pass arguments directly via subprocess.run(['exiftool', ...args, filepath]). If argument lists are very long, subprocess handles them correctly on all platforms.

Consider PyExifTool for batch mode. When indexing large directory trees, the original invokes exiftool once per file (a separate process spawn each time). PyExifTool keeps a single exiftool process alive and communicates with it via stdin/stdout, which is dramatically faster for batch operations. The port should support both modes: direct subprocess invocation for single-file use, and PyExifTool batch mode for directory traversal.

The extension exclusion list (.csv, .htm, .html, .json, .tsv, .xml) should be externalized into configuration (see Category 11) rather than hardcoded, so users can customize it.


7. Sidecar Metadata File Handling

What It Does

Discovers, reads, parses, and optionally merges external metadata "sidecar" files that live alongside the files they describe. Sidecar files are identified by regex patterns matching known suffixes (defined in $global:MetadataFileParser). Each sidecar file undergoes type detection (matching against the Identify configuration), format-specific reading (JSON, plain text, binary/Base64, subtitles, hash files, URL/LNK shortcuts), and construction of a metadata entry object with source attribution, type, name, name hashes, and data payload. When MetaMerge is active, sidecar metadata is folded into the parent item's Metadata array. When MetaMergeDelete is active, merged sidecar files are queued for deletion after processing.

Where It Lives (Original)

Function Role
GetFileMetaSiblings Scans the parent directory for files matching the target basename + known metadata suffixes; uses regex with [System.Text.RegularExpressions.Regex]::Escape() for basename escaping
ReadMetaFile Wrapper: calls MetaFileRead, adds name hashes, constructs standardized metadata entry objects
MetaFileRead (external pslib) The actual parser: type detection, format-specific readers (ReadJson, ReadText, ReadBinary, ReadText-Hash, ReadText-Subtitles, ReadLink), parent file resolution, SHA256 hashing
ReadMetaFile-GetNameHashMD5 / -SHA256 Hash the sidecar filename for the NameHashes field
$global:MetadataFileParser Configuration object defining suffix patterns, exclusion patterns, and type identification rules
$global:DeleteQueue Runtime accumulator for sidecar file paths to delete when MetaMergeDelete is active

Python Modules

Module Usage
re Regex pattern matching for sidecar file identification and type detection
json Reading JSON-format sidecar files (replaces jq -c '.' piped through ConvertFrom-Json)
pathlib Directory scanning for sibling files, basename extraction, suffix matching
hashlib Name hashing for sidecar files (shared with Category 3)
base64 Reading binary sidecar files as Base64-encoded data (replaces certutil -encode)

Improvement Notes

Eliminate certutil dependency. The original uses certutil -encode to convert binary sidecar file data to Base64 strings. Python's base64.b64encode() does this natively and portably.

Simplify type detection. The original iterates through all keys in $MetadataFileParser.Identify and matches each file against regex pattern arrays. This is fine algorithmically but the Python port should express this as a clean mapping structure (a dict of {type_name: [compiled_regex_patterns]}) rather than the deeply nested ordered hashtable structure of the original.

The DeleteQueue pattern (accumulate paths during traversal, delete after completion) is sound and should be preserved. In Python this is simply a list[Path] built up during traversal and iterated at the end with Path.unlink().


8. Output Object Construction & Schema

What It Does

Assembles a structured data object (the "index entry") for every file and directory processed. Each entry contains identity fields (_id, Ids, NameHashes, ContentHashes), filesystem metadata (Name, BaseName, Extension, StorageName, Size, IsDirectory, IsLink), relationship fields (ParentId, ParentIds, ParentName, ParentNameHashes), timestamps (TimeAccessed/Created/Modified, UnixTimeAccessed/Created/Modified), child items (Items array for directories), and extracted metadata (Metadata array).

The original output schema also includes an Encoding key (a complex object describing file encoding properties from BOM detection). This key is being intentionally dropped from the ported software.

Where It Lives (Original)

Function Role
MakeObject The core builder. Contains a massive switch on $ObjectType (makeobjectfile, makeobjectdirectory, makeobjectdirectoryrecursive) that constructs [PSCustomObject]@{...} with all schema fields. Repeated near-identically across 5+ switch branches.
VariableStringify Null-safe string conversion used before inserting values into the output object
GetFileEncoding BOM byte inspection for the Encoding field (being dropped)

Python Modules

Module Usage
dataclasses Define the output schema as @dataclass classes with type annotations; provides asdict() for JSON serialization. Strongly preferred for a project of this scope.
pydantic (third-party, optional) Alternative to dataclasses with built-in validation, JSON schema generation, and serialization. More powerful but heavier dependency. Worth considering if schema validation against legacy consumers is important.
typing Type annotations for nullable fields, union types, and recursive structures (Items referencing the same schema)

Improvement Notes

Eliminate the ObjectType switch duplication. The original MakeObject constructs the output object inside a switch statement with 5+ branches (makeobjectfile, makeobjectdirectory, makeobjectdirectoryrecursive, plus default branches). The actual fields are nearly identical across all branches — the only differences are: directories get Items = @() while files do not, and recursive directories get Items = @() for later population. The Python port should define one IndexEntry dataclass and conditionally populate Items and Metadata based on item type. One class, one construction path.

Drop the Encoding key. As specified. The GetFileEncoding sub-function (BOM byte inspection) and all $IEncoding variable assignments are omitted from the port. For backward compatibility, the output schema can include "Encoding": null for all items if legacy consumers expect the field to exist. Alternatively, omit it entirely and let legacy consumers handle the missing key. This is a project decision to be made when we address backward compatibility testing.

Typed schema definition. The original [PSCustomObject]@{...} has no compile-time type checking. Using Python dataclasses gives us type annotations, IDE support, and dataclasses.asdict() for clean JSON serialization. A rough sketch of the core structure:

IndexEntry:
    _id: str
    Ids: HashIds            # {MD5: str, SHA256: str, ...}
    Name: str
    NameHashes: HashIds | None
    ContentHashes: HashIds | None
    Extension: str | None
    BaseName: str
    StorageName: str
    Size: int
    IsDirectory: bool
    IsLink: bool
    ParentId: str | None
    ParentIds: HashIds | None
    ParentName: str | None
    ParentNameHashes: HashIds | None
    UnixTimeAccessed: int
    UnixTimeCreated: int
    UnixTimeModified: int
    TimeAccessed: str
    TimeCreated: str
    TimeModified: str
    Items: list[IndexEntry] | None   # Recursive reference; None for files
    Metadata: list[MetadataEntry] | None

This sketch is illustrative. The actual implementation should be derived from the output schema JSON with adjustments for the dropped Encoding key and expanded hash fields.


9. JSON Serialization & Output Routing

What It Does

Converts the assembled index tree into JSON format and routes it to one or more output destinations. The original supports 7 distinct output scenarios combining three flags: StandardOutput (write to stdout), OutFile (write to a single aggregate file), and OutFileInPlace (write individual _meta.json sidecar files alongside each processed item). The ConvertTo-Json -Depth 100 cmdlet is used for serialization, with a documented known issue: extremely large output trees can cause out-of-memory errors.

Where It Lives (Original)

Function Role
MakeIndex (top-level output logic) The 7-scenario routing switch at the end of the function; calls ConvertTo-Json -Depth 100 and Out-File -Encoding UTF8 or writes to stdout
MakeDirectoryIndexLogic / MakeDirectoryIndexRecursiveLogic In-place sidecar writing via ConvertTo-Json -Depth 100 \| Set-Content -LiteralPath $FileMetaPath

Python Modules

Module Usage
json Standard library JSON serialization via json.dumps() with indent and ensure_ascii=False for UTF-8 output
orjson (third-party, optional) Significantly faster JSON serialization for large trees; outputs bytes directly; handles dataclasses natively
sys.stdout Standard output routing
pathlib.Path.write_text() Writing JSON to output files and in-place sidecar files

Improvement Notes

The ConvertTo-Json memory problem does not exist in Python. Python's json.dumps() handles arbitrarily large nested structures without the memory ceiling that plagues PowerShell's ConvertTo-Json. If performance is a concern for very large trees (hundreds of thousands of items), orjson is a drop-in replacement that serializes 5-10x faster and produces bytes directly. The note in the original docstring about "good luck, take it up with Microsoft" can be retired.

Simplify the output routing model. The original's 7-scenario matrix is confusing. The Python port should express this as three independent boolean flags that compose naturally: --stdout, --outfile PATH, --inplace. Any combination is valid. The routing logic becomes a simple loop over enabled destinations after the index tree is built (or during traversal for in-place writes).

Streaming in-place writes. For in-place mode, the original writes each sidecar file as it processes items within the traversal loop. This is correct and should be preserved — it means partial results are available even if the process is interrupted. The aggregate output file and stdout writes happen after traversal completes, which is also correct.


10. File Rename & In-Place Write Operations

What It Does

When the Rename flag is active, processed files are renamed from their original name to their hash-based StorageName (format: <_id>.<extension>). The original file is destroyed and replaced with the renamed version. A sidecar _meta.json file is written alongside each renamed file. Directory items get _directorymeta.json sidecar files. The Rename flag implies OutFileInPlace.

Where It Lives (Original)

Function Role
MakeDirectoryIndexLogic / MakeDirectoryIndexRecursiveLogic Contains the rename-or-not branching logic using Move-Item -LiteralPath -Destination -Force for renames and Set-Content -LiteralPath -Force for sidecar writes
MakeIndex (parameter validation) Forces OutFileInPlace = $true when Rename = $true (safety measure)

Python Modules

Module Usage
pathlib.Path.rename() Atomic file rename (same filesystem)
shutil.move() Cross-filesystem move if needed (fallback)
pathlib.Path.write_text() Writing sidecar _meta.json files

Improvement Notes

The rename operation is destructive and irreversible in the original. The original docstring includes a "To-Do" note about adding a Revert parameter. The Python port should consider implementing revert capability from the start, since the in-place sidecar files contain the original filename in the Name field and can serve as the revert manifest.

Safety: MetaMergeDelete guard. The original has a $MMDSafe variable that prevents MetaMergeDelete from activating unless an output mechanism (OutFile or OutFileInPlace) is specified, protecting against accidental metadata file deletion when no output is being captured. This safety logic should be preserved.


11. Configuration Management

What It Does

Loads and provides access to the parser configuration that governs metadata file behavior: which file suffixes are recognized as sidecar files, which types they map to, which file extensions are excluded from exiftool processing, and the regex patterns used for identification. The original stores this in a large [ordered]@{} hashtable ($global:MetadataFileParser) defined at the script level and promoted to global scope for access by deeply nested sub-functions.

Where It Lives (Original)

Function Role
$global:MetadataFileParser (script-level variable, line ~16977) The master configuration object containing .Exiftool.Exclude, .Indexer.Include/Exclude/IncludeString/ExcludeString, .Identify.<Key> sub-objects
MakeIndex (parameter block) Copies sub-properties from $MetadataFileParser into global variables ($global:ExiftoolRejectList, $global:MetaSuffixInclude, etc.) at function start; cleans them up via Remove-Variable at function end

Python Modules

Module Usage
dataclasses Define configuration as typed @dataclass objects that can be validated at load time
tomllib (Python 3.11+ stdlib) Load configuration from a TOML file; human-readable, well-suited for this kind of structured config
pyyaml (third-party, optional) Alternative config format if TOML is insufficient
json Alternative config format; the least human-friendly but the most schema-compatible

Improvement Notes

Eliminate global variable promotion entirely. The original promotes configuration values to $global: scope because PowerShell's nested function scoping makes it difficult to pass data into deeply nested sub-functions cleanly. Python has no such limitation. The configuration object should be instantiated once and passed through the call chain via function parameters, or held on a class instance if using an OOP architecture. No global state needed.

Externalize the configuration to a file. The original hardcodes the configuration in the script source. The Python port should load it from an external file (TOML recommended) that ships alongside the tool as a default but can be overridden by the user. This makes the extension exclusion lists, sidecar suffix patterns, and type identification rules user-customizable without modifying source code.

Provide sensible defaults. The port should include a built-in default configuration that matches the original's behavior, so the tool works out of the box without requiring a config file. The external config file should only be needed for customization.


12. Logging & Verbosity

What It Does

Provides structured log output with severity levels (info, debug, warning, error, critical, success), caller identification via a colon-delimited call stack string, session IDs, colorized console output, and persistent log file writing. The Vbs function is the single most widely-called function in the entire pslib library — virtually every other function routes its output through it.

Where It Lives (Original)

Function Role
Vbs (external pslib) The terminal logging endpoint. Accepts Caller, Status, Message, Verbosity, LogDir, LibName, VbsSessionID. Writes to log files and optionally to colorized console output.
UpdateFunctionStack Maintains the colon-delimited call-stack string (e.g., "MakeIndex:MakeObject:GetFileExif") for the Caller parameter
VbsFormatter (inside MetaFileRead) Wrapper that prepends a progress string to messages before passing them to Vbs
$LibSessionID / $D_PSLIB_LOGS (global variables) Session identifier (GUID) and log directory path

Python Modules

Module Usage
logging Python's standard logging framework. Supports named loggers (replacing the manual call-stack string), severity levels, file handlers, console handlers with formatting, and session-scoped context via LogRecord attributes or logging.LoggerAdapter.
rich (third-party, optional) Colorized console output, progress bars, and structured logging. If the port wants to replicate the colorized console output of Vbs, rich.logging.RichHandler is an excellent drop-in handler.

Improvement Notes

Eliminate UpdateFunctionStack entirely. The original manually builds a colon-delimited string ("MakeIndex:MakeObject:GetFileExif") and passes it through every function call. Python's logging module automatically captures the call location via %(funcName)s, %(module)s, and %(pathname)s format tokens. For hierarchical logger names, the port should use Python's dotted-name logger convention (e.g., logging.getLogger('indexer.make_object.get_file_exif')). This gives the same traceability with zero manual bookkeeping.

Replace the Verbosity boolean with standard log levels. The original has a binary verbosity toggle ($true / $false) that gates console output. Python's logging already supports DEBUG, INFO, WARNING, ERROR, CRITICAL levels, controlled by configuring the handler's level. The CLI can expose --verbose / --debug / --quiet flags that map to log levels.

Session IDs are useful for correlating log entries across a single run. The Python port should generate one via uuid.uuid4().hex at startup and inject it into all log records using a logging.Filter or LoggerAdapter.


13. Progress Reporting

What It Does

Tracks and reports processing progress during directory traversal: counts total items, computes percentage complete, measures elapsed time, and formats progress strings for log messages.

Where It Lives (Original)

Function Role
MakeDirectoryIndexLogic / MakeDirectoryIndexRecursiveLogic Maintains counters ($ChildrenCountProcessed, $ChildrenCountTotal), computes [math]::Round() percentage, formats progress strings like "[42/100 (42%)]"
MakeIndex (top-level) Captures $TimeStart at the beginning, computes elapsed time at end using (Get-Date) - $TimeStart formatted as H:M:S.ms

Python Modules

Module Usage
tqdm (third-party) Progress bars for iterables. Clean, minimal, widely used. Wrap the item iterator in tqdm(items) and progress reporting is automatic.
rich.progress (third-party, optional) More visually sophisticated progress bars with elapsed time, ETA, and transfer rate. Pairs naturally with rich logging (see Category 12).
time.perf_counter() High-resolution elapsed time measurement (replaces Get-Date arithmetic)

Improvement Notes

The original manually formats progress strings and injects them into log messages, creating tight coupling between progress tracking and logging. The Python port should separate these concerns: use a progress bar library (tqdm or rich) for user-facing progress display, and use the logging system for structured log output. They can coexist cleanly — tqdm even has a tqdm.write() method for printing messages without disrupting the progress bar, and rich integrates both natively.


14. CLI Argument Parsing & Entry Point

What It Does

Accepts user input specifying the target path, output mode, recursion behavior, metadata options, rename flag, ID type, and verbosity level. Validates input combinations (e.g., File and Directory are mutually exclusive; Rename implies OutFileInPlace). Routes execution to the appropriate traversal entry point based on the resolved target type (file, directory flat, directory recursive).

Where It Lives (Original)

Function Role
MakeIndex (Param block) Declares 14 parameters with aliases, types, defaults, and switch behaviors
MakeIndex (validation and routing logic) ~200 lines of input validation, mutual exclusion checks, output scenario determination, and TargetTyp (0/1/2) routing

Python Modules

Module Usage
click (third-party) Decorator-based CLI framework. Clean syntax for defining commands, options, arguments, and mutual exclusions. Recommended for new Python CLI tools.
argparse (stdlib) Standard library alternative. More verbose but zero dependencies.

Improvement Notes

The original's parameter validation contains a significant amount of defensive logic to handle conflicting flags (e.g., Recursive with File, StandardOutput with NoStandardOutput). click handles mutual exclusions declaratively, reducing the validation boilerplate. The original's 7 output scenarios can be expressed as three independent --stdout / --outfile / --inplace boolean flags with natural composition rules.

The IdType parameter (selecting MD5 vs. SHA256 as the _id field source) should be preserved for backward compatibility. However, since the port will compute all four hash algorithms (see Category 3), this becomes a presentation choice rather than a computation toggle.


15. Temporary File Management

What It Does

Creates and deletes temporary files used as intermediaries during exiftool argument passing and Base64 encoding operations. Temp files use UUID-based naming in a dedicated temp directory.

Where It Lives (Original)

Function Role
TempOpen (external pslib) Creates a temp file using UUID+timestamp naming in $D_PSLIB_TEMP; supports type suffixes
TempClose (external pslib) Deletes a temp file by path with error suppression
MetaFileRead-Temp-Open / MetaFileRead-Temp-Close Duplicated temp file logic inside MetaFileRead

Python Modules

Module Usage
tempfile tempfile.NamedTemporaryFile() or tempfile.mkstemp() for creating temp files with automatic cleanup. Context manager support ensures cleanup even on exceptions.

Improvement Notes

This category may be largely unnecessary in the Python port. The primary consumers of temp files in the original are (a) exiftool argument passing (eliminated by passing args directly to subprocess) and (b) certutil Base64 encoding (eliminated by using base64.b64encode()). If no operations remain that require temp files, this entire category can be dropped. If temp files are needed for any future operation, tempfile.NamedTemporaryFile(delete=True) with a context manager provides automatic cleanup that is strictly superior to the manual TempOpen/TempClose pattern.


Eliminated Dependencies Summary

The following original dependencies are not carried forward into the Python port because their functionality is absorbed by Python's standard library or rendered unnecessary by architectural improvements:

Original Dependency Reason for Elimination Replaced By
jq (binary) JSON parsing and filtering done natively json stdlib
certutil (binary) Base64 encoding done natively base64 stdlib
Base64DecodeString (pslib function) Exiftool args passed directly; no encoding round-trip needed Direct argument lists
TempOpen / TempClose (pslib functions) No temp files needed for arg passing or Base64 ops tempfile (if needed at all)
Date2UnixTime (pslib function) Timestamp conversion done directly from stat results datetime stdlib
Vbs (pslib function) Replaced by Python's logging framework logging stdlib
ValidateIsLink (pslib function) Never actually called in the original; dead code (removed)
GetFileEncoding (internal sub-function) Encoding key dropped from output schema (removed)
UpdateFunctionStack (internal sub-function) Manual call-stack tracking replaced by logging's built-in caller info logging stdlib
VariableStringify (internal sub-function) Python's native str() and None handling cover this Built-in str() / repr()

Cross-Reference: Output Schema Fields → Operation Categories

This table maps every field in the output schema to the operation category that produces it, confirming full coverage.

Schema Field Category Notes
_id 3 (Hashing) Selected from Ids based on IdType parameter
Ids 3 (Hashing) {MD5, SHA1, SHA256, SHA512} — expanded from original's MD5+SHA256 only
Name 2 (Path Manipulation) Path.name
NameHashes 3 (Hashing) Hash of the Name string
ContentHashes 3 (Hashing) Hash of file contents; null for directories
Extension 2 (Path Manipulation) Path.suffix with validation
BaseName 2 (Path Manipulation) Path.stem
StorageName 3 (Hashing) + 2 (Path) f"{_id}{extension}" for files; _id for directories
~~Encoding~~ ~~(Dropped)~~ Intentionally omitted from port
Size 1 (Traversal) Path.stat().st_size for files; sum of children for directories
IsDirectory 1 (Traversal) Path.is_dir()
IsLink 4 (Symlink) Path.is_symlink()
ParentId 3 (Hashing) Directory ID of parent
ParentIds 3 (Hashing) Hash IDs of parent directory
ParentName 2 (Path Manipulation) Path.parent.name
ParentNameHashes 3 (Hashing) Hash of parent directory name
UnixTimeAccessed 5 (Timestamps) int(stat.st_atime * 1000)
UnixTimeCreated 5 (Timestamps) int(stat.st_birthtime * 1000) or st_ctime fallback
UnixTimeModified 5 (Timestamps) int(stat.st_mtime * 1000)
TimeAccessed 5 (Timestamps) ISO 8601 formatted string
TimeCreated 5 (Timestamps) ISO 8601 formatted string
TimeModified 5 (Timestamps) ISO 8601 formatted string
Items 1 (Traversal) + 8 (Object Construction) Recursive child entries for directories; null for files
Metadata 6 (EXIF) + 7 (Sidecar) Array of metadata entries from exiftool and sidecar files

Minimum Viable Dependency Set

For a Python port that achieves full feature parity (minus the intentionally dropped Encoding key) with the original MakeIndex:

Required (Standard Library Only)

Module Categories Served
pathlib 1, 2, 4, 7, 9, 10
hashlib 3
datetime 5
json 6, 7, 8, 9
subprocess 6
re 7, 11
logging 12
os 1, 5
sys 9
tempfile 15 (if needed)
base64 7
dataclasses 8, 11
typing 8
tomllib 11 (Python 3.11+)
uuid 12
time 13

Required (External)

Module Categories Served Notes
exiftool (binary) 6 Must be in system PATH; the only external binary dependency
Module Categories Served Justification
click 14 Cleaner CLI definition than argparse; widely adopted
tqdm 13 Progress bars with minimal code
orjson 9 5-10x faster JSON serialization for large trees
PyExifTool 6 Persistent exiftool process for batch performance
rich 12, 13 Colorized logging + progress bars in one package; alternative to tqdm + custom log formatting
pydantic 8, 11 Schema validation and JSON schema generation; heavier alternative to dataclasses