meipi.indexing package¶
Submodules¶
meipi.indexing.config module¶
Konfiguration der App
Neben der hier definierten default-Konfiguration, können die Werte auch über eine .env-Datei oder direkt über Umgebungsvariablen überschrieben werden. Die .env-Datei sollte im Root-Verzeichnis der App liegen und den Namen „config.env“ tragen.
Die globale Instanz appconf wird einmal beim Import von meipi.indexing erzeugt und ist danach
unveränderlich. Änderungen an config.env, Umgebungsvariablen oder MEIPI_CONFIG_ENV wirken
erst nach einem vollständigen Neustart des Python-Prozesses (CLI erneut ausführen, Notebook-Kernel
neu starten, Streamlit-App neu starten).
Beispiel für eine .env-Datei:
IND_PG_HOST=localhost #PostgreSQL Host
IND_PG_PORT=5432 #PostgreSQL Port
IND_PG_USER=postgres #PostgreSQL Username
IND_PG_DATABASE=postgres #PostgreSQL Database Name
IND_PG_API_KEY=pg-docker #API-Key-Name für das DB-Passwort im Keyring
IND_DATADIR=./data #Datenverzeichnis
IND_DOCROOT=/home/rslsync/folders/ #Dokumenten-Root-Verzeichnis
IND_LOGGER_NAME=sqlalchemy.engine #Name des Loggers für SQLAlchemy
IND_DOCSUF='{
".pdf",
".txt",
".md",
".docx",
".doc",
".html",
".htm",
".epub",
".odt",
} #zulässige Dokumentenerweiterungen
IND_PICSUF='{
".jpg",
".jpeg",
".bmp",
".png",
".heic",
".tiff",
".tif"
} #zulässige Bild-Dateiendungen
IND_VIDSUF='{
".mov",
".vob",
".mkv",
".avi",
".mp4",
".mcf"
} #zulässige Video-Dateiendungen
IND_LOGLEVEL=20 #Log-Level (z.B. 10=DEBUG, 20=INFO, 30=WARNING, 40=ERROR, 50=CRITICAL)
Das DB-Passwort wird aus dem Keyring geholt, der API-Key-Name kann über die Umgebungsvariable IND_PG_API_KEY konfiguriert werden (Standard: „pg-docker“).
Alternatives Env-File vor dem Start setzen:
export MEIPI_CONFIG_ENV=/path/to/config.env
meipi-index schema-info
- class meipi.indexing.config.Config(_case_sensitive: bool | None = None, _nested_model_default_partial_update: bool | None = None, _env_prefix: str | None = None, _env_prefix_target: EnvPrefixTarget | None = None, _env_file: DotenvType | None = PosixPath('.'), _env_file_encoding: str | None = None, _env_ignore_empty: bool | None = None, _env_nested_delimiter: str | None = None, _env_nested_max_split: int | None = None, _env_parse_none_str: str | None = None, _env_parse_enums: bool | None = None, _cli_prog_name: str | None = None, _cli_parse_args: bool | list[str] | tuple[str, ...] | None = None, _cli_settings_source: CliSettingsSource[Any] | None = None, _cli_parse_none_str: str | None = None, _cli_hide_none_type: bool | None = None, _cli_avoid_json: bool | None = None, _cli_enforce_required: bool | None = None, _cli_use_class_docs_for_groups: bool | None = None, _cli_exit_on_error: bool | None = None, _cli_prefix: str | None = None, _cli_flag_prefix_char: str | None = None, _cli_implicit_flags: bool | Literal['dual', 'toggle'] | None = None, _cli_ignore_unknown_args: bool | None = None, _cli_kebab_case: bool | Literal['all', 'no_enums'] | None = None, _cli_shortcuts: Mapping[str, str | list[str]] | None = None, _secrets_dir: PathType | None = None, _build_sources: tuple[tuple[PydanticBaseSettingsSource, ...], dict[str, Any]] | None = None, *, envfile: str = 'config.env', pg_host: str = 'localhost', pg_port: str = '5432', pg_user: str = 'postgres', pg_passwd: str = 'postgres', pg_database: str = 'postgres', pg_schema: str = 'public', pg_api_key: str = 'pg-docker', tika_noocr_url: str = 'http://localhost:9998', tika_ocrurl: str = 'http://localhost:9997', datadir: str = './data', docsuf: Set[str] = {'.doc', '.docx', '.epub', '.htm', '.html', '.md', '.odt', '.pdf', '.txt'}, picsuf: Set[str] = {'.bmp', '.heic', '.jpeg', '.jpg', '.png', '.tif', '.tiff'}, vidsuf: Set[str] = {'.avi', '.mcf', '.mkv', '.mov', '.mp4', '.vob'}, logger_name: str = 'sqlalchemy.engine', loglevel: int = 20)[Quellcode]¶
Bases:
BaseSettingsEnthält die Konfiguration der App
Instanzen sind unveränderlich (frozen). Erzeugen Sie für andere Env-Dateien eine neue
Confignur in Tests oder vor dem ersten Import vonmeipi.indexing; zur Laufzeit gilt ausschließlichappconf.- db_passwd_from_keyring() str[Quellcode]¶
DB Password aus Keyring holen, oder Default-Wert verwenden.
- get_ftype(suf: str) str[Quellcode]¶
Gibt den konfigurierten Dateityp zurück, basierend auf der Dateiendung
- classmethod load(envfile: str = 'config.env') Config[Quellcode]¶
Load settings from envfile (used at package import, not for hot reload).
- property logger¶
- model_config = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': 'config.env', 'env_file_encoding': 'utf-8', 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': 'IND_', 'env_prefix_target': 'variable', 'extra': 'forbid', 'frozen': True, 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class meipi.indexing.config.FTYPE[Quellcode]¶
Bases:
object
- meipi.indexing.config.reload_appconf() None[Quellcode]¶
Configuration cannot be reloaded in-process.
meipi.indexing.model module¶
PostgreSQL database model for pictures, documents and their metadata. Es wird SQLAlchemy ORM verwendet, um die Datenbanktabellen zu definieren und zu verwalten. Die Modelle umfassen
DBMeta: Tabelle für Meta-Daten von Dateien (inkl. Volltextinhalt)
DBDoc: Optionale Zeile pro Dokument-Datei (z. B. für Embedding-Chunks)
DBPic: Tabelle für Bilder mit Thumbnail und Perceptual Hash
DBDinoV2Vector: Tabelle für DINO V2 Bildvektoren
Die Mixins DBMetaMixin, DocVectorMixin und PicVectorMixin
bieten gemeinsame Felder und Methoden für die jeweiligen Modelle.
Die Modelle enthalten Methoden zum Erstellen und Löschen von Tabellen sowie zur Durchführung von Volltextsuchen und Berechnung von Perceptual Hashes.
- class meipi.indexing.model.Base(**kwargs: Any)[Quellcode]¶
Bases:
MappedAsDataclass,DeclarativeBaseBase class for SQLAlchemy models.
- as_dict()[Quellcode]¶
Erzeugt Dictionary ohne _sa_instance_state
- classmethod create_table(session: Session) None[Quellcode]¶
Create the table in the database.
- classmethod drop_table(session: Session) None[Quellcode]¶
Drop the table from the database.
- class meipi.indexing.model.CatalogBase(**kwargs: Any)[Quellcode]¶
Bases:
DeclarativeBaseDeclarative base for PostgreSQL catalog views (read-only, no dataclass ORM).
- class meipi.indexing.model.DBCatalog(**kwargs)[Quellcode]¶
Bases:
CatalogBaseRead-only ORM mapping of
pg_catalog.pg_tables.
- class meipi.indexing.model.DBDinoV2Vector(pic_id, vector)[Quellcode]¶
Bases:
Base,PicVectorMixinSQLAlchemy model for DINO V2 image embeddings stored in PostgreSQL.
- vector¶
- class meipi.indexing.model.DBDoc[Quellcode]¶
Bases:
BaseOptional document row for
docfilemeta (e.g. embedding chunks).Full-text content lives on
DBMeta(inhalt/ts_content).
- class meipi.indexing.model.DBMeta(pool_id, path, fname, suffix, sort_date, fdate, fsize, clength, ctype, ftype, md_keys, meta_data, sha256=None, doc=None, pic=None, vid=None, *, inhalt='')[Quellcode]¶
Bases:
BaseSQLAlchemy model for Meta data stored in PostgreSQL.
- class meipi.indexing.model.DBPic(xmp=None, truncated=None, thumbarray=None, phash=None)[Quellcode]¶
Bases:
BaseSQLAlchemy model for Picture data stored in PostgreSQL.
Es enthält neben den Datei-Metadaten Felder für XMP-Metadaten, ein Thumbnail als numpy array und einen Perceptual Hash. Die eigentlichen Bilddaten werden nicht in der Datenbank gespeichert, sondern nur die Metadaten und der Hash. Die Methode set_phash berechnet den Perceptual Hash basierend auf dem Thumbnail, falls dieses vorhanden ist. Der Perceptual Hash wird als BYTEA gespeichert, um eine effiziente Speicherung und Suche zu ermöglichen. Es wird ein Index auf dem phash-Feld erstellt, um schnelle Ähnlichkeitssuchen zu ermöglichen.
- classmethod calc_phash(im: Image) bytes[Quellcode]¶
- set_phash()[Quellcode]¶
- class meipi.indexing.model.DBPool(pool, rootpath, description, id=None)[Quellcode]¶
Bases:
BaseSQLAlchemy model for data pools stored in PostgreSQL.
Diese Tabelle dient zur Verwaltung von Datenpools, die als logische Gruppen von Dateien definiert werden können. Ein Datenpool könnte beispielsweise ein bestimmtes Anwendungsgebiet oder eine Kategorie von Dateien repräsentieren
- class meipi.indexing.model.DBVid[Quellcode]¶
Bases:
BaseSQLAlchemy model for Video data stored in PostgreSQL.
- class meipi.indexing.model.DocVectorMixin(chunk_id: int = <sqlalchemy.orm.properties.MappedColumn object>, doc_id: int = <sqlalchemy.orm.properties.MappedColumn object>, content: str = <sqlalchemy.orm.properties.MappedColumn object>)[Quellcode]¶
Bases:
MappedAsDataclassMixin für DocVectorTables
- doc = <_RelationshipDeclared at 0x7f53cf1cc5f0; no key>¶
- vector¶
- class meipi.indexing.model.PILArray(*args: Any, **kwargs: Any)[Quellcode]¶
Bases:
TypeDecoratorType for PIL Image as numpy array
Damit können Thumbnails als numpy arrays in der Datenbank gespeichert werden, ohne sie vorher in ein anderes Format konvertieren zu müssen. Der Datenbanktyp ist BYTEA, da die numpy arrays als Binärdaten gespeichert werden.
- cache_ok = True¶
Indicate if statements using this
ExternalTypeare „safe to cache“.The default value
Nonewill emit a warning and then not allow caching of a statement which includes this type. Set toFalseto disable statements using this type from being cached at all without a warning. When set toTrue, the object’s class and selected elements from its state will be used as part of the cache key. For example, using aTypeDecorator:class MyType(TypeDecorator): impl = String cache_ok = True def __init__(self, choices): self.choices = tuple(choices) self.internal_only = True
The cache key for the above type would be equivalent to:
>>> MyType(["a", "b", "c"])._static_cache_key (<class '__main__.MyType'>, ('choices', ('a', 'b', 'c')))
The caching scheme will extract attributes from the type that correspond to the names of parameters in the
__init__()method. Above, the „choices“ attribute becomes part of the cache key but „internal_only“ does not, because there is no parameter named „internal_only“.The requirements for cacheable elements is that they are hashable and also that they indicate the same SQL rendered for expressions using this type every time for a given cache value.
To accommodate for datatypes that refer to unhashable structures such as dictionaries, sets and lists, these objects can be made „cacheable“ by assigning hashable structures to the attributes whose names correspond with the names of the arguments. For example, a datatype which accepts a dictionary of lookup values may publish this as a sorted series of tuples. Given a previously un-cacheable type as:
class LookupType(UserDefinedType): """a custom type that accepts a dictionary as a parameter. this is the non-cacheable version, as "self.lookup" is not hashable. """ def __init__(self, lookup): self.lookup = lookup def get_col_spec(self, **kw): return "VARCHAR(255)" def bind_processor(self, dialect): ... # works with "self.lookup" ...
Where „lookup“ is a dictionary. The type will not be able to generate a cache key:
>>> type_ = LookupType({"a": 10, "b": 20}) >>> type_._static_cache_key <stdin>:1: SAWarning: UserDefinedType LookupType({'a': 10, 'b': 20}) will not produce a cache key because the ``cache_ok`` flag is not set to True. Set this flag to True if this type object's state is safe to use in a cache key, or False to disable this warning. symbol('no_cache')
If we did set up such a cache key, it wouldn’t be usable. We would get a tuple structure that contains a dictionary inside of it, which cannot itself be used as a key in a „cache dictionary“ such as SQLAlchemy’s statement cache, since Python dictionaries aren’t hashable:
>>> # set cache_ok = True >>> type_.cache_ok = True >>> # this is the cache key it would generate >>> key = type_._static_cache_key >>> key (<class '__main__.LookupType'>, ('lookup', {'a': 10, 'b': 20})) >>> # however this key is not hashable, will fail when used with >>> # SQLAlchemy statement cache >>> some_cache = {key: "some sql value"} Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unhashable type: 'dict'
The type may be made cacheable by assigning a sorted tuple of tuples to the „.lookup“ attribute:
class LookupType(UserDefinedType): """a custom type that accepts a dictionary as a parameter. The dictionary is stored both as itself in a private variable, and published in a public variable as a sorted tuple of tuples, which is hashable and will also return the same value for any two equivalent dictionaries. Note it assumes the keys and values of the dictionary are themselves hashable. """ cache_ok = True def __init__(self, lookup): self._lookup = lookup # assume keys/values of "lookup" are hashable; otherwise # they would also need to be converted in some way here self.lookup = tuple((key, lookup[key]) for key in sorted(lookup)) def get_col_spec(self, **kw): return "VARCHAR(255)" def bind_processor(self, dialect): ... # works with "self._lookup" ...
Where above, the cache key for
LookupType({"a": 10, "b": 20})will be:>>> LookupType({"a": 10, "b": 20})._static_cache_key (<class '__main__.LookupType'>, ('lookup', (('a', 10), ('b', 20))))
Added in version 1.4.14: - added the
cache_okflag to allow some configurability of caching forTypeDecoratorclasses.Added in version 1.4.28: - added the
ExternalTypemixin which generalizes thecache_okflag to both theTypeDecoratorandUserDefinedTypeclasses.Siehe auch
- coerce_compared_value(op, value)[Quellcode]¶
Suggest a type for a ‚coerced‘ Python value in an expression.
By default, returns self. This method is called by the expression system when an object using this type is on the left or right side of an expression against a plain Python object which does not yet have a SQLAlchemy type assigned:
expr = table.c.somecolumn + 35
Where above, if
somecolumnuses this type, this method will be called with the valueoperator.addand35. The return value is whatever SQLAlchemy type should be used for35for this particular operation.
- process_bind_param(value: ndarray | None, dialect)[Quellcode]¶
Receive a bound parameter value to be converted.
Custom subclasses of
_types.TypeDecoratorshould override this method to provide custom behaviors for incoming data values. This method is called at statement execution time and is passed the literal Python data value which is to be associated with a bound parameter in the statement.The operation could be anything desired to perform custom behavior, such as transforming or serializing data. This could also be used as a hook for validating logic.
- Parameter:
value – Data to operate upon, of any type expected by this method in the subclass. Can be
None.dialect – the
Dialectin use.
- process_literal_param(value, dialect)[Quellcode]¶
Receive a literal parameter value to be rendered inline within a statement.
Bemerkung
This method is called during the SQL compilation phase of a statement, when rendering a SQL string. Unlike other SQL compilation methods, it is passed a specific Python value to be rendered as a string. However it should not be confused with the
_types.TypeDecorator.process_bind_param()method, which is the more typical method that processes the actual value passed to a particular parameter at statement execution time.Custom subclasses of
_types.TypeDecoratorshould override this method to provide custom behaviors for incoming data values that are in the special case of being rendered as literals.The returned string will be rendered into the output string.
- process_result_value(value, dialect)[Quellcode]¶
Receive a result-row column value to be converted.
Custom subclasses of
_types.TypeDecoratorshould override this method to provide custom behaviors for data values being received in result rows coming from the database. This method is called at result fetching time and is passed the literal Python data value that’s extracted from a database result row.The operation could be anything desired to perform custom behavior, such as transforming or deserializing data.
- Parameter:
value – Data to operate upon, of any type expected by this method in the subclass. Can be
None.dialect – the
Dialectin use.
- property python_type: type[ndarray]¶
Return the Python type object expected to be returned by instances of this type, if known.
Basically, for those types which enforce a return type, or are known across the board to do such for all common DBAPIs (like
intfor example), will return that type.If a return type is not defined, raises
NotImplementedError.Note that any type also accommodates NULL in SQL which means you can also get back
Nonefrom any type in practice.
- class meipi.indexing.model.PicVectorMixin(pic_id: int = <sqlalchemy.orm.properties.MappedColumn object>)[Quellcode]¶
Bases:
MappedAsDataclassMixin für PicVectorTables
Es enthält ein spezielles Feld vector, das die von einem Embedder-Modell erstellten Vektoren speichert. Die Größe des Vektors wird durch die Klasse definiert, die dieses Mixin verwendet. Es wird ein Fremdschlüssel pic_id definiert, der auf die Tabelle der Bilddaten verweist.
- vector¶
meipi.indexing.operations module¶
- class meipi.indexing.operations.AsyncFileOperations(pool: DBPool, config: Config = Config(envfile='config.env', pg_host='localhost', pg_port='5432', pg_user='postgres', pg_passwd='postgres', pg_database='postgres', pg_schema='public', pg_api_key='pg-docker', tika_noocr_url='http://localhost:9998', tika_ocrurl='http://localhost:9997', datadir='./data', docsuf={'.epub', '.doc', '.htm', '.md', '.html', '.odt', '.docx', '.txt', '.pdf'}, picsuf={'.png', '.heic', '.tiff', '.jpeg', '.tif', '.jpg', '.bmp'}, vidsuf={'.mcf', '.mkv', '.mp4', '.mov', '.avi', '.vob'}, logger_name='sqlalchemy.engine', loglevel=20), skip_ocr: bool = True, timeout: float = 30, compress=True)[Quellcode]¶
Bases:
AsyncTikaClientAsync file parsing helpers built on top of
tika_client.- DBPic_from_DBMeta(dbmeta: DBMeta) DBPic[Quellcode]¶
Build a
DBPicrow from image metadata and XMP tags.
- dir_tree(rel_path: str) Generator[str][Quellcode]¶
Durchläuft rekursiv alle Dateien im Verzeichnisbaum, extrahiert die Metadaten und Inhalte und erstellt DB-Objekte.
- async file_to_db(rel_path: str) DBMeta | None[Quellcode]¶
Create
DBMetaand optional typed child rows for a file path.
- class meipi.indexing.operations.DBOperations(pool_id: int | None = None, pool: DBPool | None = None, *, allow_no_pool: bool = False, enginekwargs: dict | None = None, sessionkwargs: dict | None = None)[Quellcode]¶
Bases:
objectHigh-level PostgreSQL operations for one configured data pool.
- clear_pool()[Quellcode]¶
Löscht alle Daten aus dem angegebenen Pool.
- create_pool(pool: DBPool)[Quellcode]¶
Insert a new data pool and make it active for this instance.
- create_tables(entities: Sequence[type[Base]] | None = None)[Quellcode]¶
Erstellt die Tabellen in der Datenbank, falls sie noch nicht existieren.
- get_pool(pool_id: int) DBPool[Quellcode]¶
Get the pool with the given id.
- async insert_docs_from_meta(skipocr: bool = True)[Quellcode]¶
Re-extract text for filemeta rows in the pool that have empty
inhalt.For
docfilemeta, ensures aDBDocrow exists for embedding FKs.Args: skipocr (bool): Wenn True, wird die OCR-Verarbeitung übersprungen, um Ressourcen zu sparen.
- insert_pics_from_meta()[Quellcode]¶
Liest die Metadaten aller Bilder aus dem angegebenen Pool aus, erstellt zugehörige DBPic-Objekte und fügt sie der DB hinzu.
- recreate_tables(entities: Sequence[type[Base]] | None = None)[Quellcode]¶
Recreate tables in the database.
- schema_info() dict[str, Any][Quellcode]¶
Get the schema information of the database.
- update_thumbs(thumblist: List[Tuple[ndarray, int]])[Quellcode]¶
Update the thumbnails and perceptual hashes for the given list of pictures.
- update_thumbs_no_heic() List[int][Quellcode]¶
Update the thumbnails and perceptual hashes for the pictures in the pool that are not HEICs.
- update_thumbs_no_thumb() List[int][Quellcode]¶
Update the thumbnails and perceptual hashes for the pictures in the pool that have no thumbnail.
meipi.indexing.search module¶
PostgreSQL full-text search for indexed documents.
- class meipi.indexing.search.DocSearchHit(meta_id: int, path: str, fname: str, suffix: str, rank: float, snippet: str)[Quellcode]¶
Bases:
objectOne filemeta row matching a full-text query.
- meipi.indexing.search.search_documents(session: Session, *, pool_id: int, query: str, lang: str = 'german', limit: int = 50, mode: Literal['plain', 'websearch', 'phrase'] = 'websearch') list[DocSearchHit][Quellcode]¶
Search file bodies and metadata with PostgreSQL full-text matching.
Matches rows where the query hits extracted content (
ts_content/inhalt) or metadata (filename, path, content type, and Tikameta_dataJSON).
meipi.indexing.picture module¶
Image loading and resizing helpers based on DALI and PIL fallback.
The module provides a high-throughput resize pipeline for generating thumbnails from file paths plus picture ids. Failed DALI batches can be retried with smaller batches and optionally with PIL-backed loading.
- class meipi.indexing.picture.DALIImageResizer(files: Sequence[str] = (), labels: Sequence[int] = (), pipe_batch_size: int = 1, num_threads: int = 1, config: Config = Config(envfile='config.env', pg_host='localhost', pg_port='5432', pg_user='postgres', pg_passwd='postgres', pg_database='postgres', pg_schema='public', pg_api_key='pg-docker', tika_noocr_url='http://localhost:9998', tika_ocrurl='http://localhost:9997', datadir='./data', docsuf={'.epub', '.doc', '.htm', '.md', '.html', '.odt', '.docx', '.txt', '.pdf'}, picsuf={'.png', '.heic', '.tiff', '.jpeg', '.tif', '.jpg', '.bmp'}, vidsuf={'.mcf', '.mkv', '.mp4', '.mov', '.avi', '.vob'}, logger_name='sqlalchemy.engine', loglevel=20))[Quellcode]¶
Bases:
objectDALI Image Resizer Klasse zum Laden und Vorverarbeiten von Bildern mit DALI. Es können Bilder mit DALI oder PIL geladen werden. Die Bilder werden auf eine Größe von 224x224 skaliert und gepaddet. Die Funktion process() verarbeitet die Bilder in Batches und gibt die Ergebnisse zurück. Eingabe: Liste von Dateipfaden und Labels, Batchgröße in der Pipeline, Zahl der Threads. Ausgabe: Tupel aus vier Listen: (Bilder, Labels, Fehlerdateipfade, Fehlerlabels)
- pipePIL(batch_files, batch_labels)[Quellcode]¶
Erstellt eine DALI-Pipeline zum Laden und Vorverarbeiten von Bildern mit PIL. Die Pipeline liest die Bilder mit einem externen Iterator, dekodiert sie, skaliert sie auf eine Größe von 224x224 und paddet sie. Die Funktion gibt die Pipeline zurück, die in der Funktion process() verwendet wird. Wie „pipedali“, aber mit einem externen Iterator, der die Bilder mit PIL lädt und als CuPy-Arrays zurückgibt.
- pipedali(batch_files, batch_labels)[Quellcode]¶
Erstellt eine DALI-Pipeline zum Laden und Vorverarbeiten von Bildern. Die Pipeline liest die Bilder mit dem DALI-File-Reader, dekodiert sie, skaliert sie auf eine Größe von 224x224 und paddet sie. Die Funktion gibt die Pipeline zurück, die in der Funktion process() verwendet wird.
- process(files: Sequence[str], labels: Sequence[int], batch_size: int = 1, use_PIL: bool = False, show_progress: bool = False) tuple[List[ndarray], List[int], List[str], List[int]][Quellcode]¶
Run one resize pass and return successes and failures.
- Rückgabe:
A tuple
(images, labels, error_files, error_labels)where labels arepictures.idvalues.
- resize_pics(piclist: IdList, batch_size: int, use_PIL: bool) Tuple[List[ndarray], List[int], List[str], List[int]][Quellcode]¶
Create thumbnails for all pictures in
piclist.Der DALI-Image-Resizer ist wesentlich performanter als der PIL-Image-Resizer. Außerdem sind größere Batches deutlich performanter als kleinere Batches. Schlägt allerdings eine Operation fehl, so wird der gesamte Batch abgebrochen und es wird mit dem nächsten Batch fortgefahren. Daher wird zuerst mit DALI und einer großen Batchgröße versucht, die Bilder zu laden. Die fehlerhaften Bilder werden dann mit Batchgröße 1 verarbeitet. Die restlichen Fehler werden dann mit PIL verarbeitet. Das Ergebnis ist eine Liste von Bildern, Labels, Fehlerdateipfaden und Fehlerlabels. !Achtung: Die Reihenfolge der Bilder in den Listen ist nicht die gleiche wie in der Eingabeliste.
- Parameter:
- Rückgabe:
(thumbnails, pic_ids, failed_paths, failed_pic_ids).
- class meipi.indexing.picture.PILLoader(files: Sequence[str], labels: Sequence[str], batch_size)[Quellcode]¶
Bases:
objectPIL Loader for DALI External Source. Lädt Bilder mit PIL und gibt sie als CuPy-Arrays zurück. Genauer: Es wird ein Tupel aus zwei Listen der Länge „batch-size“ zurückgegeben: Eine Liste von Bildern als CuPy-Arrays und eine Liste von Labels als CuPy-Arrays.
meipi.indexing.embedding module¶
Helpers for batching images and generating image embeddings.
This module provides small, model-agnostic utilities used by indexing jobs:
pre-processing image inputs into model-compatible batches
running batched forward passes on a selected device
collecting pooled vectors as
numpy.ndarray
- meipi.indexing.embedding.check_cuda_memory() None[Quellcode]¶
Print all currently alive CUDA tensors for debugging memory usage.
- meipi.indexing.embedding.create_image_batches(images: transformers.image_utils.ImageInput, model_name: str, batch_size: int) List[transformers.BatchFeature][Quellcode]¶
Create model-ready image batches using the matching HuggingFace processor.
- Parameter:
images – Raw image inputs accepted by
transformers.model_name – HuggingFace model id used to resolve
AutoImageProcessor.batch_size – Number of samples per returned batch.
- Rückgabe:
A list of
BatchFeatureobjects containingpixel_valuestensors.
- meipi.indexing.embedding.generate_image_embeddings(model, inp_batches: List[transformers.BatchFeature], device='cuda') ndarray¶
Generate pooled embeddings for all batches and return one stacked array.
The model is temporarily moved to
devicefor inference and restored to its original device afterwards.
meipi.indexing.srcdocs module¶
Modelle zur Speicherung der eigentlichen Bilder und Dokumente in der Datenbank.
Aktuell nicht genutzt, da die Bilder und Dokumente in Dateien auf der Festplatte gespeichert werden.
- class meipi.indexing.srcdocs.PILImageType(*args: Any, **kwargs: Any)[Quellcode]¶
Bases:
TypeDecoratorDecorator für Bilder-Attribut
- impl¶
alias of
LargeBinary
- process_bind_param(value: Image | None, dialect)[Quellcode]¶
Receive a bound parameter value to be converted.
Custom subclasses of
_types.TypeDecoratorshould override this method to provide custom behaviors for incoming data values. This method is called at statement execution time and is passed the literal Python data value which is to be associated with a bound parameter in the statement.The operation could be anything desired to perform custom behavior, such as transforming or serializing data. This could also be used as a hook for validating logic.
- Parameter:
value – Data to operate upon, of any type expected by this method in the subclass. Can be
None.dialect – the
Dialectin use.
- process_result_value(value, dialect)[Quellcode]¶
Receive a result-row column value to be converted.
Custom subclasses of
_types.TypeDecoratorshould override this method to provide custom behaviors for data values being received in result rows coming from the database. This method is called at result fetching time and is passed the literal Python data value that’s extracted from a database result row.The operation could be anything desired to perform custom behavior, such as transforming or deserializing data.
- Parameter:
value – Data to operate upon, of any type expected by this method in the subclass. Can be
None.dialect – the
Dialectin use.
meipi.indexing.langchain module¶
Deprecated LangChain integration stubs.
The active indexing pipeline no longer depends on this module. It remains in the repository as historical reference until the replacement workflow is fully stabilized.