meipi.indexing package¶

Submodules¶

meipi.indexing.config module¶

Konfiguration der App

Neben der hier definierten default-Konfiguration, können die Werte auch über eine .env-Datei oder direkt über Umgebungsvariablen überschrieben werden. Die .env-Datei sollte im Root-Verzeichnis der App liegen und den Namen „config.env“ tragen.

Die globale Instanz appconf wird einmal beim Import von meipi.indexing erzeugt und ist danach unveränderlich. Änderungen an config.env, Umgebungsvariablen oder MEIPI_CONFIG_ENV wirken erst nach einem vollständigen Neustart des Python-Prozesses (CLI erneut ausführen, Notebook-Kernel neu starten, Streamlit-App neu starten).

Beispiel für eine .env-Datei:

IND_PG_HOST=localhost                   #PostgreSQL Host
IND_PG_PORT=5432                        #PostgreSQL Port
IND_PG_USER=postgres                    #PostgreSQL Username
IND_PG_DATABASE=postgres                #PostgreSQL Database Name
IND_PG_API_KEY=pg-docker                #API-Key-Name für das DB-Passwort im Keyring
IND_DATADIR=./data                  #Datenverzeichnis
IND_DOCROOT=/home/rslsync/folders/  #Dokumenten-Root-Verzeichnis
IND_LOGGER_NAME=sqlalchemy.engine          #Name des Loggers für SQLAlchemy
IND_DOCSUF='{
    ".pdf",
    ".txt",
    ".md",
    ".docx",
    ".doc",
    ".html",
    ".htm",
    ".epub",
    ".odt",
}  #zulässige Dokumentenerweiterungen
IND_PICSUF='{
    ".jpg",
    ".jpeg",
    ".bmp",
    ".png",
    ".heic",
    ".tiff",
    ".tif"
}  #zulässige Bild-Dateiendungen
IND_VIDSUF='{
    ".mov",
    ".vob",
    ".mkv",
    ".avi",
    ".mp4",
    ".mcf"
}  #zulässige Video-Dateiendungen
IND_LOGLEVEL=20  #Log-Level (z.B. 10=DEBUG, 20=INFO, 30=WARNING, 40=ERROR, 50=CRITICAL)

Das DB-Passwort wird aus dem Keyring geholt, der API-Key-Name kann über die Umgebungsvariable IND_PG_API_KEY konfiguriert werden (Standard: „pg-docker“).

Alternatives Env-File vor dem Start setzen:

export MEIPI_CONFIG_ENV=/path/to/config.env
meipi-index schema-info

class meipi.indexing.config.Config(_case_sensitive: bool | None = None, _nested_model_default_partial_update: bool | None = None, _env_prefix: str | None = None, _env_prefix_target: EnvPrefixTarget | None = None, _env_file: DotenvType | None = PosixPath('.'), _env_file_encoding: str | None = None, _env_ignore_empty: bool | None = None, _env_nested_delimiter: str | None = None, _env_nested_max_split: int | None = None, _env_parse_none_str: str | None = None, _env_parse_enums: bool | None = None, _cli_prog_name: str | None = None, _cli_parse_args: bool | list[str] | tuple[str, ...] | None = None, _cli_settings_source: CliSettingsSource[Any] | None = None, _cli_parse_none_str: str | None = None, _cli_hide_none_type: bool | None = None, _cli_avoid_json: bool | None = None, _cli_enforce_required: bool | None = None, _cli_use_class_docs_for_groups: bool | None = None, _cli_exit_on_error: bool | None = None, _cli_prefix: str | None = None, _cli_flag_prefix_char: str | None = None, _cli_implicit_flags: bool | Literal['dual', 'toggle'] | None = None, _cli_ignore_unknown_args: bool | None = None, _cli_kebab_case: bool | Literal['all', 'no_enums'] | None = None, _cli_shortcuts: Mapping[str, str | list[str]] | None = None, _secrets_dir: PathType | None = None, _build_sources: tuple[tuple[PydanticBaseSettingsSource, ...], dict[str, Any]] | None = None, *, envfile: str = 'config.env', pg_host: str = 'localhost', pg_port: str = '5432', pg_user: str = 'postgres', pg_passwd: str = 'postgres', pg_database: str = 'postgres', pg_schema: str = 'public', pg_api_key: str = 'pg-docker', tika_noocr_url: str = 'http://localhost:9998', tika_ocrurl: str = 'http://localhost:9997', datadir: str = './data', docsuf: Set[str] = {'.doc', '.docx', '.epub', '.htm', '.html', '.md', '.odt', '.pdf', '.txt'}, picsuf: Set[str] = {'.bmp', '.heic', '.jpeg', '.jpg', '.png', '.tif', '.tiff'}, vidsuf: Set[str] = {'.avi', '.mcf', '.mkv', '.mov', '.mp4', '.vob'}, logger_name: str = 'sqlalchemy.engine', loglevel: int = 20)[Quellcode]¶

Bases: BaseSettings

Enthält die Konfiguration der App

Instanzen sind unveränderlich (frozen). Erzeugen Sie für andere Env-Dateien eine neue Config nur in Tests oder vor dem ersten Import von meipi.indexing; zur Laufzeit gilt ausschließlich appconf.

datadir: str¶

property db_conn_URL: URL¶: DB-Verbindungsstring aus den Konfigurationswerten zusammensetzen

db_passwd_from_keyring() → str[Quellcode]¶: DB Password aus Keyring holen, oder Default-Wert verwenden.

docsuf: Set[str]¶

envfile: str¶

get_ftype(suf: str) → str[Quellcode]¶: Gibt den konfigurierten Dateityp zurück, basierend auf der Dateiendung

classmethod load(envfile: str = 'config.env') → Config[Quellcode]¶: Load settings from envfile (used at package import, not for hot reload).

property logger¶

logger_name: str¶

loglevel: int¶

model_config = {'arbitrary_types_allowed': True, 'case_sensitive': False, 'cli_avoid_json': False, 'cli_enforce_required': False, 'cli_exit_on_error': True, 'cli_flag_prefix_char': '-', 'cli_hide_none_type': False, 'cli_ignore_unknown_args': False, 'cli_implicit_flags': False, 'cli_kebab_case': False, 'cli_parse_args': None, 'cli_parse_none_str': None, 'cli_prefix': '', 'cli_prog_name': None, 'cli_shortcuts': None, 'cli_use_class_docs_for_groups': False, 'enable_decoding': True, 'env_file': 'config.env', 'env_file_encoding': 'utf-8', 'env_ignore_empty': False, 'env_nested_delimiter': None, 'env_nested_max_split': None, 'env_parse_enums': None, 'env_parse_none_str': None, 'env_prefix': 'IND_', 'env_prefix_target': 'variable', 'extra': 'forbid', 'frozen': True, 'json_file': None, 'json_file_encoding': None, 'nested_model_default_partial_update': False, 'protected_namespaces': ('model_validate', 'model_dump', 'settings_customise_sources'), 'secrets_dir': None, 'toml_file': None, 'validate_default': True, 'yaml_config_section': None, 'yaml_file': None, 'yaml_file_encoding': None}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

pg_api_key: str¶

pg_database: str¶

pg_host: str¶

pg_passwd: str¶

pg_port: str¶

pg_schema: str¶

pg_user: str¶

picsuf: Set[str]¶

tika_noocr_url: str¶

tika_ocrurl: str¶

vidsuf: Set[str]¶

class meipi.indexing.config.FTYPE[Quellcode]¶

Bases: object

DOC: str = 'doc'¶

PIC: str = 'pic'¶

UNK: str = 'unk'¶

VID: str = 'vid'¶

meipi.indexing.config.reload_appconf() → None[Quellcode]¶: Configuration cannot be reloaded in-process.

meipi.indexing.model module¶

PostgreSQL database model for pictures, documents and their metadata. Es wird SQLAlchemy ORM verwendet, um die Datenbanktabellen zu definieren und zu verwalten. Die Modelle umfassen

DBMeta: Tabelle für Meta-Daten von Dateien (inkl. Volltext inhalt)

DBDoc: Optionale Zeile pro Dokument-Datei (z. B. für Embedding-Chunks)

DBPic: Tabelle für Bilder mit Thumbnail und Perceptual Hash

DBDinoV2Vector: Tabelle für DINO V2 Bildvektoren

Die Mixins DBMetaMixin, DocVectorMixin und PicVectorMixin bieten gemeinsame Felder und Methoden für die jeweiligen Modelle.

Die Modelle enthalten Methoden zum Erstellen und Löschen von Tabellen sowie zur Durchführung von Volltextsuchen und Berechnung von Perceptual Hashes.

class meipi.indexing.model.Base(**kwargs: Any)[Quellcode]¶

Bases: MappedAsDataclass, DeclarativeBase

Base class for SQLAlchemy models.

as_dict()[Quellcode]¶: Erzeugt Dictionary ohne _sa_instance_state

classmethod create_table(session: Session) → None[Quellcode]¶: Create the table in the database.

classmethod drop_table(session: Session) → None[Quellcode]¶: Drop the table from the database.

metadata: ClassVar[MetaData] = MetaData()¶: Refers to the _schema.MetaData collection that will be used for new _schema.Table objects.

Siehe auch

Accessing Table and Metadata

registry: ClassVar[registry] = <sqlalchemy.orm.decl_api.registry object>¶: Refers to the _orm.registry in use where new _orm.Mapper objects will be associated.

class meipi.indexing.model.CatalogBase(**kwargs: Any)[Quellcode]¶

Bases: DeclarativeBase

Declarative base for PostgreSQL catalog views (read-only, no dataclass ORM).

metadata: ClassVar[MetaData] = MetaData()¶: Refers to the _schema.MetaData collection that will be used for new _schema.Table objects.

Siehe auch

Accessing Table and Metadata

registry: ClassVar[registry] = <sqlalchemy.orm.decl_api.registry object>¶: Refers to the _orm.registry in use where new _orm.Mapper objects will be associated.

class meipi.indexing.model.DBCatalog(**kwargs)[Quellcode]¶

Bases: CatalogBase

Read-only ORM mapping of pg_catalog.pg_tables.

hasindexes: Mapped[bool]¶

hasrules: Mapped[bool]¶

hastriggers: Mapped[bool]¶

rowsecurity: Mapped[bool]¶

schemaname: Mapped[str]¶

tablename: Mapped[str]¶

tableowner: Mapped[str]¶

tablespace: Mapped[str | None]¶

class meipi.indexing.model.DBDinoV2Vector(pic_id, vector)[Quellcode]¶

Bases: Base, PicVectorMixin

SQLAlchemy model for DINO V2 image embeddings stored in PostgreSQL.

pic_id: Mapped[int]¶

vector¶

class meipi.indexing.model.DBDoc[Quellcode]¶

Bases: Base

Optional document row for doc filemeta (e.g. embedding chunks).

Full-text content lives on DBMeta (inhalt / ts_content).

id: Mapped[int]¶

meta: Mapped[DBMeta]¶

meta_id: Mapped[int]¶

class meipi.indexing.model.DBMeta(pool_id, path, fname, suffix, sort_date, fdate, fsize, clength, ctype, ftype, md_keys, meta_data, sha256=None, doc=None, pic=None, vid=None, *, inhalt='')[Quellcode]¶

Bases: Base

SQLAlchemy model for Meta data stored in PostgreSQL.

clength: Mapped[int]¶: Content-length, aus Metadaten

ctype: Mapped[str]¶: Content type aus metadaten

doc: Mapped[DBDoc | None]¶

fdate: Mapped[datetime]¶: Dateidatum des Systems

fname: Mapped[str]¶: Dateiname

fsize: Mapped[int]¶: Dateigröße des Systems

ftype: Mapped[str]¶: Dateityp, konfiguriert in config.py

id: Mapped[int]¶

inhalt: Mapped[str]¶: Extracted text content (all file types)

md_keys: Mapped[list[str] | None]¶: Schlüssel der Metadaten

meta_data: Mapped[dict | None]¶: Metadaten als dictionary

path: Mapped[str]¶: Pfad zur Datei, relativ zu einem root-Verzeichnis

pic: Mapped[DBPic | None]¶

pool_id: Mapped[int]¶: Anwendungsgebiet, Datenpool, frei definierbar

sha256: Mapped[bytes | None]¶: FileHash

sort_date: Mapped[datetime]¶: Datum für Sortierung

suffix: Mapped[str]¶: Dateisuffix, incl. dot

ts_content: Mapped[TSVECTOR]¶: Full-text search vector derived from inhalt

classmethod tsquery(query: str, session: Session, lang: str = 'german') → Sequence[Self][Quellcode]¶: Perform a full-text search on ts_content.

vid: Mapped[DBVid | None]¶

class meipi.indexing.model.DBPic(xmp=None, truncated=None, thumbarray=None, phash=None)[Quellcode]¶

Bases: Base

SQLAlchemy model for Picture data stored in PostgreSQL.

Es enthält neben den Datei-Metadaten Felder für XMP-Metadaten, ein Thumbnail als numpy array und einen Perceptual Hash. Die eigentlichen Bilddaten werden nicht in der Datenbank gespeichert, sondern nur die Metadaten und der Hash. Die Methode set_phash berechnet den Perceptual Hash basierend auf dem Thumbnail, falls dieses vorhanden ist. Der Perceptual Hash wird als BYTEA gespeichert, um eine effiziente Speicherung und Suche zu ermöglichen. Es wird ein Index auf dem phash-Feld erstellt, um schnelle Ähnlichkeitssuchen zu ermöglichen.

classmethod calc_phash(im: Image) → bytes[Quellcode]¶

id: Mapped[int]¶

meta: Mapped[DBMeta]¶

meta_id: Mapped[int]¶

phash: Mapped[bytes | None]¶: Perceptual hash as bytes

set_phash()[Quellcode]¶

property thumb: Image | None¶

thumbarray: Mapped[ndarray | None]¶: Thumbnail 224x224x3 as ndarray

truncated: Mapped[bool | None]¶: Whether original image is truncated

xmp: Mapped[dict | None]¶: XMP-attributes of the image

class meipi.indexing.model.DBPool(pool, rootpath, description, id=None)[Quellcode]¶

Bases: Base

SQLAlchemy model for data pools stored in PostgreSQL.

Diese Tabelle dient zur Verwaltung von Datenpools, die als logische Gruppen von Dateien definiert werden können. Ein Datenpool könnte beispielsweise ein bestimmtes Anwendungsgebiet oder eine Kategorie von Dateien repräsentieren

description: Mapped[str | None]¶: Optional description of the data pool

id: Mapped[int]¶

pool: Mapped[str]¶: Name of the data pool

rootpath: Mapped[str]¶: Root path for the data pool, used to resolve file paths

class meipi.indexing.model.DBVid[Quellcode]¶

Bases: Base

SQLAlchemy model for Video data stored in PostgreSQL.

id: Mapped[int]¶

meta: Mapped[DBMeta]¶

meta_id: Mapped[int]¶

class meipi.indexing.model.DocVectorMixin(chunk_id: int = <sqlalchemy.orm.properties.MappedColumn object>, doc_id: int = <sqlalchemy.orm.properties.MappedColumn object>, content: str = <sqlalchemy.orm.properties.MappedColumn object>)[Quellcode]¶

Bases: MappedAsDataclass

Mixin für DocVectorTables

chunk_id: Mapped[int] = <sqlalchemy.orm.properties.MappedColumn object>¶

content: Mapped[str] = <sqlalchemy.orm.properties.MappedColumn object>¶

doc = <_RelationshipDeclared at 0x7f53cf1cc5f0; no key>¶

doc_id: Mapped[int] = <sqlalchemy.orm.properties.MappedColumn object>¶

vector¶

type meipi.indexing.model.IdList = Sequence[Tuple[str, int]]¶

class meipi.indexing.model.PILArray(*args: Any, **kwargs: Any)[Quellcode]¶

Bases: TypeDecorator

Type for PIL Image as numpy array

Damit können Thumbnails als numpy arrays in der Datenbank gespeichert werden, ohne sie vorher in ein anderes Format konvertieren zu müssen. Der Datenbanktyp ist BYTEA, da die numpy arrays als Binärdaten gespeichert werden.

cache_ok = True¶

Indicate if statements using this ExternalType are „safe to cache“.

The default value None will emit a warning and then not allow caching of a statement which includes this type. Set to False to disable statements using this type from being cached at all without a warning. When set to True, the object’s class and selected elements from its state will be used as part of the cache key. For example, using a TypeDecorator:

class MyType(TypeDecorator):
    impl = String

    cache_ok = True

    def __init__(self, choices):
        self.choices = tuple(choices)
        self.internal_only = True

The cache key for the above type would be equivalent to:

>>> MyType(["a", "b", "c"])._static_cache_key
(<class '__main__.MyType'>, ('choices', ('a', 'b', 'c')))

The caching scheme will extract attributes from the type that correspond to the names of parameters in the __init__() method. Above, the „choices“ attribute becomes part of the cache key but „internal_only“ does not, because there is no parameter named „internal_only“.

The requirements for cacheable elements is that they are hashable and also that they indicate the same SQL rendered for expressions using this type every time for a given cache value.

To accommodate for datatypes that refer to unhashable structures such as dictionaries, sets and lists, these objects can be made „cacheable“ by assigning hashable structures to the attributes whose names correspond with the names of the arguments. For example, a datatype which accepts a dictionary of lookup values may publish this as a sorted series of tuples. Given a previously un-cacheable type as:

class LookupType(UserDefinedType):
    """a custom type that accepts a dictionary as a parameter.

    this is the non-cacheable version, as "self.lookup" is not
    hashable.

    """

    def __init__(self, lookup):
        self.lookup = lookup

    def get_col_spec(self, **kw):
        return "VARCHAR(255)"

    def bind_processor(self, dialect): ...  # works with "self.lookup" ...

Where „lookup“ is a dictionary. The type will not be able to generate a cache key:

>>> type_ = LookupType({"a": 10, "b": 20})
>>> type_._static_cache_key
<stdin>:1: SAWarning: UserDefinedType LookupType({'a': 10, 'b': 20}) will not
produce a cache key because the ``cache_ok`` flag is not set to True.
Set this flag to True if this type object's state is safe to use
in a cache key, or False to disable this warning.
symbol('no_cache')

If we did set up such a cache key, it wouldn’t be usable. We would get a tuple structure that contains a dictionary inside of it, which cannot itself be used as a key in a „cache dictionary“ such as SQLAlchemy’s statement cache, since Python dictionaries aren’t hashable:

>>> # set cache_ok = True
>>> type_.cache_ok = True

>>> # this is the cache key it would generate
>>> key = type_._static_cache_key
>>> key
(<class '__main__.LookupType'>, ('lookup', {'a': 10, 'b': 20}))

>>> # however this key is not hashable, will fail when used with
>>> # SQLAlchemy statement cache
>>> some_cache = {key: "some sql value"}
Traceback (most recent call last): File "<stdin>", line 1,
in <module> TypeError: unhashable type: 'dict'

The type may be made cacheable by assigning a sorted tuple of tuples to the „.lookup“ attribute:

class LookupType(UserDefinedType):
    """a custom type that accepts a dictionary as a parameter.

    The dictionary is stored both as itself in a private variable,
    and published in a public variable as a sorted tuple of tuples,
    which is hashable and will also return the same value for any
    two equivalent dictionaries.  Note it assumes the keys and
    values of the dictionary are themselves hashable.

    """

    cache_ok = True

    def __init__(self, lookup):
        self._lookup = lookup

        # assume keys/values of "lookup" are hashable; otherwise
        # they would also need to be converted in some way here
        self.lookup = tuple((key, lookup[key]) for key in sorted(lookup))

    def get_col_spec(self, **kw):
        return "VARCHAR(255)"

    def bind_processor(self, dialect): ...  # works with "self._lookup" ...

Where above, the cache key for LookupType({"a": 10, "b": 20}) will be:

>>> LookupType({"a": 10, "b": 20})._static_cache_key
(<class '__main__.LookupType'>, ('lookup', (('a', 10), ('b', 20))))

Added in version 1.4.14: - added the cache_ok flag to allow some configurability of caching for TypeDecorator classes.

Added in version 1.4.28: - added the ExternalType mixin which generalizes the cache_ok flag to both the TypeDecorator and UserDefinedType classes.

Siehe auch

SQL Compilation Caching

coerce_compared_value(op, value)[Quellcode]¶

Suggest a type for a ‚coerced‘ Python value in an expression.

By default, returns self. This method is called by the expression system when an object using this type is on the left or right side of an expression against a plain Python object which does not yet have a SQLAlchemy type assigned:

expr = table.c.somecolumn + 35

Where above, if somecolumn uses this type, this method will be called with the value operator.add and 35. The return value is whatever SQLAlchemy type should be used for 35 for this particular operation.

impl¶: alias of BYTEA

process_bind_param(value: ndarray | None, dialect)[Quellcode]¶

Receive a bound parameter value to be converted.

Custom subclasses of _types.TypeDecorator should override this method to provide custom behaviors for incoming data values. This method is called at statement execution time and is passed the literal Python data value which is to be associated with a bound parameter in the statement.

The operation could be anything desired to perform custom behavior, such as transforming or serializing data. This could also be used as a hook for validating logic.

Parameter:

value – Data to operate upon, of any type expected by this method in the subclass. Can be None.
dialect – the Dialect in use.

Siehe auch

Augmenting Existing Types

_types.TypeDecorator.process_result_value()

process_literal_param(value, dialect)[Quellcode]¶

Receive a literal parameter value to be rendered inline within a statement.

Bemerkung

This method is called during the SQL compilation phase of a statement, when rendering a SQL string. Unlike other SQL compilation methods, it is passed a specific Python value to be rendered as a string. However it should not be confused with the _types.TypeDecorator.process_bind_param() method, which is the more typical method that processes the actual value passed to a particular parameter at statement execution time.

Custom subclasses of _types.TypeDecorator should override this method to provide custom behaviors for incoming data values that are in the special case of being rendered as literals.

The returned string will be rendered into the output string.

process_result_value(value, dialect)[Quellcode]¶

Receive a result-row column value to be converted.

Custom subclasses of _types.TypeDecorator should override this method to provide custom behaviors for data values being received in result rows coming from the database. This method is called at result fetching time and is passed the literal Python data value that’s extracted from a database result row.

The operation could be anything desired to perform custom behavior, such as transforming or deserializing data.

Parameter:

value – Data to operate upon, of any type expected by this method in the subclass. Can be None.
dialect – the Dialect in use.

Siehe auch

Augmenting Existing Types

_types.TypeDecorator.process_bind_param()

property python_type: type[ndarray]¶

Return the Python type object expected to be returned by instances of this type, if known.

Basically, for those types which enforce a return type, or are known across the board to do such for all common DBAPIs (like int for example), will return that type.

If a return type is not defined, raises NotImplementedError.

Note that any type also accommodates NULL in SQL which means you can also get back None from any type in practice.

class meipi.indexing.model.PicVectorMixin(pic_id: int = <sqlalchemy.orm.properties.MappedColumn object>)[Quellcode]¶

Bases: MappedAsDataclass

Mixin für PicVectorTables

Es enthält ein spezielles Feld vector, das die von einem Embedder-Modell erstellten Vektoren speichert. Die Größe des Vektors wird durch die Klasse definiert, die dieses Mixin verwendet. Es wird ein Fremdschlüssel pic_id definiert, der auf die Tabelle der Bilddaten verweist.

pic_id: Mapped[int] = <sqlalchemy.orm.properties.MappedColumn object>¶

vector¶

meipi.indexing.operations module¶

class meipi.indexing.operations.AsyncFileOperations(pool: DBPool, config: Config = Config(envfile='config.env', pg_host='localhost', pg_port='5432', pg_user='postgres', pg_passwd='postgres', pg_database='postgres', pg_schema='public', pg_api_key='pg-docker', tika_noocr_url='http://localhost:9998', tika_ocrurl='http://localhost:9997', datadir='./data', docsuf={'.epub', '.doc', '.htm', '.md', '.html', '.odt', '.docx', '.txt', '.pdf'}, picsuf={'.png', '.heic', '.tiff', '.jpeg', '.tif', '.jpg', '.bmp'}, vidsuf={'.mcf', '.mkv', '.mp4', '.mov', '.avi', '.vob'}, logger_name='sqlalchemy.engine', loglevel=20), skip_ocr: bool = True, timeout: float = 30, compress=True)[Quellcode]¶

Bases: AsyncTikaClient

Async file parsing helpers built on top of tika_client.

DBPic_from_DBMeta(dbmeta: DBMeta) → DBPic[Quellcode]¶: Build a DBPic row from image metadata and XMP tags.

dir_tree(rel_path: str) → Generator[str][Quellcode]¶: Durchläuft rekursiv alle Dateien im Verzeichnisbaum, extrahiert die Metadaten und Inhalte und erstellt DB-Objekte.

async file_to_db(rel_path: str) → DBMeta | None[Quellcode]¶: Create DBMeta and optional typed child rows for a file path.

async tika_parse(rel_path: str) → Tuple[DBMeta | None, str][Quellcode]¶: Parse one file with Tika and convert metadata into DBMeta fields.

class meipi.indexing.operations.DBOperations(pool_id: int | None = None, pool: DBPool | None = None, *, allow_no_pool: bool = False, enginekwargs: dict | None = None, sessionkwargs: dict | None = None)[Quellcode]¶

Bases: object

High-level PostgreSQL operations for one configured data pool.

clear_pool()[Quellcode]¶: Löscht alle Daten aus dem angegebenen Pool.

create_pool(pool: DBPool)[Quellcode]¶: Insert a new data pool and make it active for this instance.

create_tables(entities: Sequence[type[Base]] | None = None)[Quellcode]¶: Erstellt die Tabellen in der Datenbank, falls sie noch nicht existieren.

get_pool(pool_id: int) → DBPool[Quellcode]¶: Get the pool with the given id.

async insert_docs_from_meta(skipocr: bool = True)[Quellcode]¶

Re-extract text for filemeta rows in the pool that have empty inhalt.

For doc filemeta, ensures a DBDoc row exists for embedding FKs.

Args: skipocr (bool): Wenn True, wird die OCR-Verarbeitung übersprungen, um Ressourcen zu sparen.

insert_pics_from_meta()[Quellcode]¶: Liest die Metadaten aller Bilder aus dem angegebenen Pool aus, erstellt zugehörige DBPic-Objekte und fügt sie der DB hinzu.

recreate_tables(entities: Sequence[type[Base]] | None = None)[Quellcode]¶: Recreate tables in the database.

schema_info() → dict[str, Any][Quellcode]¶: Get the schema information of the database.

update_thumbs(thumblist: List[Tuple[ndarray, int]])[Quellcode]¶: Update the thumbnails and perceptual hashes for the given list of pictures.

update_thumbs_no_heic() → List[int][Quellcode]¶: Update the thumbnails and perceptual hashes for the pictures in the pool that are not HEICs.

update_thumbs_no_thumb() → List[int][Quellcode]¶: Update the thumbnails and perceptual hashes for the pictures in the pool that have no thumbnail.

meipi.indexing.search module¶

PostgreSQL full-text search for indexed documents.

class meipi.indexing.search.DocSearchHit(meta_id: int, path: str, fname: str, suffix: str, rank: float, snippet: str)[Quellcode]¶

Bases: object

One filemeta row matching a full-text query.

fname: str¶

meta_id: int¶

path: str¶

rank: float¶

snippet: str¶

suffix: str¶

meipi.indexing.search.search_documents(session: Session, *, pool_id: int, query: str, lang: str = 'german', limit: int = 50, mode: Literal['plain', 'websearch', 'phrase'] = 'websearch') → list[DocSearchHit][Quellcode]¶

Search file bodies and metadata with PostgreSQL full-text matching.

Matches rows where the query hits extracted content (ts_content / inhalt) or metadata (filename, path, content type, and Tika meta_data JSON).

meipi.indexing.picture module¶

Image loading and resizing helpers based on DALI and PIL fallback.

The module provides a high-throughput resize pipeline for generating thumbnails from file paths plus picture ids. Failed DALI batches can be retried with smaller batches and optionally with PIL-backed loading.

class meipi.indexing.picture.DALIImageResizer(files: Sequence[str] = (), labels: Sequence[int] = (), pipe_batch_size: int = 1, num_threads: int = 1, config: Config = Config(envfile='config.env', pg_host='localhost', pg_port='5432', pg_user='postgres', pg_passwd='postgres', pg_database='postgres', pg_schema='public', pg_api_key='pg-docker', tika_noocr_url='http://localhost:9998', tika_ocrurl='http://localhost:9997', datadir='./data', docsuf={'.epub', '.doc', '.htm', '.md', '.html', '.odt', '.docx', '.txt', '.pdf'}, picsuf={'.png', '.heic', '.tiff', '.jpeg', '.tif', '.jpg', '.bmp'}, vidsuf={'.mcf', '.mkv', '.mp4', '.mov', '.avi', '.vob'}, logger_name='sqlalchemy.engine', loglevel=20))[Quellcode]¶

Bases: object

DALI Image Resizer Klasse zum Laden und Vorverarbeiten von Bildern mit DALI. Es können Bilder mit DALI oder PIL geladen werden. Die Bilder werden auf eine Größe von 224x224 skaliert und gepaddet. Die Funktion process() verarbeitet die Bilder in Batches und gibt die Ergebnisse zurück. Eingabe: Liste von Dateipfaden und Labels, Batchgröße in der Pipeline, Zahl der Threads. Ausgabe: Tupel aus vier Listen: (Bilder, Labels, Fehlerdateipfade, Fehlerlabels)

pipePIL(batch_files, batch_labels)[Quellcode]¶: Erstellt eine DALI-Pipeline zum Laden und Vorverarbeiten von Bildern mit PIL. Die Pipeline liest die Bilder mit einem externen Iterator, dekodiert sie, skaliert sie auf eine Größe von 224x224 und paddet sie. Die Funktion gibt die Pipeline zurück, die in der Funktion process() verwendet wird. Wie „pipedali“, aber mit einem externen Iterator, der die Bilder mit PIL lädt und als CuPy-Arrays zurückgibt.

pipedali(batch_files, batch_labels)[Quellcode]¶: Erstellt eine DALI-Pipeline zum Laden und Vorverarbeiten von Bildern. Die Pipeline liest die Bilder mit dem DALI-File-Reader, dekodiert sie, skaliert sie auf eine Größe von 224x224 und paddet sie. Die Funktion gibt die Pipeline zurück, die in der Funktion process() verwendet wird.

process(files: Sequence[str], labels: Sequence[int], batch_size: int = 1, use_PIL: bool = False, show_progress: bool = False) → tuple[List[ndarray], List[int], List[str], List[int]][Quellcode]¶

Run one resize pass and return successes and failures.

Rückgabe:: A tuple (images, labels, error_files, error_labels) where labels are pictures.id values.

resize_pics(piclist: IdList, batch_size: int, use_PIL: bool) → Tuple[List[ndarray], List[int], List[str], List[int]][Quellcode]¶

Create thumbnails for all pictures in piclist.

Der DALI-Image-Resizer ist wesentlich performanter als der PIL-Image-Resizer. Außerdem sind größere Batches deutlich performanter als kleinere Batches. Schlägt allerdings eine Operation fehl, so wird der gesamte Batch abgebrochen und es wird mit dem nächsten Batch fortgefahren. Daher wird zuerst mit DALI und einer großen Batchgröße versucht, die Bilder zu laden. Die fehlerhaften Bilder werden dann mit Batchgröße 1 verarbeitet. Die restlichen Fehler werden dann mit PIL verarbeitet. Das Ergebnis ist eine Liste von Bildern, Labels, Fehlerdateipfaden und Fehlerlabels. !Achtung: Die Reihenfolge der Bilder in den Listen ist nicht die gleiche wie in der Eingabeliste.

Parameter:

piclist (IdList) – (Dateipfad, pictures.id) — nicht filemeta.id
batch_size (int) – Anzahl der Bilder, die in einem Batch verarbeitet werden sollen
use_PIL (bool) – Ob die Thumbnails mit PIL erstellt werden sollen (True) oder mit DALI (False)

Rückgabe:

(thumbnails, pic_ids, failed_paths, failed_pic_ids).

class meipi.indexing.picture.PILLoader(files: Sequence[str], labels: Sequence[str], batch_size)[Quellcode]¶

Bases: object

PIL Loader for DALI External Source. Lädt Bilder mit PIL und gibt sie als CuPy-Arrays zurück. Genauer: Es wird ein Tupel aus zwei Listen der Länge „batch-size“ zurückgegeben: Eine Liste von Bildern als CuPy-Arrays und eine Liste von Labels als CuPy-Arrays.

meipi.indexing.embedding module¶

Helpers for batching images and generating image embeddings.

This module provides small, model-agnostic utilities used by indexing jobs:

pre-processing image inputs into model-compatible batches
running batched forward passes on a selected device
collecting pooled vectors as numpy.ndarray

meipi.indexing.embedding.check_cuda_memory() → None[Quellcode]¶: Print all currently alive CUDA tensors for debugging memory usage.

meipi.indexing.embedding.create_image_batches(images: transformers.image_utils.ImageInput, model_name: str, batch_size: int) → List[transformers.BatchFeature][Quellcode]¶

Create model-ready image batches using the matching HuggingFace processor.

Parameter:

images – Raw image inputs accepted by transformers.
model_name – HuggingFace model id used to resolve AutoImageProcessor.
batch_size – Number of samples per returned batch.

Rückgabe:

A list of BatchFeature objects containing pixel_values tensors.

meipi.indexing.embedding.generate_image_embeddings(model, inp_batches: List[transformers.BatchFeature], device='cuda') → ndarray¶

Generate pooled embeddings for all batches and return one stacked array.

The model is temporarily moved to device for inference and restored to its original device afterwards.

meipi.indexing.srcdocs module¶

Modelle zur Speicherung der eigentlichen Bilder und Dokumente in der Datenbank.

Aktuell nicht genutzt, da die Bilder und Dokumente in Dateien auf der Festplatte gespeichert werden.

class meipi.indexing.srcdocs.PILImageType(*args: Any, **kwargs: Any)[Quellcode]¶

Bases: TypeDecorator

Decorator für Bilder-Attribut

impl¶: alias of LargeBinary

process_bind_param(value: Image | None, dialect)[Quellcode]¶

Receive a bound parameter value to be converted.

Custom subclasses of _types.TypeDecorator should override this method to provide custom behaviors for incoming data values. This method is called at statement execution time and is passed the literal Python data value which is to be associated with a bound parameter in the statement.

The operation could be anything desired to perform custom behavior, such as transforming or serializing data. This could also be used as a hook for validating logic.

Parameter:

value – Data to operate upon, of any type expected by this method in the subclass. Can be None.
dialect – the Dialect in use.

Siehe auch

Augmenting Existing Types

_types.TypeDecorator.process_result_value()

process_result_value(value, dialect)[Quellcode]¶

Receive a result-row column value to be converted.

Custom subclasses of _types.TypeDecorator should override this method to provide custom behaviors for data values being received in result rows coming from the database. This method is called at result fetching time and is passed the literal Python data value that’s extracted from a database result row.

The operation could be anything desired to perform custom behavior, such as transforming or deserializing data.

Parameter:

value – Data to operate upon, of any type expected by this method in the subclass. Can be None.
dialect – the Dialect in use.

Siehe auch

Augmenting Existing Types

_types.TypeDecorator.process_bind_param()

class meipi.indexing.srcdocs.Photo(id, image)[Quellcode]¶

Bases: Base

Tabelle für Bilddaten

id: Mapped[int]¶

image: Mapped[Image]¶

meipi.indexing.langchain module¶

Deprecated LangChain integration stubs.

The active indexing pipeline no longer depends on this module. It remains in the repository as historical reference until the replacement workflow is fully stabilized.

meipi.indexing package¶

Submodules¶

meipi.indexing.config module¶

meipi.indexing.model module¶

meipi.indexing.operations module¶

meipi.indexing.search module¶

meipi.indexing.picture module¶

meipi.indexing.embedding module¶

meipi.indexing.srcdocs module¶

meipi.indexing.langchain module¶

Inhaltsverzeichnis

Vorheriges Thema

Nächstes Thema

Diese Seite