Skip to content

Support direct indexing of json "vector" files #604

@JMMackenzie

Description

@JMMackenzie

It is now very common to use external tools or libraries to produce pre-computed document representations (like those based on learned sparse retrieval).

In these cases, we might see a document collection as a jsonl file, with one document per line.

Anserini already support this format, for example: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/collection/JsonVectorCollection.java

One example from Anserini with SPLADE-doc is:

{"id": 9, "contents": "", "vector": {"`": 0, "a": 183, "i": 25, "\u6e05": 46, "\uff5e": 30, "to": 29, "as": 16, "there": 34, "two": 19, "##i": 85, "most": 23, "##t": 34, "team": 70, "american": 43, "south": 46, "war": 32, "life": 100, "much": 60, "here": 8, "music": 3, "end": 33, "old": 25, "april": 62, "set": 39, "party": 68, "song": 6, "ve": 65, "population": 138, "top": 72, "book": 125, "door": 6, "st": 32, "received": 67, "##in": 90, "24": 55, "far": 44, "am": 4, "done": 12, "arms": 154, "summer": 153, "announced": 48, "records": 49, "design": 1, "considered": 10, "miles": 57, "points": 30, "person": 71, "china": 5, "official": 11, "wide": 26, "kept": 109, "##as": 106, "meet": 112, "goal": 20, "limited": 4, "sense": 2, "historic": 19, "lives": 7, "completely": 107, "annual": 0, "failed": 170, "##tion": 16, "expected": 25, "joe": 131, "##ba": 0, "daniel": 15, "mentioned": 71, "picked": 25, "settled": 186, "actress": 5, "reserve": 48, "jersey": 109, "remain": 36, "##go": 149, "##berg": 33, "sort": 0, "andrew": 28, "gets": 115, "sources": 138, "brand": 115, "documentary": 86, "lewis": 117, "##ding": 8, "promotion": 56, "soccer": 15, "5th": 81, "landing": 111, "journalist": 117, "familiar": 75, "productions": 60, "separated": 54, "##ker": 118, "amateur": 33, "li": 71, "membership": 55, "adapted": 89, "suggests": 10, "traveled": 12, "protest": 62, "baltimore": 72, "mitchell": 97, "beast": 12, "indicates": 199, "whisper": 0, "radar": 20, "isolated": 85, "slip": 7, "jefferson": 4, "grandson": 32, "reveals": 185, "##lon": 1, "ya": 4, "##bar": 60, "raced": 41, "halfway": 107, "manufacturers": 3, "dynamic": 11, "severely": 25, "cottage": 20, "ni": 8, "somerset": 40, "newport": 26, "forgot": 11, "chances": 43, "fees": 35, "saxon": 158, "kicking": 91, "testimony": 12, "genesis": 1, "charm": 7, "111": 34, "cart": 102, "mikhail": 4, "kirk": 40, "tanzania": 28, "##itt": 18, "russians": 59, "cnn": 59, "outlet": 202, "skeleton": 33, "##pling": 35, "##hol": 19, "##ipe": 27, "briggs": 50, "##right": 36, "workplace": 27, "alvarez": 8, "debbie": 118, "renee": 90, "reno": 67, "breuning": 33, "wiley": 38, "##fted": 2, "injected": 109, "##ego": 39, "maroon": 162, "kerman": 44, "minnie": 5, "merritt": 38, "goalscorer": 45}}
...

We could ideally build a tool that wraps both index compression and wand data creation for these file types. Alternatively, we can incorporate a json reader into each tool.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions