-
-
Notifications
You must be signed in to change notification settings - Fork 72
Open
Labels
enhancementNew feature or requestNew feature or request
Description
It is now very common to use external tools or libraries to produce pre-computed document representations (like those based on learned sparse retrieval).
In these cases, we might see a document collection as a jsonl file, with one document per line.
Anserini already support this format, for example: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/collection/JsonVectorCollection.java
One example from Anserini with SPLADE-doc is:
{"id": 9, "contents": "", "vector": {"`": 0, "a": 183, "i": 25, "\u6e05": 46, "\uff5e": 30, "to": 29, "as": 16, "there": 34, "two": 19, "##i": 85, "most": 23, "##t": 34, "team": 70, "american": 43, "south": 46, "war": 32, "life": 100, "much": 60, "here": 8, "music": 3, "end": 33, "old": 25, "april": 62, "set": 39, "party": 68, "song": 6, "ve": 65, "population": 138, "top": 72, "book": 125, "door": 6, "st": 32, "received": 67, "##in": 90, "24": 55, "far": 44, "am": 4, "done": 12, "arms": 154, "summer": 153, "announced": 48, "records": 49, "design": 1, "considered": 10, "miles": 57, "points": 30, "person": 71, "china": 5, "official": 11, "wide": 26, "kept": 109, "##as": 106, "meet": 112, "goal": 20, "limited": 4, "sense": 2, "historic": 19, "lives": 7, "completely": 107, "annual": 0, "failed": 170, "##tion": 16, "expected": 25, "joe": 131, "##ba": 0, "daniel": 15, "mentioned": 71, "picked": 25, "settled": 186, "actress": 5, "reserve": 48, "jersey": 109, "remain": 36, "##go": 149, "##berg": 33, "sort": 0, "andrew": 28, "gets": 115, "sources": 138, "brand": 115, "documentary": 86, "lewis": 117, "##ding": 8, "promotion": 56, "soccer": 15, "5th": 81, "landing": 111, "journalist": 117, "familiar": 75, "productions": 60, "separated": 54, "##ker": 118, "amateur": 33, "li": 71, "membership": 55, "adapted": 89, "suggests": 10, "traveled": 12, "protest": 62, "baltimore": 72, "mitchell": 97, "beast": 12, "indicates": 199, "whisper": 0, "radar": 20, "isolated": 85, "slip": 7, "jefferson": 4, "grandson": 32, "reveals": 185, "##lon": 1, "ya": 4, "##bar": 60, "raced": 41, "halfway": 107, "manufacturers": 3, "dynamic": 11, "severely": 25, "cottage": 20, "ni": 8, "somerset": 40, "newport": 26, "forgot": 11, "chances": 43, "fees": 35, "saxon": 158, "kicking": 91, "testimony": 12, "genesis": 1, "charm": 7, "111": 34, "cart": 102, "mikhail": 4, "kirk": 40, "tanzania": 28, "##itt": 18, "russians": 59, "cnn": 59, "outlet": 202, "skeleton": 33, "##pling": 35, "##hol": 19, "##ipe": 27, "briggs": 50, "##right": 36, "workplace": 27, "alvarez": 8, "debbie": 118, "renee": 90, "reno": 67, "breuning": 33, "wiley": 38, "##fted": 2, "injected": 109, "##ego": 39, "maroon": 162, "kerman": 44, "minnie": 5, "merritt": 38, "goalscorer": 45}}
...
We could ideally build a tool that wraps both index compression and wand data creation for these file types. Alternatively, we can incorporate a json reader into each tool.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request