diff --git a/README.md b/README.md index b1a0946..e6cc74c 100644 --- a/README.md +++ b/README.md @@ -105,12 +105,12 @@ Input data format The input file should list all completions in *lexicographical* order. -For example, see the the file `test_data/trec05_efficiency_queries/trec05_efficiency_queries.completions`. +For example, see the the file `test_data/trec_05_efficiency_queries/trec_05_efficiency_queries.completions`. The first column represent the ID of the completion; the other columns contain the tokens separated by white spaces. -(The IDs for the file `trec05_efficiency_queries.completions` are +(The IDs for the file `trec_05_efficiency_queries.completions` are fake, i.e., they do not take into account any particular assignment.) @@ -119,49 +119,49 @@ preparing the datasets for indexing: 1. The command - $ extract_dict.py trec05_efficiency_queries/trec05_efficiency_queries.completions + $ extract_dict.py trec_05_efficiency_queries/trec_05_efficiency_queries.completions extract the dictionary from a file listing all completions in textual form. 2. The command - $ python map_dataset.py trec05_efficiency_queries/trec05_efficiency_queries.completions + $ python map_dataset.py trec_05_efficiency_queries/trec_05_efficiency_queries.completions maps strings to integer ids. 3. The command - $ python build_stats.py trec05_efficiency_queries/trec05_efficiency_queries.completions.mapped + $ python build_stats.py trec_05_efficiency_queries/trec_05_efficiency_queries.completions.mapped calulcates the dataset statistics. 4. The command - $ python build_inverted_and_forward.py trec05_efficiency_queries/trec05_efficiency_queries.completions + $ python build_inverted_and_forward.py trec_05_efficiency_queries/trec_05_efficiency_queries.completions builds the inverted and forward files. If you run the scripts in the reported order, you will get: -- `trec05_efficiency_queries.completions.dict`: lists all the distinct +- `trec_05_efficiency_queries.completions.dict`: lists all the distinct tokens in the completions sorted in lexicographical order. -- `trec05_efficiency_queries.completions.mapped`: lists all completions +- `trec_05_efficiency_queries.completions.mapped`: lists all completions whose tokens have been mapped to integer ids as assigned by a lexicographically-sorted string dictionary (that should be built from the -tokens listed in `trec05_efficiency_queries.completions.dict`). +tokens listed in `trec_05_efficiency_queries.completions.dict`). Each completion terminates with the id `0`. -- `trec05_efficiency_queries.completions.mapped.stats` contains some +- `trec_05_efficiency_queries.completions.mapped.stats` contains some statistics about the datasets, needed to build the data structures more efficiently. - `trec05_efficiency_queries.completions.inverted` is the inverted file. -- `trec05_efficiency_queries.completions.forward` is the forward file. Note that each list is *not* sorted, thus the lists are the same as the ones contained in `trec05_efficiency_queries.completions.mapped` but sorted in docID order. +- `trec_05_efficiency_queries.completions.forward` is the forward file. Note that each list is *not* sorted, thus the lists are the same as the ones contained in `trec_05_efficiency_queries.completions.mapped` but sorted in docID order. Benchmarks ---------- @@ -174,4 +174,4 @@ Live demo ---------- Start the web server with the program `./web_server ` and access the demo at -`localhost:`. \ No newline at end of file +`localhost:`.