- Context Service
The ContextService is responsible for loading ACCESS courses from a specified directory each time a request is sent to the /context/create endpoint. The service processes each course by performing the following steps:
- Load Course Data: The service reads files specified in the exercise context directory that have supported parsers. Supported file types are listed in
app/model/mappers/reader_mapper.py. - Parse Content: It parses the content of the files.
- Generate Embeddings: The service creates embeddings for the parsed content.
- Store in Vector Database: The content is written to a vector database, partitioned by course slug. The hashed course slug is used as the table name in the vector store.
- Returns Context Processing Statistics: The statistics of the extracted files are accessed via the
/context/{course_slug}/statusendpoint
- Just run the docker compose of the access infrastructure repository
- The services required by the Context Service need to be running. You can run them via the access infrastructure docker compose. The Context Service should not be started with the docker compose.
Follow these steps to set up and run Context Service on your local machine:
-
Create a Python Virtual Environment
python -m venv venv
-
Activate the Virtual Environment
- On Windows:
venv\Scripts\activate
- On macOS and Linux:
source venv/bin/activate
- On Windows:
-
Install Dependencies
pip install -r requirements.txt
-
Set Environment Variables
Ensure the following environment variables are set in your shell or environment configuration:
WORKDIR_INTERNAL={WORKDIR_INTERNAL} MISTRAL_API_KEY={MISTRAL_API_KEY} MISTRAL_EMBEDDING_MODEL={MISTRAL_EMBEDDING_MODEL} # should be set to the same embedding model as in the backend, ceteris paribus, mistral-embedCHATBOT_DB_NAME={CHATBOT_DB_NAME} # Default: chatbot CHATBOT_DB_USER={CHATBOT_DB_USER} # Default: postgres CHATBOT_DB_PASSWORD={CHATBOT_DB_PASSWORD} # Default: postgres CHATBOT_DB_HOST={CHATBOT_DB_HOST} # Default: localhost CHATBOT_DB_PORT={CHATBOT_DB_PORT} # Default: 5555 VECTOR_STORE_HOST={VECTOR_STORE_HOST} # Default: 'http://localhost:19530' -
Run the Application
Start the application using
uvicorn:uvicorn app.main:app --port 3423 --reload
-
Add this snippet to the
.devcontainer/devcontainer.json, adjust theWORKDIRand theNETWORKname. The network name is usually the lowercase directory name where the docker compose is in +_defaultif no custom network was specified."mounts": [ "source={WORKDIR},target=/usr/data,type=bind,consistency=consistent" ], "runArgs": [ "--network={NETWORK}", "-p", "3423:3423" ]
-
Reopen in container (VSCODE)
-
Set environment variables
export MISTRAL_API_KEY={MISTRAL_API_KEY} export MISTRAL_EMBEDDING_MODEL={MISTRAL_EMBEDDING_MODEL} export VECTOR_STORE_HOST={VECTOR_STORE_HOST} # Default: 'http://localhost:19530' -> the hostname should be set to service name of vectorstore in network (ceteris paribus, http://milvus-standalone:19530) export CHATBOT_DB_NAME={CHATBOT_DB_NAME} # Default: chatbot export CHATBOT_DB_HOST={CHATBOT_DB_HOST} # Default: localhost -> should be set to service name of chatbot db in network (ceteris paribus, chatbot_postgres) export CHATBOT_DB_USER={CHATBOT_DB_USER} # Default: postgres export CHATBOT_DB_PASSWORD={CHATBOT_DB_PASSWORD} # Default: postgres export CHATBOT_DB_PORT={CHATBOT_DB_PORT} # Default: 5555 -> should be set to port of chatbot db in network (ceteris paribus, 5432)
-
Run fastapi server
uvicorn app.main:app --host 0.0.0.0 --port 3423 --reload
To connect to a different vectorstore, follow these steps:
- Create a new adapter in
app/model/output_adapter. The adapter should extendbase_vectorstoreand have the same variables in the constructor asmilvus_vectorstore. - Specify the new adapter in
app/model/mappers/vectorstore_mapper.py. - In
app/config.yaml, specify which vectorstore to use.
To connect to a different embedder, follow these steps:
- Create a new adapter in
app/model/embedder. The adapter should extendbase_embedderand have the same variables in the constructor asmistral_embedder. It should also return a langchain "Embeddings". - Specify the new adapter in
app/model/mappers/embeddings_mapper.py. - In
app/config.yaml, specify which embedding model to use.
To add a parser for other file extensions, follow these steps:
- Create a new reader in
app/model/data_readerthat extendsBaseReader. The constructor should take the filepath as an argument. - Specify the new reader in
app/model/mappers/reader_mapper.py.
- The Textsplitter are different for each file type, therefore you can change them in the data_readers
app/model/data_reader.
- Since we want the course to be created without the backend waiting for the context service to finish, the backend returns a successful message immediately after receiving a request, even if the context service encounters an error. This can result in false successful messages.
- When a course is created or updated, the old vectorstore collection is deleted and a new one is created. This is inefficient. The setting
drop_old = Trueinapp/model/output_adapter/milvus_storespecifies this behavior. However, the cost for embeddings is low, at $0.1 for 1 million tokens, which is approximately 1500 pages of written text.
- The textsplitters split the text not based on semantics but by a fixed metric like the number of tokens or pages. It would be beneficial if they could split based on the semantics.
