- This should already be available in your subscription with an Embeddings model deployed. The default in the labs is text-embeddings-ada-002.
- In your Azure subscription, create an ADLS Gen 2 storage account.
- Create a container called "stackexchange"
- From this repository, choose 1 file from a site of your interest from the /data/ folder
- Upload to the "stackechange" folder in ADLSGen2.
( The zip file contains other data which may be useful to you after this lab, uploading more than 1 file will take up to two hours (or more) to index. )
- Create an Azure AI Search resource, if you don't have one already
- On the Overview blade of Azure AI Search, choose the option to "Import and vectorize your data".
- Choose Azure Data Lake Storage Gen2
- Select your Subscription, Storage account, and Blob container where thet JSON files are located.
- Leave the Blob Folder option empty as you uploaded directly to the root folder
- Parsing mode - choose "JSON array", then leave the Document root option blank
- Click Next which will initiate schema validation
- Column to vectorize, select Body, which contains the test from the posts
- Leave Kind and Subscription with the defaults, which should be Azure OpenAI and your Subscription
- Select your Azure OpenAI Service
- Select the embeddings model which was already deployed, text-embedding-ada-002
- Leave the option for API Key selected, and check the box for acknowledgement of costs associated with using Azure OpenAI
- Leave the defaults for these settings
-
Review the settings then click Create, which will deploy the Index, Indexer, and configuration profiles to Azure AI Search. It will also initiate the indexing process.
-
Back in Azure AI Search, observe the Indexer progress. When complete, continue back in Azure AI Foundry
- Inside the Azure AI Foundry portal, click on the Chat option under Playground
- Enter the instructions from grounding.txt in "Give the model instructions and context", then click Apply Changes
- Expand the "Add your data" section and click the "Add a data source"
- From the drop down, select Azure AI Search
- The subscription should already be populated, then select the existing Azure AI Search instance and the Index you just created
- Check the box for "Add vector search to this search resource" and select the already deployed embeddings model used earlier.
- Leave the Use custom field mapping unchecked
- Leave the Search type as the default option
- Select the existing vector search configuration from your Azure AI Search resource
- Select the API Key option
- After validation, click Save and Close
StackExchange Datadump Licensing
https://meta.stackexchange.com/questions/401324/announcing-a-change-to-the-data-dump-process
This data is prepared and used by individuals for use with an LLM, not to train an LLM. Think of it this way: training an LLM is like teaching it new skills and knowledge, while accessing your own data with an LLM is like asking a trained expert to interpret or respond based on specific information you provide.
Training an LLM:
- Purpose: The data is used to improve or refine the model's ability to understand and generate human-like text.
- Method: The data is fed into the model during its training phase. This phase involves updating the model's parameters based on the patterns and information found in the data.
- Scale: Typically involves large volumes of diverse data to ensure the model learns a wide range of language patterns.
Accessing Your Own Data with an LLM:
- Purpose: The model is used to process, analyze, or generate responses based on your own specific data.
- Method: The data is input to the model at runtime (i.e., when the model is already trained), and the model uses its existing knowledge to interpret or generate responses related to that data.
- Scale: Usually involves a specific subset of data relevant to your needs, rather than the vast and diverse data used for training.