(Newer) Pygmalion 6Bv3 ggjt model appears to not be able to go over 500-600 tokens of context.

# Prerequisites

- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior

To not run out of space in the context's memory pool.

# Current Behavior

Consistently run into this every time a session reaches 500+ tokens or giving a 500+ token scenario when using a more recent ggjt version of Pygmalion located [here](https://huggingface.co/concedo/pygmalion-6bv3-ggml-ggjt/tree/main). Does not appear to effect standard llama.cpp models. Have not tested other model types that are compatible with koboldcpp. DOES NOT EFFECT older conversion of Pygmalion model to ggml located [here](https://huggingface.co/alpindale/pygmalion-6b-ggml/tree/main). It is able to handle a starting scenario of 1000+ tokens without this issue. `Processing Prompt (864 / 1302 tokens)`

`Processing Prompt (584 / 589 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 269340800, available 268435456)`
`Processing Prompt (8 / 10 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 268458928, available 268435456)`
`Processing Prompt (8 / 9 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 269097088, available 268435456)`

Have plenty of RAM available when it happens.

Edit: Also affects [janeway-ggml-q4_0.bin](https://huggingface.co/concedo/janeway-6b-ggml/blob/main/janeway-ggml-q4_0.bin).
`Processing Prompt (584 / 673 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 269340800, available 268435456)`


# Environment and Context 

* Physical (or virtual) hardware you are using, e.g. for Linux:

```
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 5 2600 Six-Core Processor
    CPU family:          23
    Model:               8
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            2
    BogoMIPS:            7600.11
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr s
                         se sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop
                         _tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 
                         movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a
                          misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfc
                         tr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap
                          clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock
                          nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_v
                         msave_vmload vgif overflow_recov succor smca sev sev_es
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   192 KiB (6 instances)
  L1i:                   384 KiB (6 instances)
  L2:                    3 MiB (6 instances)
  L3:                    16 MiB (2 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-11
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Vulnerable
  Spec store bypass:     Vulnerable
  Spectre v1:            Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
  Spectre v2:            Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
```

* Operating System, e.g. for Linux:

`Linux rabid-ms7b87 6.2.7-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Sat, 18 Mar 2023 01:06:38 +0000 x86_64 GNU/Linux`

# Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

1. Load koboldcpp with a Pygmalion model in ggml/ggjt format. In this case the model taken from [here](https://huggingface.co/concedo/pygmalion-6bv3-ggml-ggjt).
2. Enter a starting prompt exceeding 500-600 tokens or have a session go on for 500-600+ tokens
3. Observe `ggml_new_tensor_impl: not enough space in the context's memory pool (needed 269340800, available 268435456)` message in terminal.

# Failure Logs

Example run with the Linux command
```
[rabid@rabid-ms7b87 koboldcpp]$ python koboldcpp.py ../pygmalion-6b-v3-ggml-ggjt-q4_0.bin  --threads 6  --stream
Welcome to KoboldCpp - Version 1.3
Prebuilt OpenBLAS binaries only available for windows. Please manually build/link libopenblas from makefile with LLAMA_OPENBLAS=1
Initializing dynamic library: koboldcpp.dll
Loading model: /home/rabid/Desktop/pygmalion-6b-v3-ggml-ggjt-q4_0.bin 
[Parts: 1, Threads: 6]

---
Identified as GPT-J model: (ver 102)
Attempting to Load...
---
gptj_model_load: loading model from '/home/rabid/Desktop/pygmalion-6b-v3-ggml-ggjt-q4_0.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 4505.45 MB
gptj_model_load: memory_size =   896.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001?streaming=1
127.0.0.1 - - [10/Apr/2023 09:42:18] "GET /?streaming=1 HTTP/1.1" 200 -
127.0.0.1 - - [10/Apr/2023 09:42:18] "GET /api/latest/model HTTP/1.1" 200 -
127.0.0.1 - - [10/Apr/2023 09:42:18] "GET /sw.js HTTP/1.1" 404 -

Input: {"n": 1, "max_context_length": 1000, "max_length": 8, "rep_pen": 1.15, "temperature": 0.6, "top_p": 0.9, "top_k": 40, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [0, 1, 2, 3, 4, 5, 6], "prompt": "[The following is a chat message log between you and an extremely intelligent and knowledgeable AI system named KoboldGPT. KoboldGPT is a state-of-the-art Artificial General Intelligence. You may ask any question, or request any task, and KoboldGPT will always be able to respond accurately and truthfully.]\n\nYou: What are german shepherds?\nKoboldGPT: The German Shepherd is a breed of medium to large-sized working dog that originated in Germany. In the English language, the breed's officially recognized name is German Shepherd Dog. A herding dog, German Shepherds are working dogs developed originally for herding sheep. Since that time, however, because of their strength, intelligence, trainability, and obedience, German Shepherds around the world are often the preferred breed for many types of work, including disability assistance, search-and-rescue, police and military roles and acting.\nYou: Which is heavier, a duck or a car?\nKoboldGPT: A car weighs around 1300 kilograms, while a duck only weighs about 2 kilograms. Therefore, a car is heavier than a duck.\nYou: What is the atomic number of Calcium?\nKoboldGPT: A calcium atom has 20 protons, 20 electrons and 20 neutrons. The atomic number or proton number of a chemical element is the number of protons found in the nucleus of every atom of that element. The atomic number uniquely identifies a chemical element. It is identical to the charge number of the nucleus. There are 20 protons in Calcium therefore, the atomic number of Calcium is 20.\nYou: What is 28 + 13?\nKoboldGPT: 28 + 13 is equal to 41.\nYou: Tell me a joke.\nKoboldGPT: Why did the chicken cross the road? To get to the other side!\nYou: How is an earthquake measured?\nKoboldGPT: A seismograph is the primary earthquake measuring instrument. The seismograph produces a digital graphic recording of the ground motion caused by the seismic waves. The digital recording is called seismogram. A network of worldwide seismographs detects and measures the strength and duration of the earthquake's waves. The magnitude of an earthquake and the intensity of shaking is usually reported on the Richter scale.\n\nKoboldGPT: Hello, I am KoboldGPT, your personal AI assistant. What would you like to know?\nYou: what's 3*5?\nKoboldGPT: 3 \u00d7 5 = 15\nYou: capital of russia?\nKoboldGPT: Moscow\nYou: 4*6\nKoboldGPT:", "quiet": true}

Processing Prompt (584 / 589 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 269340800, available 268435456)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Newer) Pygmalion 6Bv3 ggjt model appears to not be able to go over 500-600 tokens of context. #41

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

Failure Logs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

(Newer) Pygmalion 6Bv3 ggjt model appears to not be able to go over 500-600 tokens of context. #41

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

Failure Logs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions