INDEX
Explanations
references to contrasting or supplementary elements or ideas
New Auto-Interp
Negative Logits
οÏĤ
-0.16
amba
-0.15
ling
-0.14
led
-0.14
vet
-0.14
ãģ£ãģı
-0.14
eya
-0.13
illes
-0.13
ottes
-0.13
les
-0.13
POSITIVE LOGITS
rek
0.18
hand
0.17
two
0.17
ws
0.17
idge
0.17
world
0.16
iero
0.15
ero
0.15
est
0.15
woord
0.15
Activations Density 0.039%