INDEX
Explanations
instances of words that may indicate emotional states or experiences
New Auto-Interp
Negative Logits
orum
-0.18
inda
-0.16
heimer
-0.14
pop
-0.14
-0.14
yme
-0.14
303
-0.14
iba
-0.14
backward
-0.14
litter
-0.13
POSITIVE LOGITS
ĵåIJį
0.18
meli
0.17
achsen
0.16
ovÃŃ
0.15
ekte
0.15
ander
0.15
asset
0.15
omik
0.14
etrics
0.14
_PA
0.14
Activations Density 0.002%