INDEX
Explanations
words and phrases that signify evaluations or comparisons
New Auto-Interp
Negative Logits
ensch
-0.16
Shaft
-0.15
abeth
-0.15
qa
-0.15
Mell
-0.15
nect
-0.14
erno
-0.14
posite
-0.14
ist
-0.14
I
-0.14
POSITIVE LOGITS
_Lean
0.17
ewis
0.16
LEAN
0.15
ActionTypes
0.15
borg
0.15
chwitz
0.14
اخر
0.14
ÑĦÑĤ
0.14
aille
0.14
поба
0.14
Activations Density 0.002%