INDEX
Explanations
affirmations and expressions of agreement
New Auto-Interp
Negative Logits
ès
-0.16
ogene
-0.15
acker
-0.15
Gro
-0.15
aggio
-0.15
Ober
-0.15
Gro
-0.14
ÙĨا
-0.14
Mog
-0.14
amas
-0.14
POSITIVE LOGITS
boro
0.16
fully
0.15
ersh
0.15
ffa
0.14
idian
0.14
ÙĪÙĨÙĬØ©
0.14
orth
0.14
ARGET
0.14
¦¬
0.14
edii
0.14
Activations Density 0.036%