INDEX
Explanations
negative evaluations or descriptions of situations and experiences
New Auto-Interp
Negative Logits
ssue
-0.16
789
-0.15
ISMATCH
-0.15
ider
-0.15
odd
-0.15
etty
-0.14
antar
-0.14
ovnÄĽ
-0.14
å¬
-0.13
tered
-0.13
POSITIVE LOGITS
-case
0.44
case
0.29
Case
0.27
case
0.25
_case
0.25
Case
0.23
luck
0.22
offenders
0.20
offender
0.20
worse
0.20
Activations Density 0.025%