INDEX
Explanations
negative judgment or disapproval
New Auto-Interp
Negative Logits
/
0.45
大型
0.43
lifecycle
0.40
üksek
0.40
ከፍተኛ
0.39
↵↵
0.38
including
0.38
进行
0.38
OD
0.38
काफी
0.38
POSITIVE LOGITS
forbids
0.47
denunci
0.46
отрица
0.45
despised
0.44
diminution
0.43
amenaza
0.43
scorn
0.43
sufrimiento
0.43
あなた
0.43
ненави
0.42
Activations Density 0.054%