INDEX
Explanations
adjectives describing negative physical conditions or outcomes
negative descriptions of conditions or states
New Auto-Interp
Negative Logits
ership
-0.77
inarily
-0.75
alist
-0.73
uality
-0.73
cript
-0.73
agy
-0.73
htaking
-0.71
itionally
-0.71
iferation
-0.71
iture
-0.70
POSITIVE LOGITS
behaved
0.98
beaten
0.85
damaged
0.79
enough
0.79
mistaken
0.78
asses
0.78
suited
0.77
poisoned
0.76
bitten
0.76
needed
0.75
Activations Density 0.021%