INDEX
Explanations
words related to negativity or disgrace
words and phrases indicating moral judgment or condemnation
New Auto-Interp
Negative Logits
ħĭ
-0.74
Leilan
-0.69
king
-0.65
izoph
-0.63
stood
-0.62
Immunity
-0.60
plane
-0.60
wan
-0.59
draw
-0.58
llan
-0.58
POSITIVE LOGITS
ations
1.21
omin
1.21
atory
1.05
ational
0.99
ious
0.96
ifer
0.96
ator
0.96
itives
0.93
omial
0.93
atus
0.93
Activations Density 0.019%