INDEX
Explanations
loyalty to people and groups
New Auto-Interp
Negative Logits
n
0.59
on
0.53
o
0.51
g
0.50
h
0.49
ah
0.46
f
0.46
ler
0.45
d
0.44
í
0.43
POSITIVE LOGITS
loyalty
1.23
Loy
1.08
loyal
1.07
Loyalty
1.06
loy
0.95
忠
0.92
allegiance
0.91
Loyal
0.89
faithfulness
0.79
loy
0.76
Activations Density 0.046%