INDEX
Explanations
words or symbols related to strong emotional expressions or reactions
New Auto-Interp
Negative Logits
xual
-0.74
ngth
-0.68
jri
-0.68
wark
-0.65
wagen
-0.62
laund
-0.62
WARD
-0.61
Sapphire
-0.61
Butterfly
-0.59
Seym
-0.59
POSITIVE LOGITS
ļ
1.29
Ĺ
1.24
ij
1.24
ŀ
1.23
Ģ
1.20
«
1.15
ĥ
1.15
Ī
1.14
Ĩ
1.12
ĺ
1.11
Activations Density 0.002%