INDEX
Explanations
phrases indicating options or choices
New Auto-Interp
Negative Logits
vla
-0.15
anche
-0.15
sWith
-0.15
adin
-0.15
arel
-0.15
ÙĦÙĤ
-0.14
nackte
-0.14
erna
-0.14
achuset
-0.14
sci
-0.14
POSITIVE LOGITS
wel
0.20
anged
0.20
theless
0.19
zeit
0.18
phans
0.18
-than
0.17
-sex
0.17
anges
0.15
许
0.15
456
0.14
Activations Density 0.023%