INDEX
Explanations
indications of concern or discussions about safety-related issues
followed by personal pronouns
expressing uncertainty or preference
New Auto-Interp
Negative Logits
poichè
-0.82
ainfi
-0.80
!)
-0.75
깥
-0.75
feroit
-0.73
serupa
-0.73
آنان
-0.72
således
-0.72
已是
-0.70
几人
-0.70
POSITIVE LOGITS
somebody
1.20
everybody
1.12
really
1.11
maybe
1.05
somebody
1.01
anybody
1.00
sort
0.99
[
0.97
basically
0.97
kind
0.95
Activations Density 0.320%