INDEX
Explanations
mentions of private or privileged matters
New Auto-Interp
Negative Logits
rers
-0.74
grass
-0.73
Ducks
-0.70
LER
-0.67
Ryder
-0.66
Shake
-0.65
ORN
-0.65
WAYS
-0.64
calling
-0.63
balls
-0.62
POSITIVE LOGITS
ilege
1.56
ileged
1.48
ately
1.19
acies
1.17
atis
1.01
aband
1.00
ession
0.99
acy
0.97
atism
0.96
ility
0.94
Activations Density 0.019%