INDEX
Explanations
mentions or references to unspecified individuals
references to an undefined or generic 'someone'
New Auto-Interp
Negative Logits
ories
-0.79
osterone
-0.78
heny
-0.74
inders
-0.68
interest
-0.68
enegger
-0.66
èª
-0.66
ory
-0.65
irth
-0.65
ortex
-0.63
POSITIVE LOGITS
else
1.60
Else
1.28
Else
1.00
else
0.95
WithNo
0.89
ĪĴ
0.75
who
0.74
smugg
0.73
rency
0.71
clicked
0.70
Activations Density 0.036%