INDEX
Explanations
phrases indicating risk and implications of actions on individuals or groups
New Auto-Interp
Negative Logits
refuses
-0.63
uttered
-0.62
ulously
-0.61
nce
-0.61
motions
-0.61
leys
-0.60
forbids
-0.60
naires
-0.59
gans
-0.58
lasts
-0.58
POSITIVE LOGITS
jeopardy
1.00
pmwiki
0.84
peril
0.83
ãĤ´ãĥ³
0.78
ãĥ¯ãĥ³
0.76
advant
0.73
unwelcome
0.72
uncomfortable
0.71
scape
0.70
ãĤ§
0.68
Activations Density 0.064%