INDEX
Explanations
verbs related to deception or misleading actions
New Auto-Interp
Negative Logits
rises
-0.79
foreseen
-0.67
capacity
-0.66
ynski
-0.65
ateur
-0.65
airo
-0.64
hens
-0.64
area
-0.64
joining
-0.63
riot
-0.62
POSITIVE LOGITS
perpetrated
0.93
deceive
0.93
ulent
0.91
ulence
0.86
ABOUT
0.85
esty
0.83
uten
0.82
omission
0.82
misrepresent
0.82
falsely
0.81
Activations Density 0.111%