INDEX
Explanations
questions containing the word "if"
conditional phrases and questions
New Auto-Interp
Negative Logits
abal
-0.72
_-
-0.68
lines
-0.67
etheless
-0.64
ulic
-0.64
alde
-0.62
advertising
-0.62
line
-0.62
tc
-0.61
iege
-0.61
POSITIVE LOGITS
suspic
0.86
Daddy
0.74
amera
0.70
ihad
0.68
mosqu
0.67
hairc
0.67
passwords
0.67
explan
0.66
millenn
0.65
indecent
0.64
Activations Density 0.085%