INDEX
Explanations
phrases indicating various forms of action or requests
New Auto-Interp
Negative Logits
ffa
-0.16
ĽĪ
-0.15
TestCategory
-0.15
agogue
-0.15
شت
-0.14
usercontent
-0.14
pires
-0.14
idth
-0.14
jedn
-0.14
:///
-0.14
POSITIVE LOGITS
cue
0.30
beating
0.28
liking
0.28
cues
0.28
step
0.24
stance
0.24
shine
0.24
toll
0.23
look
0.23
risks
0.23
Activations Density 0.055%