INDEX
Explanations
phrases that express specific quantities or degrees of actions or feelings
New Auto-Interp
Negative Logits
icker
-0.15
fad
-0.15
TestCategory
-0.14
ĽĪ
-0.14
:///
-0.14
ÑĤаÑħ
-0.14
agal
-0.14
yon
-0.14
師
-0.14
agogue
-0.13
POSITIVE LOGITS
liking
0.27
cue
0.26
cues
0.25
beating
0.25
stance
0.24
toll
0.23
risks
0.23
step
0.23
look
0.22
baths
0.22
Activations Density 0.058%