INDEX
Explanations
expressions related to explaining or clarifying something
phrases indicating intentions or propositions
New Auto-Interp
Negative Logits
ADRA
-0.68
watershed
-0.64
Lieberman
-0.63
vigilance
-0.62
meet
-0.60
novelty
-0.60
rule
-0.59
Dub
-0.59
TIM
-0.59
extrad
-0.58
POSITIVE LOGITS
orah
0.96
aucus
0.71
ften
0.70
hower
0.70
eous
0.70
hesive
0.69
soType
0.69
oice
0.68
ptoms
0.67
oke
0.67
Activations Density 0.082%