INDEX
Explanations
phrases indicating causal relationships or conditions
New Auto-Interp
Negative Logits
ãĢħ
-0.17
aign
-0.16
emies
-0.15
ût
-0.15
''"
-0.14
ulis
-0.14
-0.14
enance
-0.14
ags
-0.14
declspec
-0.14
POSITIVE LOGITS
more
0.20
attention
0.20
temperatures
0.20
they
0.19
awareness
0.19
we
0.18
pressure
0.18
fears
0.17
things
0.17
society
0.17
Activations Density 0.086%