INDEX
Explanations
phrases indicating reasoning or cause and effect relationships
conjunctions and phrases that introduce conditions or reasons
New Auto-Interp
Negative Logits
egu
-0.50
erenn
-0.50
abre
-0.49
ãĥ¯ãĥ³
-0.48
Further
-0.47
Basic
-0.47
Eye
-0.45
ãĤ´ãĥ³
-0.44
Versions
-0.44
uty
-0.44
POSITIVE LOGITS
she
0.84
he
0.81
you
0.75
THEY
0.75
they
0.74
we
0.73
i
0.72
I
0.72
SHE
0.72
thou
0.70
Activations Density 0.663%