INDEX
Explanations
direct references to cause-and-effect relationships or consequences
New Auto-Interp
Negative Logits
gerald
-0.77
anners
-0.75
glers
-0.71
Frie
-0.65
cautiously
-0.64
mell
-0.63
abies
-0.62
Chocobo
-0.62
Sleep
-0.62
safely
-0.60
POSITIVE LOGITS
forward
0.84
ebted
0.81
contradicted
0.80
contradicts
0.73
sunlight
0.71
observable
0.71
orship
0.70
oire
0.69
achable
0.68
aneous
0.68
Activations Density 0.534%