INDEX
Explanations
personal reflections and religious expressions
New Auto-Interp
Negative Logits
Marino
-0.82
anyahu
-0.65
loophole
-0.63
caveat
-0.63
typo
-0.62
Didn
-0.62
Moreno
-0.61
Siren
-0.58
Shark
-0.57
herry
-0.57
POSITIVE LOGITS
interact
1.04
perce
0.89
interacts
0.88
perceive
0.87
interacting
0.86
communicate
0.85
environments
0.74
interactions
0.74
interpersonal
0.72
interacted
0.72
Activations Density 0.620%