INDEX
Explanations
personal reflections or opinions expressed through language
personal reflections on identity and beliefs
New Auto-Interp
Negative Logits
mire
-0.69
usky
-0.66
ortium
-0.65
anyahu
-0.63
herry
-0.63
flix
-0.63
demon
-0.63
elsen
-0.63
lehem
-0.61
Uncommon
-0.61
POSITIVE LOGITS
perce
0.93
decisions
0.92
choices
0.90
interactions
0.84
conduct
0.83
dealings
0.79
behavior
0.78
interact
0.76
environments
0.75
behaviors
0.74
Activations Density 0.977%