INDEX
Explanations
questions related to experiences and emotions
questions directed at individuals about their feelings, experiences, or opinions
New Auto-Interp
Negative Logits
seless
-0.78
acity
-0.73
$$$$
-0.73
ospace
-0.67
Worse
-0.67
cession
-0.66
Stupid
-0.66
.")
-0.66
Godd
-0.65
udicrous
-0.62
POSITIVE LOGITS
yourselves
0.99
yourself
0.99
?ãĢį
0.93
)?
0.87
your
0.76
experien
0.76
autobi
0.73
?:
0.73
.?
0.72
?
0.71
Activations Density 0.261%