INDEX
Explanations
first-person statements expressing personal thoughts or actions
expressions of personal reflection and subjective opinions
New Auto-Interp
Negative Logits
conformity
-0.58
harms
-0.55
MpServer
-0.53
delinqu
-0.51
forms
-0.51
deeds
-0.50
subsistence
-0.49
Klux
-0.49
vitality
-0.49
Samar
-0.48
POSITIVE LOGITS
recommend
0.65
curious
0.62
delve
0.61
appreciate
0.60
uno
0.59
imagine
0.59
fond
0.59
admittedly
0.57
wondered
0.57
wondering
0.56
Activations Density 0.840%