INDEX
Explanations
mentions of personal experiences or inner struggles
New Auto-Interp
Negative Logits
protected
-0.74
DRAGON
-0.68
circ
-0.65
manship
-0.63
retirees
-0.63
populated
-0.60
disabled
-0.60
guided
-0.59
Herm
-0.58
couch
-0.57
POSITIVE LOGITS
't
1.47
ÃŃ
1.05
ned
0.92
nt
0.91
itive
0.91
iting
0.91
etsk
0.90
NT
0.89
kered
0.89
ge
0.86
Activations Density 0.064%