INDEX
Explanations
names related to the topic at hand
references to personal identity or self-referential phrases
New Auto-Interp
Negative Logits
rama
-0.79
hips
-0.78
olor
-0.74
rieved
-0.74
iosity
-0.74
roads
-0.71
iary
-0.70
aughters
-0.70
inus
-0.69
ulation
-0.68
POSITIVE LOGITS
asure
1.15
anwhile
1.11
lda
1.04
zzo
0.99
leon
0.98
ister
0.91
asuring
0.86
eting
0.85
asured
0.84
gging
0.83
Activations Density 0.017%