INDEX
Explanations
mentions of specific names or terms related to individuals
New Auto-Interp
Negative Logits
ATRIX
-0.18
aris
-0.17
iyan
-0.17
sworth
-0.16
rega
-0.15
arf
-0.15
lez
-0.15
iou
-0.15
quo
-0.15
esity
-0.15
POSITIVE LOGITS
iforn
0.23
ifornia
0.23
pan
0.20
isp
0.19
ining
0.19
ervo
0.19
orama
0.18
aign
0.18
indi
0.17
bf
0.17
Activations Density 0.006%