INDEX
Explanations
instances of the word "Nar" or "nar" at different activations
mentions of the character Naruto
New Auto-Interp
Negative Logits
Pixie
-0.84
Canary
-0.77
Democr
-0.72
Extrem
-0.71
xual
-0.69
ORED
-0.68
eous
-0.67
Giuliani
-0.66
VIDEOS
-0.65
UCT
-0.65
POSITIVE LOGITS
ration
1.07
rils
0.97
ayan
0.94
uno
0.90
ril
0.90
opa
0.89
vik
0.89
stal
0.89
aku
0.88
vati
0.86
Activations Density 0.026%