INDEX
Explanations
the specific phrase or phrases mentioned in the activation
repeated mentions of phrases and their variations
New Auto-Interp
Negative Logits
ÄŁ
-0.81
DERR
-0.77
Emirates
-0.72
llah
-0.70
Thro
-0.69
hemor
-0.68
fman
-0.67
Fal
-0.66
Indies
-0.65
Brotherhood
-0.62
POSITIVE LOGITS
phrase
1.06
ology
1.03
phrases
1.01
phrase
0.91
witz
0.89
terday
0.84
stress
0.82
mith
0.81
atre
0.78
uttered
0.78
Activations Density 0.021%