INDEX
Explanations
mentions of a specific name related to religious or community figures
New Auto-Interp
Negative Logits
ragon
-0.72
lessly
-0.71
LAND
-0.68
cam
-0.66
lund
-0.65
DERR
-0.64
LSD
-0.64
REDACTED
-0.63
lings
-0.63
detail
-0.63
POSITIVE LOGITS
plain
1.41
plin
1.37
isson
1.12
otic
0.97
umann
0.97
ften
0.95
isel
0.94
ussian
0.92
ise
0.90
isen
0.89
Activations Density 0.012%