INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
counters
-0.69
rapists
-0.66
lesbians
-0.64
ought
-0.63
incarcer
-0.61
Sexual
-0.60
sixty
-0.59
asketball
-0.58
icts
-0.58
accompanied
-0.58
POSITIVE LOGITS
oglu
0.94
ariat
0.78
uve
0.77
pora
0.73
outube
0.72
aer
0.71
oros
0.69
oire
0.68
ĵĺ
0.67
Verge
0.66
Activations Density 0.000%
No Known Activations
This feature has no known activations.