INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
roma
-0.77
oba
-0.66
obin
-0.66
Oversight
-0.65
osate
-0.64
atism
-0.64
ilege
-0.64
onomy
-0.64
uly
-0.63
Powder
-0.63
POSITIVE LOGITS
hower
0.81
Username
0.70
unsuccessfully
0.69
enegger
0.64
wikipedia
0.63
Downloadha
0.61
sten
0.61
nurse
0.60
glac
0.59
goodbye
0.59
Activations Density 0.000%
No Known Activations
This feature has no known activations.