INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
undo
-0.76
ouf
-0.71
very
-0.68
enty
-0.66
any
-0.66
elong
-0.65
fair
-0.63
lore
-0.63
oos
-0.62
Champ
-0.61
POSITIVE LOGITS
regard
1.11
regards
1.10
stood
0.88
standing
0.87
impunity
0.84
caveats
0.82
ategory
0.73
intent
0.72
roomm
0.71
respect
0.68
Activations Density 0.000%
No Known Activations
This feature has no known activations.