INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
Spoiler
-0.71
Survive
-0.65
eric
-0.65
SUR
-0.64
Profit
-0.63
RAG
-0.62
Answer
-0.61
rage
-0.59
Aware
-0.59
Copy
-0.59
POSITIVE LOGITS
bilt
0.90
cham
0.81
Palest
0.79
tob
0.74
suspic
0.70
cyl
0.68
assian
0.68
Mutant
0.67
corrid
0.67
mosqu
0.67
Activations Density 0.000%
No Known Activations
This feature has no known activations.