INDEX
Explanations
AI assistant refusing requests
New Auto-Interp
Negative Logits
abhavam
0.40
enchymal
0.39
ské
0.39
Neurons
0.39
льт
0.38
jenis
0.38
Breakpoint
0.38
𝘭
0.38
toggleClass
0.38
육
0.38
POSITIVE LOGITS
assistant
0.50
avoid
0.49
Avoiding
0.46
assistants
0.45
Avoid
0.42
避免
0.42
assist
0.41
gaf
0.41
lounge
0.41
avoids
0.41
Activations Density 0.006%