INDEX
Explanations
boundaries
The neuron detects terms referring to personal consent and limits—words like “boundaries,” “preferences,” and “autonomy.”
New Auto-Interp
Negative Logits
Iso
-0.07
console
-0.07
_MALLOC
-0.06
luluk
-0.06
tours
-0.06
fast
-0.06
($"{-0.06
science
-0.06
convin
-0.06
heating
-0.06
POSITIVE LOGITS
寸
0.06
імеч
0.06
орд
0.06
boundaries
0.06
постро
0.06
Accept
0.06
=torch
0.06
Bought
0.06
방법
0.06
upal
0.06
Activations Density 0.007%