INDEX
Explanations
large language model created by
The neuron strongly activates on the pattern where the model refers to itself as “a large language model,” i.e. self-identification phrases stating “As a large language model…”
New Auto-Interp
Negative Logits
capables
0.48
означа
0.41
offrant
0.40
capaces
0.40
схема
0.39
tribulations
0.39
система
0.38
வரவே
0.38
없고
0.38
hereby
0.38
POSITIVE LOGITS
goes
0.42
Rainbow
0.42
科
0.42
رفت
0.41
itself
0.40
我是
0.40
моего
0.40
অস্বাভাবিক
0.39
Southwest
0.39
cleaned
0.38
Activations Density 0.017%