Explanations
phrases indicating falsehood or deception
oai_token-act-pair · gpt-4o-mini
Triggered by @bot
No Scores
New Auto-Interp
AutoInterp Type
claude-3-5-haiku-20241022
Generate
Top Features by Cosine Similarity
Configuration
Juliushanhanhan/llama-3-8b-it-res/blocks.25.hook_resid_post
How To Load
Features
65,536
Data Type
float32
Hook Name
blocks.25.hook_resid_post
Hook Layer
25
Architecture
gated
Context Size
1,024
Dataset
Juliushanhanhan/openwebtext-1b-llama3-tokenized-cxt-1024
Activation Function
relu
Show All
Embeds
Plots
Explanation
Show Test Field
Default Test Text
IFrame
<iframe src=https://www.neuronpedia.org/llama3-8b-it/25-res-jh/23610?embed=true&embedexplanation=true&embedplots=true&embedtest=true" title="Neuronpedia" style="height: 300px; width: 540px;"></iframe>
Link
https://www.neuronpedia.org/llama3-8b-it/25-res-jh/23610?embed=true&embedexplanation=true&embedplots=true&embedtest=true
Not in Any Lists
Add to List
▼
No Comments
ADD
phrases indicating falsehood or deception
LLAMA3-8B-IT
25-RES-JH
INDEX 23610
Negative Logits
luk
-0.16
mlin
-0.15
CHandle
-0.15
opic
-0.15
輪
-0.15
é¡Į
-0.14
_Null
-0.14
avor
-0.14
etz
-0.14
alnız
-0.14
POSITIVE LOGITS
lie
0.97
lies
0.94
lying
0.87
Lie
0.82
Lies
0.78
Lie
0.74
lie
0.70
lied
0.66
lies
0.64
liar
0.64
Act
ivations
Density 0.101%
Stacked
Snippet
Full
Show Breaks
Hide Breaks
No Known Activations