INDEX

Explanations

CWE, bugs, CSRF, Partner, kala, abetes, Phys

np_acts-logits-general · gemini-2.5-flash-lite

references to software security vulnerabilities and attacks (like XSS, CSRF, bugs).

oai_token-act-pair · claude-3-7-sonnet-20250219 Triggered by @neilrathi

abbreviations and acronyms related to security vulnerabilities and technical terms (like XSS, CSRF, CWE).

oai_token-act-pair · claude-4-5-sonnet Triggered by @bfgcn7frf5

New Auto-Interp

Configuration

google/gemma-scope-27b-pt-res/layer_10/width_131k

Prompts (Dashboard)

24,576 prompts, 128 tokens each

Dataset (Dashboard)

monology/pile-uncopyrighted

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

-4.47

-2.83

“

-2.56

<bos>

-2.33

us

-2.28

-2.23

”

-2.20

‘

-2.17

,’

-2.16

-2.13

POSITIVE LOGITS

."

2.50

and

2.45

 querían

2.44

鿬

2.41

鹛

2.38

蟥

2.31

႐

2.28

eryllium

2.27

賅

2.27

三重

2.25

Activations Density 0.003%

CWE, bugs, CSRF, Partner, kala, abetes, Phys

references to software security vulnerabilities and attacks (like XSS, CSRF, bugs).

abbreviations and acronyms related to security vulnerabilities and technical terms (like XSS, CSRF, CWE).

No Comments

No Known Activations

CWE, bugs, CSRF, Partner, kala, abetes, Phys

references to software security vulnerabilities and attacks (like XSS, CSRF, bugs).

abbreviations and acronyms related to security vulnerabilities and technical terms (like XSS, CSRF, CWE).

No Comments

No Known Activations