INDEX

Explanations

except

np_acts-logits-general · gemini-2.5-flash-lite

Words indicating a positive or neutral assessment of quality, correctness, or acceptability, often used to describe conditions, states, or judgments as satisfactory or meeting expected standards.

eleuther_acts_top20 · claude-4-5-sonnet Triggered by @jamesnaruto04

words indicating positive evaluation, correctness, or satisfactory status.

oai_token-act-pair · claude-4-5-sonnet Triggered by @jamesnaruto04

New Auto-Interp

Configuration

google/gemma-scope-2-27b-it/resid_post/layer_31_width_262k_l0_medium

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

No Comments

Negative Logits

 невозможно

0.50

 impossibile

0.45

 impossible

0.42

 inevitably

0.41

很不

0.38

 imposible

0.38

尴

0.38

 Чтобы

0.37

 unavoid

0.36

 unwelcome

0.36

POSITIVE LOGITS

 except

0.73

except

0.66

 kecuali

0.64

 대부분

0.63

 relatively

0.61

 EXCEPT

0.61

relatively

0.61

 মোটামুটি

0.60

 Except

0.57

大部分

0.57

Activations Density 0.369%

except

Words indicating a positive or neutral assessment of quality, correctness, or acceptability, often used to describe conditions, states, or judgments as satisfactory or meeting expected standards.

words indicating positive evaluation, correctness, or satisfactory status.

No Comments

No Known Activations

except

Words indicating a positive or neutral assessment of quality, correctness, or acceptability, often used to describe conditions, states, or judgments as satisfactory or meeting expected standards.

words indicating positive evaluation, correctness, or satisfactory status.

No Comments

No Known Activations