OpenAI o1 – A downgraded version of o1-preview

OpenAI o1 – A downgraded version of o1-preview
Image: OpenAI o1 – A downgraded version of o1-preview

OpenAI has recently released a final version of their o1 LLM-model. 

o1 seems very much faster than o1-preview, but in my opinion it’s so blatantly obviously NOT a good as o1-preview.

This seems to be a common theme in the last year or so, and not only for OpenAI: Showcase something powerful, but then deploy a scaled down version that is much less capable.

o1’s capability to interpret key details in the request (prompt) are ignored, to serve something that is more generic. A “Turbo” version of o1-preview.

Example 1 – Chain of thought gone?

Example 1 – Chain of thought gone?
Image: Example 1 – Chain of thought gone?

The thought process method seems either totally removed, or simplified to a point where it’s not really helpful. o1-preview would self-evaluate, self-iterate, self-improve to a degree where it was actually useful. I don’t see that in o1.

In o1-preview, the chained self-prompting was noticeable (not because it wrote what it did, but because the final result was actually brutally much better than that of 4o.)

Example 2 – Explaining by example is not understood

Example 2 – Explaining by example is not understood
Image: Example 2 – Explaining by example is not understood

In o1-preview, you could include examples from one industry to demonstrate your idea/goal and then request a result for a different industry but with same principles. (This was a helpful way to find actionable methodologies to apply to your line of business but with borrowed ideas from another.)

In o1 this does work at all, and o1 is unable to interpret the goal of the prompt. Instead, it tends to expand the examples further by making them more detailed, instead of interpreting the principles that the examples represent.

Where o1-preview was clearly a multi-step agent creating actionable plan with multiple self-prompt, o1 seems more as if it’s simplifying the input prompt first before running it (removing important details in the process.)

Example 3 – Constructively disagreeableness gone?

Example 3 – Constructively disagreeableness gone?
Image: Example 3 – Constructively disagreeableness gone?

o1 preview was quite capable to reject bad ideas. For example, if you strongly suggested bad practices or ideas (in coding, schema, business methodologies) then o1 preview would strongly, and instantly disagree and present better options instead.

o1 is yet again agreeable, just as in example #2.

The overall impression is that o1 is ignoring instructions, and just rants on. Word count of quality.

***

I noticed a similar degradation of quality in 4o about a month ago. Suddenly it became weirdly bad for programming, started to ignore specific instructions.

***

I use ChatGPT for very specific, repetitive tasks. If something suddenly works substantially worse, I notice fast.

A common statement is “You are just less impressed with time, and that makes you think is getting worse”. I don’t think so, because these degradations are sudden, and for identical prompts.

I see no such sudden changes in the API when using specific model builds.