Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Curious to hear what folks are doing with Gemini outside of the coding space and why you chose it. Are you building your app so you can swap the underlying GenAI easily? Do you "load balance" your usage across other providers for redundancy or cost savings? What would happen if there was ever some kind of spot market for LLMs?


In my experience, Gemini 2.5 Pro really shines in some non-coding use cases such as translation and summarization via Canvas. The gigantic context window and large usage limits help in this regard.

I also believe Gemini is much better than ChatGPT in generating deep research reports. Google has an edge in web search and it shows. Gemini’s reports draw on a vast number of sources, thus tend to be more accurate. In general, I even prefer its writing style, and I like the possibility of exporting reports to Google Docs.

One thing that I don’t like about Gemini is its UI, which is miles behind the competition. Custom instructions, projects, temporary chats… these things either have no equivalent in Gemini or are underdeveloped.


If you're a power user, you should probably be using Gemini through AI studio rather than the "basic user" version. That allows you to set system instructions, temperature, structured output, etc. There's also NotebookLM. Google seems to be trying to make a bunch of side projects based on Gemini and seeing what sticks, and the generic gemini app/webchat is just one of those.


My complaint is that any data within AI Studio can be kept by Google and used for training purposes — even if using the paid tier of the API, as far as I know. Because of that, I end up only using it rarely, when I don’t care about the fate of the data.


This is only true for the free tier. Paid Ai Studio users have strong privacy protections.


Can you elaborate on “paid” ? Because I honestly still have no idea if my usage of AI Studio is used for training purposes.

I have google workspace business standard, which comes with some pro AI features. Eg, Gemini chat clearly shows “Pro”, and says something like “chats in your organization won’t be used for training”. On AI Studio it’s not clear at all. I do have some version of paid AI services through Google, but no idea if it applies to AI studio. I did create some dummy Google cloud project which allowed me to generate api key, but afaik I still haven’t authorized any billing method.


Thank you for clarifying that. I’ve researched this once again and confirmed that Google treats all AI Studio usage as private if there’s at least one API project with billing enabled in an account.


for translation you'll still be limited for longer texts by the 65K output limit though I suppose?


Yes. I haven't had problems with the output limit so far, as I do translations iteratively, over each section of longer texts.

What I like the most about translating with Gemini is that its default performance is already good enough, and it can be improved via the one million tokens of the context window. I load to the context my private databases of idiomatic translations, separated by language pairs and subject areas. After doing that, the need for manually reviewing Gemini translations is greatly diminished.


I tried swapping for my project which involves having the LLM summarize and critique medical research and didn’t have great results. The prompt I found works best with the main LLM I use fucks up the intended format when fed to other LLMs. Thinking about refining prompts for each different llm but haven’t gotten there.

My favorite personal use of Gemini right now is basically as a book club. Of course it’s not as good as my real one but I often can’t them to read the books I want and Gemini is always ready when I want to explore themes. It’s often more profound than the book club too and seems a bit less likely to tunnel vision. Before LLMs I found exploring book themes pretty tedious, often I would have to wait a while to find someone who had read it but now I can get into it as soon as I’m done reading.


I can throw a pile of NDAs at it and it neatly pulls out relevant stuff from them within a few seconds. The huge context window and excellent needle in a haystack performance is great for this kind of task.


The NIAH performance is a misleading indicator for performance on the tasks people really want the long context for. It's great as a smoke/regression test. If you're bad on NIAH, you're not gonna do well on the more holistic evals.

But the long context eval they used (MRCR) is limited. It's multi-needle, so that's a start, but its not evaluating long range dependency resolution nor topic modeling, which are the things you actually care about beyond raw retrieval for downstream tasks. Better than nothing, but not great for just throwing a pile of text at it and hoping for the best. Particularly for out-of-distribution token sequences.

I do give google some credit though, they didn't try to hide how poorly they did on that eval. But there's a reason you don't see them adding RULER, HELMET, or LongProc to this. The performance is abysmal after ~32k.

EDIT: I still love using 2.5 Pro for a ton of different tasks. I just tend to have all my custom agents compress the context aggressively for any long context or long horizon tasks.


> The performance is abysmal after ~32k.

Huh. We've not seen this in real-world use. 2.5 pro has been the only model where you can throw a bunch of docs into it, give it a "template" document (report, proposal, etc), even some other-project-example stuff, and tell it to gather all relevant context from each file and produce "template", and it does surprisingly well. Couldn't reproduce this with any other top tier model, at this level of quality.


Are you by any chance a lawyer? I’m asking because I’m genuinely curious whether lawyers are starting to use the SOTA LLMs in day-to-day drafting and review work. I use the LLMs as a CEO as a poor substitute for my in-house counsel when I just need _an_ answer quickly (i.e. when counsel is literally asleep); however, for anything serious, I always defer to them because I know LLMs make mistakes and obviously cannot offer professional liability cover.


We're a G-suite shop so I set aside a ton of time trying to get 2.5 pro to work for us. I'm not entirely unhappy with it, its a highly capable model, but the long context implosion significantly limits it for the majority of task domains.

We have long context evals using internal data that are leveraged for this (modeled after longproc specifically) and the performance across the board is pretty bad. Task-wise for us, it's about as real world as it gets, using production data. Summarization, Q&A, coding, reasoning, etc.

But I think this is where the in-distribution vs out-of-distribution distinction really carries weight. If the model has seen more instances of your token sequences in training and thus has more stable semantic representations of them in latent space, it would make sense that it would perform better on average.

In my case, the public evals align very closely with performance on internal enterprise data. They both tank pretty hard. Notably, this is true for all models after a certain context cliff. The flagship frontier models predictably do the best.


MRCR does go significantly beyond multi-needle retrieval - that's why the performance drops off as a function of context length. It's still a very simple task (reproduce the i^th essay about rocks), but it's very much not solved.

See contextarena.ai and the original paper https://arxiv.org/abs/2409.12640

It also seems to match up well with evals like https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/o...

The other evals you mention are not necessarily harder than this relatively simple one..


Sure. I didn't imply (or didn't mean to imply at least) that I thought MRCR was solved, only pointing out that it's closer to testing raw retrieval than it is testing long range dependency resolution like Longproc does. If retrieval is great but the model still implodes on the downstream task, the benchmark doesn't tell you the whole story. The intent/point of my original comment was that even the frontier models are nowhere near as good at long context tasks than what I see anecdotally claimed about them in the wild.

> The other evals you mention are not necessarily harder than this relatively simple one.

If you're comparing MRCR to for example Longproc, I do think the latter is much harder. Or at least, much more applicable to long-horizon task domains where long context accumulates over time. But I think it's probably more accurate to say its a more holistic, granular eval by comparison.

The tasks require the model to synthesize and reason over information that is scattered throughout the input context and across previously generated output segments. Additionally, the required output is lengthy (up to 8K tokens) and must adhere to a specific, structured format. The scoring is also more flexible than MRCR: you can use row-level F1 scores for tables, execution-based checks for code, or exact matches for formatted traces.

Just like NIAH, I don't think MRCR should be thrown out wholesale. I just don't think it can be pressed into the service of representing a more realistic long context performance measure.

EDIT: also wanted to note that using both types of evals in tandem is very useful for research and training/finetuning. If Longproc tanks and you dont have the NIAH/MRCR context, its hard to know what capabilities are regressing. So using both in a hybrid eval approach is valuable in certain contexts. For end users only trying to guage the current inference-time performance, I think evals like RULER and Longproc have a much higher value.


Right, the way I see it, MRCR isn't a retrieval task in the same vein as RULER. It’s less about finding one (or multiple) specific facts and more about piecing together scattered information to figure out the ordering of a set of relevant keys. Of course, it’s still a fairly simple challenge in the grand scheme of things.

LongProc looks like a fantastic test for a different but related problem, getting models to generate long answers. It seems to measure a skill the others don't. Meanwhile, RULER feels even more artificial than MRCR, since it's almost entirely focused on that simple "find the fact" skill.

But I think you're spot-on with the main takeaway, and the best frontier models are still struggling with long context. The DeepMind team points this out in the paper with that Pokemon example and the MRCR evaluation scores themselves.


Gemini Flash 2.0 is an absolute workhorse of a model at extremely low cost. It's obviously not going to measure up to frontier models in terms of intelligence but the combination of low cost, extreme speed, and highly reliable structured output generation make it really pleasant to develop with. I'll probably test against 2.5 Lite for an upgrade here.


I want to know what use cases you're using if for it it's not confidential.


We use it by having a Large Model delegate to Flash 2.0. Let's say you have a big collection of objects and a SOTA model identifies the need to edit some properties of one of them. Rather than have the Large Model perform a tool call or structured output itself (potentially slow/costly at scale), it can create a small summary of the context and change needed.

You can then provide this Flash 2.0 and have it generate the full object or diffed object in a safe way using the OpenAPI schema that Gemini accepts. The controlled generation is quite powerful, especially if you create the schema dynamically. You can generate an arbtirarily complex object with full typing, restrict valid values by enum, etc. And it's super fast and cheap and easily parallelizable. Have 100 objects to edit? No problem, send 100 simultaneous flash 2.0 calls. It's google, they can handle it.


I’ve found the 2.5 pro to be pretty insane at math. Having a lot of fun doing math that normally I wouldn’t be able to touch. I’ve always been good at math, but it’s one of those things where you have to do a LOT of learning to do anything. Being able to breeze through topics I don’t know with the help of AI and a good CAS + sympy and Mathematica verification lets me chew on problems I have no right to be even thinking about considering my mathematical background. (I did minor in math.. but the kinds of problems I’m chewing on are things people spend lifetimes working on. That I can even poke at the edges of them thanks to Gemini is really neat.)


I use it extensively for https://lexikon.ai - in particular one part of what Lexikon does involves processing large amounts of images, and the way Google charges for vision is vastly cheaper compared to the big alternatives (OpenAI, Anthropic)


Wow, if I knew that someone was using your product on my conversation with them I'd probably have to block them.


I mean I've copy pasted conversations and emails into ChatGPT as well, it often gives good advice on tricky problems (essentially like your own personalized r/AmITheAsshole chat). This service seems to just automate that process.


I use Gemini 2.5 Flash (non thinking) as a thought partner. It helps me organize my thoughts or maybe even give some new input I didn't think of before.

I really like to use it also for self reflection where I just input my thoughts and maybe concerns and just see what it has to say.


Simple unstructured to structured data transformation.

I find Flash and Flash Lite are more consistent than others as well as being really fast and cheap.

I could swap to other providers fairly easily, but don't intend to at this point. I don't operate at a large scale.


It basically made a university physics exam for me. It almost one-shot it as well. Just uploaded some exams from previous years together with a latex template and told it to make me a similar one. Worked great. Also made it do the solutions.


I use it for https://toolong.link Youtube summaries with images because only Gemini has easy access to YouTube and it has a gigantic context window


It's very good at automatically segmenting and recognizing handwritten and badly scanned text. I use it to make spreadsheets out of handwritten petitions.


Turn local real estate agents websites to RSS to get new properties on the market before they get uploaded to real estate market place platforms.

I give it the HTML, it finds the appropriate selector for the property item and then I use a HTML to RSS tool to publish the feed


Web scraping - creating semi-structured data from a wide variety of horrific HTML soups.

Absolutely do swap out models sometimes, but Gemini 2.0 Flash is the right price/performance mix for me right now. Will test Gemini 2.5 Flash-Lite tomorrow though.


I've yet to run out of free image gen credits with Gemini, so I use it for any low-effort image gen like when my kids want to play with it or for testing prompts before committing my o4 tokens for better quality results.


Yes, we implemented a separate service internally that interfaces with an LLM and so the callers can be agnostic as to what provider or model is being used. Haven't needed to load balance between models though.


Low-latency LLM for my home automation. Anecdotally, Gemini was much quicker than OpenAI in responding to simple commands.

In general, when I need "cheap and fast" I choose Gemini.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: