Martin Gauch
@gauchm.bsky.social
62 followers 110 following 8 posts
Deep learning & earth science @ Google Research
Posts Media Videos Starter Packs
Pinned
gauchm.bsky.social
Starting on bsky with a new preprint: "How to deal w___ missing input data"
doi.org/10.31223/X50...

Missing input data is a very common challenge in deep learning for hydrology: weather providers have outages, some data products start later than others, some only exist for certain regions, etc.
Different scenarios for missing input data: outages at individual time steps (top), data products starting at different points in time (middle), and local data products that are not available for all basins (bottom). All of these scenarios reduce the number of training samples for models that cannot cope with missing data (yellow, small box), while the models presented in this paper can be trained on all samples with valid targets (purple, large box).
Reposted by Martin Gauch
hs.egu.eu
Congratulations to Frederik Kratzert on winning this year's Arne Richter Award for outstanding research by an early career scientist. Fantastic presentation at #EGU25 this afternoon!
gauchm.bsky.social
I can't compete with @kratzert.bsky.social's swag game, but I'll contribute a few old NeuralHydrology stickers that I found recently :)
Reposted by Martin Gauch
kratzert.bsky.social
Now on HESSD for open discussion: egusphere.copernicus.org/preprints/20...

They even let us keep the paper title (for now?!) 🙄
gauchm.bsky.social
NeuralHydrology just got a little better, especially if you're building custom models :)
gauchm.bsky.social
All of the approaches work pretty well! Masked mean tends to perform a little better, but it's often quite close.

More details, experiments, figures, etc. in the paper.

All of this is joint work with @kratzert.bsky.social, @danklotz.bsky.social, Grey Nearing, Debby Cohen, and Oren Gilon.
Median NSE and KGE across 531 basins at different amounts of missing input time steps. The dotted horizontal line provides the baseline of a model that cannot deal with missing data but is trained to ingest all three forcing groups at every time step. The dashed line represents the baseline of a model that uses the worst individual set of forcings (NLDAS). The shaded areas indicate the spread between minimum and maximum values across three seeds; the solid lines represent the median.
gauchm.bsky.social
3) Attention: A more general variant of the masked mean that uses an attention mechanism to dynamically weight the embeddings in the average based on additional information, e.g., the basins' static attributes.
Illustration of the attention embedding strategy. Each forcing provider is projected to the same size through its own embedding network. The resulting embedding vectors become the keys and values. The static attributes, together with a binary flag for each provider, serve as the query. The attention-weighted average of embeddings is passed on to the LSTM.
gauchm.bsky.social
2) Masked mean: Embed each group of inputs separately (a group being the inputs from one data provider) and average the embeddings that are available at a given time step. This is what we currently do in Google's operational flood forecasting model.
Illustration of the masked mean strategy. Each forcing provider is projected to the same size through its own embedding network. The resulting embeddings of valid providers are averaged and passed on to the LSTM.
gauchm.bsky.social
In the paper we present and compare three ways to deal with those situations:
1) Input replacing: Just replace NaNs with some fixed value and concatenate the inputs with a flag to indicate missing data.
Illustration of the input replacing strategy. NaNs in the input data for a given time step are replaced by zeros, all forcings are concatenated, together with one binary flag for each forcing group which indicates whether that group was NaN or not. The resulting vector is passed through an embedding network to the LSTM.
gauchm.bsky.social
Starting on bsky with a new preprint: "How to deal w___ missing input data"
doi.org/10.31223/X50...

Missing input data is a very common challenge in deep learning for hydrology: weather providers have outages, some data products start later than others, some only exist for certain regions, etc.
Different scenarios for missing input data: outages at individual time steps (top), data products starting at different points in time (middle), and local data products that are not available for all basins (bottom). All of these scenarios reduce the number of training samples for models that cannot cope with missing data (yellow, small box), while the models presented in this paper can be trained on all samples with valid targets (purple, large box).