pointlessone
They definitely did not implement PDF parsing, even a subset of it. They make some assumptions that will definitely result in incorrect parsing. For instance, they assume, objects are tightly packed. They're not required to. They should be to save space but are not required to. Moreover, it is possible to place objects inside other objects. It's not advised but not prohibited. As far as I can tell this is where their PDF parsing ends. They don't parse the objects themselves (not regular objects, nor stream objects). So they've chosen PDF "because it is the most complicated format to our knowledge" but ended up just (incorrectly) chunking the stream by offset table.
w10-1
Snippets from the summary about the most promising aspects

> With attributes and intervals, IPGs allow the specification of data dependence as well as the dependence between control and data.

> Moreover, parser termination checking becomes possible.

> To further utilize the idea of intervals, an interval-based, monadic parser combinator library is proposed.

This sounds like a well-behaved variant. Adding local attribute references simplifies the grammar and is tractably implemented.

This might support classifying and implementing formats by severability + composability, i.e., whether you can parse one part at the same time as another, or at least find/prioritize precursor structures like indexes.

The yet-unaddressed streaming case is most interesting:

> We can first have an analysis that determines if it is possible to generate a stream parser from an IPG: within each production rule, it checks if the attribute dependency is only from left to right. After this analysis, a stream parser can be generated to parse in a bottom-up way

For parallel composition, you'd want to distinguish the attributes required by the consuming/combining (whole-assembly) operation from those only used in the part-parsing operation to plan the interfaces.

Aside from their mid-level parser-combinators, you might want some binary-specific lowering operations (as they did with Int) specific to your target architecture and binary encodings.

For the overall architecture it seems wise for flatbuffers et al to expressly avoid unbounded hierarchy. Perhaps three phases (prelude+split, process, merge+finish) would be more manageable than fully-general dependency stages possible with arbitrary attribute dependencies.

I would hate to see a parser technology discounted because it doesn't handle the crap of PDF or even MS xml. I'd be very interested in a language that could constrain/direct us to more performant data formats, particularly for data archives like genomics or semantics where an archive-resident index can avoid full-archive scans in most use-cases.

andrybak
> We have used IPGs to specify a number of file formats including ZIP, ELF, GIF, PE, and part of PDF

For PDF, that's fair. Video "Types of PDF - Computerphile" covers this: https://www.youtube.com/watch?v=K7oxZCgO1dY

quotemstr
> ZIP files that are prefixed by random garbage can still be extracted by unzip but fail to be recognized by a parser that conforms to the format specification

To be fair, the ability to stick a ZIP file at the end of any other kind of file enables all sorts of neat tricks (like the old self-extracting zips).

khaledh
Reminds me of binarylang[0] (a library for binary parsing written in Nim). I used it in a small hobby project to parse ELF binaries in a declarative manner (well just the headers + string table)[1].

[0] https://github.com/sealmove/binarylang

[1] https://github.com/khaledh/elfdump/blob/master/elfparse.nim

aappleby
Is this really a new thing? It feels like they've just crammed a sliver of the same bog-standard parsing we've been doing for decades back into the CFG.

I guess that's good for preventing off-by-one-based parsing errors, but surely there's prior art from long ago.

matheusmoreira
So cool to see that binary format parsing is finally being formalized.

I once asked a question related to this on the computer science stack overflow:

https://cs.stackexchange.com/q/60718

Would someone like to add this as an answer?

revskill
How about MS office document ?