Ticket #229 (new task)

Opened 2 years ago

Last modified 2 years ago

Optimized inline blob handling and other blob thoughts

Reported by: bruno Owned by: evert
Priority: major Milestone:
Component: Repository Version:
Keywords: Cc:

Description

This is about the treatment of inline blobs and other blob related thoughts.

Some observations:

  • dealing with inline blobs in the Java clients is much lightweighter than in the REST-itf, since everything happens within the client JVM.
  • an upcoming change in the Java API will make things such that blobs will now be retrieved using their {record id, field id, version} coordinates, rather than directly by blobkey. For the inline blobs, this means the advantage of faster retrieval will be lost since a call to the repository and HBase will become necessary
  • the previous point makes that the blob.value property will, upon record GET, become mostly uninteresting. The only purpose it still serves is that when changing some fields (not the blob) and resaving the record, the identity of the blob is still known.
  • the decision to use inline blobs is made by Lily when uploading the blob, but nothing prevents the client from deciding this himself: one can simply construct a blob value according to the rules for inline blobs, and this way you can store any size of blob as inline blob. There is no protection against this. I don't think this is a bad thing, it can be interesting to let the client make such decisions. While this does allow for abuse (storing large values as inline blobs), the same is possible with other field types such as strings, even more with multi-value, hierarchical strings. We should have some other protection against this (e.g. a globally configurable absolute field size limit, HBase also has some limit IIRC).
  • for the Java client, it would in theory also be possible to directly create blobs on HBase or HDFS as long as the rules are followed, in other words, these are publicly accessible, unprotected resources (on purpose, since we want to enable direct retrieval to/from these storages). The idea is that the Java client is used within trusted environments.
  • the advantage of inline blobs stems mostly from the fact that they use the same API as for other blobs. If not, we could as well introduce a 'bytes' value type to allow storing raw bytes in a field (stil interesting for cases where the blobs don't make sense at all, that is when you simply want to store a small binary structure, accessing e.g. 3 bytes via streams is clumsy and inefficient). Keeping the blob-storage API independent from the storage mechanism is important.

Other things I meanwhile observed:

  • the encoding of the blob value consists of the byte concatenation of type length (integer = 4 bytes), type name (e.g. INLINE) and data. So for inline blobs, 10 bytes are already lost in this overhead. Maybe we can change the type identification to just one byte, this drops the need to store the type length.
  • the Blob object is currently semi-immutable, which does not seem to make much sense. For example, if you want to change the mediaType or name, you have to construct a new blob value. So I think it makes more sense to have setters for those other properties too.
  • Currently one has to specify the size of the blob when creating a blob. However, there is no check whatsoever that a user does not write more data or less data to the OutputStream?, and in such case the size in the Blob metadata will not correspond to the real size. Also, afterwards one can update a blob with a new Blob object with the value of an existing blob (from a previous version of the same field) but with a different size. This should not be allowed, we should check the size corresponds. For new blobs, the only way is to let the repository check with the blob storage (an extra storage operation), for reused blobs, we can check with the previous version(s) of the field.
  • BlobStoreAccessRegistry? does not check that all supplied BlobStoreAccess?'es have different IDs.

Concerning the inline blob optimization, some proposals below. The other things should go into separate issues or just be handled immediately.

Java API

Repository.getInputStream(Record record, fieldName [, version, mvIndex, hierIndex]) : BlobInputStream

Because the record object is supplied, this can optimize the case where the blob is in the record object and is an inline blob (everything can be done without repository access).

The following variant is to allow explicitly creating inline blobs. While this is not necessary for optimization (writing an inline blob stays within the local JVM), it does allow for explicitly choosing the inline mechanism:

Repository.createInlineBlob(byte[] data, String mediaType): Blob

But then we should probably go for the more generic variant:

Repository.getOutputStream(Blob blob, String storage [one of inline,hdfs,hbase])

With the above method, specifying the size in the Blob should not be required, the system can fill it in for you.

and the following variants which addresses #41:

Repository.createBlob(Blob blob, byte[] data, String storage [one of inline,hdfs,hbase])
Repository.createBlob(Blob blob, InputStream is, String storage [one of inline,hdfs,hbase])

REST API

To allow immediate creation of inline blobs without server roundtrip, and without having to know the structure of the blob value attribute, allow submitting a record with blobs with a data field instead of a value field:

{
   data: base-64 encoded data
   ...
}

On the GET side of things, if the blob is an inline blob, return it as follows:

{
  data: base-64 encoded data,
  (no value attribute)
  ...
}

There would be no value attribute, since it is not needed: client cannot do anything with it, and upon put, we can reconstruct it from the provided data.

To allow creating a blob on any storage, we would need to add an extra parameter to the /repository/blob resource, e.g. a request parameter ?storage=hdfs|hbase|inline

Change History

comment:1 Changed 2 years ago by evert

  • Milestone changed from 0.3 to 1.0

The API call
Repository.getInputStream(Record record, fieldName [, version, mvIndex, hierIndex]) : BlobInputStream?
is handled by #230.

For all other aspects of this ticket we move this ticket to 1.0

comment:2 Changed 2 years ago by evert

Keeping this issue open for the Inline Blob Optimizations, see the Java API and Rest API sections.

For the other issues, separate tickets have been made :

comment:3 Changed 2 years ago by evert

  • Milestone 1.0 deleted

Taking this ticket out of 1.0
There is currently no demand for these extra api calls. This ticket can be picked up when there is a need for it.

Note: See TracTickets for help on using tickets.