Skip to content

RFC - Unified Structured Data#11

Merged
dirvine merged 9 commits into
maidsafe:masterfrom
dirvine:unified-structured-data
Jul 2, 2015
Merged

RFC - Unified Structured Data#11
dirvine merged 9 commits into
maidsafe:masterfrom
dirvine:unified-structured-data

Conversation

@dirvine
Copy link
Copy Markdown
Member

@dirvine dirvine commented Jun 24, 2015

Review on Reviewable

@chandraprakash
Copy link
Copy Markdown

Q1. How is backup and sacrificial types interpreted at Vaults layer ? (Client Manager need to replicate Put to backup and sacrificial copies).
Probably Vault need to have tag_types denote backup and sacrificial. This means we some how need to reserve those tag_types for vaults only ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor typo e2559 -> e25519 (I think)
Will be nice to have a link included http://ed25519.cr.yp.to/

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cheers will fix now.

@dirvine
Copy link
Copy Markdown
Member Author

dirvine commented Jun 24, 2015

@chandraprakash Backup and Sacrificial are sub types of ImmutableData and will be defined in routing. All clients will use only these types (most clients may ignore these two, our vaults will use them for sure). So we can have examples as we have right now with simple put/get (of any data size) and even an example where we assign data types as functions (add/subtract etc.) and show a decentralised calculator, to give people a feel for what this is.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor typo
ImmutableData has already What parts of the design are still to be done?two sub-types (Backup and Sacrificial)
->
ImmutableData has already two sub-types (Backup and Sacrificial)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the default behaviour?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default is these are Stored (see below) when version == 0 and updated when a copy exists and the owner of the update is identified as a previous owner of the existing data element.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On another note, the usual approach to type_tags is that the protocol defines only those types that are essential for its internal working (see TCP or any other protocol). We might, e.g. encode all of our types using just 3 bits. Then, if users need to add tags to their data, they should do it as part of the payload. Is there a reason we want to deviate from this approach?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see these type_tags more like IANA lists or similar. So a way to identify types for general use as well as internal use (and reserved). Is this what you mean?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the intentions. I guess the main argument against which itches me most is simplicity, that is, there already is an implicit way for users to use type tags for their data (inside the data member variable), why have two approaches for the same thing?

dirvine added 3 commits June 27, 2015 20:58
…o unified-structured-data

Conflicts:
	proposed/0000-Unified-structured-data.md
@Fraser999
Copy link
Copy Markdown
Contributor

I started writing comments here, but it quickly became too large. I've made a repo for my proposal and put my comments into the docs.

If you want any or all of it moved into comments here, I can have a go. I guess any discussion on my proposal should go in here though - not on my repo?

@dirvine
Copy link
Copy Markdown
Member Author

dirvine commented Jun 28, 2015

I think if it is close enough but a large departure then I would say it's perhaps better to make another rfc. If it's a lot of comments then perhaps adding these comments in a way people can grasp them, if possible a bit at a time. I would say we need to be careful at thoughts of deleting ImmutableData from these older versions though as it's a hack if I create a StructuredData and add well known ImmutableData (or worse) identifiers in the StructuredData.

I think a repo to back up / explain your ideas/suggestions is a good way to go when there is a lot to decide. If possible bits at a time works best though.

Perhaps start with changes here, or if you feel there is a large departure then a new RFC.

@Fraser999
Copy link
Copy Markdown
Contributor

OK - I'll put the main points for discussion in here. I guess we can still refer to the sample implementation without pasting it here.

  1. I'd prefer to have the immutable and mutable parts split into separate structs for clarity.
  2. I'm not sure we need to keep the signatures in the Data - only required in the messages requesting the data to be mutated.
  3. I think we can easily implement more fine-grained, flexible control of how a mutating request is authorised without increasing complexity. This is described in a bit more detail in my repo, but essentially, each owner's key is given a weight value when it is initially added, and a mutating request must be signed by enough keys to exceed a specified minimum weight.
  4. I think we'd benefit in a few ways from keeping a collection of the most recent versions rather than just the single most recent. The benefits are that the data becomes easier to work with if you want to use it for versioned data, it allows for future efficiency gains, and it allows for future auto-archiving of old versions.

I think these are the main points. I guess we can either discuss all points at the same time, or try and just stick to one topic at a time?

One other thought I had (let's call this point 5) was that it might be worth renaming StructuredData to DynamicData or VersionedData, and renaming ImmutableData to StaticData? While "StructuredData" is an appropriate name, really it's main reason for existing is to hold changing data. I don't think we should use ImmutableData and MutableData together. While the terms make sense, they're easily misheard. I guess I'd plunk for Immutable and Versioned - I think they're most representative of the two types.

I'll leave out the points about the data expiring, and about any archiving mechanism for now. I think both of these can be added later without major disruption to an existing implementation.

@dirvine
Copy link
Copy Markdown
Member Author

dirvine commented Jun 28, 2015

@Fraser999 Nice feedback (a bit of a huge lump of questions in one point though ;-) )

  1. How would you refer to the Immutable part though? would that not create another StructuredData to add the reference ?
  2. The signature is required when there is a multisig (more than one owner) to ensure a single owner cannot overwrite the multi-sig data. As the message is passed from owner to owner collecting signatures then a single owner can upload the data.
  3. Do you mean here making some owners more special than others here? I feel this is introducing complexity that could lead to even further complexity. I am not aware of any multisig technology that offers that (not that means we should not), I am just curious as to why, I am also very wary of complexity like this without all the edge effects being known. I see this like people voting, if some have weighted votes then the whole system seems to ultimately fail.
  4. I agree here with the sentiment, but does the network need to care if this is done by any type? What I mean here is that the client apps can keep as many or little number of versions as it wants. The single part I see the network should do is have a backup login session (as we always have, but has been overlooked recently). As far as any other data is concerned then client side apps could handle that without the network stringently deciding how many versions each data type will have.

As an addition to this point, I would say that network versioned data adds some fantastic benefits client versioned data does not. This includes the ability for any user (not only the owners) to go back through versions. The main issue is handling such versions, which could be done for sure by having the version as 2 fields (for instance), on is recent versions and the second being the links to chunks (which themselves are just versions). This allows a more global approach to versions that is secure (using Non encrypted ImmutableData) and extensible where if users used the versions there would be payment for each number of versions over the 100KB limit here. I do see this as a separate RFC, which is not altogether complex rfc, but could answer that specific issue for sure.

Perhaps the answer here is to not block such an rfc or require that rfc breaking API's we introduce.

5: I think the Industry recognises Structured Data (as data with a structure (i.e. Cassandra)) and Unstructured Data (which is what we call ImmutableData (i.e. hadoop)). I think that should be part of our thinking with regards to naming (even if the whole industry is wrong). In saying that structured data does not necessarily have to be versioned though, just follow a data model (arguably what we have is semi-structured data).

I agree expiration and archiving is beyond the scope of this rfc.

@Fraser999
Copy link
Copy Markdown
Contributor

  1. How would you refer to the Immutable part though? would that not create another StructuredData to add the reference ?

Not sure what you mean here. It would be part of the StructuredData struct.

@Fraser999
Copy link
Copy Markdown
Contributor

2 The signature is required when there is a multisig (more than one owner) to ensure a single owner cannot overwrite the multi-sig data. As the message is passed from owner to owner collecting signatures then a single owner can upload the data.

Since this goes via a Post message, we can have the signatures as part of the message but not part of the data.

@Fraser999
Copy link
Copy Markdown
Contributor

3 Do you mean here making some owners more special than others here?

Yes

I feel this is introducing complexity that could lead to even further complexity.

I don't feel this is any more complex than the original proposal. Instead of counting sigs, we're simply adding integers.

I am not aware of any multisig technology that offers that (not that means we should not), I am just curious as to why,

Flexibility. Say an owner of a git repo wants to have full rights to modify it, but wants to require at least 3 out of 10 devs to authorise a change. His key is given a value of 3, all devs a value of 1. Minimum weight to validate a request is 3. Not complex - and certainly not complex for the network.

I am also very wary of complexity like this without all the edge effects being known.

Me too. I guess I don't see any edge cases for this though, but this is a good place to try to find them I think - if others can spot any we should document them here.

I see this like people voting, if some have weighted votes then the whole system seems to ultimately fail.

I disagree - I don't think this is like people voting. The network will enforce an equal weight per key if that's how the creator of the data type wants it to be - so for real-life voting, yes it should be equal. But in other cases, not all people have equal rights.

If we enforce equal weight to all keys as per the original proposal, the only way bias can be achieved is to create many keys per human. So in my example mentioned above, the owner of the git repo would create 11 keys for himself, and one each for his devs. That way he can always outvote his devs.

@Fraser999
Copy link
Copy Markdown
Contributor

4 I agree here with the sentiment, but does the network need to care if this is done by any type? What I mean here is that the client apps can keep as many or little number of versions as it wants. The single part I see the network should do is have a backup login session (as we always have, but has been overlooked recently). As far as any other data is concerned then client side apps could handle that without the network stringently deciding how many versions each data type will have.

I guess the client apps can keep however many versions they want, but I think that's better done on the network.

We haven't dealt with the Get requests, but anything other than getting the full piece of StructuredData will need to go via a Post won't it? So for the example you mentioned of session backup - we'll need to have a second Post call if the contents of the initial Get don't parse as a session. If the StructuredData holds both versions, they're both available after just one call.

@Fraser999
Copy link
Copy Markdown
Contributor

I do see this as a separate RFC, which is not altogether complex rfc, but could answer that specific issue for sure.

Agreed :)

@Fraser999
Copy link
Copy Markdown
Contributor

With regards to renaming StructuedData and ImmutableData, the suggestions were personal preferences - I don't feel strongly enough about the issue to argue. I don't think the current names are wrong, I just feel they could be better.

@dirvine
Copy link
Copy Markdown
Member Author

dirvine commented Jun 29, 2015

3 Do you mean here making some owners more special than others here?

Yes

I suppose what I am saying here is, this going further than multi-sig? Nobody has asked for this so far and I am not aware of any other multi-sig that does this. So in essence I see us adding to an existing mechanism, some complexity that is not what people expect in multi-sig. My feeling is if we just provide the expected actions first then we keep it simple, expected and comfortable.

I appreciate all the people creating multiple identities to add themselves weight etc. this is available in todays multi-sig systems, AFAIK nobody does that. They could but ...

I feel we just really want to do what people expect in these cases or abandon creating multi-sig and create our own system and name to go with it, but we have promised multi-sig and it does make sense to in any crypto-currency type situation for sure. I think here we are making all data able to act the same, which is again a known mechanism used by document signatories and several business's today (docusign etc.).

i.e. we are proposing more here than people expect or have asked for under the umbrella of a known name (multi-sig) So we are not implementing a known algorithm, but something more than that!

@dirvine
Copy link
Copy Markdown
Member Author

dirvine commented Jun 29, 2015

I guess the client apps can keep however many versions they want, but I think that's better done on the network.

I agree for some types (not for all). I do feel this RFC proposes the first step, not the final solution. If the first step is simple then we ensure it works and add to is as we see opportunities to do so.

We haven't dealt with the Get requests, but anything other than getting the full piece of StructuredData will need to go via a Post won't it? So for the example you mentioned of session backup - we'll need to have a second Post call if the contents of the initial Get don't parse as a session. If the StructuredData holds both versions, they're both available after just one call.

I cannot see us doing anything else than a Get in this case, even to update you will need to Get the original data to overwrite it. I again think trying to get parts of data is too complex and not something we should be looking at, especially at network version 1.

@Fraser999
Copy link
Copy Markdown
Contributor

I suppose what I am saying here is, this going further than multi-sig? Nobody has asked for this so far

Well... I am :)

and I am not aware of any other multi-sig that does this. So in essence I see us adding to an existing mechanism, some complexity that is not what people expect in multi-sig. My feeling is if we just provide the expected actions first then we keep it simple, expected and comfortable.

I feel this proposal is simple and comfortable. As for expected, I think the community has handled several massive changes to expectations and seems to be generally well able to take in the changes.

I appreciate all the people creating multiple identities to add themselves weight etc. this is available in todays multi-sig systems, AFAIK nobody does that. They could but ...

That's partly my point. Maybe they would, but either they figured out how to do it and it's too much work, or it never occurred to them that they could. This proposal fixes both of these issues.

I feel we just really want to do what people expect in these cases or abandon creating multi-sig and create our own system and name to go with it, but we have promised multi-sig and it does make sense to in any crypto-currency type situation for sure. I think here we are making all data able to act the same, which is again a known mechanism used by document signatories and several business's today (docusign etc.).

We also have many well-known and understood cases where not all users are equal - e.g. several products we use in-house allow users to be given owner, admin, member, read-only status (google accounts, cdash, jira, etc.)

@Fraser999
Copy link
Copy Markdown
Contributor

I agree for some types (not for all). I do feel this RFC proposes the first step, not the final solution.

Agreed - but I see that being true for both proposals.

If the first step is simple then we ensure it works and add to is as we see opportunities to do so.

Again agreed. What I'm proposing is simple.

I cannot see us doing anything else than a Get in this case, even to update you will need to Get the original data to overwrite it.

So under your proposal, for the session stuff, we need to include the backup session location in the data field, yes? Or if not the location, then both the encrypted current session and the backup?

This seems about the same or even slightly more complex to me than just encrypting the session as a piece of ImmutableData, then adding that chunk's name to the list of versions in the session StructuredData.

I again think trying to get parts of data is too complex and not something we should be looking at, especially at network version 1.

I'm not proposing that we should be able to cherry-pick parts of data to receive. Not sure what you mean here?

@dirvine
Copy link
Copy Markdown
Member Author

dirvine commented Jun 29, 2015

So under your proposal, for the session stuff, we need to include the backup session location in the data field, yes? Or if not the location, then both the encrypted current session and the backup?

No, you would have a backup session packet (different type). in this case.

The data element can contain the whole session if it fits in the size limit. Otherwise the data element would contain DataMaps or just keys to Immutable Data if it's encrypted differently (would be a new type of ImmutableData, if we want further efficiency (later rfc)).

So everything stays simple this way.

@Fraser999
Copy link
Copy Markdown
Contributor

No, you would have a backup session packet (different type). in this case.

The data element can contain the whole session if it fits in the size limit. Otherwise the data element would contain DataMaps or just keys to Immutable Data if it's encrypted differently (would be a new type of ImmutableData, if we want further efficiency (later rfc)).

So everything stays simple this way.

Ah - I see. We did this before way back. It wasn't simple to rollback then, but not having different key types to sign these different packets now will probably mean it's not too hard.

I still think my approach is simpler than this and makes the StructuredData more usable. I guess other people will need to chip in to give their impressions of the complexity of my proposal - I just can't see where it's not simple.

@dirvine
Copy link
Copy Markdown
Member Author

dirvine commented Jun 29, 2015

I still think my approach is simpler than this and makes the StructuredData more usable. I guess other people will need to chip in to give their impressions of the complexity of my proposal - I just can't see where it's not simple.

If possible could you show it maybe by commenting exactly what you mean to change here. Perhaps directly in the file itself https://github.com/maidsafe/rfcs/pull/11/files Maybe folks will find that easier to discuss? Also shows if it's complex or not I hope.

@Fraser999
Copy link
Copy Markdown
Contributor

Sure thing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd propose to change this to:

/// Top-level type: a representation of "Structured Data".
pub struct Data {
    /// Immutable attributes which apply to the entire `Data` instance.
    fixed_attributes: FixedAttributes,
    /// Attributes which apply to the entire `Data` instance, but which can be changed with proper
    /// authorisation.
    mutable_attributes: MutableAttributes,
    /// The most recent (which could encompass all) versions of the `Data` instance.  Cannot be
    /// empty.
    versions: Vec<Version>,
}

/// Attributes of the `Data` which can never change once initially set.  These define the identity,
/// type and some of the rules the network will employ for handling the `Data`.  It can also hold
/// arbitrary data which will likely be meaningless to the network.
pub struct FixedAttributes {
    /// Identifier of the `Data` type.
    type_tag: u64,
    /// Identity of the piece of `Data`.
    id: NameType,
    /// Maximum number of versions allowed.
    max_versions: u64,
    /// Number of versions to retain when archiving a "full" piece of `Data` (minimum value of 1).
    min_retained_count: u8,
    /// Arbitrary, immutable, `Data`-wide information.  May be empty.
    data: Vec<u8>,
}

/// A representation of an owner's public key and the bias which should be given to that key when
/// a mutating request is received by the network.
pub struct KeyAndWeight {
    /// Owner's public key.
    key: sign::PublicKey,
    /// Bias given to this public key (minimum value of 1).
    weight: u64
}

/// Attributes of the `Data` which can be changed via a properly-authorised request to the network.
/// These define the current owner's public keys, further rules the network will employ for handling
/// the `Data` and also arbitrary data which will likely be meaningless to the network.
pub struct MutableAttributes {
    /// Current owner or owners' public keys.  Cannot be empty.
    owner_keys: Vec<KeyAndWeight>,
    /// Minimum total weight of signatories' keys to allow a mutation of the piece of `Data` (at
    /// least one signature will be required regardless of this minimum).
    min_weight_for_consensus: u64,
    /// Coarse-grained expiry date around which time the piece of `Data` will be removed from the
    /// network.
    expiry_date: time::Tm,
    /// Arbitrary, mutable, `Data`-wide information.  May be empty.
    data: Vec<u8>,
}

/// A representation of a single version.  The `index` allows provision of strict total ordering of
/// the `Version`s.  It can also hold arbitrary data specific to that particular `Version`, e.g.
/// encrypted content or the name of a piece of "Immutable Data".
pub struct Version {
    /// Sequential number to provide strict total order of versions.
    index: u64,
    /// Arbitrary, version-specific information.  May be empty.
    data: Vec<u8>,
}

More detailed description is available in the docs.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid a number of owners removing one of the current owners, for some types of data, leaving ownership can be limited to only a signed request from the owner itself.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be great to hear folks ideas here. So far mine would be

  1. Versions are nice, but already achievable, i.e. If a client app wants roll-back versioning then it can use the Identifier field. So in their data element they can add a prev Identity. This way if versions are large (i.e. this contains whole Directory listing) then users pay for every new version they create (they can also Delete old versions if they want). Alternatively they could keep a Vec<Identity, Data etc.> if they wished to use a more complex mechanism for this and create many ImutableData chunks (as large data would not fit. This payment then encourages good behaviour and ensures network is paid for extra work of handling versions.
  2. Weighted multi-sig to me is more complex than plain multi-sig. Users already find multi-sig hard to understand weighting it makes it harder for them and may also open up fraud with people not really understanding the contract they are getting into. People already can grasp M signatures from N (i.e. for cheques etc.) and woudl find this harder to understand. I do not feel we need to have further complexity like this.
  3. expiry_time Not using time has been fundamental to the network from day 1, my presentations and discussions and this would go completely against it. I have no issues if client apps do this with local time on local machines or nodes use a local time periods (counts) for very local events etc. but for the network to start considering time like this would be way beyond anything I could calculate in terms of completely bursting the whole design. I am completely relaxed about Lamport clocks or even vector clocks for ordering, but not time. I see so many edge effects and opportunities to break such time based timestamps as huge as well as this introducing the networks first centralised state, system wide (i.e. ultimate state).

I do still feel this is a more complex design. As I say be good to hear as many opinions as possible here. The least the network has to do the better I feel. If it can have extremely simple rules then it can be made more secure than complex rules.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue there @mmoadeli is you may be removing an owner for bad behaviour, death or no longer on network etc. So if we go with majority it means more want to remove the owner than keep it, so then it's more democratic perhaps?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct for some types for sure. But if it comes to the ownership of something of value. Then democratic approaches may not be the best call. Maybe different types can have different owner removal policies.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fraser999, about the min_weight_for_consensus: u64 field. The idea behind requiring that this value is set to floor(owner_count/2) + 1 is that with this condition you cannot have two disjoint subsets of owners which decide on two different outcomes. Say you have four owners {A, B, C, D} and you allow this value to be less than floor(4/2) + 1= 3 say 2, then you can have {A, B} decide on one outcome and {C, D} decide on different outcome. I don't think this can be called consensus anymore.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inetic Then both mutating requests are valid. If they're both trying to modify the same version number, one request will arbitrarily "win". This is the same scenario for the original proposal too by the way. It could also be the scenario where only a single owner exists, if he sends two conflicting requests concurrently (not improbable if he's using the same app concurrently on multiple devices and it does background Puts to the network).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also be the scenario where only a single owner exists, if he sends two conflicting requests concurrently

I did not think of this one. I think this is a good reason to use vector clock for versioning the chunk (i.e. instead of Vec<OwnerPubKey> for owners we would use Vec<(OwnerPubKey, VersionNumber)>), that way we wouldn't rely on consensus which seems non deterministic anyway given the above. Instead we would gain the opportunity to deterministically tell whether the update took place or not. (EDIT: sorry for the noise, I need to think about it more)

@inetic
Copy link
Copy Markdown
Contributor

inetic commented Jun 30, 2015

@Fraser999, about the weight proposal, I'm still reading through it, but it just came to my mind whether it wouldn't be possible to achieve the same result by just adding multiple ownerships to one person?

@Fraser999
Copy link
Copy Markdown
Contributor

@inetic Yes, that came up in the discussion as being possible even with existing multisig protocols, but I feel it's overly cumbersome for users.

@inetic
Copy link
Copy Markdown
Contributor

inetic commented Jun 30, 2015

Yes, that came up in the discussion as being possible even with existing multisig protocols

Sorry, still catching up.

but I feel it's overly cumbersome for users.

The thing is, we can't get rid of the cumbersome approach because its implicit, so together with the weighted approach we would have two approaches to do the same thing. That means more work for no benefit other than being less cumbersome (which I think is subjective).

@Fraser999
Copy link
Copy Markdown
Contributor

The thing is, we can't get rid of the cumbersome approach because its implicit, so together with the weighted approach we would have two approaches to do the same thing. That means more work for no benefit other than being less cumbersome (which I think is subjective).

To me it would be like using repeated addition as a way to do multiplication. It's do-able, but not as easy. I can't see a user creating 10 keys when they can create just 1 to achieve the same end. It's more code for them to write, it's more computationally expensive to run and it's more expensive for the network to handle in terms of space, messages and computation.

@inetic
Copy link
Copy Markdown
Contributor

inetic commented Jun 30, 2015

it's more computationally expensive to run and it's more expensive for the network to handle in terms of space

Efficiency could be a good argument for sure, but should be backed with numbers I think. It also cuts both ways, i.e. if most use cases will use weight 1 per account, then we're adding (NUM_OF_OWNERS * 8) bytes to each message.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is Data::type, given we already (and only) have type_tag during the PUTs ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be a Data enum in routing which will be like

enum Data {
StructuredData(u64),
ImmutableData(u64),
}

With the u64 being the type for that data (the tag). For immutableData we have 0,1 & 2 for plain, backup and sacrificial.

I also think we should have a PlainData type where the content is Vec<u8> for folks who use routing as a normal DHT and do not want to be able to set the rules we can here. So our examples may use PlainData to show a simple put/get DHT as we have at the moment, but we can show more complex use cases for sure.

@inetic
Copy link
Copy Markdown
Contributor

inetic commented Jun 30, 2015

@dirvine, where exactly are the StructuredData and ImmutableData structures intended to be used inside Routing? Are they supposed to be passed as arguments to get, put and post methods and then serialised as bodies inside corresponding messages? GetData currently doesn't have any payload field.

@Fraser999
Copy link
Copy Markdown
Contributor

I can't back it up with numbers, since we don't have any implementations to compare :) However, I think this is another issue which isn't addressed here. The Structured Data has at least three forms - one while it's being sent over the wire, another while it's being "used" (i.e. it's parsed and is in memory) and another when it's held in storage (serialised in the database).

I think we're treating the struct as shown in this RFC as being the object used in all three cases. It needn't be. Many of the fields can be eliminated to zero-bytes when serialising this, including the weight per key which will likely be 1 in most cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the network is going to store this as Key = (Identity, type_tag) => Value = data what i got from initial discussion. If that is right then here is the question: So everybody starts by creating Structured Data (SD) and only if there is an overflow (beyond 100 KB) that one needs to turn to Immutable Data. Now say for storing JGPs/Texts etc in the network, most might create an SD with same Identifier (say sha512(data) for example) and type_tag (say 0 as most people will like to start from it if that is not reserved and available). So the storage is going to have a conflict and user of client must change one of these (Identifier or type_tag) and retry. However isn't this going to kill the concept of de-duplication for all data < 100 KB? Previously it would be stored as Immutable Data which the network would happily de-duplicate.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ustulation Yes, but these will be paid at same price as chunk, for much less than 1/10th of the space. This is due to extra work on networks part, but makes up for any loss of de-duplication. Small data in datamaps at the moment does not deduplicate.

So for data < 100KB that is structured data this makes sense, it will prove infeasible for somebody to store non structured (Immutable) data in this manner and that is probably good.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So everybody starts by creating Structured Data (SD) and only if there is an overflow (beyond 100 KB) that one needs to turn to Immutable Data.

This RFC is only for structuredData, so if you have unstructured data like jpegs and files you would not look at this as an approach, you just use our existing self_encrypt methods. This proposal is for data that requires structure (more complex, more expensive than plain old immutable content). StrcuturedData i.e. create a database schema, Directory types, safecoin, login session, RDF / semantic info and so on. Content is handled as efficiently as it can be with self encrypt and datamaps (i.e. ImmutableData.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohk. So are these flows correct then:
<1> For unversioned Directory Listing (DL): Form the usual hybrid-encrypted Data Map of DL which will be ImmutableData (ImD) -> the name of the immutable data will replace (if any) the existing name in the data field of the new-proposed Structured Data (SD) -> Do the usual PUT/POST of this SD and PUT of ImD into the network. Of-course if the Data Map is small then store/replace it directly in the data field of SD.
<2> For versioned DL: Same as <1> above except that the word replace is changed to append. Further if this append takes the size of SD to > 100KB then the last pointer in the data field will point to another Imd (newly created at this point) which will continue the containment of names of new Data Maps.
So from DL's perspective is this the only difference or is this correct ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes this can be directly in the SD and may or may not be encrypted (for private or public data)
  2. Yes, although maybe the other way round, so vector fills the data element and all the current versions become ImmutableData and only last one exists with a pointer to older versions?

@dirvine
Copy link
Copy Markdown
Member Author

dirvine commented Jun 30, 2015

@inetic

@dirvine, where exactly are the StructuredData and ImmutableData structures intended to be used inside Routing? Are they supposed to be passed as arguments to get, put and post methods and then serialised as bodies inside corresponding messages? GetData currently doesn't have any payload field.

I hope the answer to @ustulation 's question above answers this, if not shout.

@dirvine
Copy link
Copy Markdown
Member Author

dirvine commented Jul 1, 2015

I am calling this now for a 12 Hr Merge alert, unless thee are strong objections this will be merged at minimum 12 hours from now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for seeking confirmation. Say the user wants versioning of DirectoryListing. Now with every update name of new datamaps (ImmutableData) are appended to the existing ones in the StructuredData. If this grows beyond 100KB, the last 64bytes Name will be the name of the new StructuredData (instead of name of datamap/ImmutableData) which will contain further versions (sort of chaining). There can be many ways to do it because data field completely belongs to client. So can any rule be put there or there needs to be a standardized process for this. Eg.,
One way is data = serialised (Option<NameType of previous StructuredData>, Vec<NameTypes of DataMaps>, Option<NameType of Next StructuredData>) .. So does client have flexibility to do all these tricks?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does client have flexibility to do all these tricks?

Yes, especially here as Vaults require no info, so these are completely under control of client lib :-)

dirvine added a commit that referenced this pull request Jul 2, 2015
@dirvine dirvine merged commit 31a3568 into maidsafe:master Jul 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants