Skip to content
This repository was archived by the owner on Nov 26, 2021. It is now read-only.

Conversation

@joelagnel
Copy link

Pandas is extremely fast at parsing csv to data frames. Astonishingly it takes
< 1s to serialize/deserialize a 100MB work of traces with 430000 events to/from
csv. We leverage this and write out a data frames into a csv file when they are
created for the first time. Next time we read it out if it exists. To make
sure, the cache isn't stale, we take the md5sum of the trace file and also
ensure all CSVs exist before reading from the cache. I get a speed up of 16s to
1s when parsing a 100MB trace.

Signed-off-by: Joel Fernandes joelaf@google.com

@joelagnel
Copy link
Author

CC @bjackman

Copy link
Contributor

@bjackman bjackman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome, can't wait to start using it :)

A few comments from me:

  • We need a way to disable this for the unit tests, otherwise we won't really test the trace parsing code when we run them in a dirty environment. Or we could just have the tests delete all the relevant cache files before they run. I guess I'd prefer the former since that could also be useful when testing interactively while hacking trappy.
  • Should add some unit tests for the caching functionality itself.

trappy/ftrace.py Outdated

def _trace_cache_path(trace_file):
cache_dir = '.' + os.path.basename(trace_file) + '.cache'
cache_path = os.path.dirname(trace_file) + '/' + cache_dir
Copy link
Contributor

@bjackman bjackman May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • os.path.join
  • This seems to be broken in the travis CI, I'm not sure why though

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah will check why :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please use os.path.join()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found why it broke them, because I wasn't using abspath causing mkdir to fail, fixed it in next rev.


md5sum = hashlib.md5(open(trace_file, 'rb').read()).hexdigest()
open(md5file, 'w').write(md5sum)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come these aren't methods of GenericFTrace or BareTrace?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt these are just utility methods, that didn't as such belong to these classes so I'm just as happy keeping them in the module than within classes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can put them in GenericFTrace, that'll be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, doesn't really matter was just curious.

trappy/ftrace.py Outdated
try:
# Read csv into frames
for trace_class in self.trace_classes:
csv_file = cache_path + '/' + trace_class.unique_word[:-1] + '.csv'
Copy link
Contributor

@bjackman bjackman May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • os.path.join
  • I'm not too comfortable with the use of unique_word[:-1] - I wonder if we could just use trace_class.__name__ or something like that?

Copy link
Author

@joelagnel joelagnel May 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, great idea. I'll try to use the class name. The only issue is if the class name changes between trappy versions for some reason. But I guess that's rare.

Copy link
Contributor

@bjackman bjackman May 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't think it's too much of a problem if the cache gets "invalidated" (especially since the cache parsing could theoretically change anyway). That actually got me thinking maybe we should also add a ".cache_version" in th cache directory. Then we could write "0" into that file and check it matches when reading back, so that if we decide to change the layout/format then we can just change that 0 and no need to worry about compatibility? Worth the effort?

trappy/ftrace.py Outdated
for trace_class in self.trace_classes:
csv_file = cache_path + '/' + trace_class.unique_word[:-1] + '.csv'
trace_class.read_csv(csv_file)
except:
Copy link
Contributor

@bjackman bjackman May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should log (with the warnings module) a warning here? Would be annoying if we later broke this code and didn't notice for ages because an AttributeError was getting thrown away or something.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sure

trappy/ftrace.py Outdated
cache_good = False

if cache_good:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken you can get rid of cache_good:

if _check_trace_cache(self.trace_path):
    try:
        for trace_class in self.trace_classes:
            csv_file = cache_path + '/' + trace_class.unique_word[:-1] + '.csv'
            trace_class.read_csv(csv_file)
        return
    except:
        pass

Also what if we e.g. successfully parse the CSV for trace class A (so its data_frame member is populated), then fail while parsing the CSV for class B? It looks like all the events for A will be duplicated in __parse_trace_file below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated how?
data_frame will just be overwritten in that case
https://github.com/joelagnel/trappy/blob/97a29442e2c19bc5c09f73bee83f08d35406c907/trappy/base.py#L225
So I don't see how this is an issue

Copy link
Contributor

@bjackman bjackman May 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh yes you're quite right, I misunderstood the code, I thought it would keep appending to .data_frame

trappy/ftrace.py Outdated

_create_trace_cache(self.trace_path)
for trace_class in self.trace_classes:
trace_class.write_csv(cache_path + '/' + trace_class.unique_word[:-1] + '.csv')
Copy link
Contributor

@bjackman bjackman May 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • os.path.join
  • The creation of the CSV path is duplicated here, should probably put it in a method
  • This hunk should be a try/except too IMO (again maybe with a warning).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I didn't agree with the last point why this should be a try/except? If we fail to write to the cache for some reason, I'd rather treat it as a failure and consider it a bug. Are you worried about the system running out of space?

Copy link
Contributor

@JaviMerino JaviMerino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSV is unnecessarily complicated. Can't we use cpickle or pickleinstead? It would be much faster.

Taking it to another level, why not pickle the whole class (GenericFtrace derivative). This may be harder to do in code, but it would be more efficient.

trappy/ftrace.py Outdated
import os
import re
import pandas as pd
import hashlib, shutil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One import per line please.

trappy/ftrace.py Outdated

def _trace_cache_path(trace_file):
cache_dir = '.' + os.path.basename(trace_file) + '.cache'
cache_path = os.path.dirname(trace_file) + '/' + cache_dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please use os.path.join()

trappy/ftrace.py Outdated

def _check_trace_cache(trace_file):
cache_path = _trace_cache_path(trace_file)
md5file = cache_path + '/md5sum'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.path.join()

@joelagnel
Copy link
Author

All great suggestions. Will fix up.

About pickle, I still prefer csv because it's easier to investigate the cache in clear text and also is more immune to the internal format of pandas objects than binary object representation I think. Also I am also not sure how much faster we can get because we are already so fast with csv.

@joelagnel
Copy link
Author

Also from this link it looks like csv is faster than pickle:
http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization

@joelagnel
Copy link
Author

So this is the new version with all review comments addressed.

@@ -0,0 +1,52 @@
# Copyright 2015-2017 ARM Limited
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a lawyer but this doesn't look right to me 😛

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe will fix.. infact is it Ok to make this "ARM Limited, Google and Contributors" even?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fine to my inexpert ears - actually it looks like the tests require "ARM Limited" to be in the copyright line so you'll need to change that test if you just want to mark something as solely Google copyright (which would also be fine IIUC).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, there's even a Google test just after that that checks
https://github.com/ARM-software/trappy/blob/master/tests/test_copyright.py#L48
So probably some files already have Google in their copyrights, I'll double check

trappy/ftrace.py Outdated
trace_class.read_csv(csv_file)
except:
warnings.warn("TRAPpy: Caching disabled due to cache parsing error")
cache_good = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still get rid of cache_good (previous comment here)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry will do.

for trace_class in self.trace_classes:
csv_file = self._get_csv_path(trace_class)
trace_class.write_csv(csv_file)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Continuing from previous thread, which Github won't let me reply to any more]

why this should be a try/except? If we fail to write to the cache for some reason, I'd rather treat it as a failure and consider it a bug. Are you worried about the system running out of space?

I was actually thinking of something like a permissions error - e.g. sometimes at ARM people share files via NFS mounts where only they have write permissions. Just in general the fact that you intuitively expect to be able to parse traces from read-only places.

Maybe just catch OSErrors?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll catch OSError and print a warning in this case instead of just failing.

for trace_class in self.trace_classes:
csv_file = self._get_csv_path(trace_class)
trace_class.write_csv(csv_file)

Copy link
Contributor

@bjackman bjackman May 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That actually got me thinking - would Trappy be a bad OS citizen by dropping these ".cache" directories all around people's filesystems? E.g. on Linux should we better use /var/cache? That would introduce platform-specific stuff so maybe it's not worth the hassle.

Copy link
Author

@joelagnel joelagnel May 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think management of that cache directory (configuring its path etc) would be a pain, I'd much rather just drop them into the current directory. CSVs are also much smaller than the actual trace so its not a big space consumer in my experience.

trace_file = os.path.basename(trace_path)
cache_dir = '.' + trace_file + '.cache'

self.assertFalse(cache_dir in os.listdir(trace_dir))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to also test

  • That the data parsed from a valid cache is exactly the same as the data from the corresponding trace (I've noticed it's quite easy to do things like mess up column names when doing DataFrame<->CSV).
  • That the checksum test really works, i.e. that you don't get bogus data when there's an invalid cache.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add these , could you please share what troubles you had when converting CSV to DataFrame and vice versa?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I can't actually remember 😖 I just had a post-it note scribbled on the inside of my brain that said "double check pd.read_csv did what you thought it did".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the test for data but I'm afraid for the check sum test I'm a bit worn out (I worked every night last few nights on this) - appreciate any help if you want to add more test cases. I think for now the tests I have which check the cached/uncached data are the same, the column names match, and whether caching can be disabled give most of the coverage. I am sure more things can be creatively added here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this look good to me, thanks for adding it, I think it's really useful to have coverage like this.

@joelagnel
Copy link
Author

Hopefully this one is merge-ready

Copy link
Contributor

@bjackman bjackman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @joelagnel, all of my (very demanding) comments are addressed! Looking forward to a SNAPpy TRAPpy (sorry..)

trace_file = os.path.basename(trace_path)
cache_dir = '.' + trace_file + '.cache'

self.assertFalse(cache_dir in os.listdir(trace_dir))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this look good to me, thanks for adding it, I think it's really useful to have coverage like this.

@bjackman
Copy link
Contributor

bjackman commented Jun 2, 2017

I pushed an extra test, along with some microscopic fixes for stuff I noticed while writing it, to the branch on my repo here: https://github.com/bjackman/trappy/tree/for-trappy-cache

So I guess feel free to cherry-pick/squash that into your branch? Or you can leave it and I'll submit it separately once this is merged (I'm ill this week and cannot think clearly at all so the code may be nonsense).

@joelagnel
Copy link
Author

joelagnel commented Jun 5, 2017

Squashed @bjackman's fixups except his new test case (which IMO can be added after this set is merged, had to drop it because the test case itself had a bug). Pushed a new version. @JaviMerino @bjackman does it look good to you?

@bjackman
Copy link
Contributor

bjackman commented Jun 5, 2017

This is good from my point of view, let's see what others think.

@bjackman
Copy link
Contributor

bjackman commented Jun 8, 2017

I updated my branch (https://github.com/bjackman/trappy/tree/for-trappy-cache) with the missing file for the test I added. I also added another test for something I was confused about, and an extra check in _check_trace_cache, to avoid a warning if the set of parsed events has changed since the cache was written.

Also, AFAICT this currently breaks the normalize_time and window/abs_window params. Probably the simplest solution is just to add them to the md5sum or something like that?

edit: Does it also break the events parameter? I think this case is actually broken (as @credp suggested):

  1. Parse trace with events=['foo'], cache is created but all .csv files except the one for "foo" are empty)
  2. Parse trace again with events=['bar'], the DataFrame for "bar" is empty.

Maybe it's safest to just store & check the whole parameter list for __init__

@joelagnel
Copy link
Author

CC @sinkap

@joelagnel
Copy link
Author

Updated with @bjackman 's latest commits. Thanks!
Will work on the other corner cases now (new event passed after cache creation). @bjackman while I do that could fix the normalize time issue you mentioned if you had time? I am new to that code so I could use some help there. Thanks!

@joelagnel
Copy link
Author

joelagnel commented Jun 14, 2017

@bjackman so I dropped "ftrace: Add explicit check for presence of each .csv file" and added a new commit which reads the cache partially. Your new test case for dynamic events added after caching passes with that.

@bjackman
Copy link
Contributor

I updated my branch (https://github.com/bjackman/trappy/tree/for-trappy-cache) with

  • A couple more minor fixups I noticed
  • A couple more simple tests (which fail without the following commits)
  • A commit to re-order the caching operations, to fix normalize_time (and incidentally clean up the impact on finalize_objects)
  • A commit to store and check the window parameter.

@sinkap sinkap self-assigned this Jun 14, 2017
@sinkap sinkap force-pushed the master branch 3 times, most recently from 6f40d93 to a84dd23 Compare June 14, 2017 21:41
@sinkap
Copy link
Collaborator

sinkap commented Jun 16, 2017

@bjackman I will review your branch today. Since it seems to have some more fixes. @joelagnel

@joelagnel
Copy link
Author

joelagnel commented Jun 16, 2017

@bjackman I think you didn't update your branch (last update was June 8 to https://github.com/bjackman/trappy/tree/for-trappy-cache and I already picked up those fixes). Also could you rebase on top of master when you push again? Thanks!

@bjackman
Copy link
Contributor

Eek.. you're right I didn't push, and I can't find the commits on my desktop... I will have to check when I get to my laptop 😨

@sinkap
Copy link
Collaborator

sinkap commented Jun 21, 2017

@bjackman feel free to ping when you have the updated commits.

@joelagnel
Copy link
Author

@bjackman not having caching is killing me, can't wait to have this goodness in with your fixes ;D

@credp
Copy link
Contributor

credp commented Jun 22, 2017

I think @joelagnel and @bjackman addressed Javi's change request, so I am going to dismiss that review to remove the big red cross 👍

@credp credp dismissed JaviMerino’s stale review June 22, 2017 08:01

Request to consider pickle instead of csv have been considered and addressed.

@joelagnel
Copy link
Author

thanks @credp

@credp
Copy link
Contributor

credp commented Jun 26, 2017

So where are we with this PR? Looks just about ready pending a push from @bjackman? (cc @joelagnel )

@sinkap
Copy link
Collaborator

sinkap commented Jun 26, 2017 via email

@bjackman
Copy link
Contributor

bjackman commented Jun 27, 2017

Hey guys, really sorry for causing delays here, I forgot about it then went on holiday. Will update tonight when I get home (where the lost commits are on my computer).

@sinkap
Copy link
Collaborator

sinkap commented Jun 27, 2017

@bjackman Enjoy your holiday! 😄

I am merging #253. It would be best to rebase this :)

@bjackman
Copy link
Contributor

OK, I've pushed the commits now: https://github.com/bjackman/trappy/tree/for-trappy-cache

Forgotten the details so I'll do a self-review tomorrow :)

@joelagnel
Copy link
Author

joelagnel commented Jun 27, 2017 via email

@bjackman
Copy link
Contributor

OK I updated the branch. The rebase was pretty awkward so I think the patch stack is messed up, but TBH I think we should squash them all anyway instead of having "add X", "fix X" in a single PR.

Joel Fernandes and others added 3 commits June 29, 2017 13:35
Pandas is extremely fast at parsing csv to data frames. Astonishingly it takes
< 1s to serialize/deserialize a 100MB work of traces with 430000 events to/from
csv. We leverage this and write out a data frames into a csv file when they are
created for the first time. Next time we read it out if it exists. To make
sure, the cache isn't stale, we take the md5sum of the trace file and also
ensure all CSVs exist before reading from the cache. I get a speed up of 16s to
1s when parsing a 100MB trace.

Co-developed-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
Store a JSON file with the parameters used to parse a trace in the cache. If, on
retrieval, we find a mismatch then simply invalidate the cache.

A richer implementation might detect whether we can still use the cache (e.g. if
the requested window is a subset of the cached one), but for now I don't think
that's worth the effort/complexity.
Add test for invalid cache
Add test for caching with extra events
Add test for normalize_time and window parameters

Signed-off-by: Brendan Jackman <brendan.jackman@arm.com>
Signed-off-by: Joel Fernandes <joelaf@google.com>
@joelagnel
Copy link
Author

Thanks @bjackman , squashed as many as that made sense to squash. HAPpy SNAPpy TRAPpy!

@sinkap
Copy link
Collaborator

sinkap commented Jun 29, 2017

Merged!

@sinkap sinkap closed this Jun 29, 2017
valschneider pushed a commit to valschneider/trappy that referenced this pull request Sep 28, 2018
* libs/utils/android: reset airplane mode for Jankbench

Jankbench activates airplaine mode before executing.
Set airplane mode back OFF after execution finishes.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants