Skip to content
This repository was archived by the owner on Jun 7, 2023. It is now read-only.

Conversation

@bjones1
Copy link
Collaborator

@bjones1 bjones1 commented Apr 24, 2021

I created this PR as a way to providing a glimpse at my progress in reviewing the server. I plan on squashing a lot of these commits before it becomes a final PR.

class Config:
orm_mode = True
# this tells pydantic to try read anything with attributes we give it as a model
# TODO: This doesn't make sense to me -- we're not using an ORM. What does this do?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall. It may be cruft some something I was experimenting with. I don't see anywhere in our code where we are using this Class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest removing it -- shall I do so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little more research reminded me that this is how you configure Pydantic. See: https://pydantic-docs.helpmanual.io/usage/models/#orm-mode-aka-arbitrary-class-instances I'm not sure we want to remove it. I think a little better understanding of how we are using Pydantic will make the decision.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that page, but don't see how it's relevant to what we're doing. My understanding of the approach:

  1. Given a sqlalchemy table, auto-generate a Pydantic schema for it.
  2. From an endpoint, put data from Ajax into the validator and check it before writing to the db. The validator will also handle conversion from string to other data types if this came as query params.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we are not doing 1 yet. But I think we agree on that as the direction.

For 2, I think the other point is that the Schema can define what we get from the client as well as validate. Right now in web2py you really have no way of knowing what data might be in request.vars. Using one of these schema objects tells you immediately what you get. But I think that means that we will end up creating additional schema objects that do not have a one to one mapping with the database.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- creating additional schemas for things that don't map directly to a db table makes sense. The lack of any kind of validation in web2py code produced a lot of bugs/problems.

I mainly wanted to avoid duplicating schemas when the sqlalchemy Table structure already has that data (keep the code DRY and avoid mistakes where one place is updated but the other isn't.)

@bjones1
Copy link
Collaborator Author

bjones1 commented Apr 26, 2021

I'd like to sketch out an overall documentation approach with you -- I'll create a page on the Wiki for now as a jointly-edited space. Let's plan how the docs should look there, then move that structure to the code once we agree on it.

@bjones1
Copy link
Collaborator Author

bjones1 commented Apr 26, 2021

I wrote a first pass at the outline at https://github.com/bnmnetp/BookServer/wiki/Overall-documentation-structure. Would you review and edit this?

rslogger.debug(f"Going to fetch {user_id}")
user = await fetch_user(user_id)
if user:
# TODO: I don't understand -- why do this here? Are we validating an existing user object?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue this example raises is whether we should be returning database objects or whether we should be returning pydantic objects. Either will work of course, but I think we should decide what to work with and then be consistent.

Using Pydantic objects everywhere except what we get immediately after a query provides some separation from the Database representation. But that may be overkill.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see what benefit we get from separation; the db data has already been validated / put in a known format, so that seems like what we need with additional steps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One benefit we do get is that a Pydantic object is mypy-compatible as well as supporting autocomplete in some editors. Database results remain opaque. But that may not be enough of a benefit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to read up more on how FastAPI integrates with sqlalchemy. I'd also like to see an example of validating HTTP parameters, then storing that in the db, to see how the code ends up looking. That will help shape the "right" way to do things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in routers/rslogging.py the log_book_event does just that. As well as routers/assessment.py and the get_assessment_results

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll take a look.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the way these schemas work -- FastAPI is really nice. You did a great job finding a really solid framework!

@bjones1 bjones1 force-pushed the main_review branch 4 times, most recently from d45a4e4 to c1e5a6e Compare April 27, 2021 22:59
@bjones1
Copy link
Collaborator Author

bjones1 commented Apr 28, 2021

I spent some time digging through SQLAlchemy docs last night, and I think it's best to stay with the Core, instead of using the ORM:

  1. The Core stays closer to the underlying SQL. I find that to do something in SQL, I have to understand the underlying SQL, then understand how to say that in the ORM. I'd rather avoid the second step!
  2. The ORM produces a lot of complexity -- assignment implies a future database update, then a read-back of the id, etc. It requires a nested async with instead of a single async with for the Core.
  3. The ORM requires extra care with async to avoid implicit I/O, another source of bugs.
  4. We can still use Tables to autocreate Pydantic models -- I played with this last night and got a rough prototype working.

Would you be willing to switch back to Core instead of using the ORM?

@bjones1
Copy link
Collaborator Author

bjones1 commented Apr 28, 2021

I updated https://github.com/bnmnetp/BookServer/wiki/Overall-documentation-structure to add user docs and developer docs sections. Feedback would be great!

@bnmnetp
Copy link
Member

bnmnetp commented Apr 28, 2021

Regarding Core vs ORM -- I didn't change the queries at all. I did change the inserts, but that was trivial.

The only messing around I had to do was to get the async stuff right. My understanding -- which seems to have played out -- is that 1.4 / 2.0 are unifying the Core vs ORM apis.

One thing I wanted to experiment with today was to see if the second async with block is really needed for the inserts.

Other than a day and a half of work I'm not highly invested in either Core or ORM. But before we decide to switch back I think its worth looking at whether the automated mapping to Pydantic will even work with tables.

@bjones1
Copy link
Collaborator Author

bjones1 commented Apr 28, 2021

Sounds good. I'll write up the mapper and test on tables, then let you know.

@bnmnetp
Copy link
Member

bnmnetp commented Apr 28, 2021

The nested async with session.begin() seems like overkill. The tests I have still pass without it and the db is populated correctly.

Copy link
Collaborator Author

@bjones1 bjones1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, thanks for checking. That's a good step forward toward simpler code!

@bjones1
Copy link
Collaborator Author

bjones1 commented Apr 28, 2021

I just pushed code with the auto Pydantic model generation from a Table, not a declarative base. (The code supports both styles, though.) If you're fine, let's move to Core instead of ORM / declarative.

@bjones1
Copy link
Collaborator Author

bjones1 commented Apr 28, 2021

I'm getting a flake8 failure that session isn't defined in assessment.py. What's the correct fix?

@bjones1
Copy link
Collaborator Author

bjones1 commented Apr 28, 2021

Also, this PR is now pretty much what I want it to be -- a good first pass through the code. What other changes would you like to see in it?

When you approve, I can squash this down to a nicer set of commits.

@bnmnetp
Copy link
Member

bnmnetp commented Apr 28, 2021

You can comment out line 57 for now.... None of that code in the try block will work until I make further progress. I don't know why flake8 isn't complaining for me.... It clearly should be.

So.. given that we do not need the double layer of async with AND that the query api is the same for both Core and ORM, and that I just converted it to ORM because that's what you had done with the Flask models, and how the examples of using async are written that way... what are your objections to ORM? I'm just not seeing that much difference and I want to be sure before I throw out the work I just did.

One thing that I actually like about ORM is that it does simplify the creation of a new thing very easy. I want to look a bit more at the insert-or-update scenario, but that seems simpler to me also.

I think we should merge this PR before switching back if that is the final decision.

@bnmnetp bnmnetp marked this pull request as ready for review April 28, 2021 18:29
@bjones1
Copy link
Collaborator Author

bjones1 commented Apr 28, 2021

I now have mypy checks passing, plus fixed a fair amount of actual problems that mypy found,

@bnmnetp
Copy link
Member

bnmnetp commented Apr 28, 2021

Doing a little more research on ORM vs Core.... I want to make the best decision using the 1.4 / 2.0 future as our reference... I'm certainly not a SQLAlchemy expert.

The select interface is unified and the preferred interface for 2.0 -- I think this is exactly what you want as it is the same as it was in Core for previous versions. And again my queries that I have in this code which were written thinking we were going with Core did not need to be modified! I think the orm way of doing queries was its biggest drawback and now that is gone.

I think the session.merge(Object) would save us a fair amount of work and be far less error prone in the insert-or-update use case. You modify the object and call merge on it and commit it. If the object does not exist (ie no id) then a new object is automatically created.

I think many of the issues with the ORM come from over use of the relationships or misunderstanding of relationships which I have always found to be quite ugly. But this seems to be an area where we can do our own thing and not use that feature if we don't want to.

I don't see us running into the implicit I/O scenario, but I've just read about it and maybe I'm not seeing it clearly. Do you have a specific example from Runestone code where you see a risk of that?

Just a quick FYI -- The async with session.begin(): context manger eliminates the need for the async session.commit() call because it is a context manager and does exactly what you would think a context manager would do in this case -- commit the transaction!

We will need to be much more mindful of doing commits with SQLAlchemy as we don't have the autocommit behavior of the tight integration with web2py and DAL which implicitly did commits after a successful api call. I think this is another reason why I like isolating as much of the database functionality to crud.py as we can.

@bjones1
Copy link
Collaborator Author

bjones1 commented Apr 29, 2021

I am also not a SQLAlchemy expert -- I'm also reading about it. I've used it in the past, but much of that experience is not longer relevant. I'm also very excited about the new 1.4 / 2.0 approach where the Core and ORM are more unified. That's a big step forward.

Here are some rambling thoughts on databases with Python:

Design goals

  1. I want to stay as close to the underlying SQL as possible, since this is the language in which everything is actually implemented. Constructing difficult queries or debugging difficult problems always happens at the SQL level. Therefore, I want a Python interface that keeps me close as to the SQL as possible. Any time I spend translating from Python to SQL or SQL to Python is time wasted. The amount of time we've already spent trying to understand SQLAlchemy is just a foretaste of what we'll both be doing to make this work. The size of the docs is another hint of a lot of complexity to handle.
  2. I want to write in some sort of vendor-neutral SQL. Here, I like what SQLAlchemy / DAL / etc. do -- I can define tables/indices/queries in one way, then run those on many different databases. (Practically, I want my queries to work on both sqlite and on Postgres.)
  3. I want an async connection to the database. SQLAlchemy certainly shines in this regard.
  4. I want to be able to compose queries programmatically, rather than gluing pieces of textual SQL together. I want everything automatically quoted to avoid SQL injection attacks. Again, most ORMs do well here.

So, my major design goal that's not clearly met with an ORM is to stay as close as possible to the underlying SQL. To me, the SQLAlchemy section on Data Manipulation with the ORM provides a great example of how these two are widely separated. For example, a select actually means an update followed by a select with a RETURNING clause to get the new IDs of object. Aack! When something goes wrong, debugging this will be quite difficult. Both PostgreSQL and SQLite offer upsert-like syntax; I want something that provides this at the Python level, instead of a large amount of complex Python library code that produces difficult-to-follow SQL.

I agree and am excited about the new SQLAlchemy features that use Python well -- for example, context managers which commit when closing are fantastic! I also agree that using relationships takes us farther from the underlying SQL. As you say, by not including these relationships in the tables, we can avoid this problem entirely. A plus of the SQLAlchemy ORM is that I prefer that style of defining the tables vs. the Core style.

So, I think I'm still looking for a mythical third option. However, of existing implementations, either SQLAlchemy Core or DAL are probably our best alternatives.

It might also be helpful to look at example of the common cases. Here's code I like:

async def fetch_last_answer_table_entry(query_data: schemas.AssessmentRequest):
    assessment = EVENT2TABLE[query_data.event]
    tbl = answer_tables[assessment]
    query = (
        select(tbl)
        .where(
            and_(
                tbl.div_id == query_data.div_id,
                tbl.course_name == query_data.course,
                tbl.sid == query_data.sid,
            )
        )
        .order_by(tbl.timestamp.desc())
    )
    async with async_session() as session:
        res = await session.execute(query)

    return res.scalars().first()

Here's a great example of using programmatic queries to compose a query with an obvious SQL outcome, a clear select statement, etc. In contrast, I'm not sure what SQL this generates:

async def create_answer_table_entry(log_entry: schemas.LogItem):
    values = { # omitted
    }
    tbl = answer_tables[EVENT2TABLE[log_entry.event]]
    new_entry = tbl(**values)
    async with async_session() as session:
        session.add(new_entry)
    return new_entry

I would prefer to write this as:

async def create_answer_table_entry(log_entry: schemas.LogItem):
    values = { # omitted
    }
    tbl = answer_tables[EVENT2TABLE[log_entry.event]]
    async with async_session() as session:
        session.exeute(insert(tb), [values])

An upsert would be a more interesting case to look at.

@bnmnetp
Copy link
Member

bnmnetp commented Apr 29, 2021

Thanks,

I think our disagreement may come down to a missing design goal and how much priority we place on it. I want a tool like sqlalchemy to make me productive and reduce the amount of repetitive code I (and other less experienced developers) need to write.

So the example you liked is possible with ORM or Core and unless I'm missing something meets all four of your criteria as well as my additional.

I actually prefer the first version of create_answer_table_entry I know that that add does an insert of the new object into the database. If I want to see the SQL I can turn on the flag that will spew that out. But I think it is pretty apparent that you are creating a new object and adding it to the database.

For the upsert use case I prefer something like:

my_entry.value = newvalue
session.merge(my_entry)

This means I don't have to write a lot of error prone repetitive code that amounts to an update my_table set my_entry = newvalue where id = whatever. It also means I don't have to compose a query first to see if the thing is already in the database.

If you really want to do an explicit insert you can still do that with ORM just by using tbl.__table__ I don't think that most of the time that would be a good idea, but it is there.

I'd like to resolve this so we can get moving again. Any suggestions for resolving this decision?

@bnmnetp
Copy link
Member

bnmnetp commented Apr 29, 2021

One more quick note. Just from an aesthetic point of view I really like that I don't have to say tbl.c.div_id everywhere. I can just say tbl.div_id when using ORM.

@bjones1
Copy link
Collaborator Author

bjones1 commented Apr 29, 2021

Hmm. In the end, we can't really know how this will all end up, but just make our best educated guesses. Let's go with the ORM and see what happens.

@bjones1
Copy link
Collaborator Author

bjones1 commented Apr 29, 2021

I'd like to clean up this PR before merging it. What else would you like done here before the merge?

@bnmnetp
Copy link
Member

bnmnetp commented Apr 29, 2021

I think this is fine as it is. We will continue to evolve.

I did think of one experiment I want to run that could tip the scales for me. Performance if inserting new rows to useinfo and the xxx_answers tables is significantly slower in ORM than Core then that is a big factor.

Our top three API calls are:

  1. hsblog -- insert into useinfo and answers tables
  2. published -- lookup a user and a course to serve a book page
    The top two are both double the next one
  3. getAssessResults -- retrieve answers from xxx_answers table

I will check both ways and see how performance compares.

I also know the average response time on academy.

@bjones1
Copy link
Collaborator Author

bjones1 commented Apr 29, 2021

Sounds good!

@bnmnetp
Copy link
Member

bnmnetp commented Apr 29, 2021

Well, performance testing was pretty uninteresting....

I timed the insertion of 10,000 rows into the useinfo table.

I started by testing if the extra async with versus an explicit commit made any difference in ORM mode. And it did not. I then defined the useinfo table in core mode and reran the test using await session.execute(insert(logtable).values(**new_log)).

I then reran tests on a postgresql database and was mildly surprised when they failed immediately on a ForeignKey violation... 🤣 yeah, sqlite does not enforce those. So postgresql was actually slightly slower in this test than sqlite!

I had kind of expected ORM to lose out because of the object creation. But was surprised that there wasn't really any difference. See notes below.

ORM - sqlite
For 10,000 on an empty useinfo table with context manager 0:00:58.647466
For 10,000 on an empty useinfo table with explicit commit 0:00:58.679958
ORM - postgres

For 10,000 on empty useinfo table with explicit commit  0:01:03.433731 - foreign keys enforced
Core - sqlite
For 10,000 on an empty useinfo table with explicit commit 0:00:59.070297
Core-postgres
For 10,000 on an empty useinfo table with explicit commit 0:01:02.985901 Foreign keys enforced

@bjones1
Copy link
Collaborator Author

bjones1 commented Apr 29, 2021

Neat, good data. I'm glad to know there's not much performance difference.

@bnmnetp
Copy link
Member

bnmnetp commented Apr 29, 2021

Is this ready to merge or were you still going to squash commits??

@bjones1
Copy link
Collaborator Author

bjones1 commented Apr 29, 2021

Let me do some squashing first...

@bjones1 bjones1 merged commit d3defb1 into RunestoneInteractive:main Apr 29, 2021
@bjones1
Copy link
Collaborator Author

bjones1 commented Apr 29, 2021

OK, I squashed / cleaned the commits.

@bjones1 bjones1 deleted the main_review branch April 29, 2021 21:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants