Skip to content

Add automated type checking#538

Draft
nateprewitt wants to merge 6 commits intofsspec:mainfrom
nateprewitt:type_checking
Draft

Add automated type checking#538
nateprewitt wants to merge 6 commits intofsspec:mainfrom
nateprewitt:type_checking

Conversation

@nateprewitt
Copy link
Collaborator

This is a draft PR to serve as a talking point on how we'd enable automated type checking for current and future PRs. That will help enforce the existing typing and make sure we're keeping types up to date.

This PR touches a number of places in the code base so I've split it into 3 distinct commits.

  • Enabling automated type checking with pyright, for now this is non-blocking for merge.
  • Fixing existing bugs/inaccurate typing in the code base.
  • Migrating remaining typing to PEP 585 and PEP 604.

There are some cases where we're adding inline ignores so I'm going to comment on those specifically with context. We can discuss the tradeoffs there. We'll also need to get the other outstanding type PRs merged before this is ready for final review.

Copy link
Collaborator Author

@nateprewitt nateprewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an initial overview of the changes.

adlfs/gen1.py Outdated
def __setstate__(self, state):
logger.debug("De-serialize with state: %s", state)
self.__dict__.update(state)
self.__dict__.update(state) # type: ignore[reportAttributeAccessIssue]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interaction with pyright and inheriting from a 3p untyped library. While this should be a generic Mapping/dict as an instance of the class, pyright is inferring it to be MappingProxyType[str, Any] which isn't guaranteed to have update.

I believe we either need to get this addressed at the pyright level or with typing in fsspec. Neither of those are likely to happen soon, so I've commented this out.

I also know we may remove this gen1.py file at some point but this same issue occurs elsewhere in the changes.

Comment on lines +129 to 132
service_client_kwargs: dict[str, Any] = {
"account_url": account_url,
"user_agent": _USER_AGENT,
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be explicitly typed because values are mutated later to more than the inferred type of dict[str, str]. This is the most sensible future proof option since it's kwargs. If we really want to be strict, we can enumerate all of the current types but that may need to be updated in the future.


if max_concurrency is None:
batch_size = _get_batch_size()
batch_size: int = _get_batch_size() # type: ignore[assignment]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know this is an number in all cases (we do numerical comparison on the next line) but _get_batch_size() can technically return None at the fsspec layer. This declares the type we know it to be and the code is built around.

We could add special handling for None if we want, I just wasn't able to find a way to actually trigger it for adlfs.

Comment on lines +392 to +393
else:
max_concurrency = 1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is resolving a configuration edge case. self.max_concurrency is the fallback if max_concurrency is None. That will be an int in all cases except if the operating system returns a batch_size < 1. At that point we never initialize the value correctly, and the SDK errors out because it didn't receive an int.

This change moves us to single request throughput (also the SDK default) if we can't determine a batch_size from the OS. That is what would have been intended by None (use the default) before.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little hesitant to change this fallback value as part of supporting type checking, especially if type checking can pass without this change. Mainly I would want to understand the intention of a return value of zero or negative from _get_batch_size() and compare with other fsspec implementations before committing to how it is interpreted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a PR open with the Azure SDK to hopefully help fix this issue. The APIs are inconsistently typed, some allow None, others don't. A different subset actually works with None (this is one of them).

Once we get that aligned, we should be fine to remove this piece.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We got 3 PRs merged upstream. Once the next Python SDK release goes out, we should be able to remove this.

"""
if isinstance(path, list):
return [cls._strip_protocol(p) for p in path]
return [cls._strip_protocol(p) for p in path] # type: ignore[return-value]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already explicitly require path to be a str. The upstream fsspec version of this also only uses a str but has this code path in the event we somehow get a list. This block keeps parity with upstream but isn't possible through the current public API for adlfs.

We can introduce a change that records a return type of str | list[str] but it significantly complicates the typing story without providing more accuracy.

def expand_path(self, path, recursive=False, maxdepth=None, skip_noexist=True):
return sync(
def expand_path(self, path, recursive=False, maxdepth=None, skip_noexist=True) -> list[str]:
return sync( # type: ignore[return-value]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync is defined in fsspec without a return type. It's effectively acting as a proxy from async -> sync so we could likely do something with generics for this. The only issue is the best ways to do that require 3.12+.

For the time being, we know this is the same return type, so it's ignored.

Comment on lines +2038 to +2041
if block_size == "default" or block_size is None:
self.blocksize: int = self.DEFAULT_BLOCK_SIZE
else:
self.blocksize = block_size
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a limitation of pyright. The code doesn't have any notable performance changes, it just allows the type checker to reason. pyright isn't currently able to extrapolate the in check to equivalence.

try:
if hasattr(self.fs, "account_host"):
self.fs.account_url: str = f"https://{self.fs.account_host}"
self.fs.account_url = f"https://{self.fs.account_host}"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't valid type definitions because they're modifying self.fs not a member of the class directly. The correct typing already lives on the fs implementation.

await bc.stage_block(
block_id=block_id,
data=data[start:end],
data=data[start:end], # type: ignore[arg-type]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked with the Azure Python SDK folks and am going to open a PR about this. The current definition doesn't support memoryview or bytearray which are both valid inputs. Once that's fixed there, this ignore is no longer needed.

data=data,
length=length,
blob_type=BlobType.AppendBlob,
blob_type=BlobType.APPENDBLOB,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AppendBlob is an alias pointing to this Enum but isn't explicitly declared as a member. That's leading the type checker to infer it's Unknown. We'll use the Enum directly instead of the alias. It's semantically identical.

Comment on lines +1 to +6
version: str
__version__: str
version_tuple: tuple[int | str, ...]
__version_tuple__: tuple[int | str, ...]
commit_id: str | None
__commit_id__: str | None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This covers the _version.py generated by setuptools_scm at installation time. That doesn't exist in the repository so CI complains it can't resolve typing from the utils method. You can find a similar setup in Black to handle this.

Copy link
Collaborator

@kyleknap kyleknap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I just had a few comments.


type-check:
runs-on: "ubuntu-22.04"
continue-on-error: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason this was not added to the pre-commit hook and requirements/dev.txt file? Was it for allowing to continue when there is an error? Seems like that is where most of the code checkers are and it would also mean anyone who is use the pre-commit hooks will be able to run the type checker as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we can definitely add them there. This was originally written to get testing bootstrapped on a machine other than my local one but we should make sure it's easily reproducible everywhere.

The continue-on-error was to not block PRs just because there was a typing issue. Whether we want to keep that, I'm indifferent on. What I wanted to avoid was people adding #ignore or making the types potentially worse to get around the typechecker.

Comment on lines +392 to +393
else:
max_concurrency = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little hesitant to change this fallback value as part of supporting type checking, especially if type checking can pass without this change. Mainly I would want to understand the intention of a return value of zero or negative from _get_batch_size() and compare with other fsspec implementations before committing to how it is interpreted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants