Conversation
nateprewitt
left a comment
There was a problem hiding this comment.
Here's an initial overview of the changes.
adlfs/gen1.py
Outdated
| def __setstate__(self, state): | ||
| logger.debug("De-serialize with state: %s", state) | ||
| self.__dict__.update(state) | ||
| self.__dict__.update(state) # type: ignore[reportAttributeAccessIssue] |
There was a problem hiding this comment.
This is an interaction with pyright and inheriting from a 3p untyped library. While this should be a generic Mapping/dict as an instance of the class, pyright is inferring it to be MappingProxyType[str, Any] which isn't guaranteed to have update.
I believe we either need to get this addressed at the pyright level or with typing in fsspec. Neither of those are likely to happen soon, so I've commented this out.
I also know we may remove this gen1.py file at some point but this same issue occurs elsewhere in the changes.
| service_client_kwargs: dict[str, Any] = { | ||
| "account_url": account_url, | ||
| "user_agent": _USER_AGENT, | ||
| } |
There was a problem hiding this comment.
This needs to be explicitly typed because values are mutated later to more than the inferred type of dict[str, str]. This is the most sensible future proof option since it's kwargs. If we really want to be strict, we can enumerate all of the current types but that may need to be updated in the future.
|
|
||
| if max_concurrency is None: | ||
| batch_size = _get_batch_size() | ||
| batch_size: int = _get_batch_size() # type: ignore[assignment] |
There was a problem hiding this comment.
We know this is an number in all cases (we do numerical comparison on the next line) but _get_batch_size() can technically return None at the fsspec layer. This declares the type we know it to be and the code is built around.
We could add special handling for None if we want, I just wasn't able to find a way to actually trigger it for adlfs.
| else: | ||
| max_concurrency = 1 |
There was a problem hiding this comment.
This is resolving a configuration edge case. self.max_concurrency is the fallback if max_concurrency is None. That will be an int in all cases except if the operating system returns a batch_size < 1. At that point we never initialize the value correctly, and the SDK errors out because it didn't receive an int.
This change moves us to single request throughput (also the SDK default) if we can't determine a batch_size from the OS. That is what would have been intended by None (use the default) before.
There was a problem hiding this comment.
I'm a little hesitant to change this fallback value as part of supporting type checking, especially if type checking can pass without this change. Mainly I would want to understand the intention of a return value of zero or negative from _get_batch_size() and compare with other fsspec implementations before committing to how it is interpreted.
There was a problem hiding this comment.
I have a PR open with the Azure SDK to hopefully help fix this issue. The APIs are inconsistently typed, some allow None, others don't. A different subset actually works with None (this is one of them).
Once we get that aligned, we should be fine to remove this piece.
There was a problem hiding this comment.
We got 3 PRs merged upstream. Once the next Python SDK release goes out, we should be able to remove this.
| """ | ||
| if isinstance(path, list): | ||
| return [cls._strip_protocol(p) for p in path] | ||
| return [cls._strip_protocol(p) for p in path] # type: ignore[return-value] |
There was a problem hiding this comment.
We already explicitly require path to be a str. The upstream fsspec version of this also only uses a str but has this code path in the event we somehow get a list. This block keeps parity with upstream but isn't possible through the current public API for adlfs.
We can introduce a change that records a return type of str | list[str] but it significantly complicates the typing story without providing more accuracy.
| def expand_path(self, path, recursive=False, maxdepth=None, skip_noexist=True): | ||
| return sync( | ||
| def expand_path(self, path, recursive=False, maxdepth=None, skip_noexist=True) -> list[str]: | ||
| return sync( # type: ignore[return-value] |
There was a problem hiding this comment.
sync is defined in fsspec without a return type. It's effectively acting as a proxy from async -> sync so we could likely do something with generics for this. The only issue is the best ways to do that require 3.12+.
For the time being, we know this is the same return type, so it's ignored.
| if block_size == "default" or block_size is None: | ||
| self.blocksize: int = self.DEFAULT_BLOCK_SIZE | ||
| else: | ||
| self.blocksize = block_size |
There was a problem hiding this comment.
This is a limitation of pyright. The code doesn't have any notable performance changes, it just allows the type checker to reason. pyright isn't currently able to extrapolate the in check to equivalence.
| try: | ||
| if hasattr(self.fs, "account_host"): | ||
| self.fs.account_url: str = f"https://{self.fs.account_host}" | ||
| self.fs.account_url = f"https://{self.fs.account_host}" |
There was a problem hiding this comment.
These aren't valid type definitions because they're modifying self.fs not a member of the class directly. The correct typing already lives on the fs implementation.
| await bc.stage_block( | ||
| block_id=block_id, | ||
| data=data[start:end], | ||
| data=data[start:end], # type: ignore[arg-type] |
There was a problem hiding this comment.
I talked with the Azure Python SDK folks and am going to open a PR about this. The current definition doesn't support memoryview or bytearray which are both valid inputs. Once that's fixed there, this ignore is no longer needed.
| data=data, | ||
| length=length, | ||
| blob_type=BlobType.AppendBlob, | ||
| blob_type=BlobType.APPENDBLOB, |
There was a problem hiding this comment.
AppendBlob is an alias pointing to this Enum but isn't explicitly declared as a member. That's leading the type checker to infer it's Unknown. We'll use the Enum directly instead of the alias. It's semantically identical.
| version: str | ||
| __version__: str | ||
| version_tuple: tuple[int | str, ...] | ||
| __version_tuple__: tuple[int | str, ...] | ||
| commit_id: str | None | ||
| __commit_id__: str | None |
There was a problem hiding this comment.
This covers the _version.py generated by setuptools_scm at installation time. That doesn't exist in the repository so CI complains it can't resolve typing from the utils method. You can find a similar setup in Black to handle this.
kyleknap
left a comment
There was a problem hiding this comment.
Looks good. I just had a few comments.
|
|
||
| type-check: | ||
| runs-on: "ubuntu-22.04" | ||
| continue-on-error: true |
There was a problem hiding this comment.
Is there a particular reason this was not added to the pre-commit hook and requirements/dev.txt file? Was it for allowing to continue when there is an error? Seems like that is where most of the code checkers are and it would also mean anyone who is use the pre-commit hooks will be able to run the type checker as well.
There was a problem hiding this comment.
No we can definitely add them there. This was originally written to get testing bootstrapped on a machine other than my local one but we should make sure it's easily reproducible everywhere.
The continue-on-error was to not block PRs just because there was a typing issue. Whether we want to keep that, I'm indifferent on. What I wanted to avoid was people adding #ignore or making the types potentially worse to get around the typechecker.
| else: | ||
| max_concurrency = 1 |
There was a problem hiding this comment.
I'm a little hesitant to change this fallback value as part of supporting type checking, especially if type checking can pass without this change. Mainly I would want to understand the intention of a return value of zero or negative from _get_batch_size() and compare with other fsspec implementations before committing to how it is interpreted.
This is a draft PR to serve as a talking point on how we'd enable automated type checking for current and future PRs. That will help enforce the existing typing and make sure we're keeping types up to date.
This PR touches a number of places in the code base so I've split it into 3 distinct commits.
There are some cases where we're adding inline ignores so I'm going to comment on those specifically with context. We can discuss the tradeoffs there. We'll also need to get the other outstanding type PRs merged before this is ready for final review.