Skip to content

Commit a82cfa3

Browse files
authored
Add support for large file uploads in multi-part mode (#106)
- Implement LFS support for large files - Add command to enable LFS in repos - Create multipart upload functionality - Update CLI documentation for new commands - Revise README for command usage examples
1 parent 42a3b2a commit a82cfa3

File tree

10 files changed

+636
-461
lines changed

10 files changed

+636
-461
lines changed

README.md

Lines changed: 2 additions & 227 deletions
Original file line numberDiff line numberDiff line change
@@ -85,236 +85,11 @@ pip install '.[train]'
8585

8686
## Use cases of command line
8787

88-
```shell
89-
export CSGHUB_TOKEN=your_access_token
90-
91-
# download model
92-
csghub-cli download OpenCSG/csg-wukong-1B
93-
94-
# download model with allow patterns '*.json' and ignore '*_config.json' pattern of files
95-
csghub-cli download OpenCSG/csg-wukong-1B --allow-patterns "*.json" --ignore-patterns "tokenizer.json"
96-
97-
# download model with ignore patterns '*.json' and '*.bin' pattern of files to /Users/hhwang/temp/wukong
98-
csghub-cli download OpenCSG/csg-wukong-1B --allow-patterns "*.json" --ignore-patterns "tokenizer.json" --local-dir /Users/hhwang/temp/wukong
99-
100-
# download dataset
101-
csghub-cli download OpenCSG/GitLab-DataSets-V1 -t dataset
102-
103-
# download space
104-
csghub-cli download OpenCSG/csg-wukong-1B -t space
105-
106-
# upload local large folder '/Users/hhwang/temp/abc' to model repo 'wanghh2000/model05'
107-
csghub-cli upload-large-folder wanghh2000/model05 /Users/hhwang/temp/abc
108-
109-
# list inference instances for user 'wanghh2000'
110-
csghub-cli inference list -u wanghh2000
111-
112-
# start inference instance for model repo 'wanghh2000/Qwen2.5-0.5B-Instruct' with ID '1358'
113-
csghub-cli inference start wanghh2000/Qwen2.5-0.5B-Instruct 1358
114-
115-
# stop inference instance for model repo 'wanghh2000/Qwen2.5-0.5B-Instruct' with ID '1358'
116-
csghub-cli inference stop wanghh2000/Qwen2.5-0.5B-Instruct 1358
117-
118-
# list fine-tuning instances for user 'wanghh2000'
119-
csghub-cli finetune list -u wanghh2000
120-
121-
# start fine-tuning instance for model repo 'OpenCSG/csg-wukong-1B' with ID '326'
122-
csghub-cli finetune start OpenCSG/csg-wukong-1B 326
123-
124-
# stop fine-tuning instance for model repo 'OpenCSG/csg-wukong-1B' with ID '326'
125-
csghub-cli finetune stop OpenCSG/csg-wukong-1B 326
126-
127-
# upload a single file to folder1
128-
csghub-cli upload wanghh2000/myprivate1 abc/3.txt folder1
129-
130-
# upload local folder '/Users/hhwang/temp/jsonl' to root path of repo 'wanghh2000/m01' with default branch
131-
csghub-cli upload wanghh2000/m01 /Users/hhwang/temp/jsonl
132-
133-
# upload local folder '/Users/hhwang/temp/jsonl' to root path of repo 'wanghh2000/m04' with token 'xxxxxx' and v2 branch
134-
csghub-cli upload wanghh2000/m04 /Users/hhwang/temp/jsonl -k xxxxxx --revision v2
135-
136-
# upload local folder '/Users/hhwang/temp/jsonl' to path 'test/files' of repo 'wanghh2000/m01' with branch v1
137-
csghub-cli upload wanghh2000/m01 /Users/hhwang/temp/jsonl test/files --revision v1
138-
139-
# upload local folder '/Users/hhwang/temp/jsonl' to path 'test/files' of repo 'wanghh2000/m01' with token 'xxxxxx'
140-
csghub-cli upload wanghh2000/m01 /Users/hhwang/temp/jsonl test/files -k xxxxxx
141-
```
142-
143-
Notes:
144-
- `csghub-cli upload` will create repo and its branch if they do not exist. The default branch is `main`. If you want to upload to a specific branch, you can use the `--revision` option. If the branch does not exist, it will be created. If the branch already exists, the files will be uploaded to that branch.
145-
- `csghub-cli upload` has a limitation of the file size to 4GB. If you need to upload larger files, you can use the `csghub-cli upload-large-folder` command.
146-
147-
When using the `upload-large-folder` command to upload a folder, the upload progress will be recorded in the `.cache` folder within the upload directory to support resumable uploads. Do not delete the `.cache` folder before the upload is complete.
148-
149-
Download location is `~/.cache/csg/` by default.
88+
For detailed command line usage examples, including downloading models/datasets, uploading files/folders, and managing inference/fine-tuning instances, please refer to our [CLI documentation](doc/cli.md).
15089

15190
## Use cases of SDK
15291

153-
For more detailed instructions, including API documentation and usage examples, please refer to the Use case.
154-
155-
### Download model
156-
157-
```python
158-
from pycsghub.snapshot_download import snapshot_download
159-
token = "your_access_token"
160-
161-
endpoint = "https://hub.opencsg.com"
162-
repo_id = 'OpenCSG/csg-wukong-1B'
163-
cache_dir = '/Users/hhwang/temp/'
164-
result = snapshot_download(repo_id, cache_dir=cache_dir, endpoint=endpoint, token=token)
165-
```
166-
167-
### Download model with allow patterns '*.json' and ignore '*_config.json' pattern of files
168-
169-
```python
170-
from pycsghub.snapshot_download import snapshot_download
171-
token = "your_access_token"
172-
173-
endpoint = "https://hub.opencsg.com"
174-
repo_id = 'OpenCSG/csg-wukong-1B'
175-
cache_dir = '/Users/hhwang/temp/'
176-
allow_patterns = ["*.json"]
177-
ignore_patterns = ["*_config.json"]
178-
result = snapshot_download(repo_id, cache_dir=cache_dir, endpoint=endpoint, token=token, allow_patterns=allow_patterns, ignore_patterns=ignore_patterns)
179-
```
180-
181-
### Download dataset
182-
```python
183-
from pycsghub.snapshot_download import snapshot_download
184-
token="xxxx"
185-
endpoint = "https://hub.opencsg.com"
186-
repo_id = 'AIWizards/tmmluplus'
187-
repo_type="dataset"
188-
cache_dir = '/Users/xiangzhen/Downloads/'
189-
result = snapshot_download(repo_id, repo_type=repo_type, cache_dir=cache_dir, endpoint=endpoint, token=token)
190-
```
191-
192-
### Download single file
193-
194-
Use `http_get` function to download single file
195-
196-
```python
197-
from pycsghub.file_download import http_get
198-
token = "your_access_token"
199-
200-
url = "https://hub.opencsg.com/api/v1/models/OpenCSG/csg-wukong-1B/resolve/tokenizer.model"
201-
local_dir = '/home/test/'
202-
file_name = 'test.txt'
203-
headers = None
204-
cookies = None
205-
http_get(url=url, token=token, local_dir=local_dir, file_name=file_name, headers=headers, cookies=cookies)
206-
```
207-
208-
use `file_download` function to download single file from a repository
209-
210-
```python
211-
from pycsghub.file_download import file_download
212-
token = "your_access_token"
213-
214-
endpoint = "https://hub.opencsg.com"
215-
repo_id = 'OpenCSG/csg-wukong-1B'
216-
cache_dir = '/home/test/'
217-
result = file_download(repo_id, file_name='README.md', cache_dir=cache_dir, endpoint=endpoint, token=token)
218-
```
219-
220-
### Upload file
221-
222-
```python
223-
from pycsghub.file_upload import http_upload_file
224-
225-
token = "your_access_token"
226-
227-
endpoint = "https://hub.opencsg.com"
228-
repo_type = "model"
229-
repo_id = 'wanghh2000/myprivate1'
230-
result = http_upload_file(repo_id, endpoint=endpoint, token=token, repo_type='model', file_path='test1.txt')
231-
```
232-
233-
### Upload multi-files
234-
235-
```python
236-
from pycsghub.file_upload import http_upload_file
237-
238-
token = "your_access_token"
239-
240-
endpoint = "https://hub.opencsg.com"
241-
repo_type = "model"
242-
repo_id = 'wanghh2000/myprivate1'
243-
244-
repo_files = ["1.txt", "2.txt"]
245-
for item in repo_files:
246-
http_upload_file(repo_id=repo_id, repo_type=repo_type, file_path=item, endpoint=endpoint, token=token)
247-
```
248-
249-
### Upload the local path to repo
250-
251-
Before starting, please make sure you have Git-LFS installed (see [here](https://git-lfs.github.com/) for installation instructions).
252-
253-
```python
254-
from pycsghub.repository import Repository
255-
256-
token = "your access token"
257-
258-
r = Repository(
259-
repo_id="wanghh2003/ds15",
260-
upload_path="/Users/hhwang/temp/bbb/jsonl",
261-
user_name="wanghh2003",
262-
token=token,
263-
repo_type="dataset",
264-
)
265-
266-
r.upload()
267-
```
268-
269-
### Upload the local path to the specified path in the repo
270-
271-
Before starting, please make sure you have Git-LFS installed (see [here](https://git-lfs.github.com/) for installation instructions).
272-
273-
```python
274-
from pycsghub.repository import Repository
275-
276-
token = "your access token"
277-
278-
r = Repository(
279-
repo_id="wanghh2000/model01",
280-
upload_path="/Users/hhwang/temp/jsonl",
281-
path_in_repo="test/abc",
282-
user_name="wanghh2000",
283-
token=token,
284-
repo_type="model",
285-
branch_name="v1",
286-
)
287-
288-
r.upload()
289-
```
290-
291-
### Model loading compatible with huggingface
292-
293-
The transformers library supports directly inputting the repo_id from Hugging Face to download and load related models, as shown below:
294-
295-
```python
296-
from transformers import AutoModelForCausalLM
297-
model = AutoModelForCausalLM.from_pretrained('model/repoid')
298-
```
299-
300-
In this code, the Hugging Face Transformers library first downloads the model to a local cache folder, then reads the configuration, and loads the model by dynamically selecting the relevant class for instantiation.
301-
302-
To ensure compatibility with Hugging Face, version 0.2 of the CSGHub SDK now includes the most commonly features: downloading and loading models. Models can be downloaded and loaded as follows:
303-
304-
```python
305-
# import os
306-
# os.environ['CSGHUB_TOKEN'] = 'your_access_token'
307-
from pycsghub.repo_reader import AutoModelForCausalLM
308-
model = AutoModelForCausalLM.from_pretrained('model/repoid')
309-
```
310-
311-
This code:
312-
313-
1. Use the `snapshot_download` from the CSGHub SDK library to download the related files.
314-
315-
2. By generating batch classes dynamically and using class name reflection mechanism, a large number of classes with the same names as those automatically loaded by transformers are created in batches.
316-
317-
3. Assign it with the from_pretrained method, so the model read out will be an hf-transformers model.
92+
For detailed SDK usage examples, including model/dataset downloading, file uploading, directory uploading, and Hugging Face compatible model loading, please refer to our [SDK documentation](doc/sdk.md).
31893

31994
## Roadmap
32095

0 commit comments

Comments
 (0)