Coverage for /pythoncovmergedfiles/medio/medio/usr/local/lib/python3.11/site-packages/dulwich/line_ending.py: 55%

Shortcuts on this page

r m x   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

84 statements  

1# line_ending.py -- Line ending conversion functions 

2# Copyright (C) 2018-2018 Boris Feld <boris.feld@comet.ml> 

3# 

4# SPDX-License-Identifier: Apache-2.0 OR GPL-2.0-or-later 

5# Dulwich is dual-licensed under the Apache License, Version 2.0 and the GNU 

6# General Public License as public by the Free Software Foundation; version 2.0 

7# or (at your option) any later version. You can redistribute it and/or 

8# modify it under the terms of either of these two licenses. 

9# 

10# Unless required by applicable law or agreed to in writing, software 

11# distributed under the License is distributed on an "AS IS" BASIS, 

12# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 

13# See the License for the specific language governing permissions and 

14# limitations under the License. 

15# 

16# You should have received a copy of the licenses; if not, see 

17# <http://www.gnu.org/licenses/> for a copy of the GNU General Public License 

18# and <http://www.apache.org/licenses/LICENSE-2.0> for a copy of the Apache 

19# License, Version 2.0. 

20# 

21r"""All line-ending related functions, from conversions to config processing. 

22 

23Line-ending normalization is a complex beast. Here is some notes and details 

24about how it seems to work. 

25 

26The normalization is a two-fold process that happens at two moments: 

27 

28- When reading a file from the index and to the working directory. For example 

29 when doing a ``git clone`` or ``git checkout`` call. We call this process the 

30 read filter in this module. 

31- When writing a file to the index from the working directory. For example 

32 when doing a ``git add`` call. We call this process the write filter in this 

33 module. 

34 

35Note that when checking status (getting unstaged changes), whether or not 

36normalization is done on write depends on whether or not the file in the 

37working dir has also been normalized on read: 

38 

39- For autocrlf=true all files are always normalized on both read and write. 

40- For autocrlf=input files are only normalized on write if they are newly 

41 "added". Since files which are already committed are not normalized on 

42 checkout into the working tree, they are also left alone when staging 

43 modifications into the index. 

44 

45One thing to know is that Git does line-ending normalization only on text 

46files. How does Git know that a file is text? We can either mark a file as a 

47text file, a binary file or ask Git to automatically decides. Git has an 

48heuristic to detect if a file is a text file or a binary file. It seems based 

49on the percentage of non-printable characters in files. 

50 

51The code for this heuristic is here: 

52https://git.kernel.org/pub/scm/git/git.git/tree/convert.c#n46 

53 

54Dulwich have an implementation with a slightly different heuristic, the 

55`dulwich.patch.is_binary` function. 

56 

57The binary detection heuristic implementation is close to the one in JGit: 

58https://github.com/eclipse/jgit/blob/f6873ffe522bbc3536969a3a3546bf9a819b92bf/org.eclipse.jgit/src/org/eclipse/jgit/diff/RawText.java#L300 

59 

60There is multiple variables that impact the normalization. 

61 

62First, a repository can contains a ``.gitattributes`` file (or more than one...) 

63that can further customize the operation on some file patterns, for example: 

64 

65 \*.txt text 

66 

67Force all ``.txt`` files to be treated as text files and to have their lines 

68endings normalized. 

69 

70 \*.jpg -text 

71 

72Force all ``.jpg`` files to be treated as binary files and to not have their 

73lines endings converted. 

74 

75 \*.vcproj text eol=crlf 

76 

77Force all ``.vcproj`` files to be treated as text files and to have their lines 

78endings converted into ``CRLF`` in working directory no matter the native EOL of 

79the platform. 

80 

81 \*.sh text eol=lf 

82 

83Force all ``.sh`` files to be treated as text files and to have their lines 

84endings converted into ``LF`` in working directory no matter the native EOL of 

85the platform. 

86 

87If the ``eol`` attribute is not defined, Git uses the ``core.eol`` configuration 

88value described later. 

89 

90 \* text=auto 

91 

92Force all files to be scanned by the text file heuristic detection and to have 

93their line endings normalized in case they are detected as text files. 

94 

95Git also have a obsolete attribute named ``crlf`` that can be translated to the 

96corresponding text attribute value. 

97 

98Then there are some configuration option (that can be defined at the 

99repository or user level): 

100 

101- core.autocrlf 

102- core.eol 

103 

104``core.autocrlf`` is taken into account for all files that doesn't have a ``text`` 

105attribute defined in ``.gitattributes``; it takes three possible values: 

106 

107 - ``true``: This forces all files on the working directory to have CRLF 

108 line-endings in the working directory and convert line-endings to LF 

109 when writing to the index. When autocrlf is set to true, eol value is 

110 ignored. 

111 - ``input``: Quite similar to the ``true`` value but only force the write 

112 filter, ie line-ending of new files added to the index will get their 

113 line-endings converted to LF. 

114 - ``false`` (default): No normalization is done. 

115 

116``core.eol`` is the top-level configuration to define the line-ending to use 

117when applying the read_filer. It takes three possible values: 

118 

119 - ``lf``: When normalization is done, force line-endings to be ``LF`` in the 

120 working directory. 

121 - ``crlf``: When normalization is done, force line-endings to be ``CRLF`` in 

122 the working directory. 

123 - ``native`` (default): When normalization is done, force line-endings to be 

124 the platform's native line ending. 

125 

126One thing to remember is when line-ending normalization is done on a file, Git 

127always normalize line-ending to ``LF`` when writing to the index. 

128 

129There are sources that seems to indicate that Git won't do line-ending 

130normalization when a file contains mixed line-endings. I think this logic 

131might be in text / binary detection heuristic but couldn't find it yet. 

132 

133Sources: 

134- https://git-scm.com/docs/git-config#git-config-coreeol 

135- https://git-scm.com/docs/git-config#git-config-coreautocrlf 

136- https://git-scm.com/docs/gitattributes#_checking_out_and_checking_in 

137- https://adaptivepatchwork.com/2012/03/01/mind-the-end-of-your-line/ 

138""" 

139 

140from typing import TYPE_CHECKING, Any, Callable, Optional, Union 

141 

142if TYPE_CHECKING: 

143 from .config import StackedConfig 

144 from .object_store import BaseObjectStore 

145 

146from .object_store import iter_tree_contents 

147from .objects import Blob, ObjectID 

148from .patch import is_binary 

149 

150CRLF = b"\r\n" 

151LF = b"\n" 

152 

153 

154def convert_crlf_to_lf(text_hunk: bytes) -> bytes: 

155 """Convert CRLF in text hunk into LF. 

156 

157 Args: 

158 text_hunk: A bytes string representing a text hunk 

159 Returns: The text hunk with the same type, with CRLF replaced into LF 

160 """ 

161 return text_hunk.replace(CRLF, LF) 

162 

163 

164def convert_lf_to_crlf(text_hunk: bytes) -> bytes: 

165 """Convert LF in text hunk into CRLF. 

166 

167 Args: 

168 text_hunk: A bytes string representing a text hunk 

169 Returns: The text hunk with the same type, with LF replaced into CRLF 

170 """ 

171 # Single-pass conversion: split on LF and join with CRLF 

172 # This avoids the double replacement issue 

173 parts = text_hunk.split(LF) 

174 # Remove any trailing CR to avoid CRCRLF 

175 cleaned_parts = [] 

176 for i, part in enumerate(parts): 

177 if i < len(parts) - 1 and part.endswith(b"\r"): 

178 cleaned_parts.append(part[:-1]) 

179 else: 

180 cleaned_parts.append(part) 

181 return CRLF.join(cleaned_parts) 

182 

183 

184def get_checkout_filter( 

185 core_eol: str, core_autocrlf: Union[bool, str], git_attributes: dict[str, Any] 

186) -> Optional[Callable[[bytes], bytes]]: 

187 """Returns the correct checkout filter based on the passed arguments.""" 

188 # TODO this function should process the git_attributes for the path and if 

189 # the text attribute is not defined, fallback on the 

190 # get_checkout_filter_autocrlf function with the autocrlf value 

191 if isinstance(core_autocrlf, bool): 

192 autocrlf_bytes = b"true" if core_autocrlf else b"false" 

193 else: 

194 autocrlf_bytes = ( 

195 core_autocrlf.encode("ascii") 

196 if isinstance(core_autocrlf, str) 

197 else core_autocrlf 

198 ) 

199 return get_checkout_filter_autocrlf(autocrlf_bytes) 

200 

201 

202def get_checkin_filter( 

203 core_eol: str, core_autocrlf: Union[bool, str], git_attributes: dict[str, Any] 

204) -> Optional[Callable[[bytes], bytes]]: 

205 """Returns the correct checkin filter based on the passed arguments.""" 

206 # TODO this function should process the git_attributes for the path and if 

207 # the text attribute is not defined, fallback on the 

208 # get_checkin_filter_autocrlf function with the autocrlf value 

209 if isinstance(core_autocrlf, bool): 

210 autocrlf_bytes = b"true" if core_autocrlf else b"false" 

211 else: 

212 autocrlf_bytes = ( 

213 core_autocrlf.encode("ascii") 

214 if isinstance(core_autocrlf, str) 

215 else core_autocrlf 

216 ) 

217 return get_checkin_filter_autocrlf(autocrlf_bytes) 

218 

219 

220def get_checkout_filter_autocrlf( 

221 core_autocrlf: bytes, 

222) -> Optional[Callable[[bytes], bytes]]: 

223 """Returns the correct checkout filter base on autocrlf value. 

224 

225 Args: 

226 core_autocrlf: The bytes configuration value of core.autocrlf. 

227 Valid values are: b'true', b'false' or b'input'. 

228 Returns: Either None if no filter has to be applied or a function 

229 accepting a single argument, a binary text hunk 

230 """ 

231 if core_autocrlf == b"true": 

232 return convert_lf_to_crlf 

233 

234 return None 

235 

236 

237def get_checkin_filter_autocrlf( 

238 core_autocrlf: bytes, 

239) -> Optional[Callable[[bytes], bytes]]: 

240 """Returns the correct checkin filter base on autocrlf value. 

241 

242 Args: 

243 core_autocrlf: The bytes configuration value of core.autocrlf. 

244 Valid values are: b'true', b'false' or b'input'. 

245 Returns: Either None if no filter has to be applied or a function 

246 accepting a single argument, a binary text hunk 

247 """ 

248 if core_autocrlf == b"true" or core_autocrlf == b"input": 

249 return convert_crlf_to_lf 

250 

251 # Checking filter should never be `convert_lf_to_crlf` 

252 return None 

253 

254 

255class BlobNormalizer: 

256 """An object to store computation result of which filter to apply based 

257 on configuration, gitattributes, path and operation (checkin or checkout). 

258 """ 

259 

260 def __init__( 

261 self, config_stack: "StackedConfig", gitattributes: dict[str, Any] 

262 ) -> None: 

263 self.config_stack = config_stack 

264 self.gitattributes = gitattributes 

265 

266 # Compute which filters we needs based on parameters 

267 try: 

268 core_eol_raw = config_stack.get("core", "eol") 

269 core_eol: str = ( 

270 core_eol_raw.decode("ascii") 

271 if isinstance(core_eol_raw, bytes) 

272 else core_eol_raw 

273 ) 

274 except KeyError: 

275 core_eol = "native" 

276 

277 try: 

278 core_autocrlf_raw = config_stack.get("core", "autocrlf") 

279 if isinstance(core_autocrlf_raw, bytes): 

280 core_autocrlf: Union[bool, str] = core_autocrlf_raw.decode( 

281 "ascii" 

282 ).lower() 

283 else: 

284 core_autocrlf = core_autocrlf_raw.lower() 

285 except KeyError: 

286 core_autocrlf = False 

287 

288 self.fallback_read_filter = get_checkout_filter( 

289 core_eol, core_autocrlf, self.gitattributes 

290 ) 

291 self.fallback_write_filter = get_checkin_filter( 

292 core_eol, core_autocrlf, self.gitattributes 

293 ) 

294 

295 def checkin_normalize(self, blob: Blob, tree_path: bytes) -> Blob: 

296 """Normalize a blob during a checkin operation.""" 

297 if self.fallback_write_filter is not None: 

298 return normalize_blob( 

299 blob, self.fallback_write_filter, binary_detection=True 

300 ) 

301 

302 return blob 

303 

304 def checkout_normalize(self, blob: Blob, tree_path: bytes) -> Blob: 

305 """Normalize a blob during a checkout operation.""" 

306 if self.fallback_read_filter is not None: 

307 return normalize_blob( 

308 blob, self.fallback_read_filter, binary_detection=True 

309 ) 

310 

311 return blob 

312 

313 

314def normalize_blob( 

315 blob: Blob, conversion: Callable[[bytes], bytes], binary_detection: bool 

316) -> Blob: 

317 """Takes a blob as input returns either the original blob if 

318 binary_detection is True and the blob content looks like binary, else 

319 return a new blob with converted data. 

320 """ 

321 # Read the original blob 

322 data = blob.data 

323 

324 # If we need to detect if a file is binary and the file is detected as 

325 # binary, do not apply the conversion function and return the original 

326 # chunked text 

327 if binary_detection is True: 

328 if is_binary(data): 

329 return blob 

330 

331 # Now apply the conversion 

332 converted_data = conversion(data) 

333 

334 new_blob = Blob() 

335 new_blob.data = converted_data 

336 

337 return new_blob 

338 

339 

340class TreeBlobNormalizer(BlobNormalizer): 

341 def __init__( 

342 self, 

343 config_stack: "StackedConfig", 

344 git_attributes: dict[str, Any], 

345 object_store: "BaseObjectStore", 

346 tree: Optional[ObjectID] = None, 

347 ) -> None: 

348 super().__init__(config_stack, git_attributes) 

349 if tree: 

350 self.existing_paths = { 

351 name for name, _, _ in iter_tree_contents(object_store, tree) 

352 } 

353 else: 

354 self.existing_paths = set() 

355 

356 def checkin_normalize(self, blob: Blob, tree_path: bytes) -> Blob: 

357 # Existing files should only be normalized on checkin if it was 

358 # previously normalized on checkout 

359 if ( 

360 self.fallback_read_filter is not None 

361 or tree_path not in self.existing_paths 

362 ): 

363 return super().checkin_normalize(blob, tree_path) 

364 return blob