| 47 43 43 43 43 43 49 49 43 49 49 2 47 2 109 88 22 51 2 50 2 43 1 2 1 1 43 43 6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 | // SPDX-License-Identifier: GPL-2.0 #include <linux/fs.h> #include <linux/random.h> #include <linux/buffer_head.h> #include <linux/utsname.h> #include <linux/kthread.h> #include "ext4.h" /* Checksumming functions */ static __le32 ext4_mmp_csum(struct super_block *sb, struct mmp_struct *mmp) { struct ext4_sb_info *sbi = EXT4_SB(sb); int offset = offsetof(struct mmp_struct, mmp_checksum); __u32 csum; csum = ext4_chksum(sbi, sbi->s_csum_seed, (char *)mmp, offset); return cpu_to_le32(csum); } static int ext4_mmp_csum_verify(struct super_block *sb, struct mmp_struct *mmp) { if (!ext4_has_metadata_csum(sb)) return 1; return mmp->mmp_checksum == ext4_mmp_csum(sb, mmp); } static void ext4_mmp_csum_set(struct super_block *sb, struct mmp_struct *mmp) { if (!ext4_has_metadata_csum(sb)) return; mmp->mmp_checksum = ext4_mmp_csum(sb, mmp); } /* * Write the MMP block using REQ_SYNC to try to get the block on-disk * faster. */ static int write_mmp_block_thawed(struct super_block *sb, struct buffer_head *bh) { struct mmp_struct *mmp = (struct mmp_struct *)(bh->b_data); ext4_mmp_csum_set(sb, mmp); lock_buffer(bh); bh->b_end_io = end_buffer_write_sync; get_bh(bh); submit_bh(REQ_OP_WRITE | REQ_SYNC | REQ_META | REQ_PRIO, bh); wait_on_buffer(bh); if (unlikely(!buffer_uptodate(bh))) return -EIO; return 0; } static int write_mmp_block(struct super_block *sb, struct buffer_head *bh) { int err; /* * We protect against freezing so that we don't create dirty buffers * on frozen filesystem. */ sb_start_write(sb); err = write_mmp_block_thawed(sb, bh); sb_end_write(sb); return err; } /* * Read the MMP block. It _must_ be read from disk and hence we clear the * uptodate flag on the buffer. */ static int read_mmp_block(struct super_block *sb, struct buffer_head **bh, ext4_fsblk_t mmp_block) { struct mmp_struct *mmp; int ret; if (*bh) clear_buffer_uptodate(*bh); /* This would be sb_bread(sb, mmp_block), except we need to be sure * that the MD RAID device cache has been bypassed, and that the read * is not blocked in the elevator. */ if (!*bh) { *bh = sb_getblk(sb, mmp_block); if (!*bh) { ret = -ENOMEM; goto warn_exit; } } lock_buffer(*bh); ret = ext4_read_bh(*bh, REQ_META | REQ_PRIO, NULL, false); if (ret) goto warn_exit; mmp = (struct mmp_struct *)((*bh)->b_data); if (le32_to_cpu(mmp->mmp_magic) != EXT4_MMP_MAGIC) { ret = -EFSCORRUPTED; goto warn_exit; } if (!ext4_mmp_csum_verify(sb, mmp)) { ret = -EFSBADCRC; goto warn_exit; } return 0; warn_exit: brelse(*bh); *bh = NULL; ext4_warning(sb, "Error %d while reading MMP block %llu", ret, mmp_block); return ret; } /* * Dump as much information as possible to help the admin. */ void __dump_mmp_msg(struct super_block *sb, struct mmp_struct *mmp, const char *function, unsigned int line, const char *msg) { __ext4_warning(sb, function, line, "%s", msg); __ext4_warning(sb, function, line, "MMP failure info: last update time: %llu, last update node: %.*s, last update device: %.*s", (unsigned long long)le64_to_cpu(mmp->mmp_time), (int)sizeof(mmp->mmp_nodename), mmp->mmp_nodename, (int)sizeof(mmp->mmp_bdevname), mmp->mmp_bdevname); } /* * kmmpd will update the MMP sequence every s_mmp_update_interval seconds */ static int kmmpd(void *data) { struct super_block *sb = data; struct ext4_super_block *es = EXT4_SB(sb)->s_es; struct buffer_head *bh = EXT4_SB(sb)->s_mmp_bh; struct mmp_struct *mmp; ext4_fsblk_t mmp_block; u32 seq = 0; unsigned long failed_writes = 0; int mmp_update_interval = le16_to_cpu(es->s_mmp_update_interval); unsigned mmp_check_interval; unsigned long last_update_time; unsigned long diff; int retval = 0; mmp_block = le64_to_cpu(es->s_mmp_block); mmp = (struct mmp_struct *)(bh->b_data); mmp->mmp_time = cpu_to_le64(ktime_get_real_seconds()); /* * Start with the higher mmp_check_interval and reduce it if * the MMP block is being updated on time. */ mmp_check_interval = max(EXT4_MMP_CHECK_MULT * mmp_update_interval, EXT4_MMP_MIN_CHECK_INTERVAL); mmp->mmp_check_interval = cpu_to_le16(mmp_check_interval); memcpy(mmp->mmp_nodename, init_utsname()->nodename, sizeof(mmp->mmp_nodename)); while (!kthread_should_stop() && !ext4_forced_shutdown(sb)) { if (!ext4_has_feature_mmp(sb)) { ext4_warning(sb, "kmmpd being stopped since MMP feature" " has been disabled."); goto wait_to_exit; } if (++seq > EXT4_MMP_SEQ_MAX) seq = 1; mmp->mmp_seq = cpu_to_le32(seq); mmp->mmp_time = cpu_to_le64(ktime_get_real_seconds()); last_update_time = jiffies; retval = write_mmp_block(sb, bh); /* * Don't spew too many error messages. Print one every * (s_mmp_update_interval * 60) seconds. */ if (retval) { if ((failed_writes % 60) == 0) { ext4_error_err(sb, -retval, "Error writing to MMP block"); } failed_writes++; } diff = jiffies - last_update_time; if (diff < mmp_update_interval * HZ) schedule_timeout_interruptible(mmp_update_interval * HZ - diff); /* * We need to make sure that more than mmp_check_interval * seconds have not passed since writing. If that has happened * we need to check if the MMP block is as we left it. */ diff = jiffies - last_update_time; if (diff > mmp_check_interval * HZ) { struct buffer_head *bh_check = NULL; struct mmp_struct *mmp_check; retval = read_mmp_block(sb, &bh_check, mmp_block); if (retval) { ext4_error_err(sb, -retval, "error reading MMP data: %d", retval); goto wait_to_exit; } mmp_check = (struct mmp_struct *)(bh_check->b_data); if (mmp->mmp_seq != mmp_check->mmp_seq || memcmp(mmp->mmp_nodename, mmp_check->mmp_nodename, sizeof(mmp->mmp_nodename))) { dump_mmp_msg(sb, mmp_check, "Error while updating MMP info. " "The filesystem seems to have been" " multiply mounted."); ext4_error_err(sb, EBUSY, "abort"); put_bh(bh_check); retval = -EBUSY; goto wait_to_exit; } put_bh(bh_check); } /* * Adjust the mmp_check_interval depending on how much time * it took for the MMP block to be written. */ mmp_check_interval = max(min(EXT4_MMP_CHECK_MULT * diff / HZ, EXT4_MMP_MAX_CHECK_INTERVAL), EXT4_MMP_MIN_CHECK_INTERVAL); mmp->mmp_check_interval = cpu_to_le16(mmp_check_interval); } /* * Unmount seems to be clean. */ mmp->mmp_seq = cpu_to_le32(EXT4_MMP_SEQ_CLEAN); mmp->mmp_time = cpu_to_le64(ktime_get_real_seconds()); retval = write_mmp_block(sb, bh); wait_to_exit: while (!kthread_should_stop()) { set_current_state(TASK_INTERRUPTIBLE); if (!kthread_should_stop()) schedule(); } set_current_state(TASK_RUNNING); return retval; } void ext4_stop_mmpd(struct ext4_sb_info *sbi) { if (sbi->s_mmp_tsk) { kthread_stop(sbi->s_mmp_tsk); brelse(sbi->s_mmp_bh); sbi->s_mmp_tsk = NULL; } } /* * Get a random new sequence number but make sure it is not greater than * EXT4_MMP_SEQ_MAX. */ static unsigned int mmp_new_seq(void) { return get_random_u32_below(EXT4_MMP_SEQ_MAX + 1); } /* * Protect the filesystem from being mounted more than once. */ int ext4_multi_mount_protect(struct super_block *sb, ext4_fsblk_t mmp_block) { struct ext4_super_block *es = EXT4_SB(sb)->s_es; struct buffer_head *bh = NULL; struct mmp_struct *mmp = NULL; u32 seq; unsigned int mmp_check_interval = le16_to_cpu(es->s_mmp_update_interval); unsigned int wait_time = 0; int retval; if (mmp_block < le32_to_cpu(es->s_first_data_block) || mmp_block >= ext4_blocks_count(es)) { ext4_warning(sb, "Invalid MMP block in superblock"); retval = -EINVAL; goto failed; } retval = read_mmp_block(sb, &bh, mmp_block); if (retval) goto failed; mmp = (struct mmp_struct *)(bh->b_data); if (mmp_check_interval < EXT4_MMP_MIN_CHECK_INTERVAL) mmp_check_interval = EXT4_MMP_MIN_CHECK_INTERVAL; /* * If check_interval in MMP block is larger, use that instead of * update_interval from the superblock. */ if (le16_to_cpu(mmp->mmp_check_interval) > mmp_check_interval) mmp_check_interval = le16_to_cpu(mmp->mmp_check_interval); seq = le32_to_cpu(mmp->mmp_seq); if (seq == EXT4_MMP_SEQ_CLEAN) goto skip; if (seq == EXT4_MMP_SEQ_FSCK) { dump_mmp_msg(sb, mmp, "fsck is running on the filesystem"); retval = -EBUSY; goto failed; } wait_time = min(mmp_check_interval * 2 + 1, mmp_check_interval + 60); /* Print MMP interval if more than 20 secs. */ if (wait_time > EXT4_MMP_MIN_CHECK_INTERVAL * 4) ext4_warning(sb, "MMP interval %u higher than expected, please" " wait.\n", wait_time * 2); if (schedule_timeout_interruptible(HZ * wait_time) != 0) { ext4_warning(sb, "MMP startup interrupted, failing mount\n"); retval = -ETIMEDOUT; goto failed; } retval = read_mmp_block(sb, &bh, mmp_block); if (retval) goto failed; mmp = (struct mmp_struct *)(bh->b_data); if (seq != le32_to_cpu(mmp->mmp_seq)) { dump_mmp_msg(sb, mmp, "Device is already active on another node."); retval = -EBUSY; goto failed; } skip: /* * write a new random sequence number. */ seq = mmp_new_seq(); mmp->mmp_seq = cpu_to_le32(seq); /* * On mount / remount we are protected against fs freezing (by s_umount * semaphore) and grabbing freeze protection upsets lockdep */ retval = write_mmp_block_thawed(sb, bh); if (retval) goto failed; /* * wait for MMP interval and check mmp_seq. */ if (schedule_timeout_interruptible(HZ * wait_time) != 0) { ext4_warning(sb, "MMP startup interrupted, failing mount"); retval = -ETIMEDOUT; goto failed; } retval = read_mmp_block(sb, &bh, mmp_block); if (retval) goto failed; mmp = (struct mmp_struct *)(bh->b_data); if (seq != le32_to_cpu(mmp->mmp_seq)) { dump_mmp_msg(sb, mmp, "Device is already active on another node."); retval = -EBUSY; goto failed; } EXT4_SB(sb)->s_mmp_bh = bh; BUILD_BUG_ON(sizeof(mmp->mmp_bdevname) < BDEVNAME_SIZE); snprintf(mmp->mmp_bdevname, sizeof(mmp->mmp_bdevname), "%pg", bh->b_bdev); /* * Start a kernel thread to update the MMP block periodically. */ EXT4_SB(sb)->s_mmp_tsk = kthread_run(kmmpd, sb, "kmmpd-%.*s", (int)sizeof(mmp->mmp_bdevname), mmp->mmp_bdevname); if (IS_ERR(EXT4_SB(sb)->s_mmp_tsk)) { EXT4_SB(sb)->s_mmp_tsk = NULL; ext4_warning(sb, "Unable to create kmmpd thread for %s.", sb->s_id); retval = -ENOMEM; goto failed; } return 0; failed: brelse(bh); return retval; } |
| 13 13 4 4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 | // SPDX-License-Identifier: GPL-2.0-or-later /* AFS vlserver list management. * * Copyright (C) 2018 Red Hat, Inc. All Rights Reserved. * Written by David Howells (dhowells@redhat.com) */ #include <linux/kernel.h> #include <linux/slab.h> #include "internal.h" struct afs_vlserver *afs_alloc_vlserver(const char *name, size_t name_len, unsigned short port) { struct afs_vlserver *vlserver; static atomic_t debug_ids; vlserver = kzalloc(struct_size(vlserver, name, name_len + 1), GFP_KERNEL); if (vlserver) { refcount_set(&vlserver->ref, 1); rwlock_init(&vlserver->lock); init_waitqueue_head(&vlserver->probe_wq); spin_lock_init(&vlserver->probe_lock); vlserver->debug_id = atomic_inc_return(&debug_ids); vlserver->rtt = UINT_MAX; vlserver->name_len = name_len; vlserver->service_id = VL_SERVICE; vlserver->port = port; memcpy(vlserver->name, name, name_len); } return vlserver; } static void afs_vlserver_rcu(struct rcu_head *rcu) { struct afs_vlserver *vlserver = container_of(rcu, struct afs_vlserver, rcu); afs_put_addrlist(rcu_access_pointer(vlserver->addresses), afs_alist_trace_put_vlserver); kfree_rcu(vlserver, rcu); } void afs_put_vlserver(struct afs_net *net, struct afs_vlserver *vlserver) { if (vlserver && refcount_dec_and_test(&vlserver->ref)) call_rcu(&vlserver->rcu, afs_vlserver_rcu); } struct afs_vlserver_list *afs_alloc_vlserver_list(unsigned int nr_servers) { struct afs_vlserver_list *vllist; vllist = kzalloc(struct_size(vllist, servers, nr_servers), GFP_KERNEL); if (vllist) { refcount_set(&vllist->ref, 1); rwlock_init(&vllist->lock); } return vllist; } void afs_put_vlserverlist(struct afs_net *net, struct afs_vlserver_list *vllist) { if (vllist) { if (refcount_dec_and_test(&vllist->ref)) { int i; for (i = 0; i < vllist->nr_servers; i++) { afs_put_vlserver(net, vllist->servers[i].server); } kfree_rcu(vllist, rcu); } } } static u16 afs_extract_le16(const u8 **_b) { u16 val; val = (u16)*(*_b)++ << 0; val |= (u16)*(*_b)++ << 8; return val; } /* * Build a VL server address list from a DNS queried server list. */ static struct afs_addr_list *afs_extract_vl_addrs(struct afs_net *net, const u8 **_b, const u8 *end, u8 nr_addrs, u16 port) { struct afs_addr_list *alist; const u8 *b = *_b; int ret = -EINVAL; alist = afs_alloc_addrlist(nr_addrs); if (!alist) return ERR_PTR(-ENOMEM); if (nr_addrs == 0) return alist; for (; nr_addrs > 0 && end - b >= nr_addrs; nr_addrs--) { struct dns_server_list_v1_address hdr; __be32 x[4]; hdr.address_type = *b++; switch (hdr.address_type) { case DNS_ADDRESS_IS_IPV4: if (end - b < 4) { _leave(" = -EINVAL [short inet]"); goto error; } memcpy(x, b, 4); ret = afs_merge_fs_addr4(net, alist, x[0], port); if (ret < 0) goto error; b += 4; break; case DNS_ADDRESS_IS_IPV6: if (end - b < 16) { _leave(" = -EINVAL [short inet6]"); goto error; } memcpy(x, b, 16); ret = afs_merge_fs_addr6(net, alist, x, port); if (ret < 0) goto error; b += 16; break; default: _leave(" = -EADDRNOTAVAIL [unknown af %u]", hdr.address_type); ret = -EADDRNOTAVAIL; goto error; } } /* Start with IPv6 if available. */ if (alist->nr_ipv4 < alist->nr_addrs) alist->preferred = alist->nr_ipv4; *_b = b; return alist; error: *_b = b; afs_put_addrlist(alist, afs_alist_trace_put_parse_error); return ERR_PTR(ret); } /* * Build a VL server list from a DNS queried server list. */ struct afs_vlserver_list *afs_extract_vlserver_list(struct afs_cell *cell, const void *buffer, size_t buffer_size) { const struct dns_server_list_v1_header *hdr = buffer; struct dns_server_list_v1_server bs; struct afs_vlserver_list *vllist, *previous; struct afs_addr_list *addrs; struct afs_vlserver *server; const u8 *b = buffer, *end = buffer + buffer_size; int ret = -ENOMEM, nr_servers, i, j; _enter(""); /* Check that it's a server list, v1 */ if (end - b < sizeof(*hdr) || hdr->hdr.content != DNS_PAYLOAD_IS_SERVER_LIST || hdr->hdr.version != 1) { pr_notice("kAFS: Got DNS record [%u,%u] len %zu\n", hdr->hdr.content, hdr->hdr.version, end - b); ret = -EDESTADDRREQ; goto dump; } nr_servers = hdr->nr_servers; vllist = afs_alloc_vlserver_list(nr_servers); if (!vllist) return ERR_PTR(-ENOMEM); vllist->source = (hdr->source < NR__dns_record_source) ? hdr->source : NR__dns_record_source; vllist->status = (hdr->status < NR__dns_lookup_status) ? hdr->status : NR__dns_lookup_status; read_lock(&cell->vl_servers_lock); previous = afs_get_vlserverlist( rcu_dereference_protected(cell->vl_servers, lockdep_is_held(&cell->vl_servers_lock))); read_unlock(&cell->vl_servers_lock); b += sizeof(*hdr); while (end - b >= sizeof(bs)) { bs.name_len = afs_extract_le16(&b); bs.priority = afs_extract_le16(&b); bs.weight = afs_extract_le16(&b); bs.port = afs_extract_le16(&b); bs.source = *b++; bs.status = *b++; bs.protocol = *b++; bs.nr_addrs = *b++; _debug("extract %u %u %u %u %u %u %*.*s", bs.name_len, bs.priority, bs.weight, bs.port, bs.protocol, bs.nr_addrs, bs.name_len, bs.name_len, b); if (end - b < bs.name_len) break; ret = -EPROTONOSUPPORT; if (bs.protocol == DNS_SERVER_PROTOCOL_UNSPECIFIED) { bs.protocol = DNS_SERVER_PROTOCOL_UDP; } else if (bs.protocol != DNS_SERVER_PROTOCOL_UDP) { _leave(" = [proto %u]", bs.protocol); goto error; } if (bs.port == 0) bs.port = AFS_VL_PORT; if (bs.source > NR__dns_record_source) bs.source = NR__dns_record_source; if (bs.status > NR__dns_lookup_status) bs.status = NR__dns_lookup_status; /* See if we can update an old server record */ server = NULL; for (i = 0; i < previous->nr_servers; i++) { struct afs_vlserver *p = previous->servers[i].server; if (p->name_len == bs.name_len && p->port == bs.port && strncasecmp(b, p->name, bs.name_len) == 0) { server = afs_get_vlserver(p); break; } } if (!server) { ret = -ENOMEM; server = afs_alloc_vlserver(b, bs.name_len, bs.port); if (!server) goto error; } b += bs.name_len; /* Extract the addresses - note that we can't skip this as we * have to advance the payload pointer. */ addrs = afs_extract_vl_addrs(cell->net, &b, end, bs.nr_addrs, bs.port); if (IS_ERR(addrs)) { ret = PTR_ERR(addrs); goto error_2; } if (vllist->nr_servers >= nr_servers) { _debug("skip %u >= %u", vllist->nr_servers, nr_servers); afs_put_addrlist(addrs, afs_alist_trace_put_parse_empty); afs_put_vlserver(cell->net, server); continue; } addrs->source = bs.source; addrs->status = bs.status; if (addrs->nr_addrs == 0) { afs_put_addrlist(addrs, afs_alist_trace_put_parse_empty); if (!rcu_access_pointer(server->addresses)) { afs_put_vlserver(cell->net, server); continue; } } else { struct afs_addr_list *old = addrs; write_lock(&server->lock); old = rcu_replace_pointer(server->addresses, old, lockdep_is_held(&server->lock)); write_unlock(&server->lock); afs_put_addrlist(old, afs_alist_trace_put_vlserver_old); } /* TODO: Might want to check for duplicates */ /* Insertion-sort by priority and weight */ for (j = 0; j < vllist->nr_servers; j++) { if (bs.priority < vllist->servers[j].priority) break; /* Lower preferable */ if (bs.priority == vllist->servers[j].priority && bs.weight > vllist->servers[j].weight) break; /* Higher preferable */ } if (j < vllist->nr_servers) { memmove(vllist->servers + j + 1, vllist->servers + j, (vllist->nr_servers - j) * sizeof(struct afs_vlserver_entry)); } clear_bit(AFS_VLSERVER_FL_PROBED, &server->flags); vllist->servers[j].priority = bs.priority; vllist->servers[j].weight = bs.weight; vllist->servers[j].server = server; vllist->nr_servers++; } if (b != end) { _debug("parse error %zd", b - end); goto error; } afs_put_vlserverlist(cell->net, previous); _leave(" = ok [%u]", vllist->nr_servers); return vllist; error_2: afs_put_vlserver(cell->net, server); error: afs_put_vlserverlist(cell->net, vllist); afs_put_vlserverlist(cell->net, previous); dump: if (ret != -ENOMEM) { printk(KERN_DEBUG "DNS: at %zu\n", (const void *)b - buffer); print_hex_dump_bytes("DNS: ", DUMP_PREFIX_NONE, buffer, buffer_size); } return ERR_PTR(ret); } |
| 37 32 2 11 10 7 27 55 80 79 1 52 5 47 2 2 2 2 2 2 2 8 6 55 51 3 9 1 1 8 11 6 1 1 4 3 2 1 1 2 3 2 2 2 1 1 1 7 6 1 2 1 1 5 1 1 4 4 2 6 7 29 1 3 1 3 1 4 16 5 1 1 1 1 1 111 98 3 1 4 1 2 4 1 5 3 19 18 2 19 1 5 2 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 | // SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 1991, 1992 Linus Torvalds * * Added support for a Unix98-style ptmx device. * -- C. Scott Ananian <cananian@alumni.princeton.edu>, 14-Jan-1998 * */ #include <linux/module.h> #include <linux/errno.h> #include <linux/interrupt.h> #include <linux/tty.h> #include <linux/tty_flip.h> #include <linux/fcntl.h> #include <linux/sched/signal.h> #include <linux/string.h> #include <linux/major.h> #include <linux/mm.h> #include <linux/init.h> #include <linux/device.h> #include <linux/uaccess.h> #include <linux/bitops.h> #include <linux/devpts_fs.h> #include <linux/slab.h> #include <linux/mutex.h> #include <linux/poll.h> #include <linux/mount.h> #include <linux/file.h> #include <linux/ioctl.h> #include <linux/compat.h> #include "tty.h" #undef TTY_DEBUG_HANGUP #ifdef TTY_DEBUG_HANGUP # define tty_debug_hangup(tty, f, args...) tty_debug(tty, f, ##args) #else # define tty_debug_hangup(tty, f, args...) do {} while (0) #endif #ifdef CONFIG_UNIX98_PTYS static struct tty_driver *ptm_driver; static struct tty_driver *pts_driver; static DEFINE_MUTEX(devpts_mutex); #endif static void pty_close(struct tty_struct *tty, struct file *filp) { if (tty->driver->subtype == PTY_TYPE_MASTER) WARN_ON(tty->count > 1); else { if (tty_io_error(tty)) return; if (tty->count > 2) return; } set_bit(TTY_IO_ERROR, &tty->flags); wake_up_interruptible(&tty->read_wait); wake_up_interruptible(&tty->write_wait); spin_lock_irq(&tty->ctrl.lock); tty->ctrl.packet = false; spin_unlock_irq(&tty->ctrl.lock); /* Review - krefs on tty_link ?? */ if (!tty->link) return; set_bit(TTY_OTHER_CLOSED, &tty->link->flags); wake_up_interruptible(&tty->link->read_wait); wake_up_interruptible(&tty->link->write_wait); if (tty->driver->subtype == PTY_TYPE_MASTER) { set_bit(TTY_OTHER_CLOSED, &tty->flags); #ifdef CONFIG_UNIX98_PTYS if (tty->driver == ptm_driver) { mutex_lock(&devpts_mutex); if (tty->link->driver_data) devpts_pty_kill(tty->link->driver_data); mutex_unlock(&devpts_mutex); } #endif tty_vhangup(tty->link); } } /* * The unthrottle routine is called by the line discipline to signal * that it can receive more characters. For PTY's, the TTY_THROTTLED * flag is always set, to force the line discipline to always call the * unthrottle routine when there are fewer than TTY_THRESHOLD_UNTHROTTLE * characters in the queue. This is necessary since each time this * happens, we need to wake up any sleeping processes that could be * (1) trying to send data to the pty, or (2) waiting in wait_until_sent() * for the pty buffer to be drained. */ static void pty_unthrottle(struct tty_struct *tty) { tty_wakeup(tty->link); set_bit(TTY_THROTTLED, &tty->flags); } /** * pty_write - write to a pty * @tty: the tty we write from * @buf: kernel buffer of data * @c: bytes to write * * Our "hardware" write method. Data is coming from the ldisc which * may be in a non sleeping state. We simply throw this at the other * end of the link as if we were an IRQ handler receiving stuff for * the other side of the pty/tty pair. */ static ssize_t pty_write(struct tty_struct *tty, const u8 *buf, size_t c) { struct tty_struct *to = tty->link; if (tty->flow.stopped || !c) return 0; return tty_insert_flip_string_and_push_buffer(to->port, buf, c); } /** * pty_write_room - write space * @tty: tty we are writing from * * Report how many bytes the ldisc can send into the queue for * the other device. */ static unsigned int pty_write_room(struct tty_struct *tty) { if (tty->flow.stopped) return 0; return tty_buffer_space_avail(tty->link->port); } /* Set the lock flag on a pty */ static int pty_set_lock(struct tty_struct *tty, int __user *arg) { int val; if (get_user(val, arg)) return -EFAULT; if (val) set_bit(TTY_PTY_LOCK, &tty->flags); else clear_bit(TTY_PTY_LOCK, &tty->flags); return 0; } static int pty_get_lock(struct tty_struct *tty, int __user *arg) { int locked = test_bit(TTY_PTY_LOCK, &tty->flags); return put_user(locked, arg); } /* Set the packet mode on a pty */ static int pty_set_pktmode(struct tty_struct *tty, int __user *arg) { int pktmode; if (get_user(pktmode, arg)) return -EFAULT; spin_lock_irq(&tty->ctrl.lock); if (pktmode) { if (!tty->ctrl.packet) { tty->link->ctrl.pktstatus = 0; smp_mb(); tty->ctrl.packet = true; } } else tty->ctrl.packet = false; spin_unlock_irq(&tty->ctrl.lock); return 0; } /* Get the packet mode of a pty */ static int pty_get_pktmode(struct tty_struct *tty, int __user *arg) { int pktmode = tty->ctrl.packet; return put_user(pktmode, arg); } /* Send a signal to the slave */ static int pty_signal(struct tty_struct *tty, int sig) { struct pid *pgrp; if (sig != SIGINT && sig != SIGQUIT && sig != SIGTSTP) return -EINVAL; if (tty->link) { pgrp = tty_get_pgrp(tty->link); if (pgrp) kill_pgrp(pgrp, sig, 1); put_pid(pgrp); } return 0; } static void pty_flush_buffer(struct tty_struct *tty) { struct tty_struct *to = tty->link; if (!to) return; tty_buffer_flush(to, NULL); if (to->ctrl.packet) { spin_lock_irq(&tty->ctrl.lock); tty->ctrl.pktstatus |= TIOCPKT_FLUSHWRITE; wake_up_interruptible(&to->read_wait); spin_unlock_irq(&tty->ctrl.lock); } } static int pty_open(struct tty_struct *tty, struct file *filp) { if (!tty || !tty->link) return -ENODEV; if (test_bit(TTY_OTHER_CLOSED, &tty->flags)) goto out; if (test_bit(TTY_PTY_LOCK, &tty->link->flags)) goto out; if (tty->driver->subtype == PTY_TYPE_SLAVE && tty->link->count != 1) goto out; clear_bit(TTY_IO_ERROR, &tty->flags); clear_bit(TTY_OTHER_CLOSED, &tty->link->flags); set_bit(TTY_THROTTLED, &tty->flags); return 0; out: set_bit(TTY_IO_ERROR, &tty->flags); return -EIO; } static void pty_set_termios(struct tty_struct *tty, const struct ktermios *old_termios) { /* See if packet mode change of state. */ if (tty->link && tty->link->ctrl.packet) { int extproc = (old_termios->c_lflag & EXTPROC) | L_EXTPROC(tty); int old_flow = ((old_termios->c_iflag & IXON) && (old_termios->c_cc[VSTOP] == '\023') && (old_termios->c_cc[VSTART] == '\021')); int new_flow = (I_IXON(tty) && STOP_CHAR(tty) == '\023' && START_CHAR(tty) == '\021'); if ((old_flow != new_flow) || extproc) { spin_lock_irq(&tty->ctrl.lock); if (old_flow != new_flow) { tty->ctrl.pktstatus &= ~(TIOCPKT_DOSTOP | TIOCPKT_NOSTOP); if (new_flow) tty->ctrl.pktstatus |= TIOCPKT_DOSTOP; else tty->ctrl.pktstatus |= TIOCPKT_NOSTOP; } if (extproc) tty->ctrl.pktstatus |= TIOCPKT_IOCTL; spin_unlock_irq(&tty->ctrl.lock); wake_up_interruptible(&tty->link->read_wait); } } tty->termios.c_cflag &= ~(CSIZE | PARENB); tty->termios.c_cflag |= (CS8 | CREAD); } /** * pty_resize - resize event * @tty: tty being resized * @ws: window size being set. * * Update the termios variables and send the necessary signals to * peform a terminal resize correctly */ static int pty_resize(struct tty_struct *tty, struct winsize *ws) { struct pid *pgrp, *rpgrp; struct tty_struct *pty = tty->link; /* For a PTY we need to lock the tty side */ mutex_lock(&tty->winsize_mutex); if (!memcmp(ws, &tty->winsize, sizeof(*ws))) goto done; /* Signal the foreground process group of both ptys */ pgrp = tty_get_pgrp(tty); rpgrp = tty_get_pgrp(pty); if (pgrp) kill_pgrp(pgrp, SIGWINCH, 1); if (rpgrp != pgrp && rpgrp) kill_pgrp(rpgrp, SIGWINCH, 1); put_pid(pgrp); put_pid(rpgrp); tty->winsize = *ws; pty->winsize = *ws; /* Never used so will go away soon */ done: mutex_unlock(&tty->winsize_mutex); return 0; } /** * pty_start - start() handler * pty_stop - stop() handler * @tty: tty being flow-controlled * * Propagates the TIOCPKT status to the master pty. * * NB: only the master pty can be in packet mode so only the slave * needs start()/stop() handlers */ static void pty_start(struct tty_struct *tty) { unsigned long flags; if (tty->link && tty->link->ctrl.packet) { spin_lock_irqsave(&tty->ctrl.lock, flags); tty->ctrl.pktstatus &= ~TIOCPKT_STOP; tty->ctrl.pktstatus |= TIOCPKT_START; spin_unlock_irqrestore(&tty->ctrl.lock, flags); wake_up_interruptible_poll(&tty->link->read_wait, EPOLLIN); } } static void pty_stop(struct tty_struct *tty) { unsigned long flags; if (tty->link && tty->link->ctrl.packet) { spin_lock_irqsave(&tty->ctrl.lock, flags); tty->ctrl.pktstatus &= ~TIOCPKT_START; tty->ctrl.pktstatus |= TIOCPKT_STOP; spin_unlock_irqrestore(&tty->ctrl.lock, flags); wake_up_interruptible_poll(&tty->link->read_wait, EPOLLIN); } } /** * pty_common_install - set up the pty pair * @driver: the pty driver * @tty: the tty being instantiated * @legacy: true if this is BSD style * * Perform the initial set up for the tty/pty pair. Called from the * tty layer when the port is first opened. * * Locking: the caller must hold the tty_mutex */ static int pty_common_install(struct tty_driver *driver, struct tty_struct *tty, bool legacy) { struct tty_struct *o_tty; struct tty_port *ports[2]; int idx = tty->index; int retval = -ENOMEM; /* Opening the slave first has always returned -EIO */ if (driver->subtype != PTY_TYPE_MASTER) return -EIO; ports[0] = kmalloc(sizeof **ports, GFP_KERNEL); ports[1] = kmalloc(sizeof **ports, GFP_KERNEL); if (!ports[0] || !ports[1]) goto err; if (!try_module_get(driver->other->owner)) { /* This cannot in fact currently happen */ goto err; } o_tty = alloc_tty_struct(driver->other, idx); if (!o_tty) goto err_put_module; tty_set_lock_subclass(o_tty); lockdep_set_subclass(&o_tty->termios_rwsem, TTY_LOCK_SLAVE); if (legacy) { /* We always use new tty termios data so we can do this the easy way .. */ tty_init_termios(tty); tty_init_termios(o_tty); driver->other->ttys[idx] = o_tty; driver->ttys[idx] = tty; } else { memset(&tty->termios_locked, 0, sizeof(tty->termios_locked)); tty->termios = driver->init_termios; memset(&o_tty->termios_locked, 0, sizeof(tty->termios_locked)); o_tty->termios = driver->other->init_termios; } /* * Everything allocated ... set up the o_tty structure. */ tty_driver_kref_get(driver->other); /* Establish the links in both directions */ tty->link = o_tty; o_tty->link = tty; tty_port_init(ports[0]); tty_port_init(ports[1]); tty_buffer_set_limit(ports[0], 8192); tty_buffer_set_limit(ports[1], 8192); o_tty->port = ports[0]; tty->port = ports[1]; o_tty->port->itty = o_tty; tty_buffer_set_lock_subclass(o_tty->port); tty_driver_kref_get(driver); tty->count++; o_tty->count++; return 0; err_put_module: module_put(driver->other->owner); err: kfree(ports[0]); kfree(ports[1]); return retval; } static void pty_cleanup(struct tty_struct *tty) { tty_port_put(tty->port); } /* Traditional BSD devices */ #ifdef CONFIG_LEGACY_PTYS static int pty_install(struct tty_driver *driver, struct tty_struct *tty) { return pty_common_install(driver, tty, true); } static void pty_remove(struct tty_driver *driver, struct tty_struct *tty) { struct tty_struct *pair = tty->link; driver->ttys[tty->index] = NULL; if (pair) pair->driver->ttys[pair->index] = NULL; } static int pty_bsd_ioctl(struct tty_struct *tty, unsigned int cmd, unsigned long arg) { switch (cmd) { case TIOCSPTLCK: /* Set PT Lock (disallow slave open) */ return pty_set_lock(tty, (int __user *) arg); case TIOCGPTLCK: /* Get PT Lock status */ return pty_get_lock(tty, (int __user *)arg); case TIOCPKT: /* Set PT packet mode */ return pty_set_pktmode(tty, (int __user *)arg); case TIOCGPKT: /* Get PT packet mode */ return pty_get_pktmode(tty, (int __user *)arg); case TIOCSIG: /* Send signal to other side of pty */ return pty_signal(tty, (int) arg); case TIOCGPTN: /* TTY returns ENOTTY, but glibc expects EINVAL here */ return -EINVAL; } return -ENOIOCTLCMD; } #ifdef CONFIG_COMPAT static long pty_bsd_compat_ioctl(struct tty_struct *tty, unsigned int cmd, unsigned long arg) { /* * PTY ioctls don't require any special translation between 32-bit and * 64-bit userspace, they are already compatible. */ return pty_bsd_ioctl(tty, cmd, (unsigned long)compat_ptr(arg)); } #else #define pty_bsd_compat_ioctl NULL #endif static int legacy_count = CONFIG_LEGACY_PTY_COUNT; /* * not really modular, but the easiest way to keep compat with existing * bootargs behaviour is to continue using module_param here. */ module_param(legacy_count, int, 0); /* * The master side of a pty can do TIOCSPTLCK and thus * has pty_bsd_ioctl. */ static const struct tty_operations master_pty_ops_bsd = { .install = pty_install, .open = pty_open, .close = pty_close, .write = pty_write, .write_room = pty_write_room, .flush_buffer = pty_flush_buffer, .unthrottle = pty_unthrottle, .ioctl = pty_bsd_ioctl, .compat_ioctl = pty_bsd_compat_ioctl, .cleanup = pty_cleanup, .resize = pty_resize, .remove = pty_remove }; static const struct tty_operations slave_pty_ops_bsd = { .install = pty_install, .open = pty_open, .close = pty_close, .write = pty_write, .write_room = pty_write_room, .flush_buffer = pty_flush_buffer, .unthrottle = pty_unthrottle, .set_termios = pty_set_termios, .cleanup = pty_cleanup, .resize = pty_resize, .start = pty_start, .stop = pty_stop, .remove = pty_remove }; static void __init legacy_pty_init(void) { struct tty_driver *pty_driver, *pty_slave_driver; if (legacy_count <= 0) return; pty_driver = tty_alloc_driver(legacy_count, TTY_DRIVER_RESET_TERMIOS | TTY_DRIVER_REAL_RAW | TTY_DRIVER_DYNAMIC_ALLOC); if (IS_ERR(pty_driver)) panic("Couldn't allocate pty driver"); pty_slave_driver = tty_alloc_driver(legacy_count, TTY_DRIVER_RESET_TERMIOS | TTY_DRIVER_REAL_RAW | TTY_DRIVER_DYNAMIC_ALLOC); if (IS_ERR(pty_slave_driver)) panic("Couldn't allocate pty slave driver"); pty_driver->driver_name = "pty_master"; pty_driver->name = "pty"; pty_driver->major = PTY_MASTER_MAJOR; pty_driver->minor_start = 0; pty_driver->type = TTY_DRIVER_TYPE_PTY; pty_driver->subtype = PTY_TYPE_MASTER; pty_driver->init_termios = tty_std_termios; pty_driver->init_termios.c_iflag = 0; pty_driver->init_termios.c_oflag = 0; pty_driver->init_termios.c_cflag = B38400 | CS8 | CREAD; pty_driver->init_termios.c_lflag = 0; pty_driver->init_termios.c_ispeed = 38400; pty_driver->init_termios.c_ospeed = 38400; pty_driver->other = pty_slave_driver; tty_set_operations(pty_driver, &master_pty_ops_bsd); pty_slave_driver->driver_name = "pty_slave"; pty_slave_driver->name = "ttyp"; pty_slave_driver->major = PTY_SLAVE_MAJOR; pty_slave_driver->minor_start = 0; pty_slave_driver->type = TTY_DRIVER_TYPE_PTY; pty_slave_driver->subtype = PTY_TYPE_SLAVE; pty_slave_driver->init_termios = tty_std_termios; pty_slave_driver->init_termios.c_cflag = B38400 | CS8 | CREAD; pty_slave_driver->init_termios.c_ispeed = 38400; pty_slave_driver->init_termios.c_ospeed = 38400; pty_slave_driver->other = pty_driver; tty_set_operations(pty_slave_driver, &slave_pty_ops_bsd); if (tty_register_driver(pty_driver)) panic("Couldn't register pty driver"); if (tty_register_driver(pty_slave_driver)) panic("Couldn't register pty slave driver"); } #else static inline void legacy_pty_init(void) { } #endif /* Unix98 devices */ #ifdef CONFIG_UNIX98_PTYS static struct cdev ptmx_cdev; /** * ptm_open_peer - open the peer of a pty * @master: the open struct file of the ptmx device node * @tty: the master of the pty being opened * @flags: the flags for open * * Provide a race free way for userspace to open the slave end of a pty * (where they have the master fd and cannot access or trust the mount * namespace /dev/pts was mounted inside). */ int ptm_open_peer(struct file *master, struct tty_struct *tty, int flags) { int fd; struct file *filp; int retval = -EINVAL; struct path path; if (tty->driver != ptm_driver) return -EIO; fd = get_unused_fd_flags(flags); if (fd < 0) { retval = fd; goto err; } /* Compute the slave's path */ path.mnt = devpts_mntget(master, tty->driver_data); if (IS_ERR(path.mnt)) { retval = PTR_ERR(path.mnt); goto err_put; } path.dentry = tty->link->driver_data; filp = dentry_open(&path, flags, current_cred()); mntput(path.mnt); if (IS_ERR(filp)) { retval = PTR_ERR(filp); goto err_put; } fd_install(fd, filp); return fd; err_put: put_unused_fd(fd); err: return retval; } static int pty_unix98_ioctl(struct tty_struct *tty, unsigned int cmd, unsigned long arg) { switch (cmd) { case TIOCSPTLCK: /* Set PT Lock (disallow slave open) */ return pty_set_lock(tty, (int __user *)arg); case TIOCGPTLCK: /* Get PT Lock status */ return pty_get_lock(tty, (int __user *)arg); case TIOCPKT: /* Set PT packet mode */ return pty_set_pktmode(tty, (int __user *)arg); case TIOCGPKT: /* Get PT packet mode */ return pty_get_pktmode(tty, (int __user *)arg); case TIOCGPTN: /* Get PT Number */ return put_user(tty->index, (unsigned int __user *)arg); case TIOCSIG: /* Send signal to other side of pty */ return pty_signal(tty, (int) arg); } return -ENOIOCTLCMD; } #ifdef CONFIG_COMPAT static long pty_unix98_compat_ioctl(struct tty_struct *tty, unsigned int cmd, unsigned long arg) { /* * PTY ioctls don't require any special translation between 32-bit and * 64-bit userspace, they are already compatible. */ return pty_unix98_ioctl(tty, cmd, cmd == TIOCSIG ? arg : (unsigned long)compat_ptr(arg)); } #else #define pty_unix98_compat_ioctl NULL #endif /** * ptm_unix98_lookup - find a pty master * @driver: ptm driver * @file: unused * @idx: tty index * * Look up a pty master device. Called under the tty_mutex for now. * This provides our locking. */ static struct tty_struct *ptm_unix98_lookup(struct tty_driver *driver, struct file *file, int idx) { /* Master must be open via /dev/ptmx */ return ERR_PTR(-EIO); } /** * pts_unix98_lookup - find a pty slave * @driver: pts driver * @file: file pointer to tty * @idx: tty index * * Look up a pty master device. Called under the tty_mutex for now. * This provides our locking for the tty pointer. */ static struct tty_struct *pts_unix98_lookup(struct tty_driver *driver, struct file *file, int idx) { struct tty_struct *tty; mutex_lock(&devpts_mutex); tty = devpts_get_priv(file->f_path.dentry); mutex_unlock(&devpts_mutex); /* Master must be open before slave */ if (!tty) return ERR_PTR(-EIO); return tty; } static int pty_unix98_install(struct tty_driver *driver, struct tty_struct *tty) { return pty_common_install(driver, tty, false); } /* this is called once with whichever end is closed last */ static void pty_unix98_remove(struct tty_driver *driver, struct tty_struct *tty) { struct pts_fs_info *fsi; if (tty->driver->subtype == PTY_TYPE_MASTER) fsi = tty->driver_data; else fsi = tty->link->driver_data; if (fsi) { devpts_kill_index(fsi, tty->index); devpts_release(fsi); } } static void pty_show_fdinfo(struct tty_struct *tty, struct seq_file *m) { seq_printf(m, "tty-index:\t%d\n", tty->index); } static const struct tty_operations ptm_unix98_ops = { .lookup = ptm_unix98_lookup, .install = pty_unix98_install, .remove = pty_unix98_remove, .open = pty_open, .close = pty_close, .write = pty_write, .write_room = pty_write_room, .flush_buffer = pty_flush_buffer, .unthrottle = pty_unthrottle, .ioctl = pty_unix98_ioctl, .compat_ioctl = pty_unix98_compat_ioctl, .resize = pty_resize, .cleanup = pty_cleanup, .show_fdinfo = pty_show_fdinfo, }; static const struct tty_operations pty_unix98_ops = { .lookup = pts_unix98_lookup, .install = pty_unix98_install, .remove = pty_unix98_remove, .open = pty_open, .close = pty_close, .write = pty_write, .write_room = pty_write_room, .flush_buffer = pty_flush_buffer, .unthrottle = pty_unthrottle, .set_termios = pty_set_termios, .start = pty_start, .stop = pty_stop, .cleanup = pty_cleanup, }; /** * ptmx_open - open a unix 98 pty master * @inode: inode of device file * @filp: file pointer to tty * * Allocate a unix98 pty master device from the ptmx driver. * * Locking: tty_mutex protects the init_dev work. tty->count should * protect the rest. * allocated_ptys_lock handles the list of free pty numbers */ static int ptmx_open(struct inode *inode, struct file *filp) { struct pts_fs_info *fsi; struct tty_struct *tty; struct dentry *dentry; int retval; int index; nonseekable_open(inode, filp); /* We refuse fsnotify events on ptmx, since it's a shared resource */ file_set_fsnotify_mode(filp, FMODE_NONOTIFY); retval = tty_alloc_file(filp); if (retval) return retval; fsi = devpts_acquire(filp); if (IS_ERR(fsi)) { retval = PTR_ERR(fsi); goto out_free_file; } /* find a device that is not in use. */ mutex_lock(&devpts_mutex); index = devpts_new_index(fsi); mutex_unlock(&devpts_mutex); retval = index; if (index < 0) goto out_put_fsi; mutex_lock(&tty_mutex); tty = tty_init_dev(ptm_driver, index); /* The tty returned here is locked so we can safely drop the mutex */ mutex_unlock(&tty_mutex); retval = PTR_ERR(tty); if (IS_ERR(tty)) goto out; /* * From here on out, the tty is "live", and the index and * fsi will be killed/put by the tty_release() */ set_bit(TTY_PTY_LOCK, &tty->flags); /* LOCK THE SLAVE */ tty->driver_data = fsi; tty_add_file(tty, filp); dentry = devpts_pty_new(fsi, index, tty->link); if (IS_ERR(dentry)) { retval = PTR_ERR(dentry); goto err_release; } tty->link->driver_data = dentry; retval = ptm_driver->ops->open(tty, filp); if (retval) goto err_release; tty_debug_hangup(tty, "opening (count=%d)\n", tty->count); tty_unlock(tty); return 0; err_release: tty_unlock(tty); // This will also put-ref the fsi tty_release(inode, filp); return retval; out: devpts_kill_index(fsi, index); out_put_fsi: devpts_release(fsi); out_free_file: tty_free_file(filp); return retval; } static struct file_operations ptmx_fops __ro_after_init; static void __init unix98_pty_init(void) { ptm_driver = tty_alloc_driver(NR_UNIX98_PTY_MAX, TTY_DRIVER_RESET_TERMIOS | TTY_DRIVER_REAL_RAW | TTY_DRIVER_DYNAMIC_DEV | TTY_DRIVER_DEVPTS_MEM | TTY_DRIVER_DYNAMIC_ALLOC); if (IS_ERR(ptm_driver)) panic("Couldn't allocate Unix98 ptm driver"); pts_driver = tty_alloc_driver(NR_UNIX98_PTY_MAX, TTY_DRIVER_RESET_TERMIOS | TTY_DRIVER_REAL_RAW | TTY_DRIVER_DYNAMIC_DEV | TTY_DRIVER_DEVPTS_MEM | TTY_DRIVER_DYNAMIC_ALLOC); if (IS_ERR(pts_driver)) panic("Couldn't allocate Unix98 pts driver"); ptm_driver->driver_name = "pty_master"; ptm_driver->name = "ptm"; ptm_driver->major = UNIX98_PTY_MASTER_MAJOR; ptm_driver->minor_start = 0; ptm_driver->type = TTY_DRIVER_TYPE_PTY; ptm_driver->subtype = PTY_TYPE_MASTER; ptm_driver->init_termios = tty_std_termios; ptm_driver->init_termios.c_iflag = 0; ptm_driver->init_termios.c_oflag = 0; ptm_driver->init_termios.c_cflag = B38400 | CS8 | CREAD; ptm_driver->init_termios.c_lflag = 0; ptm_driver->init_termios.c_ispeed = 38400; ptm_driver->init_termios.c_ospeed = 38400; ptm_driver->other = pts_driver; tty_set_operations(ptm_driver, &ptm_unix98_ops); pts_driver->driver_name = "pty_slave"; pts_driver->name = "pts"; pts_driver->major = UNIX98_PTY_SLAVE_MAJOR; pts_driver->minor_start = 0; pts_driver->type = TTY_DRIVER_TYPE_PTY; pts_driver->subtype = PTY_TYPE_SLAVE; pts_driver->init_termios = tty_std_termios; pts_driver->init_termios.c_cflag = B38400 | CS8 | CREAD; pts_driver->init_termios.c_ispeed = 38400; pts_driver->init_termios.c_ospeed = 38400; pts_driver->other = ptm_driver; tty_set_operations(pts_driver, &pty_unix98_ops); if (tty_register_driver(ptm_driver)) panic("Couldn't register Unix98 ptm driver"); if (tty_register_driver(pts_driver)) panic("Couldn't register Unix98 pts driver"); /* Now create the /dev/ptmx special device */ tty_default_fops(&ptmx_fops); ptmx_fops.open = ptmx_open; cdev_init(&ptmx_cdev, &ptmx_fops); if (cdev_add(&ptmx_cdev, MKDEV(TTYAUX_MAJOR, 2), 1) || register_chrdev_region(MKDEV(TTYAUX_MAJOR, 2), 1, "/dev/ptmx") < 0) panic("Couldn't register /dev/ptmx driver"); device_create(&tty_class, NULL, MKDEV(TTYAUX_MAJOR, 2), NULL, "ptmx"); } #else static inline void unix98_pty_init(void) { } #endif static int __init pty_init(void) { legacy_pty_init(); unix98_pty_init(); return 0; } device_initcall(pty_init); |
| 10 10 3 7 3 3 3 3 3 7 7 7 7 5 3 3 3 3 3 3 5 5 3 3 3 3 3 3 6 1 5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 | // SPDX-License-Identifier: GPL-2.0-or-later /* * RTC class driver for "CMOS RTC": PCs, ACPI, etc * * Copyright (C) 1996 Paul Gortmaker (drivers/char/rtc.c) * Copyright (C) 2006 David Brownell (convert to new framework) */ /* * The original "cmos clock" chip was an MC146818 chip, now obsolete. * That defined the register interface now provided by all PCs, some * non-PC systems, and incorporated into ACPI. Modern PC chipsets * integrate an MC146818 clone in their southbridge, and boards use * that instead of discrete clones like the DS12887 or M48T86. There * are also clones that connect using the LPC bus. * * That register API is also used directly by various other drivers * (notably for integrated NVRAM), infrastructure (x86 has code to * bypass the RTC framework, directly reading the RTC during boot * and updating minutes/seconds for systems using NTP synch) and * utilities (like userspace 'hwclock', if no /dev node exists). * * So **ALL** calls to CMOS_READ and CMOS_WRITE must be done with * interrupts disabled, holding the global rtc_lock, to exclude those * other drivers and utilities on correctly configured systems. */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/kernel.h> #include <linux/module.h> #include <linux/init.h> #include <linux/interrupt.h> #include <linux/spinlock.h> #include <linux/platform_device.h> #include <linux/log2.h> #include <linux/pm.h> #include <linux/of.h> #include <linux/of_platform.h> #ifdef CONFIG_X86 #include <asm/i8259.h> #include <asm/processor.h> #include <linux/dmi.h> #endif /* this is for "generic access to PC-style RTC" using CMOS_READ/CMOS_WRITE */ #include <linux/mc146818rtc.h> #ifdef CONFIG_ACPI /* * Use ACPI SCI to replace HPET interrupt for RTC Alarm event * * If cleared, ACPI SCI is only used to wake up the system from suspend * * If set, ACPI SCI is used to handle UIE/AIE and system wakeup */ static bool use_acpi_alarm; module_param(use_acpi_alarm, bool, 0444); static inline int cmos_use_acpi_alarm(void) { return use_acpi_alarm; } #else /* !CONFIG_ACPI */ static inline int cmos_use_acpi_alarm(void) { return 0; } #endif struct cmos_rtc { struct rtc_device *rtc; struct device *dev; int irq; struct resource *iomem; time64_t alarm_expires; void (*wake_on)(struct device *); void (*wake_off)(struct device *); u8 enabled_wake; u8 suspend_ctrl; /* newer hardware extends the original register set */ u8 day_alrm; u8 mon_alrm; u8 century; struct rtc_wkalrm saved_wkalrm; }; /* both platform and pnp busses use negative numbers for invalid irqs */ #define is_valid_irq(n) ((n) > 0) static const char driver_name[] = "rtc_cmos"; /* The RTC_INTR register may have e.g. RTC_PF set even if RTC_PIE is clear; * always mask it against the irq enable bits in RTC_CONTROL. Bit values * are the same: PF==PIE, AF=AIE, UF=UIE; so RTC_IRQMASK works with both. */ #define RTC_IRQMASK (RTC_PF | RTC_AF | RTC_UF) static inline int is_intr(u8 rtc_intr) { if (!(rtc_intr & RTC_IRQF)) return 0; return rtc_intr & RTC_IRQMASK; } /*----------------------------------------------------------------*/ /* Much modern x86 hardware has HPETs (10+ MHz timers) which, because * many BIOS programmers don't set up "sane mode" IRQ routing, are mostly * used in a broken "legacy replacement" mode. The breakage includes * HPET #1 hijacking the IRQ for this RTC, and being unavailable for * other (better) use. * * When that broken mode is in use, platform glue provides a partial * emulation of hardware RTC IRQ facilities using HPET #1. We don't * want to use HPET for anything except those IRQs though... */ #ifdef CONFIG_HPET_EMULATE_RTC #include <asm/hpet.h> #else static inline int is_hpet_enabled(void) { return 0; } static inline int hpet_mask_rtc_irq_bit(unsigned long mask) { return 0; } static inline int hpet_set_rtc_irq_bit(unsigned long mask) { return 0; } static inline int hpet_set_alarm_time(unsigned char hrs, unsigned char min, unsigned char sec) { return 0; } static inline int hpet_set_periodic_freq(unsigned long freq) { return 0; } static inline int hpet_rtc_timer_init(void) { return 0; } extern irq_handler_t hpet_rtc_interrupt; static inline int hpet_register_irq_handler(irq_handler_t handler) { return 0; } static inline int hpet_unregister_irq_handler(irq_handler_t handler) { return 0; } #endif /* Don't use HPET for RTC Alarm event if ACPI Fixed event is used */ static inline int use_hpet_alarm(void) { return is_hpet_enabled() && !cmos_use_acpi_alarm(); } /*----------------------------------------------------------------*/ #ifdef RTC_PORT /* Most newer x86 systems have two register banks, the first used * for RTC and NVRAM and the second only for NVRAM. Caller must * own rtc_lock ... and we won't worry about access during NMI. */ #define can_bank2 true static inline unsigned char cmos_read_bank2(unsigned char addr) { outb(addr, RTC_PORT(2)); return inb(RTC_PORT(3)); } static inline void cmos_write_bank2(unsigned char val, unsigned char addr) { outb(addr, RTC_PORT(2)); outb(val, RTC_PORT(3)); } #else #define can_bank2 false static inline unsigned char cmos_read_bank2(unsigned char addr) { return 0; } static inline void cmos_write_bank2(unsigned char val, unsigned char addr) { } #endif /*----------------------------------------------------------------*/ static int cmos_read_time(struct device *dev, struct rtc_time *t) { int ret; /* * If pm_trace abused the RTC for storage, set the timespec to 0, * which tells the caller that this RTC value is unusable. */ if (!pm_trace_rtc_valid()) return -EIO; ret = mc146818_get_time(t, 1000); if (ret < 0) { dev_err_ratelimited(dev, "unable to read current time\n"); return ret; } return 0; } static int cmos_set_time(struct device *dev, struct rtc_time *t) { /* NOTE: this ignores the issue whereby updating the seconds * takes effect exactly 500ms after we write the register. * (Also queueing and other delays before we get this far.) */ return mc146818_set_time(t); } struct cmos_read_alarm_callback_param { struct cmos_rtc *cmos; struct rtc_time *time; unsigned char rtc_control; }; static void cmos_read_alarm_callback(unsigned char __always_unused seconds, void *param_in) { struct cmos_read_alarm_callback_param *p = (struct cmos_read_alarm_callback_param *)param_in; struct rtc_time *time = p->time; time->tm_sec = CMOS_READ(RTC_SECONDS_ALARM); time->tm_min = CMOS_READ(RTC_MINUTES_ALARM); time->tm_hour = CMOS_READ(RTC_HOURS_ALARM); if (p->cmos->day_alrm) { /* ignore upper bits on readback per ACPI spec */ time->tm_mday = CMOS_READ(p->cmos->day_alrm) & 0x3f; if (!time->tm_mday) time->tm_mday = -1; if (p->cmos->mon_alrm) { time->tm_mon = CMOS_READ(p->cmos->mon_alrm); if (!time->tm_mon) time->tm_mon = -1; } } p->rtc_control = CMOS_READ(RTC_CONTROL); } static int cmos_read_alarm(struct device *dev, struct rtc_wkalrm *t) { struct cmos_rtc *cmos = dev_get_drvdata(dev); struct cmos_read_alarm_callback_param p = { .cmos = cmos, .time = &t->time, }; /* This not only a rtc_op, but also called directly */ if (!is_valid_irq(cmos->irq)) return -ETIMEDOUT; /* Basic alarms only support hour, minute, and seconds fields. * Some also support day and month, for alarms up to a year in * the future. */ /* Some Intel chipsets disconnect the alarm registers when the clock * update is in progress - during this time reads return bogus values * and writes may fail silently. See for example "7th Generation Intel® * Processor Family I/O for U/Y Platforms [...] Datasheet", section * 27.7.1 * * Use the mc146818_avoid_UIP() function to avoid this. */ if (!mc146818_avoid_UIP(cmos_read_alarm_callback, 10, &p)) return -EIO; if (!(p.rtc_control & RTC_DM_BINARY) || RTC_ALWAYS_BCD) { if (((unsigned)t->time.tm_sec) < 0x60) t->time.tm_sec = bcd2bin(t->time.tm_sec); else t->time.tm_sec = -1; if (((unsigned)t->time.tm_min) < 0x60) t->time.tm_min = bcd2bin(t->time.tm_min); else t->time.tm_min = -1; if (((unsigned)t->time.tm_hour) < 0x24) t->time.tm_hour = bcd2bin(t->time.tm_hour); else t->time.tm_hour = -1; if (cmos->day_alrm) { if (((unsigned)t->time.tm_mday) <= 0x31) t->time.tm_mday = bcd2bin(t->time.tm_mday); else t->time.tm_mday = -1; if (cmos->mon_alrm) { if (((unsigned)t->time.tm_mon) <= 0x12) t->time.tm_mon = bcd2bin(t->time.tm_mon)-1; else t->time.tm_mon = -1; } } } t->enabled = !!(p.rtc_control & RTC_AIE); t->pending = 0; return 0; } static void cmos_checkintr(struct cmos_rtc *cmos, unsigned char rtc_control) { unsigned char rtc_intr; /* NOTE after changing RTC_xIE bits we always read INTR_FLAGS; * allegedly some older rtcs need that to handle irqs properly */ rtc_intr = CMOS_READ(RTC_INTR_FLAGS); if (use_hpet_alarm()) return; rtc_intr &= (rtc_control & RTC_IRQMASK) | RTC_IRQF; if (is_intr(rtc_intr)) rtc_update_irq(cmos->rtc, 1, rtc_intr); } static void cmos_irq_enable(struct cmos_rtc *cmos, unsigned char mask) { unsigned char rtc_control; /* flush any pending IRQ status, notably for update irqs, * before we enable new IRQs */ rtc_control = CMOS_READ(RTC_CONTROL); cmos_checkintr(cmos, rtc_control); rtc_control |= mask; CMOS_WRITE(rtc_control, RTC_CONTROL); if (use_hpet_alarm()) hpet_set_rtc_irq_bit(mask); if ((mask & RTC_AIE) && cmos_use_acpi_alarm()) { if (cmos->wake_on) cmos->wake_on(cmos->dev); } cmos_checkintr(cmos, rtc_control); } static void cmos_irq_disable(struct cmos_rtc *cmos, unsigned char mask) { unsigned char rtc_control; rtc_control = CMOS_READ(RTC_CONTROL); rtc_control &= ~mask; CMOS_WRITE(rtc_control, RTC_CONTROL); if (use_hpet_alarm()) hpet_mask_rtc_irq_bit(mask); if ((mask & RTC_AIE) && cmos_use_acpi_alarm()) { if (cmos->wake_off) cmos->wake_off(cmos->dev); } cmos_checkintr(cmos, rtc_control); } static int cmos_validate_alarm(struct device *dev, struct rtc_wkalrm *t) { struct cmos_rtc *cmos = dev_get_drvdata(dev); struct rtc_time now; cmos_read_time(dev, &now); if (!cmos->day_alrm) { time64_t t_max_date; time64_t t_alrm; t_max_date = rtc_tm_to_time64(&now); t_max_date += 24 * 60 * 60 - 1; t_alrm = rtc_tm_to_time64(&t->time); if (t_alrm > t_max_date) { dev_err(dev, "Alarms can be up to one day in the future\n"); return -EINVAL; } } else if (!cmos->mon_alrm) { struct rtc_time max_date = now; time64_t t_max_date; time64_t t_alrm; int max_mday; if (max_date.tm_mon == 11) { max_date.tm_mon = 0; max_date.tm_year += 1; } else { max_date.tm_mon += 1; } max_mday = rtc_month_days(max_date.tm_mon, max_date.tm_year); if (max_date.tm_mday > max_mday) max_date.tm_mday = max_mday; t_max_date = rtc_tm_to_time64(&max_date); t_max_date -= 1; t_alrm = rtc_tm_to_time64(&t->time); if (t_alrm > t_max_date) { dev_err(dev, "Alarms can be up to one month in the future\n"); return -EINVAL; } } else { struct rtc_time max_date = now; time64_t t_max_date; time64_t t_alrm; int max_mday; max_date.tm_year += 1; max_mday = rtc_month_days(max_date.tm_mon, max_date.tm_year); if (max_date.tm_mday > max_mday) max_date.tm_mday = max_mday; t_max_date = rtc_tm_to_time64(&max_date); t_max_date -= 1; t_alrm = rtc_tm_to_time64(&t->time); if (t_alrm > t_max_date) { dev_err(dev, "Alarms can be up to one year in the future\n"); return -EINVAL; } } return 0; } struct cmos_set_alarm_callback_param { struct cmos_rtc *cmos; unsigned char mon, mday, hrs, min, sec; struct rtc_wkalrm *t; }; /* Note: this function may be executed by mc146818_avoid_UIP() more then * once */ static void cmos_set_alarm_callback(unsigned char __always_unused seconds, void *param_in) { struct cmos_set_alarm_callback_param *p = (struct cmos_set_alarm_callback_param *)param_in; /* next rtc irq must not be from previous alarm setting */ cmos_irq_disable(p->cmos, RTC_AIE); /* update alarm */ CMOS_WRITE(p->hrs, RTC_HOURS_ALARM); CMOS_WRITE(p->min, RTC_MINUTES_ALARM); CMOS_WRITE(p->sec, RTC_SECONDS_ALARM); /* the system may support an "enhanced" alarm */ if (p->cmos->day_alrm) { CMOS_WRITE(p->mday, p->cmos->day_alrm); if (p->cmos->mon_alrm) CMOS_WRITE(p->mon, p->cmos->mon_alrm); } if (use_hpet_alarm()) { /* * FIXME the HPET alarm glue currently ignores day_alrm * and mon_alrm ... */ hpet_set_alarm_time(p->t->time.tm_hour, p->t->time.tm_min, p->t->time.tm_sec); } if (p->t->enabled) cmos_irq_enable(p->cmos, RTC_AIE); } static int cmos_set_alarm(struct device *dev, struct rtc_wkalrm *t) { struct cmos_rtc *cmos = dev_get_drvdata(dev); struct cmos_set_alarm_callback_param p = { .cmos = cmos, .t = t }; unsigned char rtc_control; int ret; /* This not only a rtc_op, but also called directly */ if (!is_valid_irq(cmos->irq)) return -EIO; ret = cmos_validate_alarm(dev, t); if (ret < 0) return ret; p.mon = t->time.tm_mon + 1; p.mday = t->time.tm_mday; p.hrs = t->time.tm_hour; p.min = t->time.tm_min; p.sec = t->time.tm_sec; spin_lock_irq(&rtc_lock); rtc_control = CMOS_READ(RTC_CONTROL); spin_unlock_irq(&rtc_lock); if (!(rtc_control & RTC_DM_BINARY) || RTC_ALWAYS_BCD) { /* Writing 0xff means "don't care" or "match all". */ p.mon = (p.mon <= 12) ? bin2bcd(p.mon) : 0xff; p.mday = (p.mday >= 1 && p.mday <= 31) ? bin2bcd(p.mday) : 0xff; p.hrs = (p.hrs < 24) ? bin2bcd(p.hrs) : 0xff; p.min = (p.min < 60) ? bin2bcd(p.min) : 0xff; p.sec = (p.sec < 60) ? bin2bcd(p.sec) : 0xff; } /* * Some Intel chipsets disconnect the alarm registers when the clock * update is in progress - during this time writes fail silently. * * Use mc146818_avoid_UIP() to avoid this. */ if (!mc146818_avoid_UIP(cmos_set_alarm_callback, 10, &p)) return -ETIMEDOUT; cmos->alarm_expires = rtc_tm_to_time64(&t->time); return 0; } static int cmos_alarm_irq_enable(struct device *dev, unsigned int enabled) { struct cmos_rtc *cmos = dev_get_drvdata(dev); unsigned long flags; spin_lock_irqsave(&rtc_lock, flags); if (enabled) cmos_irq_enable(cmos, RTC_AIE); else cmos_irq_disable(cmos, RTC_AIE); spin_unlock_irqrestore(&rtc_lock, flags); return 0; } #if IS_ENABLED(CONFIG_RTC_INTF_PROC) static int cmos_procfs(struct device *dev, struct seq_file *seq) { struct cmos_rtc *cmos = dev_get_drvdata(dev); unsigned char rtc_control, valid; spin_lock_irq(&rtc_lock); rtc_control = CMOS_READ(RTC_CONTROL); valid = CMOS_READ(RTC_VALID); spin_unlock_irq(&rtc_lock); /* NOTE: at least ICH6 reports battery status using a different * (non-RTC) bit; and SQWE is ignored on many current systems. */ seq_printf(seq, "periodic_IRQ\t: %s\n" "update_IRQ\t: %s\n" "HPET_emulated\t: %s\n" // "square_wave\t: %s\n" "BCD\t\t: %s\n" "DST_enable\t: %s\n" "periodic_freq\t: %d\n" "batt_status\t: %s\n", (rtc_control & RTC_PIE) ? "yes" : "no", (rtc_control & RTC_UIE) ? "yes" : "no", use_hpet_alarm() ? "yes" : "no", // (rtc_control & RTC_SQWE) ? "yes" : "no", (rtc_control & RTC_DM_BINARY) ? "no" : "yes", (rtc_control & RTC_DST_EN) ? "yes" : "no", cmos->rtc->irq_freq, (valid & RTC_VRT) ? "okay" : "dead"); return 0; } #else #define cmos_procfs NULL #endif static const struct rtc_class_ops cmos_rtc_ops = { .read_time = cmos_read_time, .set_time = cmos_set_time, .read_alarm = cmos_read_alarm, .set_alarm = cmos_set_alarm, .proc = cmos_procfs, .alarm_irq_enable = cmos_alarm_irq_enable, }; /*----------------------------------------------------------------*/ /* * All these chips have at least 64 bytes of address space, shared by * RTC registers and NVRAM. Most of those bytes of NVRAM are used * by boot firmware. Modern chips have 128 or 256 bytes. */ #define NVRAM_OFFSET (RTC_REG_D + 1) static int cmos_nvram_read(void *priv, unsigned int off, void *val, size_t count) { unsigned char *buf = val; off += NVRAM_OFFSET; for (; count; count--, off++, buf++) { guard(spinlock_irq)(&rtc_lock); if (off < 128) *buf = CMOS_READ(off); else if (can_bank2) *buf = cmos_read_bank2(off); else return -EIO; } return 0; } static int cmos_nvram_write(void *priv, unsigned int off, void *val, size_t count) { struct cmos_rtc *cmos = priv; unsigned char *buf = val; /* NOTE: on at least PCs and Ataris, the boot firmware uses a * checksum on part of the NVRAM data. That's currently ignored * here. If userspace is smart enough to know what fields of * NVRAM to update, updating checksums is also part of its job. */ off += NVRAM_OFFSET; for (; count; count--, off++, buf++) { /* don't trash RTC registers */ if (off == cmos->day_alrm || off == cmos->mon_alrm || off == cmos->century) continue; guard(spinlock_irq)(&rtc_lock); if (off < 128) CMOS_WRITE(*buf, off); else if (can_bank2) cmos_write_bank2(*buf, off); else return -EIO; } return 0; } /*----------------------------------------------------------------*/ static struct cmos_rtc cmos_rtc; static irqreturn_t cmos_interrupt(int irq, void *p) { u8 irqstat; u8 rtc_control; spin_lock(&rtc_lock); /* When the HPET interrupt handler calls us, the interrupt * status is passed as arg1 instead of the irq number. But * always clear irq status, even when HPET is in the way. * * Note that HPET and RTC are almost certainly out of phase, * giving different IRQ status ... */ irqstat = CMOS_READ(RTC_INTR_FLAGS); rtc_control = CMOS_READ(RTC_CONTROL); if (use_hpet_alarm()) irqstat = (unsigned long)irq & 0xF0; /* If we were suspended, RTC_CONTROL may not be accurate since the * bios may have cleared it. */ if (!cmos_rtc.suspend_ctrl) irqstat &= (rtc_control & RTC_IRQMASK) | RTC_IRQF; else irqstat &= (cmos_rtc.suspend_ctrl & RTC_IRQMASK) | RTC_IRQF; /* All Linux RTC alarms should be treated as if they were oneshot. * Similar code may be needed in system wakeup paths, in case the * alarm woke the system. */ if (irqstat & RTC_AIE) { cmos_rtc.suspend_ctrl &= ~RTC_AIE; rtc_control &= ~RTC_AIE; CMOS_WRITE(rtc_control, RTC_CONTROL); if (use_hpet_alarm()) hpet_mask_rtc_irq_bit(RTC_AIE); CMOS_READ(RTC_INTR_FLAGS); } spin_unlock(&rtc_lock); if (is_intr(irqstat)) { rtc_update_irq(p, 1, irqstat); return IRQ_HANDLED; } else return IRQ_NONE; } #ifdef CONFIG_ACPI #include <linux/acpi.h> static u32 rtc_handler(void *context) { struct device *dev = context; struct cmos_rtc *cmos = dev_get_drvdata(dev); unsigned char rtc_control = 0; unsigned char rtc_intr; unsigned long flags; /* * Always update rtc irq when ACPI is used as RTC Alarm. * Or else, ACPI SCI is enabled during suspend/resume only, * update rtc irq in that case. */ if (cmos_use_acpi_alarm()) cmos_interrupt(0, (void *)cmos->rtc); else { /* Fix me: can we use cmos_interrupt() here as well? */ spin_lock_irqsave(&rtc_lock, flags); if (cmos_rtc.suspend_ctrl) rtc_control = CMOS_READ(RTC_CONTROL); if (rtc_control & RTC_AIE) { cmos_rtc.suspend_ctrl &= ~RTC_AIE; CMOS_WRITE(rtc_control, RTC_CONTROL); rtc_intr = CMOS_READ(RTC_INTR_FLAGS); rtc_update_irq(cmos->rtc, 1, rtc_intr); } spin_unlock_irqrestore(&rtc_lock, flags); } pm_wakeup_hard_event(dev); acpi_clear_event(ACPI_EVENT_RTC); acpi_disable_event(ACPI_EVENT_RTC, 0); return ACPI_INTERRUPT_HANDLED; } static void acpi_rtc_event_setup(struct device *dev) { if (acpi_disabled) return; acpi_install_fixed_event_handler(ACPI_EVENT_RTC, rtc_handler, dev); /* * After the RTC handler is installed, the Fixed_RTC event should * be disabled. Only when the RTC alarm is set will it be enabled. */ acpi_clear_event(ACPI_EVENT_RTC); acpi_disable_event(ACPI_EVENT_RTC, 0); } static void acpi_rtc_event_cleanup(void) { if (acpi_disabled) return; acpi_remove_fixed_event_handler(ACPI_EVENT_RTC, rtc_handler); } static void rtc_wake_on(struct device *dev) { acpi_clear_event(ACPI_EVENT_RTC); acpi_enable_event(ACPI_EVENT_RTC, 0); } static void rtc_wake_off(struct device *dev) { acpi_disable_event(ACPI_EVENT_RTC, 0); } #ifdef CONFIG_X86 static void use_acpi_alarm_quirks(void) { switch (boot_cpu_data.x86_vendor) { case X86_VENDOR_INTEL: if (dmi_get_bios_year() < 2015) return; break; case X86_VENDOR_AMD: case X86_VENDOR_HYGON: if (dmi_get_bios_year() < 2021) return; break; default: return; } if (!is_hpet_enabled()) return; use_acpi_alarm = true; } #else static inline void use_acpi_alarm_quirks(void) { } #endif static void acpi_cmos_wake_setup(struct device *dev) { if (acpi_disabled) return; use_acpi_alarm_quirks(); cmos_rtc.wake_on = rtc_wake_on; cmos_rtc.wake_off = rtc_wake_off; /* ACPI tables bug workaround. */ if (acpi_gbl_FADT.month_alarm && !acpi_gbl_FADT.day_alarm) { dev_dbg(dev, "bogus FADT month_alarm (%d)\n", acpi_gbl_FADT.month_alarm); acpi_gbl_FADT.month_alarm = 0; } cmos_rtc.day_alrm = acpi_gbl_FADT.day_alarm; cmos_rtc.mon_alrm = acpi_gbl_FADT.month_alarm; cmos_rtc.century = acpi_gbl_FADT.century; if (acpi_gbl_FADT.flags & ACPI_FADT_S4_RTC_WAKE) dev_info(dev, "RTC can wake from S4\n"); /* RTC always wakes from S1/S2/S3, and often S4/STD */ device_init_wakeup(dev, true); } static void cmos_check_acpi_rtc_status(struct device *dev, unsigned char *rtc_control) { struct cmos_rtc *cmos = dev_get_drvdata(dev); acpi_event_status rtc_status; acpi_status status; if (acpi_gbl_FADT.flags & ACPI_FADT_FIXED_RTC) return; status = acpi_get_event_status(ACPI_EVENT_RTC, &rtc_status); if (ACPI_FAILURE(status)) { dev_err(dev, "Could not get RTC status\n"); } else if (rtc_status & ACPI_EVENT_FLAG_SET) { unsigned char mask; *rtc_control &= ~RTC_AIE; CMOS_WRITE(*rtc_control, RTC_CONTROL); mask = CMOS_READ(RTC_INTR_FLAGS); rtc_update_irq(cmos->rtc, 1, mask); } } #else /* !CONFIG_ACPI */ static inline void acpi_rtc_event_setup(struct device *dev) { } static inline void acpi_rtc_event_cleanup(void) { } static inline void acpi_cmos_wake_setup(struct device *dev) { } static inline void cmos_check_acpi_rtc_status(struct device *dev, unsigned char *rtc_control) { } #endif /* CONFIG_ACPI */ #ifdef CONFIG_PNP #define INITSECTION #else #define INITSECTION __init #endif #define SECS_PER_DAY (24 * 60 * 60) #define SECS_PER_MONTH (28 * SECS_PER_DAY) #define SECS_PER_YEAR (365 * SECS_PER_DAY) static int INITSECTION cmos_do_probe(struct device *dev, struct resource *ports, int rtc_irq) { struct cmos_rtc_board_info *info = dev_get_platdata(dev); int retval = 0; unsigned char rtc_control; unsigned address_space; u32 flags = 0; struct nvmem_config nvmem_cfg = { .name = "cmos_nvram", .word_size = 1, .stride = 1, .reg_read = cmos_nvram_read, .reg_write = cmos_nvram_write, .priv = &cmos_rtc, }; /* there can be only one ... */ if (cmos_rtc.dev) return -EBUSY; if (!ports) return -ENODEV; /* Claim I/O ports ASAP, minimizing conflict with legacy driver. * * REVISIT non-x86 systems may instead use memory space resources * (needing ioremap etc), not i/o space resources like this ... */ if (RTC_IOMAPPED) ports = request_region(ports->start, resource_size(ports), driver_name); else ports = request_mem_region(ports->start, resource_size(ports), driver_name); if (!ports) { dev_dbg(dev, "i/o registers already in use\n"); return -EBUSY; } cmos_rtc.irq = rtc_irq; cmos_rtc.iomem = ports; /* Heuristic to deduce NVRAM size ... do what the legacy NVRAM * driver did, but don't reject unknown configs. Old hardware * won't address 128 bytes. Newer chips have multiple banks, * though they may not be listed in one I/O resource. */ #if defined(CONFIG_ATARI) address_space = 64; #elif defined(__i386__) || defined(__x86_64__) || defined(__arm__) \ || defined(__sparc__) || defined(__mips__) \ || defined(__powerpc__) address_space = 128; #else #warning Assuming 128 bytes of RTC+NVRAM address space, not 64 bytes. address_space = 128; #endif if (can_bank2 && ports->end > (ports->start + 1)) address_space = 256; /* For ACPI systems extension info comes from the FADT. On others, * board specific setup provides it as appropriate. Systems where * the alarm IRQ isn't automatically a wakeup IRQ (like ACPI, and * some almost-clones) can provide hooks to make that behave. * * Note that ACPI doesn't preclude putting these registers into * "extended" areas of the chip, including some that we won't yet * expect CMOS_READ and friends to handle. */ if (info) { if (info->flags) flags = info->flags; if (info->address_space) address_space = info->address_space; cmos_rtc.day_alrm = info->rtc_day_alarm; cmos_rtc.mon_alrm = info->rtc_mon_alarm; cmos_rtc.century = info->rtc_century; if (info->wake_on && info->wake_off) { cmos_rtc.wake_on = info->wake_on; cmos_rtc.wake_off = info->wake_off; } } else { acpi_cmos_wake_setup(dev); } if (cmos_rtc.day_alrm >= 128) cmos_rtc.day_alrm = 0; if (cmos_rtc.mon_alrm >= 128) cmos_rtc.mon_alrm = 0; if (cmos_rtc.century >= 128) cmos_rtc.century = 0; cmos_rtc.dev = dev; dev_set_drvdata(dev, &cmos_rtc); cmos_rtc.rtc = devm_rtc_allocate_device(dev); if (IS_ERR(cmos_rtc.rtc)) { retval = PTR_ERR(cmos_rtc.rtc); goto cleanup0; } if (cmos_rtc.mon_alrm) cmos_rtc.rtc->alarm_offset_max = SECS_PER_YEAR - 1; else if (cmos_rtc.day_alrm) cmos_rtc.rtc->alarm_offset_max = SECS_PER_MONTH - 1; else cmos_rtc.rtc->alarm_offset_max = SECS_PER_DAY - 1; rename_region(ports, dev_name(&cmos_rtc.rtc->dev)); if (!mc146818_does_rtc_work()) { dev_warn(dev, "broken or not accessible\n"); retval = -ENXIO; goto cleanup1; } spin_lock_irq(&rtc_lock); if (!(flags & CMOS_RTC_FLAGS_NOFREQ)) { /* force periodic irq to CMOS reset default of 1024Hz; * * REVISIT it's been reported that at least one x86_64 ALI * mobo doesn't use 32KHz here ... for portability we might * need to do something about other clock frequencies. */ cmos_rtc.rtc->irq_freq = 1024; if (use_hpet_alarm()) hpet_set_periodic_freq(cmos_rtc.rtc->irq_freq); CMOS_WRITE(RTC_REF_CLCK_32KHZ | 0x06, RTC_FREQ_SELECT); } /* disable irqs */ if (is_valid_irq(rtc_irq)) cmos_irq_disable(&cmos_rtc, RTC_PIE | RTC_AIE | RTC_UIE); rtc_control = CMOS_READ(RTC_CONTROL); spin_unlock_irq(&rtc_lock); if (is_valid_irq(rtc_irq) && !(rtc_control & RTC_24H)) { dev_warn(dev, "only 24-hr supported\n"); retval = -ENXIO; goto cleanup1; } if (use_hpet_alarm()) hpet_rtc_timer_init(); if (is_valid_irq(rtc_irq)) { irq_handler_t rtc_cmos_int_handler; if (use_hpet_alarm()) { rtc_cmos_int_handler = hpet_rtc_interrupt; retval = hpet_register_irq_handler(cmos_interrupt); if (retval) { hpet_mask_rtc_irq_bit(RTC_IRQMASK); dev_warn(dev, "hpet_register_irq_handler " " failed in rtc_init()."); goto cleanup1; } } else rtc_cmos_int_handler = cmos_interrupt; retval = request_irq(rtc_irq, rtc_cmos_int_handler, 0, dev_name(&cmos_rtc.rtc->dev), cmos_rtc.rtc); if (retval < 0) { dev_dbg(dev, "IRQ %d is already in use\n", rtc_irq); goto cleanup1; } } else { clear_bit(RTC_FEATURE_ALARM, cmos_rtc.rtc->features); } cmos_rtc.rtc->ops = &cmos_rtc_ops; retval = devm_rtc_register_device(cmos_rtc.rtc); if (retval) goto cleanup2; /* Set the sync offset for the periodic 11min update correct */ cmos_rtc.rtc->set_offset_nsec = NSEC_PER_SEC / 2; /* export at least the first block of NVRAM */ nvmem_cfg.size = address_space - NVRAM_OFFSET; devm_rtc_nvmem_register(cmos_rtc.rtc, &nvmem_cfg); /* * Everything has gone well so far, so by default register a handler for * the ACPI RTC fixed event. */ if (!info) acpi_rtc_event_setup(dev); dev_info(dev, "%s%s, %d bytes nvram%s\n", !is_valid_irq(rtc_irq) ? "no alarms" : cmos_rtc.mon_alrm ? "alarms up to one year" : cmos_rtc.day_alrm ? "alarms up to one month" : "alarms up to one day", cmos_rtc.century ? ", y3k" : "", nvmem_cfg.size, use_hpet_alarm() ? ", hpet irqs" : ""); return 0; cleanup2: if (is_valid_irq(rtc_irq)) free_irq(rtc_irq, cmos_rtc.rtc); cleanup1: cmos_rtc.dev = NULL; cleanup0: if (RTC_IOMAPPED) release_region(ports->start, resource_size(ports)); else release_mem_region(ports->start, resource_size(ports)); return retval; } static void cmos_do_shutdown(int rtc_irq) { spin_lock_irq(&rtc_lock); if (is_valid_irq(rtc_irq)) cmos_irq_disable(&cmos_rtc, RTC_IRQMASK); spin_unlock_irq(&rtc_lock); } static void cmos_do_remove(struct device *dev) { struct cmos_rtc *cmos = dev_get_drvdata(dev); struct resource *ports; cmos_do_shutdown(cmos->irq); if (is_valid_irq(cmos->irq)) { free_irq(cmos->irq, cmos->rtc); if (use_hpet_alarm()) hpet_unregister_irq_handler(cmos_interrupt); } if (!dev_get_platdata(dev)) acpi_rtc_event_cleanup(); cmos->rtc = NULL; ports = cmos->iomem; if (RTC_IOMAPPED) release_region(ports->start, resource_size(ports)); else release_mem_region(ports->start, resource_size(ports)); cmos->iomem = NULL; cmos->dev = NULL; } static int cmos_aie_poweroff(struct device *dev) { struct cmos_rtc *cmos = dev_get_drvdata(dev); struct rtc_time now; time64_t t_now; int retval = 0; unsigned char rtc_control; if (!cmos->alarm_expires) return -EINVAL; spin_lock_irq(&rtc_lock); rtc_control = CMOS_READ(RTC_CONTROL); spin_unlock_irq(&rtc_lock); /* We only care about the situation where AIE is disabled. */ if (rtc_control & RTC_AIE) return -EBUSY; cmos_read_time(dev, &now); t_now = rtc_tm_to_time64(&now); /* * When enabling "RTC wake-up" in BIOS setup, the machine reboots * automatically right after shutdown on some buggy boxes. * This automatic rebooting issue won't happen when the alarm * time is larger than now+1 seconds. * * If the alarm time is equal to now+1 seconds, the issue can be * prevented by cancelling the alarm. */ if (cmos->alarm_expires == t_now + 1) { struct rtc_wkalrm alarm; /* Cancel the AIE timer by configuring the past time. */ rtc_time64_to_tm(t_now - 1, &alarm.time); alarm.enabled = 0; retval = cmos_set_alarm(dev, &alarm); } else if (cmos->alarm_expires > t_now + 1) { retval = -EBUSY; } return retval; } static int cmos_suspend(struct device *dev) { struct cmos_rtc *cmos = dev_get_drvdata(dev); unsigned char tmp; /* only the alarm might be a wakeup event source */ spin_lock_irq(&rtc_lock); cmos->suspend_ctrl = tmp = CMOS_READ(RTC_CONTROL); if (tmp & (RTC_PIE|RTC_AIE|RTC_UIE)) { unsigned char mask; if (device_may_wakeup(dev)) mask = RTC_IRQMASK & ~RTC_AIE; else mask = RTC_IRQMASK; tmp &= ~mask; CMOS_WRITE(tmp, RTC_CONTROL); if (use_hpet_alarm()) hpet_mask_rtc_irq_bit(mask); cmos_checkintr(cmos, tmp); } spin_unlock_irq(&rtc_lock); if ((tmp & RTC_AIE) && !cmos_use_acpi_alarm()) { cmos->enabled_wake = 1; if (cmos->wake_on) cmos->wake_on(dev); else enable_irq_wake(cmos->irq); } memset(&cmos->saved_wkalrm, 0, sizeof(struct rtc_wkalrm)); cmos_read_alarm(dev, &cmos->saved_wkalrm); dev_dbg(dev, "suspend%s, ctrl %02x\n", (tmp & RTC_AIE) ? ", alarm may wake" : "", tmp); return 0; } /* We want RTC alarms to wake us from e.g. ACPI G2/S5 "soft off", even * after a detour through G3 "mechanical off", although the ACPI spec * says wakeup should only work from G1/S4 "hibernate". To most users, * distinctions between S4 and S5 are pointless. So when the hardware * allows, don't draw that distinction. */ static inline int cmos_poweroff(struct device *dev) { if (!IS_ENABLED(CONFIG_PM)) return -ENOSYS; return cmos_suspend(dev); } static void cmos_check_wkalrm(struct device *dev) { struct cmos_rtc *cmos = dev_get_drvdata(dev); struct rtc_wkalrm current_alarm; time64_t t_now; time64_t t_current_expires; time64_t t_saved_expires; struct rtc_time now; /* Check if we have RTC Alarm armed */ if (!(cmos->suspend_ctrl & RTC_AIE)) return; cmos_read_time(dev, &now); t_now = rtc_tm_to_time64(&now); /* * ACPI RTC wake event is cleared after resume from STR, * ACK the rtc irq here */ if (t_now >= cmos->alarm_expires && cmos_use_acpi_alarm()) { local_irq_disable(); cmos_interrupt(0, (void *)cmos->rtc); local_irq_enable(); return; } memset(¤t_alarm, 0, sizeof(struct rtc_wkalrm)); cmos_read_alarm(dev, ¤t_alarm); t_current_expires = rtc_tm_to_time64(¤t_alarm.time); t_saved_expires = rtc_tm_to_time64(&cmos->saved_wkalrm.time); if (t_current_expires != t_saved_expires || cmos->saved_wkalrm.enabled != current_alarm.enabled) { cmos_set_alarm(dev, &cmos->saved_wkalrm); } } static int __maybe_unused cmos_resume(struct device *dev) { struct cmos_rtc *cmos = dev_get_drvdata(dev); unsigned char tmp; if (cmos->enabled_wake && !cmos_use_acpi_alarm()) { if (cmos->wake_off) cmos->wake_off(dev); else disable_irq_wake(cmos->irq); cmos->enabled_wake = 0; } /* The BIOS might have changed the alarm, restore it */ cmos_check_wkalrm(dev); spin_lock_irq(&rtc_lock); tmp = cmos->suspend_ctrl; cmos->suspend_ctrl = 0; /* re-enable any irqs previously active */ if (tmp & RTC_IRQMASK) { unsigned char mask; if (device_may_wakeup(dev) && use_hpet_alarm()) hpet_rtc_timer_init(); do { CMOS_WRITE(tmp, RTC_CONTROL); if (use_hpet_alarm()) hpet_set_rtc_irq_bit(tmp & RTC_IRQMASK); mask = CMOS_READ(RTC_INTR_FLAGS); mask &= (tmp & RTC_IRQMASK) | RTC_IRQF; if (!use_hpet_alarm() || !is_intr(mask)) break; /* force one-shot behavior if HPET blocked * the wake alarm's irq */ rtc_update_irq(cmos->rtc, 1, mask); tmp &= ~RTC_AIE; hpet_mask_rtc_irq_bit(RTC_AIE); } while (mask & RTC_AIE); if (tmp & RTC_AIE) cmos_check_acpi_rtc_status(dev, &tmp); } spin_unlock_irq(&rtc_lock); dev_dbg(dev, "resume, ctrl %02x\n", tmp); return 0; } static SIMPLE_DEV_PM_OPS(cmos_pm_ops, cmos_suspend, cmos_resume); /*----------------------------------------------------------------*/ /* On non-x86 systems, a "CMOS" RTC lives most naturally on platform_bus. * ACPI systems always list these as PNPACPI devices, and pre-ACPI PCs * probably list them in similar PNPBIOS tables; so PNP is more common. * * We don't use legacy "poke at the hardware" probing. Ancient PCs that * predate even PNPBIOS should set up platform_bus devices. */ #ifdef CONFIG_PNP #include <linux/pnp.h> static int cmos_pnp_probe(struct pnp_dev *pnp, const struct pnp_device_id *id) { int irq; if (pnp_port_start(pnp, 0) == 0x70 && !pnp_irq_valid(pnp, 0)) { irq = 0; #ifdef CONFIG_X86 /* Some machines contain a PNP entry for the RTC, but * don't define the IRQ. It should always be safe to * hardcode it on systems with a legacy PIC. */ if (nr_legacy_irqs()) irq = RTC_IRQ; #endif } else { irq = pnp_irq(pnp, 0); } return cmos_do_probe(&pnp->dev, pnp_get_resource(pnp, IORESOURCE_IO, 0), irq); } static void cmos_pnp_remove(struct pnp_dev *pnp) { cmos_do_remove(&pnp->dev); } static void cmos_pnp_shutdown(struct pnp_dev *pnp) { struct device *dev = &pnp->dev; struct cmos_rtc *cmos = dev_get_drvdata(dev); if (system_state == SYSTEM_POWER_OFF) { int retval = cmos_poweroff(dev); if (cmos_aie_poweroff(dev) < 0 && !retval) return; } cmos_do_shutdown(cmos->irq); } static const struct pnp_device_id rtc_ids[] = { { .id = "PNP0b00", }, { .id = "PNP0b01", }, { .id = "PNP0b02", }, { }, }; MODULE_DEVICE_TABLE(pnp, rtc_ids); static struct pnp_driver cmos_pnp_driver = { .name = driver_name, .id_table = rtc_ids, .probe = cmos_pnp_probe, .remove = cmos_pnp_remove, .shutdown = cmos_pnp_shutdown, /* flag ensures resume() gets called, and stops syslog spam */ .flags = PNP_DRIVER_RES_DO_NOT_CHANGE, .driver = { .pm = &cmos_pm_ops, }, }; #endif /* CONFIG_PNP */ #ifdef CONFIG_OF static const struct of_device_id of_cmos_match[] = { { .compatible = "motorola,mc146818", }, { }, }; MODULE_DEVICE_TABLE(of, of_cmos_match); static __init void cmos_of_init(struct platform_device *pdev) { struct device_node *node = pdev->dev.of_node; const __be32 *val; if (!node) return; val = of_get_property(node, "ctrl-reg", NULL); if (val) CMOS_WRITE(be32_to_cpup(val), RTC_CONTROL); val = of_get_property(node, "freq-reg", NULL); if (val) CMOS_WRITE(be32_to_cpup(val), RTC_FREQ_SELECT); } #else static inline void cmos_of_init(struct platform_device *pdev) {} #endif /*----------------------------------------------------------------*/ /* Platform setup should have set up an RTC device, when PNP is * unavailable ... this could happen even on (older) PCs. */ static int __init cmos_platform_probe(struct platform_device *pdev) { struct resource *resource; int irq; cmos_of_init(pdev); if (RTC_IOMAPPED) resource = platform_get_resource(pdev, IORESOURCE_IO, 0); else resource = platform_get_resource(pdev, IORESOURCE_MEM, 0); irq = platform_get_irq(pdev, 0); if (irq < 0) irq = -1; return cmos_do_probe(&pdev->dev, resource, irq); } static void cmos_platform_remove(struct platform_device *pdev) { cmos_do_remove(&pdev->dev); } static void cmos_platform_shutdown(struct platform_device *pdev) { struct device *dev = &pdev->dev; struct cmos_rtc *cmos = dev_get_drvdata(dev); if (system_state == SYSTEM_POWER_OFF) { int retval = cmos_poweroff(dev); if (cmos_aie_poweroff(dev) < 0 && !retval) return; } cmos_do_shutdown(cmos->irq); } /* work with hotplug and coldplug */ MODULE_ALIAS("platform:rtc_cmos"); static struct platform_driver cmos_platform_driver = { .remove = cmos_platform_remove, .shutdown = cmos_platform_shutdown, .driver = { .name = driver_name, .pm = &cmos_pm_ops, .of_match_table = of_match_ptr(of_cmos_match), } }; #ifdef CONFIG_PNP static bool pnp_driver_registered; #endif static bool platform_driver_registered; static int __init cmos_init(void) { int retval = 0; #ifdef CONFIG_PNP retval = pnp_register_driver(&cmos_pnp_driver); if (retval == 0) pnp_driver_registered = true; #endif if (!cmos_rtc.dev) { retval = platform_driver_probe(&cmos_platform_driver, cmos_platform_probe); if (retval == 0) platform_driver_registered = true; } if (retval == 0) return 0; #ifdef CONFIG_PNP if (pnp_driver_registered) pnp_unregister_driver(&cmos_pnp_driver); #endif return retval; } module_init(cmos_init); static void __exit cmos_exit(void) { #ifdef CONFIG_PNP if (pnp_driver_registered) pnp_unregister_driver(&cmos_pnp_driver); #endif if (platform_driver_registered) platform_driver_unregister(&cmos_platform_driver); } module_exit(cmos_exit); MODULE_AUTHOR("David Brownell"); MODULE_DESCRIPTION("Driver for PC-style 'CMOS' RTCs"); MODULE_LICENSE("GPL"); |
| 2397 2409 57 59 58 61 984 986 1 1288 2 201 136 38 1289 1 1282 1288 1287 2 2 2 1 2 1 1 1 1 117 1165 1160 1165 1164 1 758 755 277 189 1063 570 1 1067 1071 1065 1 1 1 1 1050 19 1061 1065 57 5 62 17 199 783 1 1 1 1 5 1061 751 756 4 12 7 7 264 1 800 2 1065 1 1 1064 2 1073 2364 1 2 1 1075 1289 1 1 1 2 2 2 2 2 2 2 2 2 1 1 1 45 10 10 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 | /* Netfilter messages via netlink socket. Allows for user space * protocol helpers and general trouble making from userspace. * * (C) 2001 by Jay Schulist <jschlst@samba.org>, * (C) 2002-2005 by Harald Welte <laforge@gnumonks.org> * (C) 2005-2017 by Pablo Neira Ayuso <pablo@netfilter.org> * * Initial netfilter messages via netlink development funded and * generally made possible by Network Robots, Inc. (www.networkrobots.com) * * Further development of this code funded by Astaro AG (http://www.astaro.com) * * This software may be used and distributed according to the terms * of the GNU General Public License, incorporated herein by reference. */ #include <linux/module.h> #include <linux/types.h> #include <linux/socket.h> #include <linux/kernel.h> #include <linux/string.h> #include <linux/sockios.h> #include <linux/net.h> #include <linux/skbuff.h> #include <linux/uaccess.h> #include <net/sock.h> #include <linux/init.h> #include <linux/sched/signal.h> #include <net/netlink.h> #include <net/netns/generic.h> #include <linux/netfilter.h> #include <linux/netfilter/nfnetlink.h> MODULE_LICENSE("GPL"); MODULE_AUTHOR("Harald Welte <laforge@netfilter.org>"); MODULE_ALIAS_NET_PF_PROTO(PF_NETLINK, NETLINK_NETFILTER); MODULE_DESCRIPTION("Netfilter messages via netlink socket"); #define nfnl_dereference_protected(id) \ rcu_dereference_protected(table[(id)].subsys, \ lockdep_nfnl_is_held((id))) #define NFNL_MAX_ATTR_COUNT 32 static unsigned int nfnetlink_pernet_id __read_mostly; #ifdef CONFIG_NF_CONNTRACK_EVENTS static DEFINE_SPINLOCK(nfnl_grp_active_lock); #endif struct nfnl_net { struct sock *nfnl; }; static struct { struct mutex mutex; const struct nfnetlink_subsystem __rcu *subsys; } table[NFNL_SUBSYS_COUNT]; static struct lock_class_key nfnl_lockdep_keys[NFNL_SUBSYS_COUNT]; static const char *const nfnl_lockdep_names[NFNL_SUBSYS_COUNT] = { [NFNL_SUBSYS_NONE] = "nfnl_subsys_none", [NFNL_SUBSYS_CTNETLINK] = "nfnl_subsys_ctnetlink", [NFNL_SUBSYS_CTNETLINK_EXP] = "nfnl_subsys_ctnetlink_exp", [NFNL_SUBSYS_QUEUE] = "nfnl_subsys_queue", [NFNL_SUBSYS_ULOG] = "nfnl_subsys_ulog", [NFNL_SUBSYS_OSF] = "nfnl_subsys_osf", [NFNL_SUBSYS_IPSET] = "nfnl_subsys_ipset", [NFNL_SUBSYS_ACCT] = "nfnl_subsys_acct", [NFNL_SUBSYS_CTNETLINK_TIMEOUT] = "nfnl_subsys_cttimeout", [NFNL_SUBSYS_CTHELPER] = "nfnl_subsys_cthelper", [NFNL_SUBSYS_NFTABLES] = "nfnl_subsys_nftables", [NFNL_SUBSYS_NFT_COMPAT] = "nfnl_subsys_nftcompat", [NFNL_SUBSYS_HOOK] = "nfnl_subsys_hook", }; static const int nfnl_group2type[NFNLGRP_MAX+1] = { [NFNLGRP_CONNTRACK_NEW] = NFNL_SUBSYS_CTNETLINK, [NFNLGRP_CONNTRACK_UPDATE] = NFNL_SUBSYS_CTNETLINK, [NFNLGRP_CONNTRACK_DESTROY] = NFNL_SUBSYS_CTNETLINK, [NFNLGRP_CONNTRACK_EXP_NEW] = NFNL_SUBSYS_CTNETLINK_EXP, [NFNLGRP_CONNTRACK_EXP_UPDATE] = NFNL_SUBSYS_CTNETLINK_EXP, [NFNLGRP_CONNTRACK_EXP_DESTROY] = NFNL_SUBSYS_CTNETLINK_EXP, [NFNLGRP_NFTABLES] = NFNL_SUBSYS_NFTABLES, [NFNLGRP_ACCT_QUOTA] = NFNL_SUBSYS_ACCT, [NFNLGRP_NFTRACE] = NFNL_SUBSYS_NFTABLES, }; static struct nfnl_net *nfnl_pernet(struct net *net) { return net_generic(net, nfnetlink_pernet_id); } void nfnl_lock(__u8 subsys_id) { mutex_lock(&table[subsys_id].mutex); } EXPORT_SYMBOL_GPL(nfnl_lock); void nfnl_unlock(__u8 subsys_id) { mutex_unlock(&table[subsys_id].mutex); } EXPORT_SYMBOL_GPL(nfnl_unlock); #ifdef CONFIG_PROVE_LOCKING bool lockdep_nfnl_is_held(u8 subsys_id) { return lockdep_is_held(&table[subsys_id].mutex); } EXPORT_SYMBOL_GPL(lockdep_nfnl_is_held); #endif int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n) { u8 cb_id; /* Sanity-check attr_count size to avoid stack buffer overflow. */ for (cb_id = 0; cb_id < n->cb_count; cb_id++) if (WARN_ON(n->cb[cb_id].attr_count > NFNL_MAX_ATTR_COUNT)) return -EINVAL; nfnl_lock(n->subsys_id); if (table[n->subsys_id].subsys) { nfnl_unlock(n->subsys_id); return -EBUSY; } rcu_assign_pointer(table[n->subsys_id].subsys, n); nfnl_unlock(n->subsys_id); return 0; } EXPORT_SYMBOL_GPL(nfnetlink_subsys_register); int nfnetlink_subsys_unregister(const struct nfnetlink_subsystem *n) { nfnl_lock(n->subsys_id); table[n->subsys_id].subsys = NULL; nfnl_unlock(n->subsys_id); synchronize_rcu(); return 0; } EXPORT_SYMBOL_GPL(nfnetlink_subsys_unregister); static inline const struct nfnetlink_subsystem *nfnetlink_get_subsys(u16 type) { u8 subsys_id = NFNL_SUBSYS_ID(type); if (subsys_id >= NFNL_SUBSYS_COUNT) return NULL; return rcu_dereference(table[subsys_id].subsys); } static inline const struct nfnl_callback * nfnetlink_find_client(u16 type, const struct nfnetlink_subsystem *ss) { u8 cb_id = NFNL_MSG_TYPE(type); if (cb_id >= ss->cb_count) return NULL; return &ss->cb[cb_id]; } int nfnetlink_has_listeners(struct net *net, unsigned int group) { struct nfnl_net *nfnlnet = nfnl_pernet(net); return netlink_has_listeners(nfnlnet->nfnl, group); } EXPORT_SYMBOL_GPL(nfnetlink_has_listeners); int nfnetlink_send(struct sk_buff *skb, struct net *net, u32 portid, unsigned int group, int echo, gfp_t flags) { struct nfnl_net *nfnlnet = nfnl_pernet(net); return nlmsg_notify(nfnlnet->nfnl, skb, portid, group, echo, flags); } EXPORT_SYMBOL_GPL(nfnetlink_send); int nfnetlink_set_err(struct net *net, u32 portid, u32 group, int error) { struct nfnl_net *nfnlnet = nfnl_pernet(net); return netlink_set_err(nfnlnet->nfnl, portid, group, error); } EXPORT_SYMBOL_GPL(nfnetlink_set_err); int nfnetlink_unicast(struct sk_buff *skb, struct net *net, u32 portid) { struct nfnl_net *nfnlnet = nfnl_pernet(net); int err; err = nlmsg_unicast(nfnlnet->nfnl, skb, portid); if (err == -EAGAIN) err = -ENOBUFS; return err; } EXPORT_SYMBOL_GPL(nfnetlink_unicast); void nfnetlink_broadcast(struct net *net, struct sk_buff *skb, __u32 portid, __u32 group, gfp_t allocation) { struct nfnl_net *nfnlnet = nfnl_pernet(net); netlink_broadcast(nfnlnet->nfnl, skb, portid, group, allocation); } EXPORT_SYMBOL_GPL(nfnetlink_broadcast); /* Process one complete nfnetlink message. */ static int nfnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, struct netlink_ext_ack *extack) { struct net *net = sock_net(skb->sk); const struct nfnl_callback *nc; const struct nfnetlink_subsystem *ss; int type, err; /* All the messages must at least contain nfgenmsg */ if (nlmsg_len(nlh) < sizeof(struct nfgenmsg)) return 0; type = nlh->nlmsg_type; replay: rcu_read_lock(); ss = nfnetlink_get_subsys(type); if (!ss) { #ifdef CONFIG_MODULES rcu_read_unlock(); request_module("nfnetlink-subsys-%d", NFNL_SUBSYS_ID(type)); rcu_read_lock(); ss = nfnetlink_get_subsys(type); if (!ss) #endif { rcu_read_unlock(); return -EINVAL; } } nc = nfnetlink_find_client(type, ss); if (!nc) { rcu_read_unlock(); return -EINVAL; } { int min_len = nlmsg_total_size(sizeof(struct nfgenmsg)); struct nfnl_net *nfnlnet = nfnl_pernet(net); u8 cb_id = NFNL_MSG_TYPE(nlh->nlmsg_type); struct nlattr *cda[NFNL_MAX_ATTR_COUNT + 1]; struct nlattr *attr = (void *)nlh + min_len; int attrlen = nlh->nlmsg_len - min_len; __u8 subsys_id = NFNL_SUBSYS_ID(type); struct nfnl_info info = { .net = net, .sk = nfnlnet->nfnl, .nlh = nlh, .nfmsg = nlmsg_data(nlh), .extack = extack, }; /* Sanity-check NFNL_MAX_ATTR_COUNT */ if (ss->cb[cb_id].attr_count > NFNL_MAX_ATTR_COUNT) { rcu_read_unlock(); return -ENOMEM; } err = nla_parse_deprecated(cda, ss->cb[cb_id].attr_count, attr, attrlen, ss->cb[cb_id].policy, extack); if (err < 0) { rcu_read_unlock(); return err; } if (!nc->call) { rcu_read_unlock(); return -EINVAL; } switch (nc->type) { case NFNL_CB_RCU: err = nc->call(skb, &info, (const struct nlattr **)cda); rcu_read_unlock(); break; case NFNL_CB_MUTEX: rcu_read_unlock(); nfnl_lock(subsys_id); if (nfnl_dereference_protected(subsys_id) != ss || nfnetlink_find_client(type, ss) != nc) { nfnl_unlock(subsys_id); err = -EAGAIN; break; } err = nc->call(skb, &info, (const struct nlattr **)cda); nfnl_unlock(subsys_id); break; default: rcu_read_unlock(); err = -EINVAL; break; } if (err == -EAGAIN) goto replay; return err; } } struct nfnl_err { struct list_head head; struct nlmsghdr *nlh; int err; struct netlink_ext_ack extack; }; static int nfnl_err_add(struct list_head *list, struct nlmsghdr *nlh, int err, const struct netlink_ext_ack *extack) { struct nfnl_err *nfnl_err; nfnl_err = kmalloc(sizeof(struct nfnl_err), GFP_KERNEL); if (nfnl_err == NULL) return -ENOMEM; nfnl_err->nlh = nlh; nfnl_err->err = err; nfnl_err->extack = *extack; list_add_tail(&nfnl_err->head, list); return 0; } static void nfnl_err_del(struct nfnl_err *nfnl_err) { list_del(&nfnl_err->head); kfree(nfnl_err); } static void nfnl_err_reset(struct list_head *err_list) { struct nfnl_err *nfnl_err, *next; list_for_each_entry_safe(nfnl_err, next, err_list, head) nfnl_err_del(nfnl_err); } static void nfnl_err_deliver(struct list_head *err_list, struct sk_buff *skb) { struct nfnl_err *nfnl_err, *next; list_for_each_entry_safe(nfnl_err, next, err_list, head) { netlink_ack(skb, nfnl_err->nlh, nfnl_err->err, &nfnl_err->extack); nfnl_err_del(nfnl_err); } } enum { NFNL_BATCH_FAILURE = (1 << 0), NFNL_BATCH_DONE = (1 << 1), NFNL_BATCH_REPLAY = (1 << 2), }; static void nfnetlink_rcv_batch(struct sk_buff *skb, struct nlmsghdr *nlh, u16 subsys_id, u32 genid) { struct sk_buff *oskb = skb; struct net *net = sock_net(skb->sk); const struct nfnetlink_subsystem *ss; const struct nfnl_callback *nc; struct netlink_ext_ack extack; LIST_HEAD(err_list); u32 status; int err; if (subsys_id >= NFNL_SUBSYS_COUNT) return netlink_ack(skb, nlh, -EINVAL, NULL); replay: status = 0; replay_abort: skb = netlink_skb_clone(oskb, GFP_KERNEL); if (!skb) return netlink_ack(oskb, nlh, -ENOMEM, NULL); nfnl_lock(subsys_id); ss = nfnl_dereference_protected(subsys_id); if (!ss) { #ifdef CONFIG_MODULES nfnl_unlock(subsys_id); request_module("nfnetlink-subsys-%d", subsys_id); nfnl_lock(subsys_id); ss = nfnl_dereference_protected(subsys_id); if (!ss) #endif { nfnl_unlock(subsys_id); netlink_ack(oskb, nlh, -EOPNOTSUPP, NULL); return consume_skb(skb); } } if (!ss->valid_genid || !ss->commit || !ss->abort) { nfnl_unlock(subsys_id); netlink_ack(oskb, nlh, -EOPNOTSUPP, NULL); return consume_skb(skb); } if (!try_module_get(ss->owner)) { nfnl_unlock(subsys_id); netlink_ack(oskb, nlh, -EOPNOTSUPP, NULL); return consume_skb(skb); } if (!ss->valid_genid(net, genid)) { module_put(ss->owner); nfnl_unlock(subsys_id); netlink_ack(oskb, nlh, -ERESTART, NULL); return consume_skb(skb); } nfnl_unlock(subsys_id); if (nlh->nlmsg_flags & NLM_F_ACK) { memset(&extack, 0, sizeof(extack)); nfnl_err_add(&err_list, nlh, 0, &extack); } while (skb->len >= nlmsg_total_size(0)) { int msglen, type; if (fatal_signal_pending(current)) { nfnl_err_reset(&err_list); err = -EINTR; status = NFNL_BATCH_FAILURE; goto done; } memset(&extack, 0, sizeof(extack)); nlh = nlmsg_hdr(skb); err = 0; if (nlh->nlmsg_len < NLMSG_HDRLEN || skb->len < nlh->nlmsg_len || nlmsg_len(nlh) < sizeof(struct nfgenmsg)) { nfnl_err_reset(&err_list); status |= NFNL_BATCH_FAILURE; goto done; } /* Only requests are handled by the kernel */ if (!(nlh->nlmsg_flags & NLM_F_REQUEST)) { err = -EINVAL; goto ack; } type = nlh->nlmsg_type; if (type == NFNL_MSG_BATCH_BEGIN) { /* Malformed: Batch begin twice */ nfnl_err_reset(&err_list); status |= NFNL_BATCH_FAILURE; goto done; } else if (type == NFNL_MSG_BATCH_END) { status |= NFNL_BATCH_DONE; goto done; } else if (type < NLMSG_MIN_TYPE) { err = -EINVAL; goto ack; } /* We only accept a batch with messages for the same * subsystem. */ if (NFNL_SUBSYS_ID(type) != subsys_id) { err = -EINVAL; goto ack; } nc = nfnetlink_find_client(type, ss); if (!nc) { err = -EINVAL; goto ack; } if (nc->type != NFNL_CB_BATCH) { err = -EINVAL; goto ack; } { int min_len = nlmsg_total_size(sizeof(struct nfgenmsg)); struct nfnl_net *nfnlnet = nfnl_pernet(net); struct nlattr *cda[NFNL_MAX_ATTR_COUNT + 1]; struct nlattr *attr = (void *)nlh + min_len; u8 cb_id = NFNL_MSG_TYPE(nlh->nlmsg_type); int attrlen = nlh->nlmsg_len - min_len; struct nfnl_info info = { .net = net, .sk = nfnlnet->nfnl, .nlh = nlh, .nfmsg = nlmsg_data(nlh), .extack = &extack, }; /* Sanity-check NFTA_MAX_ATTR */ if (ss->cb[cb_id].attr_count > NFNL_MAX_ATTR_COUNT) { err = -ENOMEM; goto ack; } err = nla_parse_deprecated(cda, ss->cb[cb_id].attr_count, attr, attrlen, ss->cb[cb_id].policy, &extack); if (err < 0) goto ack; err = nc->call(skb, &info, (const struct nlattr **)cda); /* The lock was released to autoload some module, we * have to abort and start from scratch using the * original skb. */ if (err == -EAGAIN) { status |= NFNL_BATCH_REPLAY; goto done; } } ack: if (nlh->nlmsg_flags & NLM_F_ACK || err) { /* Errors are delivered once the full batch has been * processed, this avoids that the same error is * reported several times when replaying the batch. */ if (err == -ENOMEM || nfnl_err_add(&err_list, nlh, err, &extack) < 0) { /* We failed to enqueue an error, reset the * list of errors and send OOM to userspace * pointing to the batch header. */ nfnl_err_reset(&err_list); netlink_ack(oskb, nlmsg_hdr(oskb), -ENOMEM, NULL); status |= NFNL_BATCH_FAILURE; goto done; } /* We don't stop processing the batch on errors, thus, * userspace gets all the errors that the batch * triggers. */ if (err) status |= NFNL_BATCH_FAILURE; } msglen = NLMSG_ALIGN(nlh->nlmsg_len); if (msglen > skb->len) msglen = skb->len; skb_pull(skb, msglen); } done: if (status & NFNL_BATCH_REPLAY) { ss->abort(net, oskb, NFNL_ABORT_AUTOLOAD); nfnl_err_reset(&err_list); consume_skb(skb); module_put(ss->owner); goto replay; } else if (status == NFNL_BATCH_DONE) { err = ss->commit(net, oskb); if (err == -EAGAIN) { status |= NFNL_BATCH_REPLAY; goto done; } else if (err) { ss->abort(net, oskb, NFNL_ABORT_NONE); netlink_ack(oskb, nlmsg_hdr(oskb), err, NULL); } else if (nlh->nlmsg_flags & NLM_F_ACK) { memset(&extack, 0, sizeof(extack)); nfnl_err_add(&err_list, nlh, 0, &extack); } } else { enum nfnl_abort_action abort_action; if (status & NFNL_BATCH_FAILURE) abort_action = NFNL_ABORT_NONE; else abort_action = NFNL_ABORT_VALIDATE; err = ss->abort(net, oskb, abort_action); if (err == -EAGAIN) { nfnl_err_reset(&err_list); consume_skb(skb); module_put(ss->owner); status |= NFNL_BATCH_FAILURE; goto replay_abort; } } nfnl_err_deliver(&err_list, oskb); consume_skb(skb); module_put(ss->owner); } static const struct nla_policy nfnl_batch_policy[NFNL_BATCH_MAX + 1] = { [NFNL_BATCH_GENID] = { .type = NLA_U32 }, }; static void nfnetlink_rcv_skb_batch(struct sk_buff *skb, struct nlmsghdr *nlh) { int min_len = nlmsg_total_size(sizeof(struct nfgenmsg)); struct nlattr *attr = (void *)nlh + min_len; struct nlattr *cda[NFNL_BATCH_MAX + 1]; int attrlen = nlh->nlmsg_len - min_len; struct nfgenmsg *nfgenmsg; int msglen, err; u32 gen_id = 0; u16 res_id; msglen = NLMSG_ALIGN(nlh->nlmsg_len); if (msglen > skb->len) msglen = skb->len; if (skb->len < NLMSG_HDRLEN + sizeof(struct nfgenmsg)) return; err = nla_parse_deprecated(cda, NFNL_BATCH_MAX, attr, attrlen, nfnl_batch_policy, NULL); if (err < 0) { netlink_ack(skb, nlh, err, NULL); return; } if (cda[NFNL_BATCH_GENID]) gen_id = ntohl(nla_get_be32(cda[NFNL_BATCH_GENID])); nfgenmsg = nlmsg_data(nlh); skb_pull(skb, msglen); /* Work around old nft using host byte order */ if (nfgenmsg->res_id == (__force __be16)NFNL_SUBSYS_NFTABLES) res_id = NFNL_SUBSYS_NFTABLES; else res_id = ntohs(nfgenmsg->res_id); nfnetlink_rcv_batch(skb, nlh, res_id, gen_id); } static void nfnetlink_rcv(struct sk_buff *skb) { struct nlmsghdr *nlh = nlmsg_hdr(skb); if (skb->len < NLMSG_HDRLEN || nlh->nlmsg_len < NLMSG_HDRLEN || skb->len < nlh->nlmsg_len) return; if (!netlink_net_capable(skb, CAP_NET_ADMIN)) { netlink_ack(skb, nlh, -EPERM, NULL); return; } if (nlh->nlmsg_type == NFNL_MSG_BATCH_BEGIN) nfnetlink_rcv_skb_batch(skb, nlh); else netlink_rcv_skb(skb, nfnetlink_rcv_msg); } static void nfnetlink_bind_event(struct net *net, unsigned int group) { #ifdef CONFIG_NF_CONNTRACK_EVENTS int type, group_bit; u8 v; /* All NFNLGRP_CONNTRACK_* group bits fit into u8. * The other groups are not relevant and can be ignored. */ if (group >= 8) return; type = nfnl_group2type[group]; switch (type) { case NFNL_SUBSYS_CTNETLINK: break; case NFNL_SUBSYS_CTNETLINK_EXP: break; default: return; } group_bit = (1 << group); spin_lock(&nfnl_grp_active_lock); v = READ_ONCE(nf_ctnetlink_has_listener); if ((v & group_bit) == 0) { v |= group_bit; /* read concurrently without nfnl_grp_active_lock held. */ WRITE_ONCE(nf_ctnetlink_has_listener, v); } spin_unlock(&nfnl_grp_active_lock); #endif } static int nfnetlink_bind(struct net *net, int group) { const struct nfnetlink_subsystem *ss; int type; if (group <= NFNLGRP_NONE || group > NFNLGRP_MAX) return 0; type = nfnl_group2type[group]; rcu_read_lock(); ss = nfnetlink_get_subsys(type << 8); rcu_read_unlock(); if (!ss) request_module_nowait("nfnetlink-subsys-%d", type); nfnetlink_bind_event(net, group); return 0; } static void nfnetlink_unbind(struct net *net, int group) { #ifdef CONFIG_NF_CONNTRACK_EVENTS int type, group_bit; if (group <= NFNLGRP_NONE || group > NFNLGRP_MAX) return; type = nfnl_group2type[group]; switch (type) { case NFNL_SUBSYS_CTNETLINK: break; case NFNL_SUBSYS_CTNETLINK_EXP: break; default: return; } /* ctnetlink_has_listener is u8 */ if (group >= 8) return; group_bit = (1 << group); spin_lock(&nfnl_grp_active_lock); if (!nfnetlink_has_listeners(net, group)) { u8 v = READ_ONCE(nf_ctnetlink_has_listener); v &= ~group_bit; /* read concurrently without nfnl_grp_active_lock held. */ WRITE_ONCE(nf_ctnetlink_has_listener, v); } spin_unlock(&nfnl_grp_active_lock); #endif } static int __net_init nfnetlink_net_init(struct net *net) { struct nfnl_net *nfnlnet = nfnl_pernet(net); struct netlink_kernel_cfg cfg = { .groups = NFNLGRP_MAX, .input = nfnetlink_rcv, .bind = nfnetlink_bind, .unbind = nfnetlink_unbind, }; nfnlnet->nfnl = netlink_kernel_create(net, NETLINK_NETFILTER, &cfg); if (!nfnlnet->nfnl) return -ENOMEM; return 0; } static void __net_exit nfnetlink_net_exit_batch(struct list_head *net_exit_list) { struct nfnl_net *nfnlnet; struct net *net; list_for_each_entry(net, net_exit_list, exit_list) { nfnlnet = nfnl_pernet(net); netlink_kernel_release(nfnlnet->nfnl); } } static struct pernet_operations nfnetlink_net_ops = { .init = nfnetlink_net_init, .exit_batch = nfnetlink_net_exit_batch, .id = &nfnetlink_pernet_id, .size = sizeof(struct nfnl_net), }; static int __init nfnetlink_init(void) { int i; for (i = NFNLGRP_NONE + 1; i <= NFNLGRP_MAX; i++) BUG_ON(nfnl_group2type[i] == NFNL_SUBSYS_NONE); for (i=0; i<NFNL_SUBSYS_COUNT; i++) __mutex_init(&table[i].mutex, nfnl_lockdep_names[i], &nfnl_lockdep_keys[i]); return register_pernet_subsys(&nfnetlink_net_ops); } static void __exit nfnetlink_exit(void) { unregister_pernet_subsys(&nfnetlink_net_ops); } module_init(nfnetlink_init); module_exit(nfnetlink_exit); |
| 104 90 103 2 101 3 100 5 201 148 46 2 1 2 36 5 5 1 5 1 3 3 1 1 1 34 7 5 2 1 5 35 13 48 48 48 2 2 1 2 11 141 116 24 80 66 43 2 9 1 9 2 30 2 3 19 5 1 3 7 1 9 1 2 1 1 2 4 1 4 2 1 3 9 1 10 2 7 17 1 15 3 1 2 44 99 64 37 6 5 5 3 1 1 1 2 1 31 27 1 6 20 19 8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 | // SPDX-License-Identifier: GPL-2.0 /* * INET An implementation of the TCP/IP protocol suite for the LINUX * operating system. INET is implemented using the BSD Socket * interface as the means of communication with the user level. * * The options processing module for ip.c * * Authors: A.N.Kuznetsov * */ #define pr_fmt(fmt) "IPv4: " fmt #include <linux/capability.h> #include <linux/module.h> #include <linux/slab.h> #include <linux/types.h> #include <linux/uaccess.h> #include <linux/unaligned.h> #include <linux/skbuff.h> #include <linux/ip.h> #include <linux/icmp.h> #include <linux/netdevice.h> #include <linux/rtnetlink.h> #include <net/sock.h> #include <net/ip.h> #include <net/icmp.h> #include <net/route.h> #include <net/cipso_ipv4.h> #include <net/ip_fib.h> /* * Write options to IP header, record destination address to * source route option, address of outgoing interface * (we should already know it, so that this function is allowed be * called only after routing decision) and timestamp, * if we originate this datagram. * * daddr is real destination address, next hop is recorded in IP header. * saddr is address of outgoing interface. */ void ip_options_build(struct sk_buff *skb, struct ip_options *opt, __be32 daddr, struct rtable *rt) { unsigned char *iph = skb_network_header(skb); memcpy(&(IPCB(skb)->opt), opt, sizeof(struct ip_options)); memcpy(iph + sizeof(struct iphdr), opt->__data, opt->optlen); opt = &(IPCB(skb)->opt); if (opt->srr) memcpy(iph + opt->srr + iph[opt->srr + 1] - 4, &daddr, 4); if (opt->rr_needaddr) ip_rt_get_source(iph + opt->rr + iph[opt->rr + 2] - 5, skb, rt); if (opt->ts_needaddr) ip_rt_get_source(iph + opt->ts + iph[opt->ts + 2] - 9, skb, rt); if (opt->ts_needtime) { __be32 midtime; midtime = inet_current_timestamp(); memcpy(iph + opt->ts + iph[opt->ts + 2] - 5, &midtime, 4); } } /* * Provided (sopt, skb) points to received options, * build in dopt compiled option set appropriate for answering. * i.e. invert SRR option, copy anothers, * and grab room in RR/TS options. * * NOTE: dopt cannot point to skb. */ int __ip_options_echo(struct net *net, struct ip_options *dopt, struct sk_buff *skb, const struct ip_options *sopt) { unsigned char *sptr, *dptr; int soffset, doffset; int optlen; memset(dopt, 0, sizeof(struct ip_options)); if (sopt->optlen == 0) return 0; sptr = skb_network_header(skb); dptr = dopt->__data; if (sopt->rr) { optlen = sptr[sopt->rr+1]; soffset = sptr[sopt->rr+2]; dopt->rr = dopt->optlen + sizeof(struct iphdr); memcpy(dptr, sptr+sopt->rr, optlen); if (sopt->rr_needaddr && soffset <= optlen) { if (soffset + 3 > optlen) return -EINVAL; dptr[2] = soffset + 4; dopt->rr_needaddr = 1; } dptr += optlen; dopt->optlen += optlen; } if (sopt->ts) { optlen = sptr[sopt->ts+1]; soffset = sptr[sopt->ts+2]; dopt->ts = dopt->optlen + sizeof(struct iphdr); memcpy(dptr, sptr+sopt->ts, optlen); if (soffset <= optlen) { if (sopt->ts_needaddr) { if (soffset + 3 > optlen) return -EINVAL; dopt->ts_needaddr = 1; soffset += 4; } if (sopt->ts_needtime) { if (soffset + 3 > optlen) return -EINVAL; if ((dptr[3]&0xF) != IPOPT_TS_PRESPEC) { dopt->ts_needtime = 1; soffset += 4; } else { dopt->ts_needtime = 0; if (soffset + 7 <= optlen) { __be32 addr; memcpy(&addr, dptr+soffset-1, 4); if (inet_addr_type(net, addr) != RTN_UNICAST) { dopt->ts_needtime = 1; soffset += 8; } } } } dptr[2] = soffset; } dptr += optlen; dopt->optlen += optlen; } if (sopt->srr) { unsigned char *start = sptr+sopt->srr; __be32 faddr; optlen = start[1]; soffset = start[2]; doffset = 0; if (soffset > optlen) soffset = optlen + 1; soffset -= 4; if (soffset > 3) { memcpy(&faddr, &start[soffset-1], 4); for (soffset -= 4, doffset = 4; soffset > 3; soffset -= 4, doffset += 4) memcpy(&dptr[doffset-1], &start[soffset-1], 4); /* * RFC1812 requires to fix illegal source routes. */ if (memcmp(&ip_hdr(skb)->saddr, &start[soffset + 3], 4) == 0) doffset -= 4; } if (doffset > 3) { dopt->faddr = faddr; dptr[0] = start[0]; dptr[1] = doffset+3; dptr[2] = 4; dptr += doffset+3; dopt->srr = dopt->optlen + sizeof(struct iphdr); dopt->optlen += doffset+3; dopt->is_strictroute = sopt->is_strictroute; } } if (sopt->cipso) { optlen = sptr[sopt->cipso+1]; dopt->cipso = dopt->optlen+sizeof(struct iphdr); memcpy(dptr, sptr+sopt->cipso, optlen); dptr += optlen; dopt->optlen += optlen; } while (dopt->optlen & 3) { *dptr++ = IPOPT_END; dopt->optlen++; } return 0; } /* * Options "fragmenting", just fill options not * allowed in fragments with NOOPs. * Simple and stupid 8), but the most efficient way. */ void ip_options_fragment(struct sk_buff *skb) { unsigned char *optptr = skb_network_header(skb) + sizeof(struct iphdr); struct ip_options *opt = &(IPCB(skb)->opt); int l = opt->optlen; int optlen; while (l > 0) { switch (*optptr) { case IPOPT_END: return; case IPOPT_NOOP: l--; optptr++; continue; } optlen = optptr[1]; if (optlen < 2 || optlen > l) return; if (!IPOPT_COPIED(*optptr)) memset(optptr, IPOPT_NOOP, optlen); l -= optlen; optptr += optlen; } opt->ts = 0; opt->rr = 0; opt->rr_needaddr = 0; opt->ts_needaddr = 0; opt->ts_needtime = 0; } /* helper used by ip_options_compile() to call fib_compute_spec_dst() * at most one time. */ static void spec_dst_fill(__be32 *spec_dst, struct sk_buff *skb) { if (*spec_dst == htonl(INADDR_ANY)) *spec_dst = fib_compute_spec_dst(skb); } /* * Verify options and fill pointers in struct options. * Caller should clear *opt, and set opt->data. * If opt == NULL, then skb->data should point to IP header. */ int __ip_options_compile(struct net *net, struct ip_options *opt, struct sk_buff *skb, __be32 *info) { __be32 spec_dst = htonl(INADDR_ANY); unsigned char *pp_ptr = NULL; struct rtable *rt = NULL; unsigned char *optptr; unsigned char *iph; int optlen, l; if (skb) { rt = skb_rtable(skb); optptr = (unsigned char *)&(ip_hdr(skb)[1]); } else optptr = opt->__data; iph = optptr - sizeof(struct iphdr); for (l = opt->optlen; l > 0; ) { switch (*optptr) { case IPOPT_END: for (optptr++, l--; l > 0; optptr++, l--) { if (*optptr != IPOPT_END) { *optptr = IPOPT_END; opt->is_changed = 1; } } goto eol; case IPOPT_NOOP: l--; optptr++; continue; } if (unlikely(l < 2)) { pp_ptr = optptr; goto error; } optlen = optptr[1]; if (optlen < 2 || optlen > l) { pp_ptr = optptr; goto error; } switch (*optptr) { case IPOPT_SSRR: case IPOPT_LSRR: if (optlen < 3) { pp_ptr = optptr + 1; goto error; } if (optptr[2] < 4) { pp_ptr = optptr + 2; goto error; } /* NB: cf RFC-1812 5.2.4.1 */ if (opt->srr) { pp_ptr = optptr; goto error; } if (!skb) { if (optptr[2] != 4 || optlen < 7 || ((optlen-3) & 3)) { pp_ptr = optptr + 1; goto error; } memcpy(&opt->faddr, &optptr[3], 4); if (optlen > 7) memmove(&optptr[3], &optptr[7], optlen-7); } opt->is_strictroute = (optptr[0] == IPOPT_SSRR); opt->srr = optptr - iph; break; case IPOPT_RR: if (opt->rr) { pp_ptr = optptr; goto error; } if (optlen < 3) { pp_ptr = optptr + 1; goto error; } if (optptr[2] < 4) { pp_ptr = optptr + 2; goto error; } if (optptr[2] <= optlen) { if (optptr[2]+3 > optlen) { pp_ptr = optptr + 2; goto error; } if (rt) { spec_dst_fill(&spec_dst, skb); memcpy(&optptr[optptr[2]-1], &spec_dst, 4); opt->is_changed = 1; } optptr[2] += 4; opt->rr_needaddr = 1; } opt->rr = optptr - iph; break; case IPOPT_TIMESTAMP: if (opt->ts) { pp_ptr = optptr; goto error; } if (optlen < 4) { pp_ptr = optptr + 1; goto error; } if (optptr[2] < 5) { pp_ptr = optptr + 2; goto error; } if (optptr[2] <= optlen) { unsigned char *timeptr = NULL; if (optptr[2]+3 > optlen) { pp_ptr = optptr + 2; goto error; } switch (optptr[3]&0xF) { case IPOPT_TS_TSONLY: if (skb) timeptr = &optptr[optptr[2]-1]; opt->ts_needtime = 1; optptr[2] += 4; break; case IPOPT_TS_TSANDADDR: if (optptr[2]+7 > optlen) { pp_ptr = optptr + 2; goto error; } if (rt) { spec_dst_fill(&spec_dst, skb); memcpy(&optptr[optptr[2]-1], &spec_dst, 4); timeptr = &optptr[optptr[2]+3]; } opt->ts_needaddr = 1; opt->ts_needtime = 1; optptr[2] += 8; break; case IPOPT_TS_PRESPEC: if (optptr[2]+7 > optlen) { pp_ptr = optptr + 2; goto error; } { __be32 addr; memcpy(&addr, &optptr[optptr[2]-1], 4); if (inet_addr_type(net, addr) == RTN_UNICAST) break; if (skb) timeptr = &optptr[optptr[2]+3]; } opt->ts_needtime = 1; optptr[2] += 8; break; default: if (!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) { pp_ptr = optptr + 3; goto error; } break; } if (timeptr) { __be32 midtime; midtime = inet_current_timestamp(); memcpy(timeptr, &midtime, 4); opt->is_changed = 1; } } else if ((optptr[3]&0xF) != IPOPT_TS_PRESPEC) { unsigned int overflow = optptr[3]>>4; if (overflow == 15) { pp_ptr = optptr + 3; goto error; } if (skb) { optptr[3] = (optptr[3]&0xF)|((overflow+1)<<4); opt->is_changed = 1; } } opt->ts = optptr - iph; break; case IPOPT_RA: if (optlen < 4) { pp_ptr = optptr + 1; goto error; } if (optptr[2] == 0 && optptr[3] == 0) opt->router_alert = optptr - iph; break; case IPOPT_CIPSO: if ((!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) || opt->cipso) { pp_ptr = optptr; goto error; } opt->cipso = optptr - iph; if (cipso_v4_validate(skb, &optptr)) { pp_ptr = optptr; goto error; } break; case IPOPT_SEC: case IPOPT_SID: default: if (!skb && !ns_capable(net->user_ns, CAP_NET_RAW)) { pp_ptr = optptr; goto error; } break; } l -= optlen; optptr += optlen; } eol: if (!pp_ptr) return 0; error: if (info) *info = htonl((pp_ptr-iph)<<24); return -EINVAL; } EXPORT_SYMBOL(__ip_options_compile); int ip_options_compile(struct net *net, struct ip_options *opt, struct sk_buff *skb) { int ret; __be32 info; ret = __ip_options_compile(net, opt, skb, &info); if (ret != 0 && skb) icmp_send(skb, ICMP_PARAMETERPROB, 0, info); return ret; } EXPORT_SYMBOL(ip_options_compile); /* * Undo all the changes done by ip_options_compile(). */ void ip_options_undo(struct ip_options *opt) { if (opt->srr) { unsigned char *optptr = opt->__data + opt->srr - sizeof(struct iphdr); memmove(optptr + 7, optptr + 3, optptr[1] - 7); memcpy(optptr + 3, &opt->faddr, 4); } if (opt->rr_needaddr) { unsigned char *optptr = opt->__data + opt->rr - sizeof(struct iphdr); optptr[2] -= 4; memset(&optptr[optptr[2] - 1], 0, 4); } if (opt->ts) { unsigned char *optptr = opt->__data + opt->ts - sizeof(struct iphdr); if (opt->ts_needtime) { optptr[2] -= 4; memset(&optptr[optptr[2] - 1], 0, 4); if ((optptr[3] & 0xF) == IPOPT_TS_PRESPEC) optptr[2] -= 4; } if (opt->ts_needaddr) { optptr[2] -= 4; memset(&optptr[optptr[2] - 1], 0, 4); } } } int ip_options_get(struct net *net, struct ip_options_rcu **optp, sockptr_t data, int optlen) { struct ip_options_rcu *opt; opt = kzalloc(sizeof(struct ip_options_rcu) + ((optlen + 3) & ~3), GFP_KERNEL); if (!opt) return -ENOMEM; if (optlen && copy_from_sockptr(opt->opt.__data, data, optlen)) { kfree(opt); return -EFAULT; } while (optlen & 3) opt->opt.__data[optlen++] = IPOPT_END; opt->opt.optlen = optlen; if (optlen && ip_options_compile(net, &opt->opt, NULL)) { kfree(opt); return -EINVAL; } kfree(*optp); *optp = opt; return 0; } void ip_forward_options(struct sk_buff *skb) { struct ip_options *opt = &(IPCB(skb)->opt); unsigned char *optptr; struct rtable *rt = skb_rtable(skb); unsigned char *raw = skb_network_header(skb); if (opt->rr_needaddr) { optptr = (unsigned char *)raw + opt->rr; ip_rt_get_source(&optptr[optptr[2]-5], skb, rt); opt->is_changed = 1; } if (opt->srr_is_hit) { int srrptr, srrspace; optptr = raw + opt->srr; for ( srrptr = optptr[2], srrspace = optptr[1]; srrptr <= srrspace; srrptr += 4 ) { if (srrptr + 3 > srrspace) break; if (memcmp(&opt->nexthop, &optptr[srrptr-1], 4) == 0) break; } if (srrptr + 3 <= srrspace) { opt->is_changed = 1; ip_hdr(skb)->daddr = opt->nexthop; ip_rt_get_source(&optptr[srrptr-1], skb, rt); optptr[2] = srrptr+4; } else { net_crit_ratelimited("%s(): Argh! Destination lost!\n", __func__); } if (opt->ts_needaddr) { optptr = raw + opt->ts; ip_rt_get_source(&optptr[optptr[2]-9], skb, rt); opt->is_changed = 1; } } if (opt->is_changed) { opt->is_changed = 0; ip_send_check(ip_hdr(skb)); } } int ip_options_rcv_srr(struct sk_buff *skb, struct net_device *dev) { struct ip_options *opt = &(IPCB(skb)->opt); int srrspace, srrptr; __be32 nexthop; struct iphdr *iph = ip_hdr(skb); unsigned char *optptr = skb_network_header(skb) + opt->srr; struct rtable *rt = skb_rtable(skb); struct rtable *rt2; unsigned long orefdst; int err; if (!rt) return 0; if (skb->pkt_type != PACKET_HOST) return -EINVAL; if (rt->rt_type == RTN_UNICAST) { if (!opt->is_strictroute) return 0; icmp_send(skb, ICMP_PARAMETERPROB, 0, htonl(16<<24)); return -EINVAL; } if (rt->rt_type != RTN_LOCAL) return -EINVAL; for (srrptr = optptr[2], srrspace = optptr[1]; srrptr <= srrspace; srrptr += 4) { if (srrptr + 3 > srrspace) { icmp_send(skb, ICMP_PARAMETERPROB, 0, htonl((opt->srr+2)<<24)); return -EINVAL; } memcpy(&nexthop, &optptr[srrptr-1], 4); orefdst = skb->_skb_refdst; skb_dst_set(skb, NULL); err = ip_route_input(skb, nexthop, iph->saddr, ip4h_dscp(iph), dev) ? -EINVAL : 0; rt2 = skb_rtable(skb); if (err || (rt2->rt_type != RTN_UNICAST && rt2->rt_type != RTN_LOCAL)) { skb_dst_drop(skb); skb->_skb_refdst = orefdst; return -EINVAL; } refdst_drop(orefdst); if (rt2->rt_type != RTN_LOCAL) break; /* Superfast 8) loopback forward */ iph->daddr = nexthop; opt->is_changed = 1; } if (srrptr <= srrspace) { opt->srr_is_hit = 1; opt->nexthop = nexthop; opt->is_changed = 1; } return 0; } EXPORT_SYMBOL(ip_options_rcv_srr); |
| 1 1 1 6 1 6 2 1 15 8 7 1 1 1 1 1 1 3 4 4 5 5 4 1 5 5 5 3 3 3 2 2 7 18 1 1 1 1 1 7 1 6 2 10 3 3 10 8 7 5 1 1 5 3 1 1 1 6 1 1 1 1 6 16 16 2 14 16 5 13 2 2 2 3 2 1 1 1 4 6 6 6 4 1 4 2 7 2 1 4 10 1 1 8 6 6 16 1 1 1 1 1 2 8 1 4 1 4 1 2 1 8 5 1 1 2 2 1 1 3 4 2 1 1 3 1 1 1 28 9 1 1 1 1 1 8 1 7 6 1 7 2 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 | // SPDX-License-Identifier: GPL-2.0-only /* * Copyright (c) 2015, Sony Mobile Communications Inc. * Copyright (c) 2013, The Linux Foundation. All rights reserved. */ #include <linux/module.h> #include <linux/netlink.h> #include <linux/qrtr.h> #include <linux/termios.h> /* For TIOCINQ/OUTQ */ #include <linux/spinlock.h> #include <linux/wait.h> #include <net/sock.h> #include "qrtr.h" #define QRTR_PROTO_VER_1 1 #define QRTR_PROTO_VER_2 3 /* auto-bind range */ #define QRTR_MIN_EPH_SOCKET 0x4000 #define QRTR_MAX_EPH_SOCKET 0x7fff #define QRTR_EPH_PORT_RANGE \ XA_LIMIT(QRTR_MIN_EPH_SOCKET, QRTR_MAX_EPH_SOCKET) #define QRTR_PORT_CTRL_LEGACY 0xffff /** * struct qrtr_hdr_v1 - (I|R)PCrouter packet header version 1 * @version: protocol version * @type: packet type; one of QRTR_TYPE_* * @src_node_id: source node * @src_port_id: source port * @confirm_rx: boolean; whether a resume-tx packet should be send in reply * @size: length of packet, excluding this header * @dst_node_id: destination node * @dst_port_id: destination port */ struct qrtr_hdr_v1 { __le32 version; __le32 type; __le32 src_node_id; __le32 src_port_id; __le32 confirm_rx; __le32 size; __le32 dst_node_id; __le32 dst_port_id; } __packed; /** * struct qrtr_hdr_v2 - (I|R)PCrouter packet header later versions * @version: protocol version * @type: packet type; one of QRTR_TYPE_* * @flags: bitmask of QRTR_FLAGS_* * @optlen: length of optional header data * @size: length of packet, excluding this header and optlen * @src_node_id: source node * @src_port_id: source port * @dst_node_id: destination node * @dst_port_id: destination port */ struct qrtr_hdr_v2 { u8 version; u8 type; u8 flags; u8 optlen; __le32 size; __le16 src_node_id; __le16 src_port_id; __le16 dst_node_id; __le16 dst_port_id; }; #define QRTR_FLAGS_CONFIRM_RX BIT(0) struct qrtr_cb { u32 src_node; u32 src_port; u32 dst_node; u32 dst_port; u8 type; u8 confirm_rx; }; #define QRTR_HDR_MAX_SIZE max_t(size_t, sizeof(struct qrtr_hdr_v1), \ sizeof(struct qrtr_hdr_v2)) struct qrtr_sock { /* WARNING: sk must be the first member */ struct sock sk; struct sockaddr_qrtr us; struct sockaddr_qrtr peer; }; static inline struct qrtr_sock *qrtr_sk(struct sock *sk) { BUILD_BUG_ON(offsetof(struct qrtr_sock, sk) != 0); return container_of(sk, struct qrtr_sock, sk); } static unsigned int qrtr_local_nid = 1; /* for node ids */ static RADIX_TREE(qrtr_nodes, GFP_ATOMIC); static DEFINE_SPINLOCK(qrtr_nodes_lock); /* broadcast list */ static LIST_HEAD(qrtr_all_nodes); /* lock for qrtr_all_nodes and node reference */ static DEFINE_MUTEX(qrtr_node_lock); /* local port allocation management */ static DEFINE_XARRAY_ALLOC(qrtr_ports); /** * struct qrtr_node - endpoint node * @ep_lock: lock for endpoint management and callbacks * @ep: endpoint * @ref: reference count for node * @nid: node id * @qrtr_tx_flow: tree of qrtr_tx_flow, keyed by node << 32 | port * @qrtr_tx_lock: lock for qrtr_tx_flow inserts * @rx_queue: receive queue * @item: list item for broadcast list */ struct qrtr_node { struct mutex ep_lock; struct qrtr_endpoint *ep; struct kref ref; unsigned int nid; struct radix_tree_root qrtr_tx_flow; struct mutex qrtr_tx_lock; /* for qrtr_tx_flow */ struct sk_buff_head rx_queue; struct list_head item; }; /** * struct qrtr_tx_flow - tx flow control * @resume_tx: waiters for a resume tx from the remote * @pending: number of waiting senders * @tx_failed: indicates that a message with confirm_rx flag was lost */ struct qrtr_tx_flow { struct wait_queue_head resume_tx; int pending; int tx_failed; }; #define QRTR_TX_FLOW_HIGH 10 #define QRTR_TX_FLOW_LOW 5 static int qrtr_local_enqueue(struct qrtr_node *node, struct sk_buff *skb, int type, struct sockaddr_qrtr *from, struct sockaddr_qrtr *to); static int qrtr_bcast_enqueue(struct qrtr_node *node, struct sk_buff *skb, int type, struct sockaddr_qrtr *from, struct sockaddr_qrtr *to); static struct qrtr_sock *qrtr_port_lookup(int port); static void qrtr_port_put(struct qrtr_sock *ipc); /* Release node resources and free the node. * * Do not call directly, use qrtr_node_release. To be used with * kref_put_mutex. As such, the node mutex is expected to be locked on call. */ static void __qrtr_node_release(struct kref *kref) { struct qrtr_node *node = container_of(kref, struct qrtr_node, ref); struct radix_tree_iter iter; struct qrtr_tx_flow *flow; unsigned long flags; void __rcu **slot; spin_lock_irqsave(&qrtr_nodes_lock, flags); /* If the node is a bridge for other nodes, there are possibly * multiple entries pointing to our released node, delete them all. */ radix_tree_for_each_slot(slot, &qrtr_nodes, &iter, 0) { if (*slot == node) radix_tree_iter_delete(&qrtr_nodes, &iter, slot); } spin_unlock_irqrestore(&qrtr_nodes_lock, flags); list_del(&node->item); mutex_unlock(&qrtr_node_lock); skb_queue_purge(&node->rx_queue); /* Free tx flow counters */ radix_tree_for_each_slot(slot, &node->qrtr_tx_flow, &iter, 0) { flow = *slot; radix_tree_iter_delete(&node->qrtr_tx_flow, &iter, slot); kfree(flow); } kfree(node); } /* Increment reference to node. */ static struct qrtr_node *qrtr_node_acquire(struct qrtr_node *node) { if (node) kref_get(&node->ref); return node; } /* Decrement reference to node and release as necessary. */ static void qrtr_node_release(struct qrtr_node *node) { if (!node) return; kref_put_mutex(&node->ref, __qrtr_node_release, &qrtr_node_lock); } /** * qrtr_tx_resume() - reset flow control counter * @node: qrtr_node that the QRTR_TYPE_RESUME_TX packet arrived on * @skb: resume_tx packet */ static void qrtr_tx_resume(struct qrtr_node *node, struct sk_buff *skb) { struct qrtr_ctrl_pkt *pkt = (struct qrtr_ctrl_pkt *)skb->data; u64 remote_node = le32_to_cpu(pkt->client.node); u32 remote_port = le32_to_cpu(pkt->client.port); struct qrtr_tx_flow *flow; unsigned long key; key = remote_node << 32 | remote_port; rcu_read_lock(); flow = radix_tree_lookup(&node->qrtr_tx_flow, key); rcu_read_unlock(); if (flow) { spin_lock(&flow->resume_tx.lock); flow->pending = 0; spin_unlock(&flow->resume_tx.lock); wake_up_interruptible_all(&flow->resume_tx); } consume_skb(skb); } /** * qrtr_tx_wait() - flow control for outgoing packets * @node: qrtr_node that the packet is to be send to * @dest_node: node id of the destination * @dest_port: port number of the destination * @type: type of message * * The flow control scheme is based around the low and high "watermarks". When * the low watermark is passed the confirm_rx flag is set on the outgoing * message, which will trigger the remote to send a control message of the type * QRTR_TYPE_RESUME_TX to reset the counter. If the high watermark is hit * further transmision should be paused. * * Return: 1 if confirm_rx should be set, 0 otherwise or errno failure */ static int qrtr_tx_wait(struct qrtr_node *node, int dest_node, int dest_port, int type) { unsigned long key = (u64)dest_node << 32 | dest_port; struct qrtr_tx_flow *flow; int confirm_rx = 0; int ret; /* Never set confirm_rx on non-data packets */ if (type != QRTR_TYPE_DATA) return 0; mutex_lock(&node->qrtr_tx_lock); flow = radix_tree_lookup(&node->qrtr_tx_flow, key); if (!flow) { flow = kzalloc(sizeof(*flow), GFP_KERNEL); if (flow) { init_waitqueue_head(&flow->resume_tx); if (radix_tree_insert(&node->qrtr_tx_flow, key, flow)) { kfree(flow); flow = NULL; } } } mutex_unlock(&node->qrtr_tx_lock); /* Set confirm_rx if we where unable to find and allocate a flow */ if (!flow) return 1; spin_lock_irq(&flow->resume_tx.lock); ret = wait_event_interruptible_locked_irq(flow->resume_tx, flow->pending < QRTR_TX_FLOW_HIGH || flow->tx_failed || !node->ep); if (ret < 0) { confirm_rx = ret; } else if (!node->ep) { confirm_rx = -EPIPE; } else if (flow->tx_failed) { flow->tx_failed = 0; confirm_rx = 1; } else { flow->pending++; confirm_rx = flow->pending == QRTR_TX_FLOW_LOW; } spin_unlock_irq(&flow->resume_tx.lock); return confirm_rx; } /** * qrtr_tx_flow_failed() - flag that tx of confirm_rx flagged messages failed * @node: qrtr_node that the packet is to be send to * @dest_node: node id of the destination * @dest_port: port number of the destination * * Signal that the transmission of a message with confirm_rx flag failed. The * flow's "pending" counter will keep incrementing towards QRTR_TX_FLOW_HIGH, * at which point transmission would stall forever waiting for the resume TX * message associated with the dropped confirm_rx message. * Work around this by marking the flow as having a failed transmission and * cause the next transmission attempt to be sent with the confirm_rx. */ static void qrtr_tx_flow_failed(struct qrtr_node *node, int dest_node, int dest_port) { unsigned long key = (u64)dest_node << 32 | dest_port; struct qrtr_tx_flow *flow; rcu_read_lock(); flow = radix_tree_lookup(&node->qrtr_tx_flow, key); rcu_read_unlock(); if (flow) { spin_lock_irq(&flow->resume_tx.lock); flow->tx_failed = 1; spin_unlock_irq(&flow->resume_tx.lock); } } /* Pass an outgoing packet socket buffer to the endpoint driver. */ static int qrtr_node_enqueue(struct qrtr_node *node, struct sk_buff *skb, int type, struct sockaddr_qrtr *from, struct sockaddr_qrtr *to) { struct qrtr_hdr_v1 *hdr; size_t len = skb->len; int rc, confirm_rx; confirm_rx = qrtr_tx_wait(node, to->sq_node, to->sq_port, type); if (confirm_rx < 0) { kfree_skb(skb); return confirm_rx; } hdr = skb_push(skb, sizeof(*hdr)); hdr->version = cpu_to_le32(QRTR_PROTO_VER_1); hdr->type = cpu_to_le32(type); hdr->src_node_id = cpu_to_le32(from->sq_node); hdr->src_port_id = cpu_to_le32(from->sq_port); if (to->sq_port == QRTR_PORT_CTRL) { hdr->dst_node_id = cpu_to_le32(node->nid); hdr->dst_port_id = cpu_to_le32(QRTR_PORT_CTRL); } else { hdr->dst_node_id = cpu_to_le32(to->sq_node); hdr->dst_port_id = cpu_to_le32(to->sq_port); } hdr->size = cpu_to_le32(len); hdr->confirm_rx = !!confirm_rx; rc = skb_put_padto(skb, ALIGN(len, 4) + sizeof(*hdr)); if (!rc) { mutex_lock(&node->ep_lock); rc = -ENODEV; if (node->ep) rc = node->ep->xmit(node->ep, skb); else kfree_skb(skb); mutex_unlock(&node->ep_lock); } /* Need to ensure that a subsequent message carries the otherwise lost * confirm_rx flag if we dropped this one */ if (rc && confirm_rx) qrtr_tx_flow_failed(node, to->sq_node, to->sq_port); return rc; } /* Lookup node by id. * * callers must release with qrtr_node_release() */ static struct qrtr_node *qrtr_node_lookup(unsigned int nid) { struct qrtr_node *node; unsigned long flags; mutex_lock(&qrtr_node_lock); spin_lock_irqsave(&qrtr_nodes_lock, flags); node = radix_tree_lookup(&qrtr_nodes, nid); node = qrtr_node_acquire(node); spin_unlock_irqrestore(&qrtr_nodes_lock, flags); mutex_unlock(&qrtr_node_lock); return node; } /* Assign node id to node. * * This is mostly useful for automatic node id assignment, based on * the source id in the incoming packet. */ static void qrtr_node_assign(struct qrtr_node *node, unsigned int nid) { unsigned long flags; if (nid == QRTR_EP_NID_AUTO) return; spin_lock_irqsave(&qrtr_nodes_lock, flags); radix_tree_insert(&qrtr_nodes, nid, node); if (node->nid == QRTR_EP_NID_AUTO) node->nid = nid; spin_unlock_irqrestore(&qrtr_nodes_lock, flags); } /** * qrtr_endpoint_post() - post incoming data * @ep: endpoint handle * @data: data pointer * @len: size of data in bytes * * Return: 0 on success; negative error code on failure */ int qrtr_endpoint_post(struct qrtr_endpoint *ep, const void *data, size_t len) { struct qrtr_node *node = ep->node; const struct qrtr_hdr_v1 *v1; const struct qrtr_hdr_v2 *v2; struct qrtr_sock *ipc; struct sk_buff *skb; struct qrtr_cb *cb; size_t size; unsigned int ver; size_t hdrlen; if (len == 0 || len & 3) return -EINVAL; skb = __netdev_alloc_skb(NULL, len, GFP_ATOMIC | __GFP_NOWARN); if (!skb) return -ENOMEM; cb = (struct qrtr_cb *)skb->cb; /* Version field in v1 is little endian, so this works for both cases */ ver = *(u8*)data; switch (ver) { case QRTR_PROTO_VER_1: if (len < sizeof(*v1)) goto err; v1 = data; hdrlen = sizeof(*v1); cb->type = le32_to_cpu(v1->type); cb->src_node = le32_to_cpu(v1->src_node_id); cb->src_port = le32_to_cpu(v1->src_port_id); cb->confirm_rx = !!v1->confirm_rx; cb->dst_node = le32_to_cpu(v1->dst_node_id); cb->dst_port = le32_to_cpu(v1->dst_port_id); size = le32_to_cpu(v1->size); break; case QRTR_PROTO_VER_2: if (len < sizeof(*v2)) goto err; v2 = data; hdrlen = sizeof(*v2) + v2->optlen; cb->type = v2->type; cb->confirm_rx = !!(v2->flags & QRTR_FLAGS_CONFIRM_RX); cb->src_node = le16_to_cpu(v2->src_node_id); cb->src_port = le16_to_cpu(v2->src_port_id); cb->dst_node = le16_to_cpu(v2->dst_node_id); cb->dst_port = le16_to_cpu(v2->dst_port_id); if (cb->src_port == (u16)QRTR_PORT_CTRL) cb->src_port = QRTR_PORT_CTRL; if (cb->dst_port == (u16)QRTR_PORT_CTRL) cb->dst_port = QRTR_PORT_CTRL; size = le32_to_cpu(v2->size); break; default: pr_err("qrtr: Invalid version %d\n", ver); goto err; } if (cb->dst_port == QRTR_PORT_CTRL_LEGACY) cb->dst_port = QRTR_PORT_CTRL; if (!size || len != ALIGN(size, 4) + hdrlen) goto err; if ((cb->type == QRTR_TYPE_NEW_SERVER || cb->type == QRTR_TYPE_RESUME_TX) && size < sizeof(struct qrtr_ctrl_pkt)) goto err; if (cb->dst_port != QRTR_PORT_CTRL && cb->type != QRTR_TYPE_DATA && cb->type != QRTR_TYPE_RESUME_TX) goto err; skb_put_data(skb, data + hdrlen, size); qrtr_node_assign(node, cb->src_node); if (cb->type == QRTR_TYPE_NEW_SERVER) { /* Remote node endpoint can bridge other distant nodes */ const struct qrtr_ctrl_pkt *pkt; pkt = data + hdrlen; qrtr_node_assign(node, le32_to_cpu(pkt->server.node)); } if (cb->type == QRTR_TYPE_RESUME_TX) { qrtr_tx_resume(node, skb); } else { ipc = qrtr_port_lookup(cb->dst_port); if (!ipc) goto err; if (sock_queue_rcv_skb(&ipc->sk, skb)) { qrtr_port_put(ipc); goto err; } qrtr_port_put(ipc); } return 0; err: kfree_skb(skb); return -EINVAL; } EXPORT_SYMBOL_GPL(qrtr_endpoint_post); /** * qrtr_alloc_ctrl_packet() - allocate control packet skb * @pkt: reference to qrtr_ctrl_pkt pointer * @flags: the type of memory to allocate * * Returns newly allocated sk_buff, or NULL on failure * * This function allocates a sk_buff large enough to carry a qrtr_ctrl_pkt and * on success returns a reference to the control packet in @pkt. */ static struct sk_buff *qrtr_alloc_ctrl_packet(struct qrtr_ctrl_pkt **pkt, gfp_t flags) { const int pkt_len = sizeof(struct qrtr_ctrl_pkt); struct sk_buff *skb; skb = alloc_skb(QRTR_HDR_MAX_SIZE + pkt_len, flags); if (!skb) return NULL; skb_reserve(skb, QRTR_HDR_MAX_SIZE); *pkt = skb_put_zero(skb, pkt_len); return skb; } /** * qrtr_endpoint_register() - register a new endpoint * @ep: endpoint to register * @nid: desired node id; may be QRTR_EP_NID_AUTO for auto-assignment * Return: 0 on success; negative error code on failure * * The specified endpoint must have the xmit function pointer set on call. */ int qrtr_endpoint_register(struct qrtr_endpoint *ep, unsigned int nid) { struct qrtr_node *node; if (!ep || !ep->xmit) return -EINVAL; node = kzalloc(sizeof(*node), GFP_KERNEL); if (!node) return -ENOMEM; kref_init(&node->ref); mutex_init(&node->ep_lock); skb_queue_head_init(&node->rx_queue); node->nid = QRTR_EP_NID_AUTO; node->ep = ep; INIT_RADIX_TREE(&node->qrtr_tx_flow, GFP_KERNEL); mutex_init(&node->qrtr_tx_lock); qrtr_node_assign(node, nid); mutex_lock(&qrtr_node_lock); list_add(&node->item, &qrtr_all_nodes); mutex_unlock(&qrtr_node_lock); ep->node = node; return 0; } EXPORT_SYMBOL_GPL(qrtr_endpoint_register); /** * qrtr_endpoint_unregister - unregister endpoint * @ep: endpoint to unregister */ void qrtr_endpoint_unregister(struct qrtr_endpoint *ep) { struct qrtr_node *node = ep->node; struct sockaddr_qrtr src = {AF_QIPCRTR, node->nid, QRTR_PORT_CTRL}; struct sockaddr_qrtr dst = {AF_QIPCRTR, qrtr_local_nid, QRTR_PORT_CTRL}; struct radix_tree_iter iter; struct qrtr_ctrl_pkt *pkt; struct qrtr_tx_flow *flow; struct sk_buff *skb; unsigned long flags; void __rcu **slot; mutex_lock(&node->ep_lock); node->ep = NULL; mutex_unlock(&node->ep_lock); /* Notify the local controller about the event */ spin_lock_irqsave(&qrtr_nodes_lock, flags); radix_tree_for_each_slot(slot, &qrtr_nodes, &iter, 0) { if (*slot != node) continue; src.sq_node = iter.index; skb = qrtr_alloc_ctrl_packet(&pkt, GFP_ATOMIC); if (skb) { pkt->cmd = cpu_to_le32(QRTR_TYPE_BYE); qrtr_local_enqueue(NULL, skb, QRTR_TYPE_BYE, &src, &dst); } } spin_unlock_irqrestore(&qrtr_nodes_lock, flags); /* Wake up any transmitters waiting for resume-tx from the node */ mutex_lock(&node->qrtr_tx_lock); radix_tree_for_each_slot(slot, &node->qrtr_tx_flow, &iter, 0) { flow = *slot; wake_up_interruptible_all(&flow->resume_tx); } mutex_unlock(&node->qrtr_tx_lock); qrtr_node_release(node); ep->node = NULL; } EXPORT_SYMBOL_GPL(qrtr_endpoint_unregister); /* Lookup socket by port. * * Callers must release with qrtr_port_put() */ static struct qrtr_sock *qrtr_port_lookup(int port) { struct qrtr_sock *ipc; if (port == QRTR_PORT_CTRL) port = 0; rcu_read_lock(); ipc = xa_load(&qrtr_ports, port); if (ipc) sock_hold(&ipc->sk); rcu_read_unlock(); return ipc; } /* Release acquired socket. */ static void qrtr_port_put(struct qrtr_sock *ipc) { sock_put(&ipc->sk); } /* Remove port assignment. */ static void qrtr_port_remove(struct qrtr_sock *ipc) { struct qrtr_ctrl_pkt *pkt; struct sk_buff *skb; int port = ipc->us.sq_port; struct sockaddr_qrtr to; to.sq_family = AF_QIPCRTR; to.sq_node = QRTR_NODE_BCAST; to.sq_port = QRTR_PORT_CTRL; skb = qrtr_alloc_ctrl_packet(&pkt, GFP_KERNEL); if (skb) { pkt->cmd = cpu_to_le32(QRTR_TYPE_DEL_CLIENT); pkt->client.node = cpu_to_le32(ipc->us.sq_node); pkt->client.port = cpu_to_le32(ipc->us.sq_port); skb_set_owner_w(skb, &ipc->sk); qrtr_bcast_enqueue(NULL, skb, QRTR_TYPE_DEL_CLIENT, &ipc->us, &to); } if (port == QRTR_PORT_CTRL) port = 0; __sock_put(&ipc->sk); xa_erase(&qrtr_ports, port); /* Ensure that if qrtr_port_lookup() did enter the RCU read section we * wait for it to up increment the refcount */ synchronize_rcu(); } /* Assign port number to socket. * * Specify port in the integer pointed to by port, and it will be adjusted * on return as necesssary. * * Port may be: * 0: Assign ephemeral port in [QRTR_MIN_EPH_SOCKET, QRTR_MAX_EPH_SOCKET] * <QRTR_MIN_EPH_SOCKET: Specified; requires CAP_NET_ADMIN * >QRTR_MIN_EPH_SOCKET: Specified; available to all */ static int qrtr_port_assign(struct qrtr_sock *ipc, int *port) { int rc; if (!*port) { rc = xa_alloc(&qrtr_ports, port, ipc, QRTR_EPH_PORT_RANGE, GFP_KERNEL); } else if (*port < QRTR_MIN_EPH_SOCKET && !capable(CAP_NET_ADMIN)) { rc = -EACCES; } else if (*port == QRTR_PORT_CTRL) { rc = xa_insert(&qrtr_ports, 0, ipc, GFP_KERNEL); } else { rc = xa_insert(&qrtr_ports, *port, ipc, GFP_KERNEL); } if (rc == -EBUSY) return -EADDRINUSE; else if (rc < 0) return rc; sock_hold(&ipc->sk); return 0; } /* Reset all non-control ports */ static void qrtr_reset_ports(void) { struct qrtr_sock *ipc; unsigned long index; rcu_read_lock(); xa_for_each_start(&qrtr_ports, index, ipc, 1) { sock_hold(&ipc->sk); ipc->sk.sk_err = ENETRESET; sk_error_report(&ipc->sk); sock_put(&ipc->sk); } rcu_read_unlock(); } /* Bind socket to address. * * Socket should be locked upon call. */ static int __qrtr_bind(struct socket *sock, const struct sockaddr_qrtr *addr, int zapped) { struct qrtr_sock *ipc = qrtr_sk(sock->sk); struct sock *sk = sock->sk; int port; int rc; /* rebinding ok */ if (!zapped && addr->sq_port == ipc->us.sq_port) return 0; port = addr->sq_port; rc = qrtr_port_assign(ipc, &port); if (rc) return rc; /* unbind previous, if any */ if (!zapped) qrtr_port_remove(ipc); ipc->us.sq_port = port; sock_reset_flag(sk, SOCK_ZAPPED); /* Notify all open ports about the new controller */ if (port == QRTR_PORT_CTRL) qrtr_reset_ports(); return 0; } /* Auto bind to an ephemeral port. */ static int qrtr_autobind(struct socket *sock) { struct sock *sk = sock->sk; struct sockaddr_qrtr addr; if (!sock_flag(sk, SOCK_ZAPPED)) return 0; addr.sq_family = AF_QIPCRTR; addr.sq_node = qrtr_local_nid; addr.sq_port = 0; return __qrtr_bind(sock, &addr, 1); } /* Bind socket to specified sockaddr. */ static int qrtr_bind(struct socket *sock, struct sockaddr *saddr, int len) { DECLARE_SOCKADDR(struct sockaddr_qrtr *, addr, saddr); struct qrtr_sock *ipc = qrtr_sk(sock->sk); struct sock *sk = sock->sk; int rc; if (len < sizeof(*addr) || addr->sq_family != AF_QIPCRTR) return -EINVAL; if (addr->sq_node != ipc->us.sq_node) return -EINVAL; lock_sock(sk); rc = __qrtr_bind(sock, addr, sock_flag(sk, SOCK_ZAPPED)); release_sock(sk); return rc; } /* Queue packet to local peer socket. */ static int qrtr_local_enqueue(struct qrtr_node *node, struct sk_buff *skb, int type, struct sockaddr_qrtr *from, struct sockaddr_qrtr *to) { struct qrtr_sock *ipc; struct qrtr_cb *cb; ipc = qrtr_port_lookup(to->sq_port); if (!ipc || &ipc->sk == skb->sk) { /* do not send to self */ if (ipc) qrtr_port_put(ipc); kfree_skb(skb); return -ENODEV; } cb = (struct qrtr_cb *)skb->cb; cb->src_node = from->sq_node; cb->src_port = from->sq_port; if (sock_queue_rcv_skb(&ipc->sk, skb)) { qrtr_port_put(ipc); kfree_skb(skb); return -ENOSPC; } qrtr_port_put(ipc); return 0; } /* Queue packet for broadcast. */ static int qrtr_bcast_enqueue(struct qrtr_node *node, struct sk_buff *skb, int type, struct sockaddr_qrtr *from, struct sockaddr_qrtr *to) { struct sk_buff *skbn; mutex_lock(&qrtr_node_lock); list_for_each_entry(node, &qrtr_all_nodes, item) { skbn = pskb_copy(skb, GFP_KERNEL); if (!skbn) break; skb_set_owner_w(skbn, skb->sk); qrtr_node_enqueue(node, skbn, type, from, to); } mutex_unlock(&qrtr_node_lock); qrtr_local_enqueue(NULL, skb, type, from, to); return 0; } static int qrtr_sendmsg(struct socket *sock, struct msghdr *msg, size_t len) { DECLARE_SOCKADDR(struct sockaddr_qrtr *, addr, msg->msg_name); int (*enqueue_fn)(struct qrtr_node *, struct sk_buff *, int, struct sockaddr_qrtr *, struct sockaddr_qrtr *); __le32 qrtr_type = cpu_to_le32(QRTR_TYPE_DATA); struct qrtr_sock *ipc = qrtr_sk(sock->sk); struct sock *sk = sock->sk; struct qrtr_node *node; struct sk_buff *skb; size_t plen; u32 type; int rc; if (msg->msg_flags & ~(MSG_DONTWAIT)) return -EINVAL; if (len > 65535) return -EMSGSIZE; lock_sock(sk); if (addr) { if (msg->msg_namelen < sizeof(*addr)) { release_sock(sk); return -EINVAL; } if (addr->sq_family != AF_QIPCRTR) { release_sock(sk); return -EINVAL; } rc = qrtr_autobind(sock); if (rc) { release_sock(sk); return rc; } } else if (sk->sk_state == TCP_ESTABLISHED) { addr = &ipc->peer; } else { release_sock(sk); return -ENOTCONN; } node = NULL; if (addr->sq_node == QRTR_NODE_BCAST) { if (addr->sq_port != QRTR_PORT_CTRL && qrtr_local_nid != QRTR_NODE_BCAST) { release_sock(sk); return -ENOTCONN; } enqueue_fn = qrtr_bcast_enqueue; } else if (addr->sq_node == ipc->us.sq_node) { enqueue_fn = qrtr_local_enqueue; } else { node = qrtr_node_lookup(addr->sq_node); if (!node) { release_sock(sk); return -ECONNRESET; } enqueue_fn = qrtr_node_enqueue; } plen = (len + 3) & ~3; skb = sock_alloc_send_skb(sk, plen + QRTR_HDR_MAX_SIZE, msg->msg_flags & MSG_DONTWAIT, &rc); if (!skb) { rc = -ENOMEM; goto out_node; } skb_reserve(skb, QRTR_HDR_MAX_SIZE); rc = memcpy_from_msg(skb_put(skb, len), msg, len); if (rc) { kfree_skb(skb); goto out_node; } if (ipc->us.sq_port == QRTR_PORT_CTRL) { if (len < 4) { rc = -EINVAL; kfree_skb(skb); goto out_node; } /* control messages already require the type as 'command' */ skb_copy_bits(skb, 0, &qrtr_type, 4); } type = le32_to_cpu(qrtr_type); rc = enqueue_fn(node, skb, type, &ipc->us, addr); if (rc >= 0) rc = len; out_node: qrtr_node_release(node); release_sock(sk); return rc; } static int qrtr_send_resume_tx(struct qrtr_cb *cb) { struct sockaddr_qrtr remote = { AF_QIPCRTR, cb->src_node, cb->src_port }; struct sockaddr_qrtr local = { AF_QIPCRTR, cb->dst_node, cb->dst_port }; struct qrtr_ctrl_pkt *pkt; struct qrtr_node *node; struct sk_buff *skb; int ret; node = qrtr_node_lookup(remote.sq_node); if (!node) return -EINVAL; skb = qrtr_alloc_ctrl_packet(&pkt, GFP_KERNEL); if (!skb) return -ENOMEM; pkt->cmd = cpu_to_le32(QRTR_TYPE_RESUME_TX); pkt->client.node = cpu_to_le32(cb->dst_node); pkt->client.port = cpu_to_le32(cb->dst_port); ret = qrtr_node_enqueue(node, skb, QRTR_TYPE_RESUME_TX, &local, &remote); qrtr_node_release(node); return ret; } static int qrtr_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, int flags) { DECLARE_SOCKADDR(struct sockaddr_qrtr *, addr, msg->msg_name); struct sock *sk = sock->sk; struct sk_buff *skb; struct qrtr_cb *cb; int copied, rc; lock_sock(sk); if (sock_flag(sk, SOCK_ZAPPED)) { release_sock(sk); return -EADDRNOTAVAIL; } skb = skb_recv_datagram(sk, flags, &rc); if (!skb) { release_sock(sk); return rc; } cb = (struct qrtr_cb *)skb->cb; copied = skb->len; if (copied > size) { copied = size; msg->msg_flags |= MSG_TRUNC; } rc = skb_copy_datagram_msg(skb, 0, msg, copied); if (rc < 0) goto out; rc = copied; if (addr) { /* There is an anonymous 2-byte hole after sq_family, * make sure to clear it. */ memset(addr, 0, sizeof(*addr)); addr->sq_family = AF_QIPCRTR; addr->sq_node = cb->src_node; addr->sq_port = cb->src_port; msg->msg_namelen = sizeof(*addr); } out: if (cb->confirm_rx) qrtr_send_resume_tx(cb); skb_free_datagram(sk, skb); release_sock(sk); return rc; } static int qrtr_connect(struct socket *sock, struct sockaddr *saddr, int len, int flags) { DECLARE_SOCKADDR(struct sockaddr_qrtr *, addr, saddr); struct qrtr_sock *ipc = qrtr_sk(sock->sk); struct sock *sk = sock->sk; int rc; if (len < sizeof(*addr) || addr->sq_family != AF_QIPCRTR) return -EINVAL; lock_sock(sk); sk->sk_state = TCP_CLOSE; sock->state = SS_UNCONNECTED; rc = qrtr_autobind(sock); if (rc) { release_sock(sk); return rc; } ipc->peer = *addr; sock->state = SS_CONNECTED; sk->sk_state = TCP_ESTABLISHED; release_sock(sk); return 0; } static int qrtr_getname(struct socket *sock, struct sockaddr *saddr, int peer) { struct qrtr_sock *ipc = qrtr_sk(sock->sk); struct sockaddr_qrtr qaddr; struct sock *sk = sock->sk; lock_sock(sk); if (peer) { if (sk->sk_state != TCP_ESTABLISHED) { release_sock(sk); return -ENOTCONN; } qaddr = ipc->peer; } else { qaddr = ipc->us; } release_sock(sk); qaddr.sq_family = AF_QIPCRTR; memcpy(saddr, &qaddr, sizeof(qaddr)); return sizeof(qaddr); } static int qrtr_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg) { void __user *argp = (void __user *)arg; struct qrtr_sock *ipc = qrtr_sk(sock->sk); struct sock *sk = sock->sk; struct sockaddr_qrtr *sq; struct sk_buff *skb; struct ifreq ifr; long len = 0; int rc = 0; lock_sock(sk); switch (cmd) { case TIOCOUTQ: len = sk->sk_sndbuf - sk_wmem_alloc_get(sk); if (len < 0) len = 0; rc = put_user(len, (int __user *)argp); break; case TIOCINQ: skb = skb_peek(&sk->sk_receive_queue); if (skb) len = skb->len; rc = put_user(len, (int __user *)argp); break; case SIOCGIFADDR: if (get_user_ifreq(&ifr, NULL, argp)) { rc = -EFAULT; break; } sq = (struct sockaddr_qrtr *)&ifr.ifr_addr; *sq = ipc->us; if (put_user_ifreq(&ifr, argp)) { rc = -EFAULT; break; } break; case SIOCADDRT: case SIOCDELRT: case SIOCSIFADDR: case SIOCGIFDSTADDR: case SIOCSIFDSTADDR: case SIOCGIFBRDADDR: case SIOCSIFBRDADDR: case SIOCGIFNETMASK: case SIOCSIFNETMASK: rc = -EINVAL; break; default: rc = -ENOIOCTLCMD; break; } release_sock(sk); return rc; } static int qrtr_release(struct socket *sock) { struct sock *sk = sock->sk; struct qrtr_sock *ipc; if (!sk) return 0; lock_sock(sk); ipc = qrtr_sk(sk); sk->sk_shutdown = SHUTDOWN_MASK; if (!sock_flag(sk, SOCK_DEAD)) sk->sk_state_change(sk); sock_set_flag(sk, SOCK_DEAD); sock_orphan(sk); sock->sk = NULL; if (!sock_flag(sk, SOCK_ZAPPED)) qrtr_port_remove(ipc); skb_queue_purge(&sk->sk_receive_queue); release_sock(sk); sock_put(sk); return 0; } static const struct proto_ops qrtr_proto_ops = { .owner = THIS_MODULE, .family = AF_QIPCRTR, .bind = qrtr_bind, .connect = qrtr_connect, .socketpair = sock_no_socketpair, .accept = sock_no_accept, .listen = sock_no_listen, .sendmsg = qrtr_sendmsg, .recvmsg = qrtr_recvmsg, .getname = qrtr_getname, .ioctl = qrtr_ioctl, .gettstamp = sock_gettstamp, .poll = datagram_poll, .shutdown = sock_no_shutdown, .release = qrtr_release, .mmap = sock_no_mmap, }; static struct proto qrtr_proto = { .name = "QIPCRTR", .owner = THIS_MODULE, .obj_size = sizeof(struct qrtr_sock), }; static int qrtr_create(struct net *net, struct socket *sock, int protocol, int kern) { struct qrtr_sock *ipc; struct sock *sk; if (sock->type != SOCK_DGRAM) return -EPROTOTYPE; sk = sk_alloc(net, AF_QIPCRTR, GFP_KERNEL, &qrtr_proto, kern); if (!sk) return -ENOMEM; sock_set_flag(sk, SOCK_ZAPPED); sock_init_data(sock, sk); sock->ops = &qrtr_proto_ops; ipc = qrtr_sk(sk); ipc->us.sq_family = AF_QIPCRTR; ipc->us.sq_node = qrtr_local_nid; ipc->us.sq_port = 0; return 0; } static const struct net_proto_family qrtr_family = { .owner = THIS_MODULE, .family = AF_QIPCRTR, .create = qrtr_create, }; static int __init qrtr_proto_init(void) { int rc; rc = proto_register(&qrtr_proto, 1); if (rc) return rc; rc = sock_register(&qrtr_family); if (rc) goto err_proto; rc = qrtr_ns_init(); if (rc) goto err_sock; return 0; err_sock: sock_unregister(qrtr_family.family); err_proto: proto_unregister(&qrtr_proto); return rc; } postcore_initcall(qrtr_proto_init); static void __exit qrtr_proto_fini(void) { qrtr_ns_remove(); sock_unregister(qrtr_family.family); proto_unregister(&qrtr_proto); } module_exit(qrtr_proto_fini); MODULE_DESCRIPTION("Qualcomm IPC-router driver"); MODULE_LICENSE("GPL v2"); MODULE_ALIAS_NETPROTO(PF_QIPCRTR); |
| 714 7 404 458 135 514 2 46 48 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 | /* SPDX-License-Identifier: GPL-2.0-only */ /* * net busy poll support * Copyright(c) 2013 Intel Corporation. * * Author: Eliezer Tamir * * Contact Information: * e1000-devel Mailing List <e1000-devel@lists.sourceforge.net> */ #ifndef _LINUX_NET_BUSY_POLL_H #define _LINUX_NET_BUSY_POLL_H #include <linux/netdevice.h> #include <linux/sched/clock.h> #include <linux/sched/signal.h> #include <net/ip.h> #include <net/xdp.h> /* 0 - Reserved to indicate value not set * 1..NR_CPUS - Reserved for sender_cpu * NR_CPUS+1..~0 - Region available for NAPI IDs */ #define MIN_NAPI_ID ((unsigned int)(NR_CPUS + 1)) #define BUSY_POLL_BUDGET 8 #ifdef CONFIG_NET_RX_BUSY_POLL struct napi_struct; extern unsigned int sysctl_net_busy_read __read_mostly; extern unsigned int sysctl_net_busy_poll __read_mostly; static inline bool net_busy_loop_on(void) { return READ_ONCE(sysctl_net_busy_poll); } static inline bool sk_can_busy_loop(const struct sock *sk) { return READ_ONCE(sk->sk_ll_usec) && !signal_pending(current); } bool sk_busy_loop_end(void *p, unsigned long start_time); void napi_busy_loop(unsigned int napi_id, bool (*loop_end)(void *, unsigned long), void *loop_end_arg, bool prefer_busy_poll, u16 budget); void napi_busy_loop_rcu(unsigned int napi_id, bool (*loop_end)(void *, unsigned long), void *loop_end_arg, bool prefer_busy_poll, u16 budget); void napi_suspend_irqs(unsigned int napi_id); void napi_resume_irqs(unsigned int napi_id); #else /* CONFIG_NET_RX_BUSY_POLL */ static inline unsigned long net_busy_loop_on(void) { return 0; } static inline bool sk_can_busy_loop(struct sock *sk) { return false; } #endif /* CONFIG_NET_RX_BUSY_POLL */ static inline unsigned long busy_loop_current_time(void) { #ifdef CONFIG_NET_RX_BUSY_POLL return (unsigned long)(ktime_get_ns() >> 10); #else return 0; #endif } /* in poll/select we use the global sysctl_net_ll_poll value */ static inline bool busy_loop_timeout(unsigned long start_time) { #ifdef CONFIG_NET_RX_BUSY_POLL unsigned long bp_usec = READ_ONCE(sysctl_net_busy_poll); if (bp_usec) { unsigned long end_time = start_time + bp_usec; unsigned long now = busy_loop_current_time(); return time_after(now, end_time); } #endif return true; } static inline bool sk_busy_loop_timeout(struct sock *sk, unsigned long start_time) { #ifdef CONFIG_NET_RX_BUSY_POLL unsigned long bp_usec = READ_ONCE(sk->sk_ll_usec); if (bp_usec) { unsigned long end_time = start_time + bp_usec; unsigned long now = busy_loop_current_time(); return time_after(now, end_time); } #endif return true; } static inline void sk_busy_loop(struct sock *sk, int nonblock) { #ifdef CONFIG_NET_RX_BUSY_POLL unsigned int napi_id = READ_ONCE(sk->sk_napi_id); if (napi_id >= MIN_NAPI_ID) napi_busy_loop(napi_id, nonblock ? NULL : sk_busy_loop_end, sk, READ_ONCE(sk->sk_prefer_busy_poll), READ_ONCE(sk->sk_busy_poll_budget) ?: BUSY_POLL_BUDGET); #endif } /* used in the NIC receive handler to mark the skb */ static inline void skb_mark_napi_id(struct sk_buff *skb, struct napi_struct *napi) { #ifdef CONFIG_NET_RX_BUSY_POLL /* If the skb was already marked with a valid NAPI ID, avoid overwriting * it. */ if (skb->napi_id < MIN_NAPI_ID) skb->napi_id = napi->napi_id; #endif } /* used in the protocol handler to propagate the napi_id to the socket */ static inline void sk_mark_napi_id(struct sock *sk, const struct sk_buff *skb) { #ifdef CONFIG_NET_RX_BUSY_POLL if (unlikely(READ_ONCE(sk->sk_napi_id) != skb->napi_id)) WRITE_ONCE(sk->sk_napi_id, skb->napi_id); #endif sk_rx_queue_update(sk, skb); } /* Variant of sk_mark_napi_id() for passive flow setup, * as sk->sk_napi_id and sk->sk_rx_queue_mapping content * needs to be set. */ static inline void sk_mark_napi_id_set(struct sock *sk, const struct sk_buff *skb) { #ifdef CONFIG_NET_RX_BUSY_POLL WRITE_ONCE(sk->sk_napi_id, skb->napi_id); #endif sk_rx_queue_set(sk, skb); } static inline void __sk_mark_napi_id_once(struct sock *sk, unsigned int napi_id) { #ifdef CONFIG_NET_RX_BUSY_POLL if (!READ_ONCE(sk->sk_napi_id)) WRITE_ONCE(sk->sk_napi_id, napi_id); #endif } /* variant used for unconnected sockets */ static inline void sk_mark_napi_id_once(struct sock *sk, const struct sk_buff *skb) { #ifdef CONFIG_NET_RX_BUSY_POLL __sk_mark_napi_id_once(sk, skb->napi_id); #endif } #endif /* _LINUX_NET_BUSY_POLL_H */ |
| 79 83 421 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _BLOCK_BLK_PM_H_ #define _BLOCK_BLK_PM_H_ #include <linux/pm_runtime.h> #ifdef CONFIG_PM static inline int blk_pm_resume_queue(const bool pm, struct request_queue *q) { if (!q->dev || !blk_queue_pm_only(q)) return 1; /* Nothing to do */ if (pm && q->rpm_status != RPM_SUSPENDED) return 1; /* Request allowed */ pm_request_resume(q->dev); return 0; } static inline void blk_pm_mark_last_busy(struct request *rq) { if (rq->q->dev && !(rq->rq_flags & RQF_PM)) pm_runtime_mark_last_busy(rq->q->dev); } #else static inline int blk_pm_resume_queue(const bool pm, struct request_queue *q) { return 1; } static inline void blk_pm_mark_last_busy(struct request *rq) { } #endif #endif /* _BLOCK_BLK_PM_H_ */ |
| 43 42 1 43 41 43 43 43 198 153 3 43 161 167 8 8 8 8 3 2 2 1 1 9 9 3 1 2 2 2 2 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 | // SPDX-License-Identifier: GPL-2.0-only /* * Pid namespaces * * Authors: * (C) 2007 Pavel Emelyanov <xemul@openvz.org>, OpenVZ, SWsoft Inc. * (C) 2007 Sukadev Bhattiprolu <sukadev@us.ibm.com>, IBM * Many thanks to Oleg Nesterov for comments and help * */ #include <linux/pid.h> #include <linux/pid_namespace.h> #include <linux/user_namespace.h> #include <linux/syscalls.h> #include <linux/cred.h> #include <linux/err.h> #include <linux/acct.h> #include <linux/slab.h> #include <linux/proc_ns.h> #include <linux/reboot.h> #include <linux/export.h> #include <linux/sched/task.h> #include <linux/sched/signal.h> #include <linux/idr.h> #include <uapi/linux/wait.h> #include "pid_sysctl.h" static DEFINE_MUTEX(pid_caches_mutex); static struct kmem_cache *pid_ns_cachep; /* Write once array, filled from the beginning. */ static struct kmem_cache *pid_cache[MAX_PID_NS_LEVEL]; /* * creates the kmem cache to allocate pids from. * @level: pid namespace level */ static struct kmem_cache *create_pid_cachep(unsigned int level) { /* Level 0 is init_pid_ns.pid_cachep */ struct kmem_cache **pkc = &pid_cache[level - 1]; struct kmem_cache *kc; char name[4 + 10 + 1]; unsigned int len; kc = READ_ONCE(*pkc); if (kc) return kc; snprintf(name, sizeof(name), "pid_%u", level + 1); len = struct_size_t(struct pid, numbers, level + 1); mutex_lock(&pid_caches_mutex); /* Name collision forces to do allocation under mutex. */ if (!*pkc) *pkc = kmem_cache_create(name, len, 0, SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL); mutex_unlock(&pid_caches_mutex); /* current can fail, but someone else can succeed. */ return READ_ONCE(*pkc); } static struct ucounts *inc_pid_namespaces(struct user_namespace *ns) { return inc_ucount(ns, current_euid(), UCOUNT_PID_NAMESPACES); } static void dec_pid_namespaces(struct ucounts *ucounts) { dec_ucount(ucounts, UCOUNT_PID_NAMESPACES); } static void destroy_pid_namespace_work(struct work_struct *work); static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns, struct pid_namespace *parent_pid_ns) { struct pid_namespace *ns; unsigned int level = parent_pid_ns->level + 1; struct ucounts *ucounts; int err; err = -EINVAL; if (!in_userns(parent_pid_ns->user_ns, user_ns)) goto out; err = -ENOSPC; if (level > MAX_PID_NS_LEVEL) goto out; ucounts = inc_pid_namespaces(user_ns); if (!ucounts) goto out; err = -ENOMEM; ns = kmem_cache_zalloc(pid_ns_cachep, GFP_KERNEL); if (ns == NULL) goto out_dec; idr_init(&ns->idr); ns->pid_cachep = create_pid_cachep(level); if (ns->pid_cachep == NULL) goto out_free_idr; err = ns_alloc_inum(&ns->ns); if (err) goto out_free_idr; ns->ns.ops = &pidns_operations; ns->pid_max = parent_pid_ns->pid_max; err = register_pidns_sysctls(ns); if (err) goto out_free_inum; refcount_set(&ns->ns.count, 1); ns->level = level; ns->parent = get_pid_ns(parent_pid_ns); ns->user_ns = get_user_ns(user_ns); ns->ucounts = ucounts; ns->pid_allocated = PIDNS_ADDING; INIT_WORK(&ns->work, destroy_pid_namespace_work); #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns); #endif return ns; out_free_inum: ns_free_inum(&ns->ns); out_free_idr: idr_destroy(&ns->idr); kmem_cache_free(pid_ns_cachep, ns); out_dec: dec_pid_namespaces(ucounts); out: return ERR_PTR(err); } static void delayed_free_pidns(struct rcu_head *p) { struct pid_namespace *ns = container_of(p, struct pid_namespace, rcu); dec_pid_namespaces(ns->ucounts); put_user_ns(ns->user_ns); kmem_cache_free(pid_ns_cachep, ns); } static void destroy_pid_namespace(struct pid_namespace *ns) { unregister_pidns_sysctls(ns); ns_free_inum(&ns->ns); idr_destroy(&ns->idr); call_rcu(&ns->rcu, delayed_free_pidns); } static void destroy_pid_namespace_work(struct work_struct *work) { struct pid_namespace *ns = container_of(work, struct pid_namespace, work); do { struct pid_namespace *parent; parent = ns->parent; destroy_pid_namespace(ns); ns = parent; } while (ns != &init_pid_ns && refcount_dec_and_test(&ns->ns.count)); } struct pid_namespace *copy_pid_ns(unsigned long flags, struct user_namespace *user_ns, struct pid_namespace *old_ns) { if (!(flags & CLONE_NEWPID)) return get_pid_ns(old_ns); if (task_active_pid_ns(current) != old_ns) return ERR_PTR(-EINVAL); return create_pid_namespace(user_ns, old_ns); } void put_pid_ns(struct pid_namespace *ns) { if (ns && ns != &init_pid_ns && refcount_dec_and_test(&ns->ns.count)) schedule_work(&ns->work); } EXPORT_SYMBOL_GPL(put_pid_ns); void zap_pid_ns_processes(struct pid_namespace *pid_ns) { int nr; int rc; struct task_struct *task, *me = current; int init_pids = thread_group_leader(me) ? 1 : 2; struct pid *pid; /* Don't allow any more processes into the pid namespace */ disable_pid_allocation(pid_ns); /* * Ignore SIGCHLD causing any terminated children to autoreap. * This speeds up the namespace shutdown, plus see the comment * below. */ spin_lock_irq(&me->sighand->siglock); me->sighand->action[SIGCHLD - 1].sa.sa_handler = SIG_IGN; spin_unlock_irq(&me->sighand->siglock); /* * The last thread in the cgroup-init thread group is terminating. * Find remaining pid_ts in the namespace, signal and wait for them * to exit. * * Note: This signals each threads in the namespace - even those that * belong to the same thread group, To avoid this, we would have * to walk the entire tasklist looking a processes in this * namespace, but that could be unnecessarily expensive if the * pid namespace has just a few processes. Or we need to * maintain a tasklist for each pid namespace. * */ rcu_read_lock(); read_lock(&tasklist_lock); nr = 2; idr_for_each_entry_continue(&pid_ns->idr, pid, nr) { task = pid_task(pid, PIDTYPE_PID); if (task && !__fatal_signal_pending(task)) group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX); } read_unlock(&tasklist_lock); rcu_read_unlock(); /* * Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD. * kernel_wait4() will also block until our children traced from the * parent namespace are detached and become EXIT_DEAD. */ do { clear_thread_flag(TIF_SIGPENDING); clear_thread_flag(TIF_NOTIFY_SIGNAL); rc = kernel_wait4(-1, NULL, __WALL, NULL); } while (rc != -ECHILD); /* * kernel_wait4() misses EXIT_DEAD children, and EXIT_ZOMBIE * process whose parents processes are outside of the pid * namespace. Such processes are created with setns()+fork(). * * If those EXIT_ZOMBIE processes are not reaped by their * parents before their parents exit, they will be reparented * to pid_ns->child_reaper. Thus pidns->child_reaper needs to * stay valid until they all go away. * * The code relies on the pid_ns->child_reaper ignoring * SIGCHILD to cause those EXIT_ZOMBIE processes to be * autoreaped if reparented. * * Semantically it is also desirable to wait for EXIT_ZOMBIE * processes before allowing the child_reaper to be reaped, as * that gives the invariant that when the init process of a * pid namespace is reaped all of the processes in the pid * namespace are gone. * * Once all of the other tasks are gone from the pid_namespace * free_pid() will awaken this task. */ for (;;) { set_current_state(TASK_INTERRUPTIBLE); if (pid_ns->pid_allocated == init_pids) break; schedule(); } __set_current_state(TASK_RUNNING); if (pid_ns->reboot) current->signal->group_exit_code = pid_ns->reboot; acct_exit_ns(pid_ns); return; } #ifdef CONFIG_CHECKPOINT_RESTORE static int pid_ns_ctl_handler(const struct ctl_table *table, int write, void *buffer, size_t *lenp, loff_t *ppos) { struct pid_namespace *pid_ns = task_active_pid_ns(current); struct ctl_table tmp = *table; int ret, next; if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns)) return -EPERM; next = idr_get_cursor(&pid_ns->idr) - 1; tmp.data = &next; tmp.extra2 = &pid_ns->pid_max; ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos); if (!ret && write) idr_set_cursor(&pid_ns->idr, next + 1); return ret; } static const struct ctl_table pid_ns_ctl_table[] = { { .procname = "ns_last_pid", .maxlen = sizeof(int), .mode = 0666, /* permissions are checked in the handler */ .proc_handler = pid_ns_ctl_handler, .extra1 = SYSCTL_ZERO, .extra2 = &init_pid_ns.pid_max, }, }; #endif /* CONFIG_CHECKPOINT_RESTORE */ int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) { if (pid_ns == &init_pid_ns) return 0; switch (cmd) { case LINUX_REBOOT_CMD_RESTART2: case LINUX_REBOOT_CMD_RESTART: pid_ns->reboot = SIGHUP; break; case LINUX_REBOOT_CMD_POWER_OFF: case LINUX_REBOOT_CMD_HALT: pid_ns->reboot = SIGINT; break; default: return -EINVAL; } read_lock(&tasklist_lock); send_sig(SIGKILL, pid_ns->child_reaper, 1); read_unlock(&tasklist_lock); do_exit(0); /* Not reached */ return 0; } static inline struct pid_namespace *to_pid_ns(struct ns_common *ns) { return container_of(ns, struct pid_namespace, ns); } static struct ns_common *pidns_get(struct task_struct *task) { struct pid_namespace *ns; rcu_read_lock(); ns = task_active_pid_ns(task); if (ns) get_pid_ns(ns); rcu_read_unlock(); return ns ? &ns->ns : NULL; } static struct ns_common *pidns_for_children_get(struct task_struct *task) { struct pid_namespace *ns = NULL; task_lock(task); if (task->nsproxy) { ns = task->nsproxy->pid_ns_for_children; get_pid_ns(ns); } task_unlock(task); if (ns) { read_lock(&tasklist_lock); if (!ns->child_reaper) { put_pid_ns(ns); ns = NULL; } read_unlock(&tasklist_lock); } return ns ? &ns->ns : NULL; } static void pidns_put(struct ns_common *ns) { put_pid_ns(to_pid_ns(ns)); } static int pidns_install(struct nsset *nsset, struct ns_common *ns) { struct nsproxy *nsproxy = nsset->nsproxy; struct pid_namespace *active = task_active_pid_ns(current); struct pid_namespace *ancestor, *new = to_pid_ns(ns); if (!ns_capable(new->user_ns, CAP_SYS_ADMIN) || !ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN)) return -EPERM; /* * Only allow entering the current active pid namespace * or a child of the current active pid namespace. * * This is required for fork to return a usable pid value and * this maintains the property that processes and their * children can not escape their current pid namespace. */ if (new->level < active->level) return -EINVAL; ancestor = new; while (ancestor->level > active->level) ancestor = ancestor->parent; if (ancestor != active) return -EINVAL; put_pid_ns(nsproxy->pid_ns_for_children); nsproxy->pid_ns_for_children = get_pid_ns(new); return 0; } static struct ns_common *pidns_get_parent(struct ns_common *ns) { struct pid_namespace *active = task_active_pid_ns(current); struct pid_namespace *pid_ns, *p; /* See if the parent is in the current namespace */ pid_ns = p = to_pid_ns(ns)->parent; for (;;) { if (!p) return ERR_PTR(-EPERM); if (p == active) break; p = p->parent; } return &get_pid_ns(pid_ns)->ns; } static struct user_namespace *pidns_owner(struct ns_common *ns) { return to_pid_ns(ns)->user_ns; } const struct proc_ns_operations pidns_operations = { .name = "pid", .type = CLONE_NEWPID, .get = pidns_get, .put = pidns_put, .install = pidns_install, .owner = pidns_owner, .get_parent = pidns_get_parent, }; const struct proc_ns_operations pidns_for_children_operations = { .name = "pid_for_children", .real_ns_name = "pid", .type = CLONE_NEWPID, .get = pidns_for_children_get, .put = pidns_put, .install = pidns_install, .owner = pidns_owner, .get_parent = pidns_get_parent, }; static __init int pid_namespaces_init(void) { pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT); #ifdef CONFIG_CHECKPOINT_RESTORE register_sysctl_init("kernel", pid_ns_ctl_table); #endif register_pid_ns_sysctl_table_vm(); return 0; } __initcall(pid_namespaces_init); |
| 31 30 1 1 24 23 24 24 24 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | // SPDX-License-Identifier: GPL-2.0-only /* * lowlevel.c * * PURPOSE * Low Level Device Routines for the UDF filesystem * * COPYRIGHT * (C) 1999-2001 Ben Fennema * * HISTORY * * 03/26/99 blf Created. */ #include "udfdecl.h" #include <linux/blkdev.h> #include <linux/cdrom.h> #include <linux/uaccess.h> #include "udf_sb.h" unsigned int udf_get_last_session(struct super_block *sb) { struct cdrom_device_info *cdi = disk_to_cdi(sb->s_bdev->bd_disk); struct cdrom_multisession ms_info; if (!cdi) { udf_debug("CDROMMULTISESSION not supported.\n"); return 0; } ms_info.addr_format = CDROM_LBA; if (cdrom_multisession(cdi, &ms_info) == 0) { udf_debug("XA disk: %s, vol_desc_start=%d\n", ms_info.xa_flag ? "yes" : "no", ms_info.addr.lba); if (ms_info.xa_flag) /* necessary for a valid ms_info.addr */ return ms_info.addr.lba; } return 0; } udf_pblk_t udf_get_last_block(struct super_block *sb) { struct cdrom_device_info *cdi = disk_to_cdi(sb->s_bdev->bd_disk); unsigned long lblock = 0; /* * The cdrom layer call failed or returned obviously bogus value? * Try using the device size... */ if (!cdi || cdrom_get_last_written(cdi, &lblock) || lblock == 0) { if (sb_bdev_nr_blocks(sb) > ~(udf_pblk_t)0) return 0; lblock = sb_bdev_nr_blocks(sb); } if (lblock) return lblock - 1; return 0; } |
| 15 15 14 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | // SPDX-License-Identifier: GPL-2.0+ /* * Copyright (C) 2003-2008 Takahiro Hirofuchi * Copyright (C) 2015 Nobuo Iwata */ #include <linux/kthread.h> #include <linux/export.h> #include <linux/slab.h> #include <linux/workqueue.h> #include "usbip_common.h" struct usbip_event { struct list_head node; struct usbip_device *ud; }; static DEFINE_SPINLOCK(event_lock); static LIST_HEAD(event_list); static void set_event(struct usbip_device *ud, unsigned long event) { unsigned long flags; spin_lock_irqsave(&ud->lock, flags); ud->event |= event; spin_unlock_irqrestore(&ud->lock, flags); } static void unset_event(struct usbip_device *ud, unsigned long event) { unsigned long flags; spin_lock_irqsave(&ud->lock, flags); ud->event &= ~event; spin_unlock_irqrestore(&ud->lock, flags); } static struct usbip_device *get_event(void) { struct usbip_event *ue = NULL; struct usbip_device *ud = NULL; unsigned long flags; spin_lock_irqsave(&event_lock, flags); if (!list_empty(&event_list)) { ue = list_first_entry(&event_list, struct usbip_event, node); list_del(&ue->node); } spin_unlock_irqrestore(&event_lock, flags); if (ue) { ud = ue->ud; kfree(ue); } return ud; } static struct task_struct *worker_context; static void event_handler(struct work_struct *work) { struct usbip_device *ud; if (worker_context == NULL) { worker_context = current; } while ((ud = get_event()) != NULL) { usbip_dbg_eh("pending event %lx\n", ud->event); mutex_lock(&ud->sysfs_lock); /* * NOTE: shutdown must come first. * Shutdown the device. */ if (ud->event & USBIP_EH_SHUTDOWN) { ud->eh_ops.shutdown(ud); unset_event(ud, USBIP_EH_SHUTDOWN); } /* Reset the device. */ if (ud->event & USBIP_EH_RESET) { ud->eh_ops.reset(ud); unset_event(ud, USBIP_EH_RESET); } /* Mark the device as unusable. */ if (ud->event & USBIP_EH_UNUSABLE) { ud->eh_ops.unusable(ud); unset_event(ud, USBIP_EH_UNUSABLE); } mutex_unlock(&ud->sysfs_lock); wake_up(&ud->eh_waitq); } } int usbip_start_eh(struct usbip_device *ud) { init_waitqueue_head(&ud->eh_waitq); ud->event = 0; return 0; } EXPORT_SYMBOL_GPL(usbip_start_eh); void usbip_stop_eh(struct usbip_device *ud) { unsigned long pending = ud->event & ~USBIP_EH_BYE; if (!(ud->event & USBIP_EH_BYE)) usbip_dbg_eh("usbip_eh stopping but not removed\n"); if (pending) usbip_dbg_eh("usbip_eh waiting completion %lx\n", pending); wait_event_interruptible(ud->eh_waitq, !(ud->event & ~USBIP_EH_BYE)); usbip_dbg_eh("usbip_eh has stopped\n"); } EXPORT_SYMBOL_GPL(usbip_stop_eh); #define WORK_QUEUE_NAME "usbip_event" static struct workqueue_struct *usbip_queue; static DECLARE_WORK(usbip_work, event_handler); int usbip_init_eh(void) { usbip_queue = create_singlethread_workqueue(WORK_QUEUE_NAME); if (usbip_queue == NULL) { pr_err("failed to create usbip_event\n"); return -ENOMEM; } return 0; } void usbip_finish_eh(void) { destroy_workqueue(usbip_queue); usbip_queue = NULL; } void usbip_event_add(struct usbip_device *ud, unsigned long event) { struct usbip_event *ue; unsigned long flags; if (ud->event & USBIP_EH_BYE) return; set_event(ud, event); spin_lock_irqsave(&event_lock, flags); list_for_each_entry_reverse(ue, &event_list, node) { if (ue->ud == ud) goto out; } ue = kmalloc(sizeof(struct usbip_event), GFP_ATOMIC); if (ue == NULL) goto out; ue->ud = ud; list_add_tail(&ue->node, &event_list); queue_work(usbip_queue, &usbip_work); out: spin_unlock_irqrestore(&event_lock, flags); } EXPORT_SYMBOL_GPL(usbip_event_add); int usbip_event_happened(struct usbip_device *ud) { int happened = 0; unsigned long flags; spin_lock_irqsave(&ud->lock, flags); if (ud->event != 0) happened = 1; spin_unlock_irqrestore(&ud->lock, flags); return happened; } EXPORT_SYMBOL_GPL(usbip_event_happened); int usbip_in_eh(struct task_struct *task) { if (task == worker_context) return 1; return 0; } EXPORT_SYMBOL_GPL(usbip_in_eh); |
| 12 12 4 4 4 12 5 355 345 12 7 5 6 368 366 13 18 17 1 14 1 2 2 2 2 2 117 112 5 4 4 5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 | // SPDX-License-Identifier: GPL-2.0 #include "bcachefs.h" #include "eytzinger.h" #include "journal.h" #include "journal_seq_blacklist.h" #include "super-io.h" /* * journal_seq_blacklist machinery: * * To guarantee order of btree updates after a crash, we need to detect when a * btree node entry (bset) is newer than the newest journal entry that was * successfully written, and ignore it - effectively ignoring any btree updates * that didn't make it into the journal. * * If we didn't do this, we might have two btree nodes, a and b, both with * updates that weren't written to the journal yet: if b was updated after a, * but b was flushed and not a - oops; on recovery we'll find that the updates * to b happened, but not the updates to a that happened before it. * * Ignoring bsets that are newer than the newest journal entry is always safe, * because everything they contain will also have been journalled - and must * still be present in the journal on disk until a journal entry has been * written _after_ that bset was written. * * To accomplish this, bsets record the newest journal sequence number they * contain updates for; then, on startup, the btree code queries the journal * code to ask "Is this sequence number newer than the newest journal entry? If * so, ignore it." * * When this happens, we must blacklist that journal sequence number: the * journal must not write any entries with that sequence number, and it must * record that it was blacklisted so that a) on recovery we don't think we have * missing journal entries and b) so that the btree code continues to ignore * that bset, until that btree node is rewritten. */ static unsigned sb_blacklist_u64s(unsigned nr) { struct bch_sb_field_journal_seq_blacklist *bl; return (sizeof(*bl) + sizeof(bl->start[0]) * nr) / sizeof(u64); } int bch2_journal_seq_blacklist_add(struct bch_fs *c, u64 start, u64 end) { struct bch_sb_field_journal_seq_blacklist *bl; unsigned i = 0, nr; int ret = 0; mutex_lock(&c->sb_lock); bl = bch2_sb_field_get(c->disk_sb.sb, journal_seq_blacklist); nr = blacklist_nr_entries(bl); while (i < nr) { struct journal_seq_blacklist_entry *e = bl->start + i; if (end < le64_to_cpu(e->start)) break; if (start > le64_to_cpu(e->end)) { i++; continue; } /* * Entry is contiguous or overlapping with new entry: merge it * with new entry, and delete: */ start = min(start, le64_to_cpu(e->start)); end = max(end, le64_to_cpu(e->end)); array_remove_item(bl->start, nr, i); } bl = bch2_sb_field_resize(&c->disk_sb, journal_seq_blacklist, sb_blacklist_u64s(nr + 1)); if (!bl) { ret = -BCH_ERR_ENOSPC_sb_journal_seq_blacklist; goto out; } array_insert_item(bl->start, nr, i, ((struct journal_seq_blacklist_entry) { .start = cpu_to_le64(start), .end = cpu_to_le64(end), })); c->disk_sb.sb->features[0] |= cpu_to_le64(1ULL << BCH_FEATURE_journal_seq_blacklist_v3); ret = bch2_write_super(c); out: mutex_unlock(&c->sb_lock); return ret ?: bch2_blacklist_table_initialize(c); } static int journal_seq_blacklist_table_cmp(const void *_l, const void *_r) { const struct journal_seq_blacklist_table_entry *l = _l; const struct journal_seq_blacklist_table_entry *r = _r; return cmp_int(l->start, r->start); } bool bch2_journal_seq_is_blacklisted(struct bch_fs *c, u64 seq, bool dirty) { struct journal_seq_blacklist_table *t = c->journal_seq_blacklist_table; struct journal_seq_blacklist_table_entry search = { .start = seq }; int idx; if (!t) return false; idx = eytzinger0_find_le(t->entries, t->nr, sizeof(t->entries[0]), journal_seq_blacklist_table_cmp, &search); if (idx < 0) return false; BUG_ON(t->entries[idx].start > seq); if (seq >= t->entries[idx].end) return false; if (dirty) t->entries[idx].dirty = true; return true; } int bch2_blacklist_table_initialize(struct bch_fs *c) { struct bch_sb_field_journal_seq_blacklist *bl = bch2_sb_field_get(c->disk_sb.sb, journal_seq_blacklist); struct journal_seq_blacklist_table *t; unsigned i, nr = blacklist_nr_entries(bl); if (!bl) return 0; t = kzalloc(struct_size(t, entries, nr), GFP_KERNEL); if (!t) return -BCH_ERR_ENOMEM_blacklist_table_init; t->nr = nr; for (i = 0; i < nr; i++) { t->entries[i].start = le64_to_cpu(bl->start[i].start); t->entries[i].end = le64_to_cpu(bl->start[i].end); } eytzinger0_sort(t->entries, t->nr, sizeof(t->entries[0]), journal_seq_blacklist_table_cmp, NULL); kfree(c->journal_seq_blacklist_table); c->journal_seq_blacklist_table = t; return 0; } static int bch2_sb_journal_seq_blacklist_validate(struct bch_sb *sb, struct bch_sb_field *f, enum bch_validate_flags flags, struct printbuf *err) { struct bch_sb_field_journal_seq_blacklist *bl = field_to_type(f, journal_seq_blacklist); unsigned i, nr = blacklist_nr_entries(bl); for (i = 0; i < nr; i++) { struct journal_seq_blacklist_entry *e = bl->start + i; if (le64_to_cpu(e->start) >= le64_to_cpu(e->end)) { prt_printf(err, "entry %u start >= end (%llu >= %llu)", i, le64_to_cpu(e->start), le64_to_cpu(e->end)); return -BCH_ERR_invalid_sb_journal_seq_blacklist; } if (i + 1 < nr && le64_to_cpu(e[0].end) > le64_to_cpu(e[1].start)) { prt_printf(err, "entry %u out of order with next entry (%llu > %llu)", i + 1, le64_to_cpu(e[0].end), le64_to_cpu(e[1].start)); return -BCH_ERR_invalid_sb_journal_seq_blacklist; } } return 0; } static void bch2_sb_journal_seq_blacklist_to_text(struct printbuf *out, struct bch_sb *sb, struct bch_sb_field *f) { struct bch_sb_field_journal_seq_blacklist *bl = field_to_type(f, journal_seq_blacklist); struct journal_seq_blacklist_entry *i; unsigned nr = blacklist_nr_entries(bl); for (i = bl->start; i < bl->start + nr; i++) { if (i != bl->start) prt_printf(out, " "); prt_printf(out, "%llu-%llu", le64_to_cpu(i->start), le64_to_cpu(i->end)); } prt_newline(out); } const struct bch_sb_field_ops bch_sb_field_ops_journal_seq_blacklist = { .validate = bch2_sb_journal_seq_blacklist_validate, .to_text = bch2_sb_journal_seq_blacklist_to_text }; bool bch2_blacklist_entries_gc(struct bch_fs *c) { struct journal_seq_blacklist_entry *src, *dst; struct bch_sb_field_journal_seq_blacklist *bl = bch2_sb_field_get(c->disk_sb.sb, journal_seq_blacklist); if (!bl) return false; unsigned nr = blacklist_nr_entries(bl); dst = bl->start; struct journal_seq_blacklist_table *t = c->journal_seq_blacklist_table; BUG_ON(nr != t->nr); unsigned i; for (src = bl->start, i = t->nr == 0 ? 0 : eytzinger0_first(t->nr); src < bl->start + nr; src++, i = eytzinger0_next(i, nr)) { BUG_ON(t->entries[i].start != le64_to_cpu(src->start)); BUG_ON(t->entries[i].end != le64_to_cpu(src->end)); if (t->entries[i].dirty || t->entries[i].end >= c->journal.oldest_seq_found_ondisk) *dst++ = *src; } unsigned new_nr = dst - bl->start; if (new_nr == nr) return false; bch_verbose(c, "nr blacklist entries was %u, now %u", nr, new_nr); bl = bch2_sb_field_resize(&c->disk_sb, journal_seq_blacklist, new_nr ? sb_blacklist_u64s(new_nr) : 0); BUG_ON(new_nr && !bl); return true; } |
| 30 9 14 9 9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 | /* SPDX-License-Identifier: GPL-2.0-only */ /* * Copyright (c) 1999-2002 Vojtech Pavlik */ #ifndef _INPUT_H #define _INPUT_H #include <linux/time.h> #include <linux/list.h> #include <uapi/linux/input.h> /* Implementation details, userspace should not care about these */ #define ABS_MT_FIRST ABS_MT_TOUCH_MAJOR #define ABS_MT_LAST ABS_MT_TOOL_Y /* * In-kernel definitions. */ #include <linux/device.h> #include <linux/fs.h> #include <linux/timer.h> #include <linux/mod_devicetable.h> struct input_dev_poller; /** * struct input_value - input value representation * @type: type of value (EV_KEY, EV_ABS, etc) * @code: the value code * @value: the value */ struct input_value { __u16 type; __u16 code; __s32 value; }; enum input_clock_type { INPUT_CLK_REAL = 0, INPUT_CLK_MONO, INPUT_CLK_BOOT, INPUT_CLK_MAX }; /** * struct input_dev - represents an input device * @name: name of the device * @phys: physical path to the device in the system hierarchy * @uniq: unique identification code for the device (if device has it) * @id: id of the device (struct input_id) * @propbit: bitmap of device properties and quirks * @evbit: bitmap of types of events supported by the device (EV_KEY, * EV_REL, etc.) * @keybit: bitmap of keys/buttons this device has * @relbit: bitmap of relative axes for the device * @absbit: bitmap of absolute axes for the device * @mscbit: bitmap of miscellaneous events supported by the device * @ledbit: bitmap of leds present on the device * @sndbit: bitmap of sound effects supported by the device * @ffbit: bitmap of force feedback effects supported by the device * @swbit: bitmap of switches present on the device * @hint_events_per_packet: average number of events generated by the * device in a packet (between EV_SYN/SYN_REPORT events). Used by * event handlers to estimate size of the buffer needed to hold * events. * @keycodemax: size of keycode table * @keycodesize: size of elements in keycode table * @keycode: map of scancodes to keycodes for this device * @getkeycode: optional legacy method to retrieve current keymap. * @setkeycode: optional method to alter current keymap, used to implement * sparse keymaps. If not supplied default mechanism will be used. * The method is being called while holding event_lock and thus must * not sleep * @ff: force feedback structure associated with the device if device * supports force feedback effects * @poller: poller structure associated with the device if device is * set up to use polling mode * @repeat_key: stores key code of the last key pressed; used to implement * software autorepeat * @timer: timer for software autorepeat * @rep: current values for autorepeat parameters (delay, rate) * @mt: pointer to multitouch state * @absinfo: array of &struct input_absinfo elements holding information * about absolute axes (current value, min, max, flat, fuzz, * resolution) * @key: reflects current state of device's keys/buttons * @led: reflects current state of device's LEDs * @snd: reflects current state of sound effects * @sw: reflects current state of device's switches * @open: this method is called when the very first user calls * input_open_device(). The driver must prepare the device * to start generating events (start polling thread, * request an IRQ, submit URB, etc.). The meaning of open() is * to start providing events to the input core. * @close: this method is called when the very last user calls * input_close_device(). The meaning of close() is to stop * providing events to the input core. * @flush: purges the device. Most commonly used to get rid of force * feedback effects loaded into the device when disconnecting * from it * @event: event handler for events sent _to_ the device, like EV_LED * or EV_SND. The device is expected to carry out the requested * action (turn on a LED, play sound, etc.) The call is protected * by @event_lock and must not sleep * @grab: input handle that currently has the device grabbed (via * EVIOCGRAB ioctl). When a handle grabs a device it becomes sole * recipient for all input events coming from the device * @event_lock: this spinlock is taken when input core receives * and processes a new event for the device (in input_event()). * Code that accesses and/or modifies parameters of a device * (such as keymap or absmin, absmax, absfuzz, etc.) after device * has been registered with input core must take this lock. * @mutex: serializes calls to open(), close() and flush() methods * @users: stores number of users (input handlers) that opened this * device. It is used by input_open_device() and input_close_device() * to make sure that dev->open() is only called when the first * user opens device and dev->close() is called when the very * last user closes the device * @going_away: marks devices that are in a middle of unregistering and * causes input_open_device*() fail with -ENODEV. * @dev: driver model's view of this device * @h_list: list of input handles associated with the device. When * accessing the list dev->mutex must be held * @node: used to place the device onto input_dev_list * @num_vals: number of values queued in the current frame * @max_vals: maximum number of values queued in a frame * @vals: array of values queued in the current frame * @devres_managed: indicates that devices is managed with devres framework * and needs not be explicitly unregistered or freed. * @timestamp: storage for a timestamp set by input_set_timestamp called * by a driver * @inhibited: indicates that the input device is inhibited. If that is * the case then input core ignores any events generated by the device. * Device's close() is called when it is being inhibited and its open() * is called when it is being uninhibited. */ struct input_dev { const char *name; const char *phys; const char *uniq; struct input_id id; unsigned long propbit[BITS_TO_LONGS(INPUT_PROP_CNT)]; unsigned long evbit[BITS_TO_LONGS(EV_CNT)]; unsigned long keybit[BITS_TO_LONGS(KEY_CNT)]; unsigned long relbit[BITS_TO_LONGS(REL_CNT)]; unsigned long absbit[BITS_TO_LONGS(ABS_CNT)]; unsigned long mscbit[BITS_TO_LONGS(MSC_CNT)]; unsigned long ledbit[BITS_TO_LONGS(LED_CNT)]; unsigned long sndbit[BITS_TO_LONGS(SND_CNT)]; unsigned long ffbit[BITS_TO_LONGS(FF_CNT)]; unsigned long swbit[BITS_TO_LONGS(SW_CNT)]; unsigned int hint_events_per_packet; unsigned int keycodemax; unsigned int keycodesize; void *keycode; int (*setkeycode)(struct input_dev *dev, const struct input_keymap_entry *ke, unsigned int *old_keycode); int (*getkeycode)(struct input_dev *dev, struct input_keymap_entry *ke); struct ff_device *ff; struct input_dev_poller *poller; unsigned int repeat_key; struct timer_list timer; int rep[REP_CNT]; struct input_mt *mt; struct input_absinfo *absinfo; unsigned long key[BITS_TO_LONGS(KEY_CNT)]; unsigned long led[BITS_TO_LONGS(LED_CNT)]; unsigned long snd[BITS_TO_LONGS(SND_CNT)]; unsigned long sw[BITS_TO_LONGS(SW_CNT)]; int (*open)(struct input_dev *dev); void (*close)(struct input_dev *dev); int (*flush)(struct input_dev *dev, struct file *file); int (*event)(struct input_dev *dev, unsigned int type, unsigned int code, int value); struct input_handle __rcu *grab; spinlock_t event_lock; struct mutex mutex; unsigned int users; bool going_away; struct device dev; struct list_head h_list; struct list_head node; unsigned int num_vals; unsigned int max_vals; struct input_value *vals; bool devres_managed; ktime_t timestamp[INPUT_CLK_MAX]; bool inhibited; }; #define to_input_dev(d) container_of(d, struct input_dev, dev) /* * Verify that we are in sync with input_device_id mod_devicetable.h #defines */ #if EV_MAX != INPUT_DEVICE_ID_EV_MAX #error "EV_MAX and INPUT_DEVICE_ID_EV_MAX do not match" #endif #if KEY_MIN_INTERESTING != INPUT_DEVICE_ID_KEY_MIN_INTERESTING #error "KEY_MIN_INTERESTING and INPUT_DEVICE_ID_KEY_MIN_INTERESTING do not match" #endif #if KEY_MAX != INPUT_DEVICE_ID_KEY_MAX #error "KEY_MAX and INPUT_DEVICE_ID_KEY_MAX do not match" #endif #if REL_MAX != INPUT_DEVICE_ID_REL_MAX #error "REL_MAX and INPUT_DEVICE_ID_REL_MAX do not match" #endif #if ABS_MAX != INPUT_DEVICE_ID_ABS_MAX #error "ABS_MAX and INPUT_DEVICE_ID_ABS_MAX do not match" #endif #if MSC_MAX != INPUT_DEVICE_ID_MSC_MAX #error "MSC_MAX and INPUT_DEVICE_ID_MSC_MAX do not match" #endif #if LED_MAX != INPUT_DEVICE_ID_LED_MAX #error "LED_MAX and INPUT_DEVICE_ID_LED_MAX do not match" #endif #if SND_MAX != INPUT_DEVICE_ID_SND_MAX #error "SND_MAX and INPUT_DEVICE_ID_SND_MAX do not match" #endif #if FF_MAX != INPUT_DEVICE_ID_FF_MAX #error "FF_MAX and INPUT_DEVICE_ID_FF_MAX do not match" #endif #if SW_MAX != INPUT_DEVICE_ID_SW_MAX #error "SW_MAX and INPUT_DEVICE_ID_SW_MAX do not match" #endif #if INPUT_PROP_MAX != INPUT_DEVICE_ID_PROP_MAX #error "INPUT_PROP_MAX and INPUT_DEVICE_ID_PROP_MAX do not match" #endif #define INPUT_DEVICE_ID_MATCH_DEVICE \ (INPUT_DEVICE_ID_MATCH_BUS | INPUT_DEVICE_ID_MATCH_VENDOR | INPUT_DEVICE_ID_MATCH_PRODUCT) #define INPUT_DEVICE_ID_MATCH_DEVICE_AND_VERSION \ (INPUT_DEVICE_ID_MATCH_DEVICE | INPUT_DEVICE_ID_MATCH_VERSION) struct input_handle; /** * struct input_handler - implements one of interfaces for input devices * @private: driver-specific data * @event: event handler. This method is being called by input core with * interrupts disabled and dev->event_lock spinlock held and so * it may not sleep * @events: event sequence handler. This method is being called by * input core with interrupts disabled and dev->event_lock * spinlock held and so it may not sleep. The method must return * number of events passed to it. * @filter: similar to @event; separates normal event handlers from * "filters". * @match: called after comparing device's id with handler's id_table * to perform fine-grained matching between device and handler * @connect: called when attaching a handler to an input device * @disconnect: disconnects a handler from input device * @start: starts handler for given handle. This function is called by * input core right after connect() method and also when a process * that "grabbed" a device releases it * @passive_observer: set to %true by drivers only interested in observing * data stream from devices if there are other users present. Such * drivers will not result in starting underlying hardware device * when input_open_device() is called for their handles * @legacy_minors: set to %true by drivers using legacy minor ranges * @minor: beginning of range of 32 legacy minors for devices this driver * can provide * @name: name of the handler, to be shown in /proc/bus/input/handlers * @id_table: pointer to a table of input_device_ids this driver can * handle * @h_list: list of input handles associated with the handler * @node: for placing the driver onto input_handler_list * * Input handlers attach to input devices and create input handles. There * are likely several handlers attached to any given input device at the * same time. All of them will get their copy of input event generated by * the device. * * The very same structure is used to implement input filters. Input core * allows filters to run first and will not pass event to regular handlers * if any of the filters indicate that the event should be filtered (by * returning %true from their filter() method). * * Note that input core serializes calls to connect() and disconnect() * methods. */ struct input_handler { void *private; void (*event)(struct input_handle *handle, unsigned int type, unsigned int code, int value); unsigned int (*events)(struct input_handle *handle, struct input_value *vals, unsigned int count); bool (*filter)(struct input_handle *handle, unsigned int type, unsigned int code, int value); bool (*match)(struct input_handler *handler, struct input_dev *dev); int (*connect)(struct input_handler *handler, struct input_dev *dev, const struct input_device_id *id); void (*disconnect)(struct input_handle *handle); void (*start)(struct input_handle *handle); bool passive_observer; bool legacy_minors; int minor; const char *name; const struct input_device_id *id_table; struct list_head h_list; struct list_head node; }; /** * struct input_handle - links input device with an input handler * @private: handler-specific data * @open: counter showing whether the handle is 'open', i.e. should deliver * events from its device * @name: name given to the handle by handler that created it * @dev: input device the handle is attached to * @handler: handler that works with the device through this handle * @handle_events: event sequence handler. It is set up by the input core * according to event handling method specified in the @handler. See * input_handle_setup_event_handler(). * This method is being called by the input core with interrupts disabled * and dev->event_lock spinlock held and so it may not sleep. * @d_node: used to put the handle on device's list of attached handles * @h_node: used to put the handle on handler's list of handles from which * it gets events */ struct input_handle { void *private; int open; const char *name; struct input_dev *dev; struct input_handler *handler; unsigned int (*handle_events)(struct input_handle *handle, struct input_value *vals, unsigned int count); struct list_head d_node; struct list_head h_node; }; struct input_dev __must_check *input_allocate_device(void); struct input_dev __must_check *devm_input_allocate_device(struct device *); void input_free_device(struct input_dev *dev); static inline struct input_dev *input_get_device(struct input_dev *dev) { return dev ? to_input_dev(get_device(&dev->dev)) : NULL; } static inline void input_put_device(struct input_dev *dev) { if (dev) put_device(&dev->dev); } static inline void *input_get_drvdata(struct input_dev *dev) { return dev_get_drvdata(&dev->dev); } static inline void input_set_drvdata(struct input_dev *dev, void *data) { dev_set_drvdata(&dev->dev, data); } int __must_check input_register_device(struct input_dev *); void input_unregister_device(struct input_dev *); void input_reset_device(struct input_dev *); int input_setup_polling(struct input_dev *dev, void (*poll_fn)(struct input_dev *dev)); void input_set_poll_interval(struct input_dev *dev, unsigned int interval); void input_set_min_poll_interval(struct input_dev *dev, unsigned int interval); void input_set_max_poll_interval(struct input_dev *dev, unsigned int interval); int input_get_poll_interval(struct input_dev *dev); int __must_check input_register_handler(struct input_handler *); void input_unregister_handler(struct input_handler *); int __must_check input_get_new_minor(int legacy_base, unsigned int legacy_num, bool allow_dynamic); void input_free_minor(unsigned int minor); int input_handler_for_each_handle(struct input_handler *, void *data, int (*fn)(struct input_handle *, void *)); int input_register_handle(struct input_handle *); void input_unregister_handle(struct input_handle *); int input_grab_device(struct input_handle *); void input_release_device(struct input_handle *); int input_open_device(struct input_handle *); void input_close_device(struct input_handle *); int input_flush_device(struct input_handle *handle, struct file *file); void input_set_timestamp(struct input_dev *dev, ktime_t timestamp); ktime_t *input_get_timestamp(struct input_dev *dev); void input_event(struct input_dev *dev, unsigned int type, unsigned int code, int value); void input_inject_event(struct input_handle *handle, unsigned int type, unsigned int code, int value); static inline void input_report_key(struct input_dev *dev, unsigned int code, int value) { input_event(dev, EV_KEY, code, !!value); } static inline void input_report_rel(struct input_dev *dev, unsigned int code, int value) { input_event(dev, EV_REL, code, value); } static inline void input_report_abs(struct input_dev *dev, unsigned int code, int value) { input_event(dev, EV_ABS, code, value); } static inline void input_report_ff_status(struct input_dev *dev, unsigned int code, int value) { input_event(dev, EV_FF_STATUS, code, value); } static inline void input_report_switch(struct input_dev *dev, unsigned int code, int value) { input_event(dev, EV_SW, code, !!value); } static inline void input_sync(struct input_dev *dev) { input_event(dev, EV_SYN, SYN_REPORT, 0); } static inline void input_mt_sync(struct input_dev *dev) { input_event(dev, EV_SYN, SYN_MT_REPORT, 0); } void input_set_capability(struct input_dev *dev, unsigned int type, unsigned int code); /** * input_set_events_per_packet - tell handlers about the driver event rate * @dev: the input device used by the driver * @n_events: the average number of events between calls to input_sync() * * If the event rate sent from a device is unusually large, use this * function to set the expected event rate. This will allow handlers * to set up an appropriate buffer size for the event stream, in order * to minimize information loss. */ static inline void input_set_events_per_packet(struct input_dev *dev, int n_events) { dev->hint_events_per_packet = n_events; } void input_alloc_absinfo(struct input_dev *dev); void input_set_abs_params(struct input_dev *dev, unsigned int axis, int min, int max, int fuzz, int flat); void input_copy_abs(struct input_dev *dst, unsigned int dst_axis, const struct input_dev *src, unsigned int src_axis); #define INPUT_GENERATE_ABS_ACCESSORS(_suffix, _item) \ static inline int input_abs_get_##_suffix(struct input_dev *dev, \ unsigned int axis) \ { \ return dev->absinfo ? dev->absinfo[axis]._item : 0; \ } \ \ static inline void input_abs_set_##_suffix(struct input_dev *dev, \ unsigned int axis, int val) \ { \ input_alloc_absinfo(dev); \ if (dev->absinfo) \ dev->absinfo[axis]._item = val; \ } INPUT_GENERATE_ABS_ACCESSORS(val, value) INPUT_GENERATE_ABS_ACCESSORS(min, minimum) INPUT_GENERATE_ABS_ACCESSORS(max, maximum) INPUT_GENERATE_ABS_ACCESSORS(fuzz, fuzz) INPUT_GENERATE_ABS_ACCESSORS(flat, flat) INPUT_GENERATE_ABS_ACCESSORS(res, resolution) int input_scancode_to_scalar(const struct input_keymap_entry *ke, unsigned int *scancode); int input_get_keycode(struct input_dev *dev, struct input_keymap_entry *ke); int input_set_keycode(struct input_dev *dev, const struct input_keymap_entry *ke); bool input_match_device_id(const struct input_dev *dev, const struct input_device_id *id); void input_enable_softrepeat(struct input_dev *dev, int delay, int period); bool input_device_enabled(struct input_dev *dev); extern const struct class input_class; /** * struct ff_device - force-feedback part of an input device * @upload: Called to upload an new effect into device * @erase: Called to erase an effect from device * @playback: Called to request device to start playing specified effect * @set_gain: Called to set specified gain * @set_autocenter: Called to auto-center device * @destroy: called by input core when parent input device is being * destroyed * @private: driver-specific data, will be freed automatically * @ffbit: bitmap of force feedback capabilities truly supported by * device (not emulated like ones in input_dev->ffbit) * @mutex: mutex for serializing access to the device * @max_effects: maximum number of effects supported by device * @effects: pointer to an array of effects currently loaded into device * @effect_owners: array of effect owners; when file handle owning * an effect gets closed the effect is automatically erased * * Every force-feedback device must implement upload() and playback() * methods; erase() is optional. set_gain() and set_autocenter() need * only be implemented if driver sets up FF_GAIN and FF_AUTOCENTER * bits. * * Note that playback(), set_gain() and set_autocenter() are called with * dev->event_lock spinlock held and interrupts off and thus may not * sleep. */ struct ff_device { int (*upload)(struct input_dev *dev, struct ff_effect *effect, struct ff_effect *old); int (*erase)(struct input_dev *dev, int effect_id); int (*playback)(struct input_dev *dev, int effect_id, int value); void (*set_gain)(struct input_dev *dev, u16 gain); void (*set_autocenter)(struct input_dev *dev, u16 magnitude); void (*destroy)(struct ff_device *); void *private; unsigned long ffbit[BITS_TO_LONGS(FF_CNT)]; struct mutex mutex; int max_effects; struct ff_effect *effects; struct file *effect_owners[] __counted_by(max_effects); }; int input_ff_create(struct input_dev *dev, unsigned int max_effects); void input_ff_destroy(struct input_dev *dev); int input_ff_event(struct input_dev *dev, unsigned int type, unsigned int code, int value); int input_ff_upload(struct input_dev *dev, struct ff_effect *effect, struct file *file); int input_ff_erase(struct input_dev *dev, int effect_id, struct file *file); int input_ff_flush(struct input_dev *dev, struct file *file); int input_ff_create_memless(struct input_dev *dev, void *data, int (*play_effect)(struct input_dev *, void *, struct ff_effect *)); #endif |
| 13 13 4 4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | // SPDX-License-Identifier: GPL-2.0-or-later /* * * Copyright (C) Jonathan Naylor G4KLX (g4klx@g4klx.demon.co.uk) */ #include <linux/module.h> #include <linux/proc_fs.h> #include <linux/kernel.h> #include <linux/interrupt.h> #include <linux/fs.h> #include <linux/types.h> #include <linux/sysctl.h> #include <linux/string.h> #include <linux/socket.h> #include <linux/errno.h> #include <linux/fcntl.h> #include <linux/in.h> #include <linux/if_ether.h> #include <linux/slab.h> #include <asm/io.h> #include <linux/inet.h> #include <linux/netdevice.h> #include <linux/etherdevice.h> #include <linux/if_arp.h> #include <linux/skbuff.h> #include <net/ip.h> #include <net/arp.h> #include <net/ax25.h> #include <net/rose.h> static int rose_header(struct sk_buff *skb, struct net_device *dev, unsigned short type, const void *daddr, const void *saddr, unsigned int len) { unsigned char *buff = skb_push(skb, ROSE_MIN_LEN + 2); if (daddr) memcpy(buff + 7, daddr, dev->addr_len); *buff++ = ROSE_GFI | ROSE_Q_BIT; *buff++ = 0x00; *buff++ = ROSE_DATA; *buff++ = 0x7F; *buff++ = AX25_P_IP; if (daddr != NULL) return 37; return -37; } static int rose_set_mac_address(struct net_device *dev, void *addr) { struct sockaddr *sa = addr; int err; if (!memcmp(dev->dev_addr, sa->sa_data, dev->addr_len)) return 0; if (dev->flags & IFF_UP) { err = rose_add_loopback_node((rose_address *)sa->sa_data); if (err) return err; rose_del_loopback_node((const rose_address *)dev->dev_addr); } dev_addr_set(dev, sa->sa_data); return 0; } static int rose_open(struct net_device *dev) { int err; err = rose_add_loopback_node((const rose_address *)dev->dev_addr); if (err) return err; netif_start_queue(dev); return 0; } static int rose_close(struct net_device *dev) { netif_stop_queue(dev); rose_del_loopback_node((const rose_address *)dev->dev_addr); return 0; } static netdev_tx_t rose_xmit(struct sk_buff *skb, struct net_device *dev) { struct net_device_stats *stats = &dev->stats; unsigned int len = skb->len; if (!netif_running(dev)) { printk(KERN_ERR "ROSE: rose_xmit - called when iface is down\n"); return NETDEV_TX_BUSY; } if (!rose_route_frame(skb, NULL)) { dev_kfree_skb(skb); stats->tx_errors++; return NETDEV_TX_OK; } stats->tx_packets++; stats->tx_bytes += len; return NETDEV_TX_OK; } static const struct header_ops rose_header_ops = { .create = rose_header, }; static const struct net_device_ops rose_netdev_ops = { .ndo_open = rose_open, .ndo_stop = rose_close, .ndo_start_xmit = rose_xmit, .ndo_set_mac_address = rose_set_mac_address, }; void rose_setup(struct net_device *dev) { dev->mtu = ROSE_MAX_PACKET_SIZE - 2; dev->netdev_ops = &rose_netdev_ops; dev->header_ops = &rose_header_ops; dev->hard_header_len = AX25_BPQ_HEADER_LEN + AX25_MAX_HEADER_LEN + ROSE_MIN_LEN; dev->addr_len = ROSE_ADDR_LEN; dev->type = ARPHRD_ROSE; /* New-style flags. */ dev->flags = IFF_NOARP; } |
| 191 196 172 169 10 205 73 40 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 | /* SPDX-License-Identifier: GPL-2.0-only */ /* * vivid-core.h - core datastructures * * Copyright 2014 Cisco Systems, Inc. and/or its affiliates. All rights reserved. */ #ifndef _VIVID_CORE_H_ #define _VIVID_CORE_H_ #include <linux/fb.h> #include <linux/workqueue.h> #include <media/cec.h> #include <media/videobuf2-v4l2.h> #include <media/v4l2-device.h> #include <media/v4l2-dev.h> #include <media/v4l2-ctrls.h> #include <media/tpg/v4l2-tpg.h> #include "vivid-rds-gen.h" #include "vivid-vbi-gen.h" #define dprintk(dev, level, fmt, arg...) \ v4l2_dbg(level, vivid_debug, &dev->v4l2_dev, fmt, ## arg) /* The maximum number of inputs */ #define MAX_INPUTS 16 /* The maximum number of outputs */ #define MAX_OUTPUTS 16 /* The maximum number of video capture buffers */ #define MAX_VID_CAP_BUFFERS 64 /* The maximum up or down scaling factor is 4 */ #define MAX_ZOOM 4 /* The maximum image width/height are set to 4K DMT */ #define MAX_WIDTH 4096 #define MAX_HEIGHT 2160 /* The minimum image width/height */ #define MIN_WIDTH 16 #define MIN_HEIGHT MIN_WIDTH /* Pixel Array control divider */ #define PIXEL_ARRAY_DIV MIN_WIDTH /* The data_offset of plane 0 for the multiplanar formats */ #define PLANE0_DATA_OFFSET 128 /* The supported TV frequency range in MHz */ #define MIN_TV_FREQ (44U * 16U) #define MAX_TV_FREQ (958U * 16U) /* The number of samples returned in every SDR buffer */ #define SDR_CAP_SAMPLES_PER_BUF 0x4000 /* used by the threads to know when to resync internal counters */ #define JIFFIES_PER_DAY (3600U * 24U * HZ) #define JIFFIES_RESYNC (JIFFIES_PER_DAY * (0xf0000000U / JIFFIES_PER_DAY)) /* * Maximum number of HDMI inputs allowed by vivid, due to limitations * of the Physical Address in the EDID and used by CEC we stop at 15 * inputs and outputs. */ #define MAX_HDMI_INPUTS 15 #define MAX_HDMI_OUTPUTS 15 /* Maximum number of S-Video inputs allowed by vivid */ #define MAX_SVID_INPUTS 16 /* The maximum number of items in a menu control */ #define MAX_MENU_ITEMS BITS_PER_LONG_LONG /* Number of fixed menu items in the 'Connected To' menu controls */ #define FIXED_MENU_ITEMS 2 /* The maximum number of vivid devices */ #define VIVID_MAX_DEVS CONFIG_VIDEO_VIVID_MAX_DEVS extern const struct v4l2_rect vivid_min_rect; extern const struct v4l2_rect vivid_max_rect; extern unsigned vivid_debug; /* * NULL-terminated string array for the HDMI 'Connected To' menu controls * with the list of possible HDMI outputs. * * The first two items are fixed ("TPG" and "None"). */ extern char *vivid_ctrl_hdmi_to_output_strings[1 + MAX_MENU_ITEMS]; /* Menu control skip mask of all HDMI outputs that are in use */ extern u64 hdmi_to_output_menu_skip_mask; /* * Bitmask of which vivid instances need to update any connected * HDMI outputs. */ extern u64 hdmi_input_update_outputs_mask; /* * Spinlock for access to hdmi_to_output_menu_skip_mask and * hdmi_input_update_outputs_mask. */ extern spinlock_t hdmi_output_skip_mask_lock; /* * Workqueue that updates the menu controls whenever the HDMI menu skip mask * changes. */ extern struct workqueue_struct *update_hdmi_ctrls_workqueue; /* * The HDMI menu control value (index in the menu list) maps to an HDMI * output that is part of the given vivid_dev instance and has the given * output index (as returned by VIDIOC_G_OUTPUT). * * NULL/0 if not available. */ extern struct vivid_dev *vivid_ctrl_hdmi_to_output_instance[MAX_MENU_ITEMS]; extern unsigned int vivid_ctrl_hdmi_to_output_index[MAX_MENU_ITEMS]; /* * NULL-terminated string array for the S-Video 'Connected To' menu controls * with the list of possible S-Video outputs. * * The first two items are fixed ("TPG" and "None"). */ extern char *vivid_ctrl_svid_to_output_strings[1 + MAX_MENU_ITEMS]; /* Menu control skip mask of all S-Video outputs that are in use */ extern u64 svid_to_output_menu_skip_mask; /* Spinlock for access to svid_to_output_menu_skip_mask */ extern spinlock_t svid_output_skip_mask_lock; /* * Workqueue that updates the menu controls whenever the S-Video menu skip mask * changes. */ extern struct workqueue_struct *update_svid_ctrls_workqueue; /* * The S-Video menu control value (index in the menu list) maps to an S-Video * output that is part of the given vivid_dev instance and has the given * output index (as returned by VIDIOC_G_OUTPUT). * * NULL/0 if not available. */ extern struct vivid_dev *vivid_ctrl_svid_to_output_instance[MAX_MENU_ITEMS]; extern unsigned int vivid_ctrl_svid_to_output_index[MAX_MENU_ITEMS]; extern struct vivid_dev *vivid_devs[VIVID_MAX_DEVS]; extern unsigned int n_devs; struct vivid_fmt { u32 fourcc; /* v4l2 format id */ enum tgp_color_enc color_enc; bool can_do_overlay; u8 vdownsampling[TPG_MAX_PLANES]; u32 alpha_mask; u8 planes; u8 buffers; u32 data_offset[TPG_MAX_PLANES]; u32 bit_depth[TPG_MAX_PLANES]; }; extern struct vivid_fmt vivid_formats[]; /* buffer for one video frame */ struct vivid_buffer { /* common v4l buffer stuff -- must be first */ struct vb2_v4l2_buffer vb; struct list_head list; }; enum vivid_input { WEBCAM, TV, SVID, HDMI, }; enum vivid_signal_mode { CURRENT_DV_TIMINGS, CURRENT_STD = CURRENT_DV_TIMINGS, NO_SIGNAL, NO_LOCK, OUT_OF_RANGE, SELECTED_DV_TIMINGS, SELECTED_STD = SELECTED_DV_TIMINGS, CYCLE_DV_TIMINGS, CYCLE_STD = CYCLE_DV_TIMINGS, CUSTOM_DV_TIMINGS, }; enum vivid_colorspace { VIVID_CS_170M, VIVID_CS_709, VIVID_CS_SRGB, VIVID_CS_OPRGB, VIVID_CS_2020, VIVID_CS_DCI_P3, VIVID_CS_240M, VIVID_CS_SYS_M, VIVID_CS_SYS_BG, }; #define VIVID_INVALID_SIGNAL(mode) \ ((mode) == NO_SIGNAL || (mode) == NO_LOCK || (mode) == OUT_OF_RANGE) struct vivid_cec_xfer { struct cec_adapter *adap; u8 msg[CEC_MAX_MSG_SIZE]; u32 len; u32 sft; }; struct vivid_dev { u8 inst; struct v4l2_device v4l2_dev; #ifdef CONFIG_MEDIA_CONTROLLER struct media_device mdev; struct media_pad vid_cap_pad; struct media_pad vid_out_pad; struct media_pad vbi_cap_pad; struct media_pad vbi_out_pad; struct media_pad sdr_cap_pad; struct media_pad meta_cap_pad; struct media_pad meta_out_pad; struct media_pad touch_cap_pad; #endif struct v4l2_ctrl_handler ctrl_hdl_user_gen; struct v4l2_ctrl_handler ctrl_hdl_user_vid; struct v4l2_ctrl_handler ctrl_hdl_user_aud; struct v4l2_ctrl_handler ctrl_hdl_streaming; struct v4l2_ctrl_handler ctrl_hdl_sdtv_cap; struct v4l2_ctrl_handler ctrl_hdl_loop_cap; struct v4l2_ctrl_handler ctrl_hdl_fb; struct video_device vid_cap_dev; struct v4l2_ctrl_handler ctrl_hdl_vid_cap; struct video_device vid_out_dev; struct v4l2_ctrl_handler ctrl_hdl_vid_out; struct video_device vbi_cap_dev; struct v4l2_ctrl_handler ctrl_hdl_vbi_cap; struct video_device vbi_out_dev; struct v4l2_ctrl_handler ctrl_hdl_vbi_out; struct video_device radio_rx_dev; struct v4l2_ctrl_handler ctrl_hdl_radio_rx; struct video_device radio_tx_dev; struct v4l2_ctrl_handler ctrl_hdl_radio_tx; struct video_device sdr_cap_dev; struct v4l2_ctrl_handler ctrl_hdl_sdr_cap; struct video_device meta_cap_dev; struct v4l2_ctrl_handler ctrl_hdl_meta_cap; struct video_device meta_out_dev; struct v4l2_ctrl_handler ctrl_hdl_meta_out; struct video_device touch_cap_dev; struct v4l2_ctrl_handler ctrl_hdl_touch_cap; spinlock_t slock; struct mutex mutex; struct work_struct update_hdmi_ctrl_work; struct work_struct update_svid_ctrl_work; /* capabilities */ u32 vid_cap_caps; u32 vid_out_caps; u32 vbi_cap_caps; u32 vbi_out_caps; u32 sdr_cap_caps; u32 radio_rx_caps; u32 radio_tx_caps; u32 meta_cap_caps; u32 meta_out_caps; u32 touch_cap_caps; /* supported features */ bool multiplanar; u8 num_inputs; u8 num_hdmi_inputs; u8 num_svid_inputs; u8 input_type[MAX_INPUTS]; u8 input_name_counter[MAX_INPUTS]; u8 num_outputs; u8 num_hdmi_outputs; u8 output_type[MAX_OUTPUTS]; u8 output_name_counter[MAX_OUTPUTS]; bool has_audio_inputs; bool has_audio_outputs; bool has_vid_cap; bool has_vid_out; bool has_vbi_cap; bool has_raw_vbi_cap; bool has_sliced_vbi_cap; bool has_vbi_out; bool has_raw_vbi_out; bool has_sliced_vbi_out; bool has_radio_rx; bool has_radio_tx; bool has_sdr_cap; bool has_fb; bool has_meta_cap; bool has_meta_out; bool has_tv_tuner; bool has_touch_cap; /* Output index (0-MAX_OUTPUTS) to vivid instance of connected input */ struct vivid_dev *output_to_input_instance[MAX_OUTPUTS]; /* Output index (0-MAX_OUTPUTS) to input index (0-MAX_INPUTS) of connected input */ u8 output_to_input_index[MAX_OUTPUTS]; /* Output index (0-MAX_OUTPUTS) to HDMI or S-Video output index (0-MAX_HDMI/SVID_OUTPUTS) */ u8 output_to_iface_index[MAX_OUTPUTS]; /* ctrl_hdmi_to_output or ctrl_svid_to_output control value for each input */ s32 input_is_connected_to_output[MAX_INPUTS]; /* HDMI index (0-MAX_HDMI_OUTPUTS) to output index (0-MAX_OUTPUTS) */ u8 hdmi_index_to_output_index[MAX_HDMI_OUTPUTS]; /* HDMI index (0-MAX_HDMI_INPUTS) to input index (0-MAX_INPUTS) */ u8 hdmi_index_to_input_index[MAX_HDMI_INPUTS]; /* S-Video index (0-MAX_SVID_INPUTS) to input index (0-MAX_INPUTS) */ u8 svid_index_to_input_index[MAX_SVID_INPUTS]; /* controls */ struct v4l2_ctrl *brightness; struct v4l2_ctrl *contrast; struct v4l2_ctrl *saturation; struct v4l2_ctrl *hue; struct { /* autogain/gain cluster */ struct v4l2_ctrl *autogain; struct v4l2_ctrl *gain; }; struct v4l2_ctrl *volume; struct v4l2_ctrl *mute; struct v4l2_ctrl *alpha; struct v4l2_ctrl *button; struct v4l2_ctrl *boolean; struct v4l2_ctrl *int32; struct v4l2_ctrl *int64; struct v4l2_ctrl *menu; struct v4l2_ctrl *string; struct v4l2_ctrl *bitmask; struct v4l2_ctrl *int_menu; struct v4l2_ctrl *ro_int32; struct v4l2_ctrl *pixel_array; struct v4l2_ctrl *test_pattern; struct v4l2_ctrl *colorspace; struct v4l2_ctrl *rgb_range_cap; struct v4l2_ctrl *real_rgb_range_cap; struct { /* std_signal_mode/standard cluster */ struct v4l2_ctrl *ctrl_std_signal_mode; struct v4l2_ctrl *ctrl_standard; }; struct { /* dv_timings_signal_mode/timings cluster */ struct v4l2_ctrl *ctrl_dv_timings_signal_mode; struct v4l2_ctrl *ctrl_dv_timings; }; struct v4l2_ctrl *ctrl_has_crop_cap; struct v4l2_ctrl *ctrl_has_compose_cap; struct v4l2_ctrl *ctrl_has_scaler_cap; struct v4l2_ctrl *ctrl_has_crop_out; struct v4l2_ctrl *ctrl_has_compose_out; struct v4l2_ctrl *ctrl_has_scaler_out; struct v4l2_ctrl *ctrl_tx_mode; struct v4l2_ctrl *ctrl_tx_rgb_range; struct v4l2_ctrl *ctrl_tx_edid_present; struct v4l2_ctrl *ctrl_tx_hotplug; struct v4l2_ctrl *ctrl_tx_rxsense; struct v4l2_ctrl *ctrl_rx_power_present; struct v4l2_ctrl *radio_tx_rds_pi; struct v4l2_ctrl *radio_tx_rds_pty; struct v4l2_ctrl *radio_tx_rds_mono_stereo; struct v4l2_ctrl *radio_tx_rds_art_head; struct v4l2_ctrl *radio_tx_rds_compressed; struct v4l2_ctrl *radio_tx_rds_dyn_pty; struct v4l2_ctrl *radio_tx_rds_ta; struct v4l2_ctrl *radio_tx_rds_tp; struct v4l2_ctrl *radio_tx_rds_ms; struct v4l2_ctrl *radio_tx_rds_psname; struct v4l2_ctrl *radio_tx_rds_radiotext; struct v4l2_ctrl *radio_rx_rds_pty; struct v4l2_ctrl *radio_rx_rds_ta; struct v4l2_ctrl *radio_rx_rds_tp; struct v4l2_ctrl *radio_rx_rds_ms; struct v4l2_ctrl *radio_rx_rds_psname; struct v4l2_ctrl *radio_rx_rds_radiotext; struct v4l2_ctrl *ctrl_hdmi_to_output[MAX_HDMI_INPUTS]; char ctrl_hdmi_to_output_names[MAX_HDMI_INPUTS][32]; struct v4l2_ctrl *ctrl_svid_to_output[MAX_SVID_INPUTS]; char ctrl_svid_to_output_names[MAX_SVID_INPUTS][32]; unsigned input_brightness[MAX_INPUTS]; unsigned osd_mode; unsigned button_pressed; bool sensor_hflip; bool sensor_vflip; bool hflip; bool vflip; bool vbi_cap_interlaced; bool loop_video; bool reduced_fps; /* Framebuffer */ unsigned long video_pbase; void *video_vbase; u32 video_buffer_size; int display_width; int display_height; int display_byte_stride; int bits_per_pixel; int bytes_per_pixel; struct fb_info fb_info; struct fb_var_screeninfo fb_defined; struct fb_fix_screeninfo fb_fix; /* Error injection */ bool disconnect_error; bool queue_setup_error; bool buf_prepare_error; bool start_streaming_error; bool dqbuf_error; bool req_validate_error; bool seq_wrap; u64 time_wrap; u64 time_wrap_offset; unsigned perc_dropped_buffers; enum vivid_signal_mode std_signal_mode[MAX_INPUTS]; unsigned int query_std_last[MAX_INPUTS]; v4l2_std_id query_std[MAX_INPUTS]; enum tpg_video_aspect std_aspect_ratio[MAX_INPUTS]; enum vivid_signal_mode dv_timings_signal_mode[MAX_INPUTS]; char **query_dv_timings_qmenu; char *query_dv_timings_qmenu_strings; unsigned query_dv_timings_size; unsigned int query_dv_timings_last[MAX_INPUTS]; unsigned int query_dv_timings[MAX_INPUTS]; enum tpg_video_aspect dv_timings_aspect_ratio[MAX_INPUTS]; /* Input */ unsigned input; v4l2_std_id std_cap[MAX_INPUTS]; struct v4l2_dv_timings dv_timings_cap[MAX_INPUTS]; int dv_timings_cap_sel[MAX_INPUTS]; u32 service_set_cap; struct vivid_vbi_gen_data vbi_gen; u8 *edid; unsigned edid_blocks; unsigned edid_max_blocks; unsigned webcam_size_idx; unsigned webcam_ival_idx; unsigned tv_freq; unsigned tv_audmode; unsigned tv_field_cap; unsigned tv_audio_input; u32 power_present; /* Output */ unsigned output; v4l2_std_id std_out; struct v4l2_dv_timings dv_timings_out; u32 colorspace_out; u32 ycbcr_enc_out; u32 hsv_enc_out; u32 quantization_out; u32 xfer_func_out; u32 service_set_out; unsigned bytesperline_out[TPG_MAX_PLANES]; unsigned tv_field_out; unsigned tv_audio_output; bool vbi_out_have_wss; u8 vbi_out_wss[2]; bool vbi_out_have_cc[2]; u8 vbi_out_cc[2][2]; bool dvi_d_out; u8 *scaled_line; u8 *blended_line; unsigned cur_scaled_line; /* Output Overlay */ void *fb_vbase_out; bool overlay_out_enabled; int overlay_out_top, overlay_out_left; unsigned fbuf_out_flags; u32 chromakey_out; u8 global_alpha_out; /* video capture */ struct tpg_data tpg; unsigned ms_vid_cap; bool must_blank[MAX_VID_CAP_BUFFERS]; const struct vivid_fmt *fmt_cap; struct v4l2_fract timeperframe_vid_cap; enum v4l2_field field_cap; struct v4l2_rect src_rect; struct v4l2_rect fmt_cap_rect; struct v4l2_rect crop_cap; struct v4l2_rect compose_cap; struct v4l2_rect crop_bounds_cap; struct vb2_queue vb_vid_cap_q; struct list_head vid_cap_active; struct vb2_queue vb_vbi_cap_q; struct list_head vbi_cap_active; struct vb2_queue vb_meta_cap_q; struct list_head meta_cap_active; struct vb2_queue vb_touch_cap_q; struct list_head touch_cap_active; /* thread for generating video capture stream */ struct task_struct *kthread_vid_cap; unsigned long jiffies_vid_cap; u64 cap_stream_start; u64 cap_frame_period; u64 cap_frame_eof_offset; u32 cap_seq_offset; u32 cap_seq_count; bool cap_seq_resync; u32 vid_cap_seq_start; u32 vid_cap_seq_count; bool vid_cap_streaming; u32 vbi_cap_seq_start; u32 vbi_cap_seq_count; bool vbi_cap_streaming; u32 meta_cap_seq_start; u32 meta_cap_seq_count; bool meta_cap_streaming; /* Touch capture */ struct task_struct *kthread_touch_cap; unsigned long jiffies_touch_cap; u64 touch_cap_stream_start; u32 touch_cap_seq_offset; bool touch_cap_seq_resync; u32 touch_cap_seq_start; u32 touch_cap_seq_count; u32 touch_cap_with_seq_wrap_count; bool touch_cap_streaming; struct v4l2_fract timeperframe_tch_cap; struct v4l2_pix_format tch_format; int tch_pat_random; /* video output */ const struct vivid_fmt *fmt_out; struct v4l2_fract timeperframe_vid_out; enum v4l2_field field_out; struct v4l2_rect sink_rect; struct v4l2_rect fmt_out_rect; struct v4l2_rect crop_out; struct v4l2_rect compose_out; struct v4l2_rect compose_bounds_out; struct vb2_queue vb_vid_out_q; struct list_head vid_out_active; struct vb2_queue vb_vbi_out_q; struct list_head vbi_out_active; struct vb2_queue vb_meta_out_q; struct list_head meta_out_active; /* video loop precalculated rectangles */ /* * Intersection between what the output side composes and the capture side * crops. I.e., what actually needs to be copied from the output buffer to * the capture buffer. */ struct v4l2_rect loop_vid_copy; /* The part of the output buffer that (after scaling) corresponds to loop_vid_copy. */ struct v4l2_rect loop_vid_out; /* The part of the capture buffer that (after scaling) corresponds to loop_vid_copy. */ struct v4l2_rect loop_vid_cap; /* * The intersection of the framebuffer, the overlay output window and * loop_vid_copy. I.e., the part of the framebuffer that actually should be * blended with the compose_out rectangle. This uses the framebuffer origin. */ struct v4l2_rect loop_fb_copy; /* The same as loop_fb_copy but with compose_out origin. */ struct v4l2_rect loop_vid_overlay; /* * The part of the capture buffer that (after scaling) corresponds * to loop_vid_overlay. */ struct v4l2_rect loop_vid_overlay_cap; /* thread for generating video output stream */ struct task_struct *kthread_vid_out; unsigned long jiffies_vid_out; u32 out_seq_offset; u32 out_seq_count; bool out_seq_resync; u32 vid_out_seq_start; u32 vid_out_seq_count; bool vid_out_streaming; u32 vbi_out_seq_start; u32 vbi_out_seq_count; bool vbi_out_streaming; bool stream_sliced_vbi_out; u32 meta_out_seq_start; u32 meta_out_seq_count; bool meta_out_streaming; /* SDR capture */ struct vb2_queue vb_sdr_cap_q; struct list_head sdr_cap_active; u32 sdr_pixelformat; /* v4l2 format id */ unsigned sdr_buffersize; unsigned sdr_adc_freq; unsigned sdr_fm_freq; unsigned sdr_fm_deviation; int sdr_fixp_src_phase; int sdr_fixp_mod_phase; bool tstamp_src_is_soe; bool has_crop_cap; bool has_compose_cap; bool has_scaler_cap; bool has_crop_out; bool has_compose_out; bool has_scaler_out; /* thread for generating SDR stream */ struct task_struct *kthread_sdr_cap; unsigned long jiffies_sdr_cap; u32 sdr_cap_seq_offset; u32 sdr_cap_seq_start; u32 sdr_cap_seq_count; u32 sdr_cap_with_seq_wrap_count; bool sdr_cap_seq_resync; /* RDS generator */ struct vivid_rds_gen rds_gen; /* Radio receiver */ unsigned radio_rx_freq; unsigned radio_rx_audmode; int radio_rx_sig_qual; unsigned radio_rx_hw_seek_mode; bool radio_rx_hw_seek_prog_lim; bool radio_rx_rds_controls; bool radio_rx_rds_enabled; unsigned radio_rx_rds_use_alternates; unsigned radio_rx_rds_last_block; struct v4l2_fh *radio_rx_rds_owner; /* Radio transmitter */ unsigned radio_tx_freq; unsigned radio_tx_subchans; bool radio_tx_rds_controls; unsigned radio_tx_rds_last_block; struct v4l2_fh *radio_tx_rds_owner; /* Shared between radio receiver and transmitter */ bool radio_rds_loop; ktime_t radio_rds_init_time; /* CEC */ struct cec_adapter *cec_rx_adap; struct cec_adapter *cec_tx_adap[MAX_HDMI_OUTPUTS]; struct task_struct *kthread_cec; wait_queue_head_t kthread_waitq_cec; struct vivid_cec_xfer xfers[MAX_OUTPUTS]; spinlock_t cec_xfers_slock; /* read and write cec messages */ u32 cec_sft; /* bus signal free time, in bit periods */ u8 last_initiator; /* CEC OSD String */ char osd[14]; unsigned long osd_jiffies; bool meta_pts; bool meta_scr; }; static inline bool vivid_is_webcam(const struct vivid_dev *dev) { return dev->input_type[dev->input] == WEBCAM; } static inline bool vivid_is_tv_cap(const struct vivid_dev *dev) { return dev->input_type[dev->input] == TV; } static inline bool vivid_is_svid_cap(const struct vivid_dev *dev) { return dev->input_type[dev->input] == SVID; } static inline bool vivid_is_hdmi_cap(const struct vivid_dev *dev) { return dev->input_type[dev->input] == HDMI; } static inline bool vivid_is_sdtv_cap(const struct vivid_dev *dev) { return vivid_is_tv_cap(dev) || vivid_is_svid_cap(dev); } static inline bool vivid_is_svid_out(const struct vivid_dev *dev) { return dev->output_type[dev->output] == SVID; } static inline bool vivid_is_hdmi_out(const struct vivid_dev *dev) { return dev->output_type[dev->output] == HDMI; } #endif |
| 52 1 50 1 2 28 48 2 1 1 23 3 1 7 1 1 1 3 1 1 4 1 19 19 17 1 4 2 2 1 9 1 1 7 1 1 1 3 1 1 4 1 8 8 5 1 4 2 6 1 1 12 1 1 14 60 2 1 27 14 16 4 1 3 1 2 1 1 1 2 1 2 2 4 2 2 6 1 5 4 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 | // SPDX-License-Identifier: GPL-2.0-or-later /* * NetLabel CIPSO/IPv4 Support * * This file defines the CIPSO/IPv4 functions for the NetLabel system. The * NetLabel system manages static and dynamic label mappings for network * protocols such as CIPSO and RIPSO. * * Author: Paul Moore <paul@paul-moore.com> */ /* * (c) Copyright Hewlett-Packard Development Company, L.P., 2006 */ #include <linux/types.h> #include <linux/socket.h> #include <linux/string.h> #include <linux/skbuff.h> #include <linux/audit.h> #include <linux/slab.h> #include <net/sock.h> #include <net/netlink.h> #include <net/genetlink.h> #include <net/netlabel.h> #include <net/cipso_ipv4.h> #include <linux/atomic.h> #include "netlabel_user.h" #include "netlabel_cipso_v4.h" #include "netlabel_mgmt.h" #include "netlabel_domainhash.h" /* Argument struct for cipso_v4_doi_walk() */ struct netlbl_cipsov4_doiwalk_arg { struct netlink_callback *nl_cb; struct sk_buff *skb; u32 seq; }; /* Argument struct for netlbl_domhsh_walk() */ struct netlbl_domhsh_walk_arg { struct netlbl_audit *audit_info; u32 doi; }; /* NetLabel Generic NETLINK CIPSOv4 family */ static struct genl_family netlbl_cipsov4_gnl_family; /* NetLabel Netlink attribute policy */ static const struct nla_policy netlbl_cipsov4_genl_policy[NLBL_CIPSOV4_A_MAX + 1] = { [NLBL_CIPSOV4_A_DOI] = { .type = NLA_U32 }, [NLBL_CIPSOV4_A_MTYPE] = { .type = NLA_U32 }, [NLBL_CIPSOV4_A_TAG] = { .type = NLA_U8 }, [NLBL_CIPSOV4_A_TAGLST] = { .type = NLA_NESTED }, [NLBL_CIPSOV4_A_MLSLVLLOC] = { .type = NLA_U32 }, [NLBL_CIPSOV4_A_MLSLVLREM] = { .type = NLA_U32 }, [NLBL_CIPSOV4_A_MLSLVL] = { .type = NLA_NESTED }, [NLBL_CIPSOV4_A_MLSLVLLST] = { .type = NLA_NESTED }, [NLBL_CIPSOV4_A_MLSCATLOC] = { .type = NLA_U32 }, [NLBL_CIPSOV4_A_MLSCATREM] = { .type = NLA_U32 }, [NLBL_CIPSOV4_A_MLSCAT] = { .type = NLA_NESTED }, [NLBL_CIPSOV4_A_MLSCATLST] = { .type = NLA_NESTED }, }; /* * Helper Functions */ /** * netlbl_cipsov4_add_common - Parse the common sections of a ADD message * @info: the Generic NETLINK info block * @doi_def: the CIPSO V4 DOI definition * * Description: * Parse the common sections of a ADD message and fill in the related values * in @doi_def. Returns zero on success, negative values on failure. * */ static int netlbl_cipsov4_add_common(struct genl_info *info, struct cipso_v4_doi *doi_def) { struct nlattr *nla; int nla_rem; u32 iter = 0; doi_def->doi = nla_get_u32(info->attrs[NLBL_CIPSOV4_A_DOI]); if (nla_validate_nested_deprecated(info->attrs[NLBL_CIPSOV4_A_TAGLST], NLBL_CIPSOV4_A_MAX, netlbl_cipsov4_genl_policy, NULL) != 0) return -EINVAL; nla_for_each_nested(nla, info->attrs[NLBL_CIPSOV4_A_TAGLST], nla_rem) if (nla_type(nla) == NLBL_CIPSOV4_A_TAG) { if (iter >= CIPSO_V4_TAG_MAXCNT) return -EINVAL; doi_def->tags[iter++] = nla_get_u8(nla); } while (iter < CIPSO_V4_TAG_MAXCNT) doi_def->tags[iter++] = CIPSO_V4_TAG_INVALID; return 0; } /* * NetLabel Command Handlers */ /** * netlbl_cipsov4_add_std - Adds a CIPSO V4 DOI definition * @info: the Generic NETLINK info block * @audit_info: NetLabel audit information * * Description: * Create a new CIPSO_V4_MAP_TRANS DOI definition based on the given ADD * message and add it to the CIPSO V4 engine. Return zero on success and * non-zero on error. * */ static int netlbl_cipsov4_add_std(struct genl_info *info, struct netlbl_audit *audit_info) { int ret_val = -EINVAL; struct cipso_v4_doi *doi_def = NULL; struct nlattr *nla_a; struct nlattr *nla_b; int nla_a_rem; int nla_b_rem; u32 iter; if (!info->attrs[NLBL_CIPSOV4_A_TAGLST] || !info->attrs[NLBL_CIPSOV4_A_MLSLVLLST]) return -EINVAL; if (nla_validate_nested_deprecated(info->attrs[NLBL_CIPSOV4_A_MLSLVLLST], NLBL_CIPSOV4_A_MAX, netlbl_cipsov4_genl_policy, NULL) != 0) return -EINVAL; doi_def = kmalloc(sizeof(*doi_def), GFP_KERNEL); if (doi_def == NULL) return -ENOMEM; doi_def->map.std = kzalloc(sizeof(*doi_def->map.std), GFP_KERNEL); if (doi_def->map.std == NULL) { kfree(doi_def); return -ENOMEM; } doi_def->type = CIPSO_V4_MAP_TRANS; ret_val = netlbl_cipsov4_add_common(info, doi_def); if (ret_val != 0) goto add_std_failure; ret_val = -EINVAL; nla_for_each_nested(nla_a, info->attrs[NLBL_CIPSOV4_A_MLSLVLLST], nla_a_rem) if (nla_type(nla_a) == NLBL_CIPSOV4_A_MLSLVL) { if (nla_validate_nested_deprecated(nla_a, NLBL_CIPSOV4_A_MAX, netlbl_cipsov4_genl_policy, NULL) != 0) goto add_std_failure; nla_for_each_nested(nla_b, nla_a, nla_b_rem) switch (nla_type(nla_b)) { case NLBL_CIPSOV4_A_MLSLVLLOC: if (nla_get_u32(nla_b) > CIPSO_V4_MAX_LOC_LVLS) goto add_std_failure; if (nla_get_u32(nla_b) >= doi_def->map.std->lvl.local_size) doi_def->map.std->lvl.local_size = nla_get_u32(nla_b) + 1; break; case NLBL_CIPSOV4_A_MLSLVLREM: if (nla_get_u32(nla_b) > CIPSO_V4_MAX_REM_LVLS) goto add_std_failure; if (nla_get_u32(nla_b) >= doi_def->map.std->lvl.cipso_size) doi_def->map.std->lvl.cipso_size = nla_get_u32(nla_b) + 1; break; } } doi_def->map.std->lvl.local = kcalloc(doi_def->map.std->lvl.local_size, sizeof(u32), GFP_KERNEL | __GFP_NOWARN); if (doi_def->map.std->lvl.local == NULL) { ret_val = -ENOMEM; goto add_std_failure; } doi_def->map.std->lvl.cipso = kcalloc(doi_def->map.std->lvl.cipso_size, sizeof(u32), GFP_KERNEL | __GFP_NOWARN); if (doi_def->map.std->lvl.cipso == NULL) { ret_val = -ENOMEM; goto add_std_failure; } for (iter = 0; iter < doi_def->map.std->lvl.local_size; iter++) doi_def->map.std->lvl.local[iter] = CIPSO_V4_INV_LVL; for (iter = 0; iter < doi_def->map.std->lvl.cipso_size; iter++) doi_def->map.std->lvl.cipso[iter] = CIPSO_V4_INV_LVL; nla_for_each_nested(nla_a, info->attrs[NLBL_CIPSOV4_A_MLSLVLLST], nla_a_rem) if (nla_type(nla_a) == NLBL_CIPSOV4_A_MLSLVL) { struct nlattr *lvl_loc; struct nlattr *lvl_rem; lvl_loc = nla_find_nested(nla_a, NLBL_CIPSOV4_A_MLSLVLLOC); lvl_rem = nla_find_nested(nla_a, NLBL_CIPSOV4_A_MLSLVLREM); if (lvl_loc == NULL || lvl_rem == NULL) goto add_std_failure; doi_def->map.std->lvl.local[nla_get_u32(lvl_loc)] = nla_get_u32(lvl_rem); doi_def->map.std->lvl.cipso[nla_get_u32(lvl_rem)] = nla_get_u32(lvl_loc); } if (info->attrs[NLBL_CIPSOV4_A_MLSCATLST]) { if (nla_validate_nested_deprecated(info->attrs[NLBL_CIPSOV4_A_MLSCATLST], NLBL_CIPSOV4_A_MAX, netlbl_cipsov4_genl_policy, NULL) != 0) goto add_std_failure; nla_for_each_nested(nla_a, info->attrs[NLBL_CIPSOV4_A_MLSCATLST], nla_a_rem) if (nla_type(nla_a) == NLBL_CIPSOV4_A_MLSCAT) { if (nla_validate_nested_deprecated(nla_a, NLBL_CIPSOV4_A_MAX, netlbl_cipsov4_genl_policy, NULL) != 0) goto add_std_failure; nla_for_each_nested(nla_b, nla_a, nla_b_rem) switch (nla_type(nla_b)) { case NLBL_CIPSOV4_A_MLSCATLOC: if (nla_get_u32(nla_b) > CIPSO_V4_MAX_LOC_CATS) goto add_std_failure; if (nla_get_u32(nla_b) >= doi_def->map.std->cat.local_size) doi_def->map.std->cat.local_size = nla_get_u32(nla_b) + 1; break; case NLBL_CIPSOV4_A_MLSCATREM: if (nla_get_u32(nla_b) > CIPSO_V4_MAX_REM_CATS) goto add_std_failure; if (nla_get_u32(nla_b) >= doi_def->map.std->cat.cipso_size) doi_def->map.std->cat.cipso_size = nla_get_u32(nla_b) + 1; break; } } doi_def->map.std->cat.local = kcalloc( doi_def->map.std->cat.local_size, sizeof(u32), GFP_KERNEL | __GFP_NOWARN); if (doi_def->map.std->cat.local == NULL) { ret_val = -ENOMEM; goto add_std_failure; } doi_def->map.std->cat.cipso = kcalloc( doi_def->map.std->cat.cipso_size, sizeof(u32), GFP_KERNEL | __GFP_NOWARN); if (doi_def->map.std->cat.cipso == NULL) { ret_val = -ENOMEM; goto add_std_failure; } for (iter = 0; iter < doi_def->map.std->cat.local_size; iter++) doi_def->map.std->cat.local[iter] = CIPSO_V4_INV_CAT; for (iter = 0; iter < doi_def->map.std->cat.cipso_size; iter++) doi_def->map.std->cat.cipso[iter] = CIPSO_V4_INV_CAT; nla_for_each_nested(nla_a, info->attrs[NLBL_CIPSOV4_A_MLSCATLST], nla_a_rem) if (nla_type(nla_a) == NLBL_CIPSOV4_A_MLSCAT) { struct nlattr *cat_loc; struct nlattr *cat_rem; cat_loc = nla_find_nested(nla_a, NLBL_CIPSOV4_A_MLSCATLOC); cat_rem = nla_find_nested(nla_a, NLBL_CIPSOV4_A_MLSCATREM); if (cat_loc == NULL || cat_rem == NULL) goto add_std_failure; doi_def->map.std->cat.local[ nla_get_u32(cat_loc)] = nla_get_u32(cat_rem); doi_def->map.std->cat.cipso[ nla_get_u32(cat_rem)] = nla_get_u32(cat_loc); } } ret_val = cipso_v4_doi_add(doi_def, audit_info); if (ret_val != 0) goto add_std_failure; return 0; add_std_failure: cipso_v4_doi_free(doi_def); return ret_val; } /** * netlbl_cipsov4_add_pass - Adds a CIPSO V4 DOI definition * @info: the Generic NETLINK info block * @audit_info: NetLabel audit information * * Description: * Create a new CIPSO_V4_MAP_PASS DOI definition based on the given ADD message * and add it to the CIPSO V4 engine. Return zero on success and non-zero on * error. * */ static int netlbl_cipsov4_add_pass(struct genl_info *info, struct netlbl_audit *audit_info) { int ret_val; struct cipso_v4_doi *doi_def = NULL; if (!info->attrs[NLBL_CIPSOV4_A_TAGLST]) return -EINVAL; doi_def = kmalloc(sizeof(*doi_def), GFP_KERNEL); if (doi_def == NULL) return -ENOMEM; doi_def->type = CIPSO_V4_MAP_PASS; ret_val = netlbl_cipsov4_add_common(info, doi_def); if (ret_val != 0) goto add_pass_failure; ret_val = cipso_v4_doi_add(doi_def, audit_info); if (ret_val != 0) goto add_pass_failure; return 0; add_pass_failure: cipso_v4_doi_free(doi_def); return ret_val; } /** * netlbl_cipsov4_add_local - Adds a CIPSO V4 DOI definition * @info: the Generic NETLINK info block * @audit_info: NetLabel audit information * * Description: * Create a new CIPSO_V4_MAP_LOCAL DOI definition based on the given ADD * message and add it to the CIPSO V4 engine. Return zero on success and * non-zero on error. * */ static int netlbl_cipsov4_add_local(struct genl_info *info, struct netlbl_audit *audit_info) { int ret_val; struct cipso_v4_doi *doi_def = NULL; if (!info->attrs[NLBL_CIPSOV4_A_TAGLST]) return -EINVAL; doi_def = kmalloc(sizeof(*doi_def), GFP_KERNEL); if (doi_def == NULL) return -ENOMEM; doi_def->type = CIPSO_V4_MAP_LOCAL; ret_val = netlbl_cipsov4_add_common(info, doi_def); if (ret_val != 0) goto add_local_failure; ret_val = cipso_v4_doi_add(doi_def, audit_info); if (ret_val != 0) goto add_local_failure; return 0; add_local_failure: cipso_v4_doi_free(doi_def); return ret_val; } /** * netlbl_cipsov4_add - Handle an ADD message * @skb: the NETLINK buffer * @info: the Generic NETLINK info block * * Description: * Create a new DOI definition based on the given ADD message and add it to the * CIPSO V4 engine. Returns zero on success, negative values on failure. * */ static int netlbl_cipsov4_add(struct sk_buff *skb, struct genl_info *info) { int ret_val = -EINVAL; struct netlbl_audit audit_info; if (!info->attrs[NLBL_CIPSOV4_A_DOI] || !info->attrs[NLBL_CIPSOV4_A_MTYPE]) return -EINVAL; netlbl_netlink_auditinfo(&audit_info); switch (nla_get_u32(info->attrs[NLBL_CIPSOV4_A_MTYPE])) { case CIPSO_V4_MAP_TRANS: ret_val = netlbl_cipsov4_add_std(info, &audit_info); break; case CIPSO_V4_MAP_PASS: ret_val = netlbl_cipsov4_add_pass(info, &audit_info); break; case CIPSO_V4_MAP_LOCAL: ret_val = netlbl_cipsov4_add_local(info, &audit_info); break; } if (ret_val == 0) atomic_inc(&netlabel_mgmt_protocount); return ret_val; } /** * netlbl_cipsov4_list - Handle a LIST message * @skb: the NETLINK buffer * @info: the Generic NETLINK info block * * Description: * Process a user generated LIST message and respond accordingly. While the * response message generated by the kernel is straightforward, determining * before hand the size of the buffer to allocate is not (we have to generate * the message to know the size). In order to keep this function sane what we * do is allocate a buffer of NLMSG_GOODSIZE and try to fit the response in * that size, if we fail then we restart with a larger buffer and try again. * We continue in this manner until we hit a limit of failed attempts then we * give up and just send an error message. Returns zero on success and * negative values on error. * */ static int netlbl_cipsov4_list(struct sk_buff *skb, struct genl_info *info) { int ret_val; struct sk_buff *ans_skb = NULL; u32 nlsze_mult = 1; void *data; u32 doi; struct nlattr *nla_a; struct nlattr *nla_b; struct cipso_v4_doi *doi_def; u32 iter; if (!info->attrs[NLBL_CIPSOV4_A_DOI]) { ret_val = -EINVAL; goto list_failure; } list_start: ans_skb = nlmsg_new(NLMSG_DEFAULT_SIZE * nlsze_mult, GFP_KERNEL); if (ans_skb == NULL) { ret_val = -ENOMEM; goto list_failure; } data = genlmsg_put_reply(ans_skb, info, &netlbl_cipsov4_gnl_family, 0, NLBL_CIPSOV4_C_LIST); if (data == NULL) { ret_val = -ENOMEM; goto list_failure; } doi = nla_get_u32(info->attrs[NLBL_CIPSOV4_A_DOI]); rcu_read_lock(); doi_def = cipso_v4_doi_getdef(doi); if (doi_def == NULL) { ret_val = -EINVAL; goto list_failure_lock; } ret_val = nla_put_u32(ans_skb, NLBL_CIPSOV4_A_MTYPE, doi_def->type); if (ret_val != 0) goto list_failure_lock; nla_a = nla_nest_start_noflag(ans_skb, NLBL_CIPSOV4_A_TAGLST); if (nla_a == NULL) { ret_val = -ENOMEM; goto list_failure_lock; } for (iter = 0; iter < CIPSO_V4_TAG_MAXCNT && doi_def->tags[iter] != CIPSO_V4_TAG_INVALID; iter++) { ret_val = nla_put_u8(ans_skb, NLBL_CIPSOV4_A_TAG, doi_def->tags[iter]); if (ret_val != 0) goto list_failure_lock; } nla_nest_end(ans_skb, nla_a); switch (doi_def->type) { case CIPSO_V4_MAP_TRANS: nla_a = nla_nest_start_noflag(ans_skb, NLBL_CIPSOV4_A_MLSLVLLST); if (nla_a == NULL) { ret_val = -ENOMEM; goto list_failure_lock; } for (iter = 0; iter < doi_def->map.std->lvl.local_size; iter++) { if (doi_def->map.std->lvl.local[iter] == CIPSO_V4_INV_LVL) continue; nla_b = nla_nest_start_noflag(ans_skb, NLBL_CIPSOV4_A_MLSLVL); if (nla_b == NULL) { ret_val = -ENOMEM; goto list_retry; } ret_val = nla_put_u32(ans_skb, NLBL_CIPSOV4_A_MLSLVLLOC, iter); if (ret_val != 0) goto list_retry; ret_val = nla_put_u32(ans_skb, NLBL_CIPSOV4_A_MLSLVLREM, doi_def->map.std->lvl.local[iter]); if (ret_val != 0) goto list_retry; nla_nest_end(ans_skb, nla_b); } nla_nest_end(ans_skb, nla_a); nla_a = nla_nest_start_noflag(ans_skb, NLBL_CIPSOV4_A_MLSCATLST); if (nla_a == NULL) { ret_val = -ENOMEM; goto list_retry; } for (iter = 0; iter < doi_def->map.std->cat.local_size; iter++) { if (doi_def->map.std->cat.local[iter] == CIPSO_V4_INV_CAT) continue; nla_b = nla_nest_start_noflag(ans_skb, NLBL_CIPSOV4_A_MLSCAT); if (nla_b == NULL) { ret_val = -ENOMEM; goto list_retry; } ret_val = nla_put_u32(ans_skb, NLBL_CIPSOV4_A_MLSCATLOC, iter); if (ret_val != 0) goto list_retry; ret_val = nla_put_u32(ans_skb, NLBL_CIPSOV4_A_MLSCATREM, doi_def->map.std->cat.local[iter]); if (ret_val != 0) goto list_retry; nla_nest_end(ans_skb, nla_b); } nla_nest_end(ans_skb, nla_a); break; } cipso_v4_doi_putdef(doi_def); rcu_read_unlock(); genlmsg_end(ans_skb, data); return genlmsg_reply(ans_skb, info); list_retry: /* XXX - this limit is a guesstimate */ if (nlsze_mult < 4) { cipso_v4_doi_putdef(doi_def); rcu_read_unlock(); kfree_skb(ans_skb); nlsze_mult *= 2; goto list_start; } list_failure_lock: cipso_v4_doi_putdef(doi_def); rcu_read_unlock(); list_failure: kfree_skb(ans_skb); return ret_val; } /** * netlbl_cipsov4_listall_cb - cipso_v4_doi_walk() callback for LISTALL * @doi_def: the CIPSOv4 DOI definition * @arg: the netlbl_cipsov4_doiwalk_arg structure * * Description: * This function is designed to be used as a callback to the * cipso_v4_doi_walk() function for use in generating a response for a LISTALL * message. Returns the size of the message on success, negative values on * failure. * */ static int netlbl_cipsov4_listall_cb(struct cipso_v4_doi *doi_def, void *arg) { int ret_val = -ENOMEM; struct netlbl_cipsov4_doiwalk_arg *cb_arg = arg; void *data; data = genlmsg_put(cb_arg->skb, NETLINK_CB(cb_arg->nl_cb->skb).portid, cb_arg->seq, &netlbl_cipsov4_gnl_family, NLM_F_MULTI, NLBL_CIPSOV4_C_LISTALL); if (data == NULL) goto listall_cb_failure; ret_val = nla_put_u32(cb_arg->skb, NLBL_CIPSOV4_A_DOI, doi_def->doi); if (ret_val != 0) goto listall_cb_failure; ret_val = nla_put_u32(cb_arg->skb, NLBL_CIPSOV4_A_MTYPE, doi_def->type); if (ret_val != 0) goto listall_cb_failure; genlmsg_end(cb_arg->skb, data); return 0; listall_cb_failure: genlmsg_cancel(cb_arg->skb, data); return ret_val; } /** * netlbl_cipsov4_listall - Handle a LISTALL message * @skb: the NETLINK buffer * @cb: the NETLINK callback * * Description: * Process a user generated LISTALL message and respond accordingly. Returns * zero on success and negative values on error. * */ static int netlbl_cipsov4_listall(struct sk_buff *skb, struct netlink_callback *cb) { struct netlbl_cipsov4_doiwalk_arg cb_arg; u32 doi_skip = cb->args[0]; cb_arg.nl_cb = cb; cb_arg.skb = skb; cb_arg.seq = cb->nlh->nlmsg_seq; cipso_v4_doi_walk(&doi_skip, netlbl_cipsov4_listall_cb, &cb_arg); cb->args[0] = doi_skip; return skb->len; } /** * netlbl_cipsov4_remove_cb - netlbl_cipsov4_remove() callback for REMOVE * @entry: LSM domain mapping entry * @arg: the netlbl_domhsh_walk_arg structure * * Description: * This function is intended for use by netlbl_cipsov4_remove() as the callback * for the netlbl_domhsh_walk() function; it removes LSM domain map entries * which are associated with the CIPSO DOI specified in @arg. Returns zero on * success, negative values on failure. * */ static int netlbl_cipsov4_remove_cb(struct netlbl_dom_map *entry, void *arg) { struct netlbl_domhsh_walk_arg *cb_arg = arg; if (entry->def.type == NETLBL_NLTYPE_CIPSOV4 && entry->def.cipso->doi == cb_arg->doi) return netlbl_domhsh_remove_entry(entry, cb_arg->audit_info); return 0; } /** * netlbl_cipsov4_remove - Handle a REMOVE message * @skb: the NETLINK buffer * @info: the Generic NETLINK info block * * Description: * Process a user generated REMOVE message and respond accordingly. Returns * zero on success, negative values on failure. * */ static int netlbl_cipsov4_remove(struct sk_buff *skb, struct genl_info *info) { int ret_val = -EINVAL; struct netlbl_domhsh_walk_arg cb_arg; struct netlbl_audit audit_info; u32 skip_bkt = 0; u32 skip_chain = 0; if (!info->attrs[NLBL_CIPSOV4_A_DOI]) return -EINVAL; netlbl_netlink_auditinfo(&audit_info); cb_arg.doi = nla_get_u32(info->attrs[NLBL_CIPSOV4_A_DOI]); cb_arg.audit_info = &audit_info; ret_val = netlbl_domhsh_walk(&skip_bkt, &skip_chain, netlbl_cipsov4_remove_cb, &cb_arg); if (ret_val == 0 || ret_val == -ENOENT) { ret_val = cipso_v4_doi_remove(cb_arg.doi, &audit_info); if (ret_val == 0) atomic_dec(&netlabel_mgmt_protocount); } return ret_val; } /* * NetLabel Generic NETLINK Command Definitions */ static const struct genl_small_ops netlbl_cipsov4_ops[] = { { .cmd = NLBL_CIPSOV4_C_ADD, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .flags = GENL_ADMIN_PERM, .doit = netlbl_cipsov4_add, .dumpit = NULL, }, { .cmd = NLBL_CIPSOV4_C_REMOVE, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .flags = GENL_ADMIN_PERM, .doit = netlbl_cipsov4_remove, .dumpit = NULL, }, { .cmd = NLBL_CIPSOV4_C_LIST, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .flags = 0, .doit = netlbl_cipsov4_list, .dumpit = NULL, }, { .cmd = NLBL_CIPSOV4_C_LISTALL, .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, .flags = 0, .doit = NULL, .dumpit = netlbl_cipsov4_listall, }, }; static struct genl_family netlbl_cipsov4_gnl_family __ro_after_init = { .hdrsize = 0, .name = NETLBL_NLTYPE_CIPSOV4_NAME, .version = NETLBL_PROTO_VERSION, .maxattr = NLBL_CIPSOV4_A_MAX, .policy = netlbl_cipsov4_genl_policy, .module = THIS_MODULE, .small_ops = netlbl_cipsov4_ops, .n_small_ops = ARRAY_SIZE(netlbl_cipsov4_ops), .resv_start_op = NLBL_CIPSOV4_C_LISTALL + 1, }; /* * NetLabel Generic NETLINK Protocol Functions */ /** * netlbl_cipsov4_genl_init - Register the CIPSOv4 NetLabel component * * Description: * Register the CIPSOv4 packet NetLabel component with the Generic NETLINK * mechanism. Returns zero on success, negative values on failure. * */ int __init netlbl_cipsov4_genl_init(void) { return genl_register_family(&netlbl_cipsov4_gnl_family); } |
| 462 628 38 154 661 256 681 513 683 1187 1153 685 683 638 680 683 601 685 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | // SPDX-License-Identifier: GPL-2.0 #include <linux/compiler.h> #include <linux/export.h> #include <linux/list_sort.h> #include <linux/list.h> /* * Returns a list organized in an intermediate format suited * to chaining of merge() calls: null-terminated, no reserved or * sentinel head node, "prev" links not maintained. */ __attribute__((nonnull(2,3,4))) static struct list_head *merge(void *priv, list_cmp_func_t cmp, struct list_head *a, struct list_head *b) { struct list_head *head, **tail = &head; for (;;) { /* if equal, take 'a' -- important for sort stability */ if (cmp(priv, a, b) <= 0) { *tail = a; tail = &a->next; a = a->next; if (!a) { *tail = b; break; } } else { *tail = b; tail = &b->next; b = b->next; if (!b) { *tail = a; break; } } } return head; } /* * Combine final list merge with restoration of standard doubly-linked * list structure. This approach duplicates code from merge(), but * runs faster than the tidier alternatives of either a separate final * prev-link restoration pass, or maintaining the prev links * throughout. */ __attribute__((nonnull(2,3,4,5))) static void merge_final(void *priv, list_cmp_func_t cmp, struct list_head *head, struct list_head *a, struct list_head *b) { struct list_head *tail = head; u8 count = 0; for (;;) { /* if equal, take 'a' -- important for sort stability */ if (cmp(priv, a, b) <= 0) { tail->next = a; a->prev = tail; tail = a; a = a->next; if (!a) break; } else { tail->next = b; b->prev = tail; tail = b; b = b->next; if (!b) { b = a; break; } } } /* Finish linking remainder of list b on to tail */ tail->next = b; do { /* * If the merge is highly unbalanced (e.g. the input is * already sorted), this loop may run many iterations. * Continue callbacks to the client even though no * element comparison is needed, so the client's cmp() * routine can invoke cond_resched() periodically. */ if (unlikely(!++count)) cmp(priv, b, b); b->prev = tail; tail = b; b = b->next; } while (b); /* And the final links to make a circular doubly-linked list */ tail->next = head; head->prev = tail; } /** * list_sort - sort a list * @priv: private data, opaque to list_sort(), passed to @cmp * @head: the list to sort * @cmp: the elements comparison function * * The comparison function @cmp must return > 0 if @a should sort after * @b ("@a > @b" if you want an ascending sort), and <= 0 if @a should * sort before @b *or* their original order should be preserved. It is * always called with the element that came first in the input in @a, * and list_sort is a stable sort, so it is not necessary to distinguish * the @a < @b and @a == @b cases. * * The comparison function must adhere to specific mathematical properties * to ensure correct and stable sorting: * - Antisymmetry: cmp(@a, @b) must return the opposite sign of * cmp(@b, @a). * - Transitivity: if cmp(@a, @b) <= 0 and cmp(@b, @c) <= 0, then * cmp(@a, @c) <= 0. * * This is compatible with two styles of @cmp function: * - The traditional style which returns <0 / =0 / >0, or * - Returning a boolean 0/1. * The latter offers a chance to save a few cycles in the comparison * (which is used by e.g. plug_ctx_cmp() in block/blk-mq.c). * * A good way to write a multi-word comparison is:: * * if (a->high != b->high) * return a->high > b->high; * if (a->middle != b->middle) * return a->middle > b->middle; * return a->low > b->low; * * * This mergesort is as eager as possible while always performing at least * 2:1 balanced merges. Given two pending sublists of size 2^k, they are * merged to a size-2^(k+1) list as soon as we have 2^k following elements. * * Thus, it will avoid cache thrashing as long as 3*2^k elements can * fit into the cache. Not quite as good as a fully-eager bottom-up * mergesort, but it does use 0.2*n fewer comparisons, so is faster in * the common case that everything fits into L1. * * * The merging is controlled by "count", the number of elements in the * pending lists. This is beautifully simple code, but rather subtle. * * Each time we increment "count", we set one bit (bit k) and clear * bits k-1 .. 0. Each time this happens (except the very first time * for each bit, when count increments to 2^k), we merge two lists of * size 2^k into one list of size 2^(k+1). * * This merge happens exactly when the count reaches an odd multiple of * 2^k, which is when we have 2^k elements pending in smaller lists, * so it's safe to merge away two lists of size 2^k. * * After this happens twice, we have created two lists of size 2^(k+1), * which will be merged into a list of size 2^(k+2) before we create * a third list of size 2^(k+1), so there are never more than two pending. * * The number of pending lists of size 2^k is determined by the * state of bit k of "count" plus two extra pieces of information: * * - The state of bit k-1 (when k == 0, consider bit -1 always set), and * - Whether the higher-order bits are zero or non-zero (i.e. * is count >= 2^(k+1)). * * There are six states we distinguish. "x" represents some arbitrary * bits, and "y" represents some arbitrary non-zero bits: * 0: 00x: 0 pending of size 2^k; x pending of sizes < 2^k * 1: 01x: 0 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k * 2: x10x: 0 pending of size 2^k; 2^k + x pending of sizes < 2^k * 3: x11x: 1 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k * 4: y00x: 1 pending of size 2^k; 2^k + x pending of sizes < 2^k * 5: y01x: 2 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k * (merge and loop back to state 2) * * We gain lists of size 2^k in the 2->3 and 4->5 transitions (because * bit k-1 is set while the more significant bits are non-zero) and * merge them away in the 5->2 transition. Note in particular that just * before the 5->2 transition, all lower-order bits are 11 (state 3), * so there is one list of each smaller size. * * When we reach the end of the input, we merge all the pending * lists, from smallest to largest. If you work through cases 2 to * 5 above, you can see that the number of elements we merge with a list * of size 2^k varies from 2^(k-1) (cases 3 and 5 when x == 0) to * 2^(k+1) - 1 (second merge of case 5 when x == 2^(k-1) - 1). */ __attribute__((nonnull(2,3))) void list_sort(void *priv, struct list_head *head, list_cmp_func_t cmp) { struct list_head *list = head->next, *pending = NULL; size_t count = 0; /* Count of pending */ if (list == head->prev) /* Zero or one elements */ return; /* Convert to a null-terminated singly-linked list. */ head->prev->next = NULL; /* * Data structure invariants: * - All lists are singly linked and null-terminated; prev * pointers are not maintained. * - pending is a prev-linked "list of lists" of sorted * sublists awaiting further merging. * - Each of the sorted sublists is power-of-two in size. * - Sublists are sorted by size and age, smallest & newest at front. * - There are zero to two sublists of each size. * - A pair of pending sublists are merged as soon as the number * of following pending elements equals their size (i.e. * each time count reaches an odd multiple of that size). * That ensures each later final merge will be at worst 2:1. * - Each round consists of: * - Merging the two sublists selected by the highest bit * which flips when count is incremented, and * - Adding an element from the input as a size-1 sublist. */ do { size_t bits; struct list_head **tail = &pending; /* Find the least-significant clear bit in count */ for (bits = count; bits & 1; bits >>= 1) tail = &(*tail)->prev; /* Do the indicated merge */ if (likely(bits)) { struct list_head *a = *tail, *b = a->prev; a = merge(priv, cmp, b, a); /* Install the merged result in place of the inputs */ a->prev = b->prev; *tail = a; } /* Move one element from input list to pending */ list->prev = pending; pending = list; list = list->next; pending->next = NULL; count++; } while (list); /* End of input; merge together all the pending lists. */ list = pending; pending = pending->prev; for (;;) { struct list_head *next = pending->prev; if (!next) break; list = merge(priv, cmp, pending, list); pending = next; } /* The final merge, rebuilding prev links */ merge_final(priv, cmp, head, pending, list); } EXPORT_SYMBOL(list_sort); |
| 64 64 65 65 65 63 1 64 85 84 84 84 84 84 83 84 49 35 53 1 25 1 1 2 14 34 14 35 18 13 2 2 18 52 8 8 2 2 1 4 2 105 105 65 105 79 116 115 116 114 114 114 114 1 27 1 5 2 2 79 105 105 104 1 105 230 231 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Handle incoming frames * Linux ethernet bridge * * Authors: * Lennert Buytenhek <buytenh@gnu.org> */ #include <linux/slab.h> #include <linux/kernel.h> #include <linux/netdevice.h> #include <linux/etherdevice.h> #include <linux/netfilter_bridge.h> #ifdef CONFIG_NETFILTER_FAMILY_BRIDGE #include <net/netfilter/nf_queue.h> #endif #include <linux/neighbour.h> #include <net/arp.h> #include <net/dsa.h> #include <linux/export.h> #include <linux/rculist.h> #include "br_private.h" #include "br_private_tunnel.h" static int br_netif_receive_skb(struct net *net, struct sock *sk, struct sk_buff *skb) { br_drop_fake_rtable(skb); return netif_receive_skb(skb); } static int br_pass_frame_up(struct sk_buff *skb, bool promisc) { struct net_device *indev, *brdev = BR_INPUT_SKB_CB(skb)->brdev; struct net_bridge *br = netdev_priv(brdev); struct net_bridge_vlan_group *vg; dev_sw_netstats_rx_add(brdev, skb->len); vg = br_vlan_group_rcu(br); /* Reset the offload_fwd_mark because there could be a stacked * bridge above, and it should not think this bridge it doing * that bridge's work forwarding out its ports. */ br_switchdev_frame_unmark(skb); /* Bridge is just like any other port. Make sure the * packet is allowed except in promisc mode when someone * may be running packet capture. */ if (!(brdev->flags & IFF_PROMISC) && !br_allowed_egress(vg, skb)) { kfree_skb(skb); return NET_RX_DROP; } indev = skb->dev; skb->dev = brdev; skb = br_handle_vlan(br, NULL, vg, skb); if (!skb) return NET_RX_DROP; /* update the multicast stats if the packet is IGMP/MLD */ br_multicast_count(br, NULL, skb, br_multicast_igmp_type(skb), BR_MCAST_DIR_TX); BR_INPUT_SKB_CB(skb)->promisc = promisc; return NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN, dev_net(indev), NULL, skb, indev, NULL, br_netif_receive_skb); } /* note: already called with rcu_read_lock */ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { enum skb_drop_reason reason = SKB_DROP_REASON_NOT_SPECIFIED; struct net_bridge_port *p = br_port_get_rcu(skb->dev); enum br_pkt_type pkt_type = BR_PKT_UNICAST; struct net_bridge_fdb_entry *dst = NULL; struct net_bridge_mcast_port *pmctx; struct net_bridge_mdb_entry *mdst; bool local_rcv, mcast_hit = false; struct net_bridge_mcast *brmctx; struct net_bridge_vlan *vlan; struct net_bridge *br; bool promisc; u16 vid = 0; u8 state; if (!p) goto drop; br = p->br; if (br_mst_is_enabled(br)) { state = BR_STATE_FORWARDING; } else { if (p->state == BR_STATE_DISABLED) { reason = SKB_DROP_REASON_BRIDGE_INGRESS_STP_STATE; goto drop; } state = p->state; } brmctx = &p->br->multicast_ctx; pmctx = &p->multicast_ctx; if (!br_allowed_ingress(p->br, nbp_vlan_group_rcu(p), skb, &vid, &state, &vlan)) goto out; if (p->flags & BR_PORT_LOCKED) { struct net_bridge_fdb_entry *fdb_src = br_fdb_find_rcu(br, eth_hdr(skb)->h_source, vid); if (!fdb_src) { /* FDB miss. Create locked FDB entry if MAB is enabled * and drop the packet. */ if (p->flags & BR_PORT_MAB) br_fdb_update(br, p, eth_hdr(skb)->h_source, vid, BIT(BR_FDB_LOCKED)); goto drop; } else if (READ_ONCE(fdb_src->dst) != p || test_bit(BR_FDB_LOCAL, &fdb_src->flags)) { /* FDB mismatch. Drop the packet without roaming. */ goto drop; } else if (test_bit(BR_FDB_LOCKED, &fdb_src->flags)) { /* FDB match, but entry is locked. Refresh it and drop * the packet. */ br_fdb_update(br, p, eth_hdr(skb)->h_source, vid, BIT(BR_FDB_LOCKED)); goto drop; } } nbp_switchdev_frame_mark(p, skb); /* insert into forwarding database after filtering to avoid spoofing */ if (p->flags & BR_LEARNING) br_fdb_update(br, p, eth_hdr(skb)->h_source, vid, 0); promisc = !!(br->dev->flags & IFF_PROMISC); local_rcv = promisc; if (is_multicast_ether_addr(eth_hdr(skb)->h_dest)) { /* by definition the broadcast is also a multicast address */ if (is_broadcast_ether_addr(eth_hdr(skb)->h_dest)) { pkt_type = BR_PKT_BROADCAST; local_rcv = true; } else { pkt_type = BR_PKT_MULTICAST; if (br_multicast_rcv(&brmctx, &pmctx, vlan, skb, vid)) goto drop; } } if (state == BR_STATE_LEARNING) { reason = SKB_DROP_REASON_BRIDGE_INGRESS_STP_STATE; goto drop; } BR_INPUT_SKB_CB(skb)->brdev = br->dev; BR_INPUT_SKB_CB(skb)->src_port_isolated = !!(p->flags & BR_ISOLATED); if (IS_ENABLED(CONFIG_INET) && (skb->protocol == htons(ETH_P_ARP) || skb->protocol == htons(ETH_P_RARP))) { br_do_proxy_suppress_arp(skb, br, vid, p); } else if (IS_ENABLED(CONFIG_IPV6) && skb->protocol == htons(ETH_P_IPV6) && br_opt_get(br, BROPT_NEIGH_SUPPRESS_ENABLED) && pskb_may_pull(skb, sizeof(struct ipv6hdr) + sizeof(struct nd_msg)) && ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) { struct nd_msg *msg, _msg; msg = br_is_nd_neigh_msg(skb, &_msg); if (msg) br_do_suppress_nd(skb, br, vid, p, msg); } switch (pkt_type) { case BR_PKT_MULTICAST: mdst = br_mdb_entry_skb_get(brmctx, skb, vid); if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) && br_multicast_querier_exists(brmctx, eth_hdr(skb), mdst)) { if ((mdst && mdst->host_joined) || br_multicast_is_router(brmctx, skb)) { local_rcv = true; DEV_STATS_INC(br->dev, multicast); } mcast_hit = true; } else { local_rcv = true; DEV_STATS_INC(br->dev, multicast); } break; case BR_PKT_UNICAST: dst = br_fdb_find_rcu(br, eth_hdr(skb)->h_dest, vid); break; default: break; } if (dst) { unsigned long now = jiffies; if (test_bit(BR_FDB_LOCAL, &dst->flags)) return br_pass_frame_up(skb, false); if (now != dst->used) dst->used = now; br_forward(dst->dst, skb, local_rcv, false); } else { if (!mcast_hit) br_flood(br, skb, pkt_type, local_rcv, false, vid); else br_multicast_flood(mdst, skb, brmctx, local_rcv, false); } if (local_rcv) return br_pass_frame_up(skb, promisc); out: return 0; drop: kfree_skb_reason(skb, reason); goto out; } EXPORT_SYMBOL_GPL(br_handle_frame_finish); static void __br_handle_local_finish(struct sk_buff *skb) { struct net_bridge_port *p = br_port_get_rcu(skb->dev); u16 vid = 0; /* check if vlan is allowed, to avoid spoofing */ if ((p->flags & BR_LEARNING) && nbp_state_should_learn(p) && !br_opt_get(p->br, BROPT_NO_LL_LEARN) && br_should_learn(p, skb, &vid)) br_fdb_update(p->br, p, eth_hdr(skb)->h_source, vid, 0); } /* note: already called with rcu_read_lock */ static int br_handle_local_finish(struct net *net, struct sock *sk, struct sk_buff *skb) { __br_handle_local_finish(skb); /* return 1 to signal the okfn() was called so it's ok to use the skb */ return 1; } static int nf_hook_bridge_pre(struct sk_buff *skb, struct sk_buff **pskb) { #ifdef CONFIG_NETFILTER_FAMILY_BRIDGE struct nf_hook_entries *e = NULL; struct nf_hook_state state; unsigned int verdict, i; struct net *net; int ret; net = dev_net(skb->dev); #ifdef HAVE_JUMP_LABEL if (!static_key_false(&nf_hooks_needed[NFPROTO_BRIDGE][NF_BR_PRE_ROUTING])) goto frame_finish; #endif e = rcu_dereference(net->nf.hooks_bridge[NF_BR_PRE_ROUTING]); if (!e) goto frame_finish; nf_hook_state_init(&state, NF_BR_PRE_ROUTING, NFPROTO_BRIDGE, skb->dev, NULL, NULL, net, br_handle_frame_finish); for (i = 0; i < e->num_hook_entries; i++) { verdict = nf_hook_entry_hookfn(&e->hooks[i], skb, &state); switch (verdict & NF_VERDICT_MASK) { case NF_ACCEPT: if (BR_INPUT_SKB_CB(skb)->br_netfilter_broute) { *pskb = skb; return RX_HANDLER_PASS; } break; case NF_DROP: kfree_skb(skb); return RX_HANDLER_CONSUMED; case NF_QUEUE: ret = nf_queue(skb, &state, i, verdict); if (ret == 1) continue; return RX_HANDLER_CONSUMED; default: /* STOLEN */ return RX_HANDLER_CONSUMED; } } frame_finish: net = dev_net(skb->dev); br_handle_frame_finish(net, NULL, skb); #else br_handle_frame_finish(dev_net(skb->dev), NULL, skb); #endif return RX_HANDLER_CONSUMED; } /* Return 0 if the frame was not processed otherwise 1 * note: already called with rcu_read_lock */ static int br_process_frame_type(struct net_bridge_port *p, struct sk_buff *skb) { struct br_frame_type *tmp; hlist_for_each_entry_rcu(tmp, &p->br->frame_type_list, list) if (unlikely(tmp->type == skb->protocol)) return tmp->frame_handler(p, skb); return 0; } /* * Return NULL if skb is handled * note: already called with rcu_read_lock */ static rx_handler_result_t br_handle_frame(struct sk_buff **pskb) { enum skb_drop_reason reason = SKB_DROP_REASON_NOT_SPECIFIED; struct net_bridge_port *p; struct sk_buff *skb = *pskb; const unsigned char *dest = eth_hdr(skb)->h_dest; if (unlikely(skb->pkt_type == PACKET_LOOPBACK)) return RX_HANDLER_PASS; if (!is_valid_ether_addr(eth_hdr(skb)->h_source)) { reason = SKB_DROP_REASON_MAC_INVALID_SOURCE; goto drop; } skb = skb_share_check(skb, GFP_ATOMIC); if (!skb) return RX_HANDLER_CONSUMED; memset(skb->cb, 0, sizeof(struct br_input_skb_cb)); br_tc_skb_miss_set(skb, false); p = br_port_get_rcu(skb->dev); if (p->flags & BR_VLAN_TUNNEL) br_handle_ingress_vlan_tunnel(skb, p, nbp_vlan_group_rcu(p)); if (unlikely(is_link_local_ether_addr(dest))) { u16 fwd_mask = p->br->group_fwd_mask_required; /* * See IEEE 802.1D Table 7-10 Reserved addresses * * Assignment Value * Bridge Group Address 01-80-C2-00-00-00 * (MAC Control) 802.3 01-80-C2-00-00-01 * (Link Aggregation) 802.3 01-80-C2-00-00-02 * 802.1X PAE address 01-80-C2-00-00-03 * * 802.1AB LLDP 01-80-C2-00-00-0E * * Others reserved for future standardization */ fwd_mask |= p->group_fwd_mask; switch (dest[5]) { case 0x00: /* Bridge Group Address */ /* If STP is turned off, then must forward to keep loop detection */ if (p->br->stp_enabled == BR_NO_STP || fwd_mask & (1u << dest[5])) goto forward; *pskb = skb; __br_handle_local_finish(skb); return RX_HANDLER_PASS; case 0x01: /* IEEE MAC (Pause) */ reason = SKB_DROP_REASON_MAC_IEEE_MAC_CONTROL; goto drop; case 0x0E: /* 802.1AB LLDP */ fwd_mask |= p->br->group_fwd_mask; if (fwd_mask & (1u << dest[5])) goto forward; *pskb = skb; __br_handle_local_finish(skb); return RX_HANDLER_PASS; default: /* Allow selective forwarding for most other protocols */ fwd_mask |= p->br->group_fwd_mask; if (fwd_mask & (1u << dest[5])) goto forward; } BR_INPUT_SKB_CB(skb)->promisc = false; /* The else clause should be hit when nf_hook(): * - returns < 0 (drop/error) * - returns = 0 (stolen/nf_queue) * Thus return 1 from the okfn() to signal the skb is ok to pass */ if (NF_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN, dev_net(skb->dev), NULL, skb, skb->dev, NULL, br_handle_local_finish) == 1) { return RX_HANDLER_PASS; } else { return RX_HANDLER_CONSUMED; } } if (unlikely(br_process_frame_type(p, skb))) return RX_HANDLER_PASS; forward: if (br_mst_is_enabled(p->br)) goto defer_stp_filtering; switch (p->state) { case BR_STATE_FORWARDING: case BR_STATE_LEARNING: defer_stp_filtering: if (ether_addr_equal(p->br->dev->dev_addr, dest)) skb->pkt_type = PACKET_HOST; return nf_hook_bridge_pre(skb, pskb); default: reason = SKB_DROP_REASON_BRIDGE_INGRESS_STP_STATE; drop: kfree_skb_reason(skb, reason); } return RX_HANDLER_CONSUMED; } /* This function has no purpose other than to appease the br_port_get_rcu/rtnl * helpers which identify bridged ports according to the rx_handler installed * on them (so there _needs_ to be a bridge rx_handler even if we don't need it * to do anything useful). This bridge won't support traffic to/from the stack, * but only hardware bridging. So return RX_HANDLER_PASS so we don't steal * frames from the ETH_P_XDSA packet_type handler. */ static rx_handler_result_t br_handle_frame_dummy(struct sk_buff **pskb) { return RX_HANDLER_PASS; } rx_handler_func_t *br_get_rx_handler(const struct net_device *dev) { if (netdev_uses_dsa(dev)) return br_handle_frame_dummy; return br_handle_frame; } void br_add_frame(struct net_bridge *br, struct br_frame_type *ft) { hlist_add_head_rcu(&ft->list, &br->frame_type_list); } void br_del_frame(struct net_bridge *br, struct br_frame_type *ft) { struct br_frame_type *tmp; hlist_for_each_entry(tmp, &br->frame_type_list, list) if (ft == tmp) { hlist_del_rcu(&ft->list); return; } } |
| 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 | // SPDX-License-Identifier: GPL-2.0+ /* * cdc-acm.c * * Copyright (c) 1999 Armin Fuerst <fuerst@in.tum.de> * Copyright (c) 1999 Pavel Machek <pavel@ucw.cz> * Copyright (c) 1999 Johannes Erdfelt <johannes@erdfelt.com> * Copyright (c) 2000 Vojtech Pavlik <vojtech@suse.cz> * Copyright (c) 2004 Oliver Neukum <oliver@neukum.name> * Copyright (c) 2005 David Kubicek <dave@awk.cz> * Copyright (c) 2011 Johan Hovold <jhovold@gmail.com> * * USB Abstract Control Model driver for USB modems and ISDN adapters * * Sponsored by SuSE */ #undef DEBUG #undef VERBOSE_DEBUG #include <linux/kernel.h> #include <linux/sched/signal.h> #include <linux/errno.h> #include <linux/init.h> #include <linux/slab.h> #include <linux/log2.h> #include <linux/tty.h> #include <linux/serial.h> #include <linux/tty_driver.h> #include <linux/tty_flip.h> #include <linux/tty_ldisc.h> #include <linux/module.h> #include <linux/mutex.h> #include <linux/uaccess.h> #include <linux/usb.h> #include <linux/usb/cdc.h> #include <asm/byteorder.h> #include <linux/unaligned.h> #include <linux/idr.h> #include <linux/list.h> #include "cdc-acm.h" #define DRIVER_AUTHOR "Armin Fuerst, Pavel Machek, Johannes Erdfelt, Vojtech Pavlik, David Kubicek, Johan Hovold" #define DRIVER_DESC "USB Abstract Control Model driver for USB modems and ISDN adapters" static struct usb_driver acm_driver; static struct tty_driver *acm_tty_driver; static DEFINE_IDR(acm_minors); static DEFINE_MUTEX(acm_minors_lock); static void acm_tty_set_termios(struct tty_struct *tty, const struct ktermios *termios_old); /* * acm_minors accessors */ /* * Look up an ACM structure by minor. If found and not disconnected, increment * its refcount and return it with its mutex held. */ static struct acm *acm_get_by_minor(unsigned int minor) { struct acm *acm; mutex_lock(&acm_minors_lock); acm = idr_find(&acm_minors, minor); if (acm) { mutex_lock(&acm->mutex); if (acm->disconnected) { mutex_unlock(&acm->mutex); acm = NULL; } else { tty_port_get(&acm->port); mutex_unlock(&acm->mutex); } } mutex_unlock(&acm_minors_lock); return acm; } /* * Try to find an available minor number and if found, associate it with 'acm'. */ static int acm_alloc_minor(struct acm *acm) { int minor; mutex_lock(&acm_minors_lock); minor = idr_alloc(&acm_minors, acm, 0, ACM_TTY_MINORS, GFP_KERNEL); mutex_unlock(&acm_minors_lock); return minor; } /* Release the minor number associated with 'acm'. */ static void acm_release_minor(struct acm *acm) { mutex_lock(&acm_minors_lock); idr_remove(&acm_minors, acm->minor); mutex_unlock(&acm_minors_lock); } /* * Functions for ACM control messages. */ static int acm_ctrl_msg(struct acm *acm, int request, int value, void *buf, int len) { int retval; retval = usb_autopm_get_interface(acm->control); if (retval) return retval; retval = usb_control_msg(acm->dev, usb_sndctrlpipe(acm->dev, 0), request, USB_RT_ACM, value, acm->control->altsetting[0].desc.bInterfaceNumber, buf, len, USB_CTRL_SET_TIMEOUT); dev_dbg(&acm->control->dev, "%s - rq 0x%02x, val %#x, len %#x, result %d\n", __func__, request, value, len, retval); usb_autopm_put_interface(acm->control); return retval < 0 ? retval : 0; } /* devices aren't required to support these requests. * the cdc acm descriptor tells whether they do... */ static inline int acm_set_control(struct acm *acm, int control) { if (acm->quirks & QUIRK_CONTROL_LINE_STATE) return -EOPNOTSUPP; return acm_ctrl_msg(acm, USB_CDC_REQ_SET_CONTROL_LINE_STATE, control, NULL, 0); } #define acm_set_line(acm, line) \ acm_ctrl_msg(acm, USB_CDC_REQ_SET_LINE_CODING, 0, line, sizeof *(line)) #define acm_send_break(acm, ms) \ acm_ctrl_msg(acm, USB_CDC_REQ_SEND_BREAK, ms, NULL, 0) static void acm_poison_urbs(struct acm *acm) { int i; usb_poison_urb(acm->ctrlurb); for (i = 0; i < ACM_NW; i++) usb_poison_urb(acm->wb[i].urb); for (i = 0; i < acm->rx_buflimit; i++) usb_poison_urb(acm->read_urbs[i]); } static void acm_unpoison_urbs(struct acm *acm) { int i; for (i = 0; i < acm->rx_buflimit; i++) usb_unpoison_urb(acm->read_urbs[i]); for (i = 0; i < ACM_NW; i++) usb_unpoison_urb(acm->wb[i].urb); usb_unpoison_urb(acm->ctrlurb); } /* * Write buffer management. * All of these assume proper locks taken by the caller. */ static int acm_wb_alloc(struct acm *acm) { int i, wbn; struct acm_wb *wb; wbn = 0; i = 0; for (;;) { wb = &acm->wb[wbn]; if (!wb->use) { wb->use = true; wb->len = 0; return wbn; } wbn = (wbn + 1) % ACM_NW; if (++i >= ACM_NW) return -1; } } static int acm_wb_is_avail(struct acm *acm) { int i, n; unsigned long flags; n = ACM_NW; spin_lock_irqsave(&acm->write_lock, flags); for (i = 0; i < ACM_NW; i++) if(acm->wb[i].use) n--; spin_unlock_irqrestore(&acm->write_lock, flags); return n; } /* * Finish write. Caller must hold acm->write_lock */ static void acm_write_done(struct acm *acm, struct acm_wb *wb) { wb->use = false; acm->transmitting--; usb_autopm_put_interface_async(acm->control); } /* * Poke write. * * the caller is responsible for locking */ static int acm_start_wb(struct acm *acm, struct acm_wb *wb) { int rc; acm->transmitting++; wb->urb->transfer_buffer = wb->buf; wb->urb->transfer_dma = wb->dmah; wb->urb->transfer_buffer_length = wb->len; wb->urb->dev = acm->dev; rc = usb_submit_urb(wb->urb, GFP_ATOMIC); if (rc < 0) { if (rc != -EPERM) dev_err(&acm->data->dev, "%s - usb_submit_urb(write bulk) failed: %d\n", __func__, rc); acm_write_done(acm, wb); } return rc; } /* * attributes exported through sysfs */ static ssize_t bmCapabilities_show (struct device *dev, struct device_attribute *attr, char *buf) { struct usb_interface *intf = to_usb_interface(dev); struct acm *acm = usb_get_intfdata(intf); return sprintf(buf, "%d", acm->ctrl_caps); } static DEVICE_ATTR_RO(bmCapabilities); static ssize_t wCountryCodes_show (struct device *dev, struct device_attribute *attr, char *buf) { struct usb_interface *intf = to_usb_interface(dev); struct acm *acm = usb_get_intfdata(intf); memcpy(buf, acm->country_codes, acm->country_code_size); return acm->country_code_size; } static DEVICE_ATTR_RO(wCountryCodes); static ssize_t iCountryCodeRelDate_show (struct device *dev, struct device_attribute *attr, char *buf) { struct usb_interface *intf = to_usb_interface(dev); struct acm *acm = usb_get_intfdata(intf); return sprintf(buf, "%d", acm->country_rel_date); } static DEVICE_ATTR_RO(iCountryCodeRelDate); /* * Interrupt handlers for various ACM device responses */ static void acm_process_notification(struct acm *acm, unsigned char *buf) { int newctrl; int difference; unsigned long flags; struct usb_cdc_notification *dr = (struct usb_cdc_notification *)buf; unsigned char *data = buf + sizeof(struct usb_cdc_notification); switch (dr->bNotificationType) { case USB_CDC_NOTIFY_NETWORK_CONNECTION: dev_dbg(&acm->control->dev, "%s - network connection: %d\n", __func__, dr->wValue); break; case USB_CDC_NOTIFY_SERIAL_STATE: if (le16_to_cpu(dr->wLength) != 2) { dev_dbg(&acm->control->dev, "%s - malformed serial state\n", __func__); break; } newctrl = get_unaligned_le16(data); dev_dbg(&acm->control->dev, "%s - serial state: 0x%x\n", __func__, newctrl); if (!acm->clocal && (acm->ctrlin & ~newctrl & USB_CDC_SERIAL_STATE_DCD)) { dev_dbg(&acm->control->dev, "%s - calling hangup\n", __func__); tty_port_tty_hangup(&acm->port, false); } difference = acm->ctrlin ^ newctrl; if ((difference & USB_CDC_SERIAL_STATE_DCD) && acm->port.tty) { struct tty_ldisc *ld = tty_ldisc_ref(acm->port.tty); if (ld) { if (ld->ops->dcd_change) ld->ops->dcd_change(acm->port.tty, newctrl & USB_CDC_SERIAL_STATE_DCD); tty_ldisc_deref(ld); } } spin_lock_irqsave(&acm->read_lock, flags); acm->ctrlin = newctrl; acm->oldcount = acm->iocount; if (difference & USB_CDC_SERIAL_STATE_DSR) acm->iocount.dsr++; if (difference & USB_CDC_SERIAL_STATE_DCD) acm->iocount.dcd++; if (newctrl & USB_CDC_SERIAL_STATE_BREAK) { acm->iocount.brk++; tty_insert_flip_char(&acm->port, 0, TTY_BREAK); } if (newctrl & USB_CDC_SERIAL_STATE_RING_SIGNAL) acm->iocount.rng++; if (newctrl & USB_CDC_SERIAL_STATE_FRAMING) acm->iocount.frame++; if (newctrl & USB_CDC_SERIAL_STATE_PARITY) acm->iocount.parity++; if (newctrl & USB_CDC_SERIAL_STATE_OVERRUN) acm->iocount.overrun++; spin_unlock_irqrestore(&acm->read_lock, flags); if (newctrl & USB_CDC_SERIAL_STATE_BREAK) tty_flip_buffer_push(&acm->port); if (difference) wake_up_all(&acm->wioctl); break; default: dev_dbg(&acm->control->dev, "%s - unknown notification %d received: index %d len %d\n", __func__, dr->bNotificationType, dr->wIndex, dr->wLength); } } /* control interface reports status changes with "interrupt" transfers */ static void acm_ctrl_irq(struct urb *urb) { struct acm *acm = urb->context; struct usb_cdc_notification *dr; unsigned int current_size = urb->actual_length; unsigned int expected_size, copy_size, alloc_size; int retval; int status = urb->status; switch (status) { case 0: /* success */ break; case -ECONNRESET: case -ENOENT: case -ESHUTDOWN: /* this urb is terminated, clean up */ dev_dbg(&acm->control->dev, "%s - urb shutting down with status: %d\n", __func__, status); return; default: dev_dbg(&acm->control->dev, "%s - nonzero urb status received: %d\n", __func__, status); goto exit; } usb_mark_last_busy(acm->dev); if (acm->nb_index == 0) { /* * The first chunk of a message must contain at least the * notification header with the length field, otherwise we * can't get an expected_size. */ if (current_size < sizeof(struct usb_cdc_notification)) { dev_dbg(&acm->control->dev, "urb too short\n"); goto exit; } dr = urb->transfer_buffer; } else { dr = (struct usb_cdc_notification *)acm->notification_buffer; } /* size = notification-header + (optional) data */ expected_size = sizeof(struct usb_cdc_notification) + le16_to_cpu(dr->wLength); if (acm->nb_index != 0 || current_size < expected_size) { /* notification is transmitted fragmented, reassemble */ if (acm->nb_size < expected_size) { u8 *new_buffer; alloc_size = roundup_pow_of_two(expected_size); /* Final freeing is done on disconnect. */ new_buffer = krealloc(acm->notification_buffer, alloc_size, GFP_ATOMIC); if (!new_buffer) { acm->nb_index = 0; goto exit; } acm->notification_buffer = new_buffer; acm->nb_size = alloc_size; dr = (struct usb_cdc_notification *)acm->notification_buffer; } copy_size = min(current_size, expected_size - acm->nb_index); memcpy(&acm->notification_buffer[acm->nb_index], urb->transfer_buffer, copy_size); acm->nb_index += copy_size; current_size = acm->nb_index; } if (current_size >= expected_size) { /* notification complete */ acm_process_notification(acm, (unsigned char *)dr); acm->nb_index = 0; } exit: retval = usb_submit_urb(urb, GFP_ATOMIC); if (retval && retval != -EPERM && retval != -ENODEV) dev_err(&acm->control->dev, "%s - usb_submit_urb failed: %d\n", __func__, retval); else dev_vdbg(&acm->control->dev, "control resubmission terminated %d\n", retval); } static int acm_submit_read_urb(struct acm *acm, int index, gfp_t mem_flags) { int res; if (!test_and_clear_bit(index, &acm->read_urbs_free)) return 0; res = usb_submit_urb(acm->read_urbs[index], mem_flags); if (res) { if (res != -EPERM && res != -ENODEV) { dev_err(&acm->data->dev, "urb %d failed submission with %d\n", index, res); } else { dev_vdbg(&acm->data->dev, "intended failure %d\n", res); } set_bit(index, &acm->read_urbs_free); return res; } else { dev_vdbg(&acm->data->dev, "submitted urb %d\n", index); } return 0; } static int acm_submit_read_urbs(struct acm *acm, gfp_t mem_flags) { int res; int i; for (i = 0; i < acm->rx_buflimit; ++i) { res = acm_submit_read_urb(acm, i, mem_flags); if (res) return res; } return 0; } static void acm_process_read_urb(struct acm *acm, struct urb *urb) { unsigned long flags; if (!urb->actual_length) return; spin_lock_irqsave(&acm->read_lock, flags); tty_insert_flip_string(&acm->port, urb->transfer_buffer, urb->actual_length); spin_unlock_irqrestore(&acm->read_lock, flags); tty_flip_buffer_push(&acm->port); } static void acm_read_bulk_callback(struct urb *urb) { struct acm_rb *rb = urb->context; struct acm *acm = rb->instance; int status = urb->status; bool stopped = false; bool stalled = false; bool cooldown = false; dev_vdbg(&acm->data->dev, "got urb %d, len %d, status %d\n", rb->index, urb->actual_length, status); switch (status) { case 0: usb_mark_last_busy(acm->dev); acm_process_read_urb(acm, urb); break; case -EPIPE: set_bit(EVENT_RX_STALL, &acm->flags); stalled = true; break; case -ENOENT: case -ECONNRESET: case -ESHUTDOWN: dev_dbg(&acm->data->dev, "%s - urb shutting down with status: %d\n", __func__, status); stopped = true; break; case -EOVERFLOW: case -EPROTO: dev_dbg(&acm->data->dev, "%s - cooling babbling device\n", __func__); usb_mark_last_busy(acm->dev); set_bit(rb->index, &acm->urbs_in_error_delay); set_bit(ACM_ERROR_DELAY, &acm->flags); cooldown = true; break; default: dev_dbg(&acm->data->dev, "%s - nonzero urb status received: %d\n", __func__, status); break; } /* * Make sure URB processing is done before marking as free to avoid * racing with unthrottle() on another CPU. Matches the barriers * implied by the test_and_clear_bit() in acm_submit_read_urb(). */ smp_mb__before_atomic(); set_bit(rb->index, &acm->read_urbs_free); /* * Make sure URB is marked as free before checking the throttled flag * to avoid racing with unthrottle() on another CPU. Matches the * smp_mb() in unthrottle(). */ smp_mb__after_atomic(); if (stopped || stalled || cooldown) { if (stalled) schedule_delayed_work(&acm->dwork, 0); else if (cooldown) schedule_delayed_work(&acm->dwork, HZ / 2); return; } if (test_bit(ACM_THROTTLED, &acm->flags)) return; acm_submit_read_urb(acm, rb->index, GFP_ATOMIC); } /* data interface wrote those outgoing bytes */ static void acm_write_bulk(struct urb *urb) { struct acm_wb *wb = urb->context; struct acm *acm = wb->instance; unsigned long flags; int status = urb->status; if (status || (urb->actual_length != urb->transfer_buffer_length)) dev_vdbg(&acm->data->dev, "wrote len %d/%d, status %d\n", urb->actual_length, urb->transfer_buffer_length, status); spin_lock_irqsave(&acm->write_lock, flags); acm_write_done(acm, wb); spin_unlock_irqrestore(&acm->write_lock, flags); set_bit(EVENT_TTY_WAKEUP, &acm->flags); schedule_delayed_work(&acm->dwork, 0); } static void acm_softint(struct work_struct *work) { int i; struct acm *acm = container_of(work, struct acm, dwork.work); if (test_bit(EVENT_RX_STALL, &acm->flags)) { smp_mb(); /* against acm_suspend() */ if (!acm->susp_count) { for (i = 0; i < acm->rx_buflimit; i++) usb_kill_urb(acm->read_urbs[i]); usb_clear_halt(acm->dev, acm->in); acm_submit_read_urbs(acm, GFP_KERNEL); clear_bit(EVENT_RX_STALL, &acm->flags); } } if (test_and_clear_bit(ACM_ERROR_DELAY, &acm->flags)) { for (i = 0; i < acm->rx_buflimit; i++) if (test_and_clear_bit(i, &acm->urbs_in_error_delay)) acm_submit_read_urb(acm, i, GFP_KERNEL); } if (test_and_clear_bit(EVENT_TTY_WAKEUP, &acm->flags)) tty_port_tty_wakeup(&acm->port); } /* * TTY handlers */ static int acm_tty_install(struct tty_driver *driver, struct tty_struct *tty) { struct acm *acm; int retval; acm = acm_get_by_minor(tty->index); if (!acm) return -ENODEV; retval = tty_standard_install(driver, tty); if (retval) goto error_init_termios; /* * Suppress initial echoing for some devices which might send data * immediately after acm driver has been installed. */ if (acm->quirks & DISABLE_ECHO) tty->termios.c_lflag &= ~ECHO; tty->driver_data = acm; return 0; error_init_termios: tty_port_put(&acm->port); return retval; } static int acm_tty_open(struct tty_struct *tty, struct file *filp) { struct acm *acm = tty->driver_data; return tty_port_open(&acm->port, tty, filp); } static void acm_port_dtr_rts(struct tty_port *port, bool active) { struct acm *acm = container_of(port, struct acm, port); int val; int res; if (active) val = USB_CDC_CTRL_DTR | USB_CDC_CTRL_RTS; else val = 0; /* FIXME: add missing ctrlout locking throughout driver */ acm->ctrlout = val; res = acm_set_control(acm, val); if (res && (acm->ctrl_caps & USB_CDC_CAP_LINE)) /* This is broken in too many devices to spam the logs */ dev_dbg(&acm->control->dev, "failed to set dtr/rts\n"); } static int acm_port_activate(struct tty_port *port, struct tty_struct *tty) { struct acm *acm = container_of(port, struct acm, port); int retval = -ENODEV; int i; mutex_lock(&acm->mutex); if (acm->disconnected) goto disconnected; retval = usb_autopm_get_interface(acm->control); if (retval) goto error_get_interface; set_bit(TTY_NO_WRITE_SPLIT, &tty->flags); acm->control->needs_remote_wakeup = 1; acm->ctrlurb->dev = acm->dev; retval = usb_submit_urb(acm->ctrlurb, GFP_KERNEL); if (retval) { dev_err(&acm->control->dev, "%s - usb_submit_urb(ctrl irq) failed\n", __func__); goto error_submit_urb; } acm_tty_set_termios(tty, NULL); /* * Unthrottle device in case the TTY was closed while throttled. */ clear_bit(ACM_THROTTLED, &acm->flags); retval = acm_submit_read_urbs(acm, GFP_KERNEL); if (retval) goto error_submit_read_urbs; usb_autopm_put_interface(acm->control); mutex_unlock(&acm->mutex); return 0; error_submit_read_urbs: for (i = 0; i < acm->rx_buflimit; i++) usb_kill_urb(acm->read_urbs[i]); usb_kill_urb(acm->ctrlurb); error_submit_urb: usb_autopm_put_interface(acm->control); error_get_interface: disconnected: mutex_unlock(&acm->mutex); return usb_translate_errors(retval); } static void acm_port_destruct(struct tty_port *port) { struct acm *acm = container_of(port, struct acm, port); if (acm->minor != ACM_MINOR_INVALID) acm_release_minor(acm); usb_put_intf(acm->control); kfree(acm->country_codes); kfree(acm); } static void acm_port_shutdown(struct tty_port *port) { struct acm *acm = container_of(port, struct acm, port); struct urb *urb; struct acm_wb *wb; /* * Need to grab write_lock to prevent race with resume, but no need to * hold it due to the tty-port initialised flag. */ acm_poison_urbs(acm); spin_lock_irq(&acm->write_lock); spin_unlock_irq(&acm->write_lock); usb_autopm_get_interface_no_resume(acm->control); acm->control->needs_remote_wakeup = 0; usb_autopm_put_interface(acm->control); for (;;) { urb = usb_get_from_anchor(&acm->delayed); if (!urb) break; wb = urb->context; wb->use = false; usb_autopm_put_interface_async(acm->control); } acm_unpoison_urbs(acm); } static void acm_tty_cleanup(struct tty_struct *tty) { struct acm *acm = tty->driver_data; tty_port_put(&acm->port); } static void acm_tty_hangup(struct tty_struct *tty) { struct acm *acm = tty->driver_data; tty_port_hangup(&acm->port); } static void acm_tty_close(struct tty_struct *tty, struct file *filp) { struct acm *acm = tty->driver_data; tty_port_close(&acm->port, tty, filp); } static ssize_t acm_tty_write(struct tty_struct *tty, const u8 *buf, size_t count) { struct acm *acm = tty->driver_data; int stat; unsigned long flags; int wbn; struct acm_wb *wb; if (!count) return 0; dev_vdbg(&acm->data->dev, "%zu bytes from tty layer\n", count); spin_lock_irqsave(&acm->write_lock, flags); wbn = acm_wb_alloc(acm); if (wbn < 0) { spin_unlock_irqrestore(&acm->write_lock, flags); return 0; } wb = &acm->wb[wbn]; if (!acm->dev) { wb->use = false; spin_unlock_irqrestore(&acm->write_lock, flags); return -ENODEV; } count = (count > acm->writesize) ? acm->writesize : count; dev_vdbg(&acm->data->dev, "writing %zu bytes\n", count); memcpy(wb->buf, buf, count); wb->len = count; stat = usb_autopm_get_interface_async(acm->control); if (stat) { wb->use = false; spin_unlock_irqrestore(&acm->write_lock, flags); return stat; } if (acm->susp_count) { usb_anchor_urb(wb->urb, &acm->delayed); spin_unlock_irqrestore(&acm->write_lock, flags); return count; } stat = acm_start_wb(acm, wb); spin_unlock_irqrestore(&acm->write_lock, flags); if (stat < 0) return stat; return count; } static unsigned int acm_tty_write_room(struct tty_struct *tty) { struct acm *acm = tty->driver_data; /* * Do not let the line discipline to know that we have a reserve, * or it might get too enthusiastic. */ return acm_wb_is_avail(acm) ? acm->writesize : 0; } static void acm_tty_flush_buffer(struct tty_struct *tty) { struct acm *acm = tty->driver_data; unsigned long flags; int i; spin_lock_irqsave(&acm->write_lock, flags); for (i = 0; i < ACM_NW; i++) if (acm->wb[i].use) usb_unlink_urb(acm->wb[i].urb); spin_unlock_irqrestore(&acm->write_lock, flags); } static unsigned int acm_tty_chars_in_buffer(struct tty_struct *tty) { struct acm *acm = tty->driver_data; /* * if the device was unplugged then any remaining characters fell out * of the connector ;) */ if (acm->disconnected) return 0; /* * This is inaccurate (overcounts), but it works. */ return (ACM_NW - acm_wb_is_avail(acm)) * acm->writesize; } static void acm_tty_throttle(struct tty_struct *tty) { struct acm *acm = tty->driver_data; set_bit(ACM_THROTTLED, &acm->flags); } static void acm_tty_unthrottle(struct tty_struct *tty) { struct acm *acm = tty->driver_data; clear_bit(ACM_THROTTLED, &acm->flags); /* Matches the smp_mb__after_atomic() in acm_read_bulk_callback(). */ smp_mb(); acm_submit_read_urbs(acm, GFP_KERNEL); } static int acm_tty_break_ctl(struct tty_struct *tty, int state) { struct acm *acm = tty->driver_data; int retval; if (!(acm->ctrl_caps & USB_CDC_CAP_BRK)) return -EOPNOTSUPP; retval = acm_send_break(acm, state ? 0xffff : 0); if (retval < 0) dev_dbg(&acm->control->dev, "%s - send break failed\n", __func__); return retval; } static int acm_tty_tiocmget(struct tty_struct *tty) { struct acm *acm = tty->driver_data; return (acm->ctrlout & USB_CDC_CTRL_DTR ? TIOCM_DTR : 0) | (acm->ctrlout & USB_CDC_CTRL_RTS ? TIOCM_RTS : 0) | (acm->ctrlin & USB_CDC_SERIAL_STATE_DSR ? TIOCM_DSR : 0) | (acm->ctrlin & USB_CDC_SERIAL_STATE_RING_SIGNAL ? TIOCM_RI : 0) | (acm->ctrlin & USB_CDC_SERIAL_STATE_DCD ? TIOCM_CD : 0) | TIOCM_CTS; } static int acm_tty_tiocmset(struct tty_struct *tty, unsigned int set, unsigned int clear) { struct acm *acm = tty->driver_data; unsigned int newctrl; newctrl = acm->ctrlout; set = (set & TIOCM_DTR ? USB_CDC_CTRL_DTR : 0) | (set & TIOCM_RTS ? USB_CDC_CTRL_RTS : 0); clear = (clear & TIOCM_DTR ? USB_CDC_CTRL_DTR : 0) | (clear & TIOCM_RTS ? USB_CDC_CTRL_RTS : 0); newctrl = (newctrl & ~clear) | set; if (acm->ctrlout == newctrl) return 0; return acm_set_control(acm, acm->ctrlout = newctrl); } static int get_serial_info(struct tty_struct *tty, struct serial_struct *ss) { struct acm *acm = tty->driver_data; ss->line = acm->minor; mutex_lock(&acm->port.mutex); ss->close_delay = jiffies_to_msecs(acm->port.close_delay) / 10; ss->closing_wait = acm->port.closing_wait == ASYNC_CLOSING_WAIT_NONE ? ASYNC_CLOSING_WAIT_NONE : jiffies_to_msecs(acm->port.closing_wait) / 10; mutex_unlock(&acm->port.mutex); return 0; } static int set_serial_info(struct tty_struct *tty, struct serial_struct *ss) { struct acm *acm = tty->driver_data; unsigned int closing_wait, close_delay; int retval = 0; close_delay = msecs_to_jiffies(ss->close_delay * 10); closing_wait = ss->closing_wait == ASYNC_CLOSING_WAIT_NONE ? ASYNC_CLOSING_WAIT_NONE : msecs_to_jiffies(ss->closing_wait * 10); mutex_lock(&acm->port.mutex); if (!capable(CAP_SYS_ADMIN)) { if ((close_delay != acm->port.close_delay) || (closing_wait != acm->port.closing_wait)) retval = -EPERM; } else { acm->port.close_delay = close_delay; acm->port.closing_wait = closing_wait; } mutex_unlock(&acm->port.mutex); return retval; } static int wait_serial_change(struct acm *acm, unsigned long arg) { int rv = 0; DECLARE_WAITQUEUE(wait, current); struct async_icount old, new; do { spin_lock_irq(&acm->read_lock); old = acm->oldcount; new = acm->iocount; acm->oldcount = new; spin_unlock_irq(&acm->read_lock); if ((arg & TIOCM_DSR) && old.dsr != new.dsr) break; if ((arg & TIOCM_CD) && old.dcd != new.dcd) break; if ((arg & TIOCM_RI) && old.rng != new.rng) break; add_wait_queue(&acm->wioctl, &wait); set_current_state(TASK_INTERRUPTIBLE); schedule(); remove_wait_queue(&acm->wioctl, &wait); if (acm->disconnected) { if (arg & TIOCM_CD) break; else rv = -ENODEV; } else { if (signal_pending(current)) rv = -ERESTARTSYS; } } while (!rv); return rv; } static int acm_tty_get_icount(struct tty_struct *tty, struct serial_icounter_struct *icount) { struct acm *acm = tty->driver_data; icount->dsr = acm->iocount.dsr; icount->rng = acm->iocount.rng; icount->dcd = acm->iocount.dcd; icount->frame = acm->iocount.frame; icount->overrun = acm->iocount.overrun; icount->parity = acm->iocount.parity; icount->brk = acm->iocount.brk; return 0; } static int acm_tty_ioctl(struct tty_struct *tty, unsigned int cmd, unsigned long arg) { struct acm *acm = tty->driver_data; int rv = -ENOIOCTLCMD; switch (cmd) { case TIOCMIWAIT: rv = usb_autopm_get_interface(acm->control); if (rv < 0) { rv = -EIO; break; } rv = wait_serial_change(acm, arg); usb_autopm_put_interface(acm->control); break; } return rv; } static void acm_tty_set_termios(struct tty_struct *tty, const struct ktermios *termios_old) { struct acm *acm = tty->driver_data; struct ktermios *termios = &tty->termios; struct usb_cdc_line_coding newline; int newctrl = acm->ctrlout; newline.dwDTERate = cpu_to_le32(tty_get_baud_rate(tty)); newline.bCharFormat = termios->c_cflag & CSTOPB ? 2 : 0; newline.bParityType = termios->c_cflag & PARENB ? (termios->c_cflag & PARODD ? 1 : 2) + (termios->c_cflag & CMSPAR ? 2 : 0) : 0; newline.bDataBits = tty_get_char_size(termios->c_cflag); /* FIXME: Needs to clear unsupported bits in the termios */ acm->clocal = ((termios->c_cflag & CLOCAL) != 0); if (C_BAUD(tty) == B0) { newline.dwDTERate = acm->line.dwDTERate; newctrl &= ~USB_CDC_CTRL_DTR; } else if (termios_old && (termios_old->c_cflag & CBAUD) == B0) { newctrl |= USB_CDC_CTRL_DTR; } if (newctrl != acm->ctrlout) acm_set_control(acm, acm->ctrlout = newctrl); if (memcmp(&acm->line, &newline, sizeof newline)) { memcpy(&acm->line, &newline, sizeof newline); dev_dbg(&acm->control->dev, "%s - set line: %d %d %d %d\n", __func__, le32_to_cpu(newline.dwDTERate), newline.bCharFormat, newline.bParityType, newline.bDataBits); acm_set_line(acm, &acm->line); } } static const struct tty_port_operations acm_port_ops = { .dtr_rts = acm_port_dtr_rts, .shutdown = acm_port_shutdown, .activate = acm_port_activate, .destruct = acm_port_destruct, }; /* * USB probe and disconnect routines. */ /* Little helpers: write/read buffers free */ static void acm_write_buffers_free(struct acm *acm) { int i; struct acm_wb *wb; for (wb = &acm->wb[0], i = 0; i < ACM_NW; i++, wb++) usb_free_coherent(acm->dev, acm->writesize, wb->buf, wb->dmah); } static void acm_read_buffers_free(struct acm *acm) { int i; for (i = 0; i < acm->rx_buflimit; i++) usb_free_coherent(acm->dev, acm->readsize, acm->read_buffers[i].base, acm->read_buffers[i].dma); } /* Little helper: write buffers allocate */ static int acm_write_buffers_alloc(struct acm *acm) { int i; struct acm_wb *wb; for (wb = &acm->wb[0], i = 0; i < ACM_NW; i++, wb++) { wb->buf = usb_alloc_coherent(acm->dev, acm->writesize, GFP_KERNEL, &wb->dmah); if (!wb->buf) { while (i != 0) { --i; --wb; usb_free_coherent(acm->dev, acm->writesize, wb->buf, wb->dmah); } return -ENOMEM; } } return 0; } static int acm_probe(struct usb_interface *intf, const struct usb_device_id *id) { struct usb_cdc_union_desc *union_header = NULL; struct usb_cdc_call_mgmt_descriptor *cmgmd = NULL; unsigned char *buffer = intf->altsetting->extra; int buflen = intf->altsetting->extralen; struct usb_interface *control_interface; struct usb_interface *data_interface; struct usb_endpoint_descriptor *epctrl = NULL; struct usb_endpoint_descriptor *epread = NULL; struct usb_endpoint_descriptor *epwrite = NULL; struct usb_device *usb_dev = interface_to_usbdev(intf); struct usb_cdc_parsed_header h; struct acm *acm; int minor; int ctrlsize, readsize; u8 *buf; int call_intf_num = -1; int data_intf_num = -1; unsigned long quirks; int num_rx_buf; int i; int combined_interfaces = 0; struct device *tty_dev; int rv = -ENOMEM; int res; /* normal quirks */ quirks = (unsigned long)id->driver_info; if (quirks == IGNORE_DEVICE) return -ENODEV; memset(&h, 0x00, sizeof(struct usb_cdc_parsed_header)); num_rx_buf = (quirks == SINGLE_RX_URB) ? 1 : ACM_NR; /* handle quirks deadly to normal probing*/ if (quirks == NO_UNION_NORMAL) { data_interface = usb_ifnum_to_if(usb_dev, 1); control_interface = usb_ifnum_to_if(usb_dev, 0); /* we would crash */ if (!data_interface || !control_interface) return -ENODEV; goto skip_normal_probe; } /* normal probing*/ if (!buffer) { dev_err(&intf->dev, "Weird descriptor references\n"); return -EINVAL; } if (!buflen) { if (intf->cur_altsetting->endpoint && intf->cur_altsetting->endpoint->extralen && intf->cur_altsetting->endpoint->extra) { dev_dbg(&intf->dev, "Seeking extra descriptors on endpoint\n"); buflen = intf->cur_altsetting->endpoint->extralen; buffer = intf->cur_altsetting->endpoint->extra; } else { dev_err(&intf->dev, "Zero length descriptor references\n"); return -EINVAL; } } cdc_parse_cdc_header(&h, intf, buffer, buflen); union_header = h.usb_cdc_union_desc; cmgmd = h.usb_cdc_call_mgmt_descriptor; if (cmgmd) call_intf_num = cmgmd->bDataInterface; if (!union_header) { if (intf->cur_altsetting->desc.bNumEndpoints == 3) { dev_dbg(&intf->dev, "No union descriptor, assuming single interface\n"); combined_interfaces = 1; control_interface = data_interface = intf; goto look_for_collapsed_interface; } else if (call_intf_num > 0) { dev_dbg(&intf->dev, "No union descriptor, using call management descriptor\n"); data_intf_num = call_intf_num; data_interface = usb_ifnum_to_if(usb_dev, data_intf_num); control_interface = intf; } else { dev_dbg(&intf->dev, "No union descriptor, giving up\n"); return -ENODEV; } } else { int class = -1; data_intf_num = union_header->bSlaveInterface0; control_interface = usb_ifnum_to_if(usb_dev, union_header->bMasterInterface0); data_interface = usb_ifnum_to_if(usb_dev, data_intf_num); if (control_interface) class = control_interface->cur_altsetting->desc.bInterfaceClass; if (class != USB_CLASS_COMM && class != USB_CLASS_CDC_DATA) { dev_dbg(&intf->dev, "Broken union descriptor, assuming single interface\n"); combined_interfaces = 1; control_interface = data_interface = intf; goto look_for_collapsed_interface; } } if (!control_interface || !data_interface) { dev_dbg(&intf->dev, "no interfaces\n"); return -ENODEV; } if (data_intf_num != call_intf_num) dev_dbg(&intf->dev, "Separate call control interface. That is not fully supported.\n"); if (control_interface == data_interface) { /* some broken devices designed for windows work this way */ dev_warn(&intf->dev,"Control and data interfaces are not separated!\n"); combined_interfaces = 1; /* a popular other OS doesn't use it */ quirks |= NO_CAP_LINE; if (data_interface->cur_altsetting->desc.bNumEndpoints != 3) { dev_err(&intf->dev, "This needs exactly 3 endpoints\n"); return -EINVAL; } look_for_collapsed_interface: res = usb_find_common_endpoints(data_interface->cur_altsetting, &epread, &epwrite, &epctrl, NULL); if (res) return res; goto made_compressed_probe; } skip_normal_probe: /*workaround for switched interfaces */ if (data_interface->cur_altsetting->desc.bInterfaceClass != USB_CLASS_CDC_DATA) { if (control_interface->cur_altsetting->desc.bInterfaceClass == USB_CLASS_CDC_DATA) { dev_dbg(&intf->dev, "Your device has switched interfaces.\n"); swap(control_interface, data_interface); } else { return -EINVAL; } } /* Accept probe requests only for the control interface */ if (!combined_interfaces && intf != control_interface) return -ENODEV; if (data_interface->cur_altsetting->desc.bNumEndpoints < 2 || control_interface->cur_altsetting->desc.bNumEndpoints == 0) return -EINVAL; epctrl = &control_interface->cur_altsetting->endpoint[0].desc; epread = &data_interface->cur_altsetting->endpoint[0].desc; epwrite = &data_interface->cur_altsetting->endpoint[1].desc; /* workaround for switched endpoints */ if (!usb_endpoint_dir_in(epread)) { /* descriptors are swapped */ dev_dbg(&intf->dev, "The data interface has switched endpoints\n"); swap(epread, epwrite); } made_compressed_probe: dev_dbg(&intf->dev, "interfaces are valid\n"); acm = kzalloc(sizeof(struct acm), GFP_KERNEL); if (!acm) return -ENOMEM; tty_port_init(&acm->port); acm->port.ops = &acm_port_ops; ctrlsize = usb_endpoint_maxp(epctrl); readsize = usb_endpoint_maxp(epread) * (quirks == SINGLE_RX_URB ? 1 : 2); acm->combined_interfaces = combined_interfaces; acm->writesize = usb_endpoint_maxp(epwrite) * 20; acm->control = control_interface; acm->data = data_interface; usb_get_intf(acm->control); /* undone in destruct() */ minor = acm_alloc_minor(acm); if (minor < 0) { acm->minor = ACM_MINOR_INVALID; goto err_put_port; } acm->minor = minor; acm->dev = usb_dev; if (h.usb_cdc_acm_descriptor) acm->ctrl_caps = h.usb_cdc_acm_descriptor->bmCapabilities; if (quirks & NO_CAP_LINE) acm->ctrl_caps &= ~USB_CDC_CAP_LINE; acm->ctrlsize = ctrlsize; acm->readsize = readsize; acm->rx_buflimit = num_rx_buf; INIT_DELAYED_WORK(&acm->dwork, acm_softint); init_waitqueue_head(&acm->wioctl); spin_lock_init(&acm->write_lock); spin_lock_init(&acm->read_lock); mutex_init(&acm->mutex); if (usb_endpoint_xfer_int(epread)) { acm->bInterval = epread->bInterval; acm->in = usb_rcvintpipe(usb_dev, epread->bEndpointAddress); } else { acm->in = usb_rcvbulkpipe(usb_dev, epread->bEndpointAddress); } if (usb_endpoint_xfer_int(epwrite)) acm->out = usb_sndintpipe(usb_dev, epwrite->bEndpointAddress); else acm->out = usb_sndbulkpipe(usb_dev, epwrite->bEndpointAddress); init_usb_anchor(&acm->delayed); acm->quirks = quirks; buf = usb_alloc_coherent(usb_dev, ctrlsize, GFP_KERNEL, &acm->ctrl_dma); if (!buf) goto err_put_port; acm->ctrl_buffer = buf; if (acm_write_buffers_alloc(acm) < 0) goto err_free_ctrl_buffer; acm->ctrlurb = usb_alloc_urb(0, GFP_KERNEL); if (!acm->ctrlurb) goto err_free_write_buffers; for (i = 0; i < num_rx_buf; i++) { struct acm_rb *rb = &(acm->read_buffers[i]); struct urb *urb; rb->base = usb_alloc_coherent(acm->dev, readsize, GFP_KERNEL, &rb->dma); if (!rb->base) goto err_free_read_urbs; rb->index = i; rb->instance = acm; urb = usb_alloc_urb(0, GFP_KERNEL); if (!urb) goto err_free_read_urbs; urb->transfer_flags |= URB_NO_TRANSFER_DMA_MAP; urb->transfer_dma = rb->dma; if (usb_endpoint_xfer_int(epread)) usb_fill_int_urb(urb, acm->dev, acm->in, rb->base, acm->readsize, acm_read_bulk_callback, rb, acm->bInterval); else usb_fill_bulk_urb(urb, acm->dev, acm->in, rb->base, acm->readsize, acm_read_bulk_callback, rb); acm->read_urbs[i] = urb; __set_bit(i, &acm->read_urbs_free); } for (i = 0; i < ACM_NW; i++) { struct acm_wb *snd = &(acm->wb[i]); snd->urb = usb_alloc_urb(0, GFP_KERNEL); if (!snd->urb) goto err_free_write_urbs; if (usb_endpoint_xfer_int(epwrite)) usb_fill_int_urb(snd->urb, usb_dev, acm->out, NULL, acm->writesize, acm_write_bulk, snd, epwrite->bInterval); else usb_fill_bulk_urb(snd->urb, usb_dev, acm->out, NULL, acm->writesize, acm_write_bulk, snd); snd->urb->transfer_flags |= URB_NO_TRANSFER_DMA_MAP; if (quirks & SEND_ZERO_PACKET) snd->urb->transfer_flags |= URB_ZERO_PACKET; snd->instance = acm; } usb_set_intfdata(intf, acm); i = device_create_file(&intf->dev, &dev_attr_bmCapabilities); if (i < 0) goto err_free_write_urbs; if (h.usb_cdc_country_functional_desc) { /* export the country data */ struct usb_cdc_country_functional_desc * cfd = h.usb_cdc_country_functional_desc; acm->country_codes = kmalloc(cfd->bLength - 4, GFP_KERNEL); if (!acm->country_codes) goto skip_countries; acm->country_code_size = cfd->bLength - 4; memcpy(acm->country_codes, (u8 *)&cfd->wCountyCode0, cfd->bLength - 4); acm->country_rel_date = cfd->iCountryCodeRelDate; i = device_create_file(&intf->dev, &dev_attr_wCountryCodes); if (i < 0) { kfree(acm->country_codes); acm->country_codes = NULL; acm->country_code_size = 0; goto skip_countries; } i = device_create_file(&intf->dev, &dev_attr_iCountryCodeRelDate); if (i < 0) { device_remove_file(&intf->dev, &dev_attr_wCountryCodes); kfree(acm->country_codes); acm->country_codes = NULL; acm->country_code_size = 0; goto skip_countries; } } skip_countries: usb_fill_int_urb(acm->ctrlurb, usb_dev, usb_rcvintpipe(usb_dev, epctrl->bEndpointAddress), acm->ctrl_buffer, ctrlsize, acm_ctrl_irq, acm, /* works around buggy devices */ epctrl->bInterval ? epctrl->bInterval : 16); acm->ctrlurb->transfer_flags |= URB_NO_TRANSFER_DMA_MAP; acm->ctrlurb->transfer_dma = acm->ctrl_dma; acm->notification_buffer = NULL; acm->nb_index = 0; acm->nb_size = 0; acm->line.dwDTERate = cpu_to_le32(9600); acm->line.bDataBits = 8; acm_set_line(acm, &acm->line); if (!acm->combined_interfaces) { rv = usb_driver_claim_interface(&acm_driver, data_interface, acm); if (rv) goto err_remove_files; } tty_dev = tty_port_register_device(&acm->port, acm_tty_driver, minor, &control_interface->dev); if (IS_ERR(tty_dev)) { rv = PTR_ERR(tty_dev); goto err_release_data_interface; } if (quirks & CLEAR_HALT_CONDITIONS) { usb_clear_halt(usb_dev, acm->in); usb_clear_halt(usb_dev, acm->out); } dev_info(&intf->dev, "ttyACM%d: USB ACM device\n", minor); return 0; err_release_data_interface: if (!acm->combined_interfaces) { /* Clear driver data so that disconnect() returns early. */ usb_set_intfdata(data_interface, NULL); usb_driver_release_interface(&acm_driver, data_interface); } err_remove_files: if (acm->country_codes) { device_remove_file(&acm->control->dev, &dev_attr_wCountryCodes); device_remove_file(&acm->control->dev, &dev_attr_iCountryCodeRelDate); } device_remove_file(&acm->control->dev, &dev_attr_bmCapabilities); err_free_write_urbs: for (i = 0; i < ACM_NW; i++) usb_free_urb(acm->wb[i].urb); err_free_read_urbs: for (i = 0; i < num_rx_buf; i++) usb_free_urb(acm->read_urbs[i]); acm_read_buffers_free(acm); usb_free_urb(acm->ctrlurb); err_free_write_buffers: acm_write_buffers_free(acm); err_free_ctrl_buffer: usb_free_coherent(usb_dev, ctrlsize, acm->ctrl_buffer, acm->ctrl_dma); err_put_port: tty_port_put(&acm->port); return rv; } static void acm_disconnect(struct usb_interface *intf) { struct acm *acm = usb_get_intfdata(intf); struct tty_struct *tty; int i; /* sibling interface is already cleaning up */ if (!acm) return; acm->disconnected = true; /* * there is a circular dependency. acm_softint() can resubmit * the URBs in error handling so we need to block any * submission right away */ acm_poison_urbs(acm); mutex_lock(&acm->mutex); if (acm->country_codes) { device_remove_file(&acm->control->dev, &dev_attr_wCountryCodes); device_remove_file(&acm->control->dev, &dev_attr_iCountryCodeRelDate); } wake_up_all(&acm->wioctl); device_remove_file(&acm->control->dev, &dev_attr_bmCapabilities); usb_set_intfdata(acm->control, NULL); usb_set_intfdata(acm->data, NULL); mutex_unlock(&acm->mutex); tty = tty_port_tty_get(&acm->port); if (tty) { tty_vhangup(tty); tty_kref_put(tty); } cancel_delayed_work_sync(&acm->dwork); tty_unregister_device(acm_tty_driver, acm->minor); usb_free_urb(acm->ctrlurb); for (i = 0; i < ACM_NW; i++) usb_free_urb(acm->wb[i].urb); for (i = 0; i < acm->rx_buflimit; i++) usb_free_urb(acm->read_urbs[i]); acm_write_buffers_free(acm); usb_free_coherent(acm->dev, acm->ctrlsize, acm->ctrl_buffer, acm->ctrl_dma); acm_read_buffers_free(acm); kfree(acm->notification_buffer); if (!acm->combined_interfaces) usb_driver_release_interface(&acm_driver, intf == acm->control ? acm->data : acm->control); tty_port_put(&acm->port); } #ifdef CONFIG_PM static int acm_suspend(struct usb_interface *intf, pm_message_t message) { struct acm *acm = usb_get_intfdata(intf); int cnt; spin_lock_irq(&acm->write_lock); if (PMSG_IS_AUTO(message)) { if (acm->transmitting) { spin_unlock_irq(&acm->write_lock); return -EBUSY; } } cnt = acm->susp_count++; spin_unlock_irq(&acm->write_lock); if (cnt) return 0; acm_poison_urbs(acm); cancel_delayed_work_sync(&acm->dwork); acm->urbs_in_error_delay = 0; return 0; } static int acm_resume(struct usb_interface *intf) { struct acm *acm = usb_get_intfdata(intf); struct urb *urb; int rv = 0; spin_lock_irq(&acm->write_lock); if (--acm->susp_count) goto out; acm_unpoison_urbs(acm); if (tty_port_initialized(&acm->port)) { rv = usb_submit_urb(acm->ctrlurb, GFP_ATOMIC); for (;;) { urb = usb_get_from_anchor(&acm->delayed); if (!urb) break; acm_start_wb(acm, urb->context); } /* * delayed error checking because we must * do the write path at all cost */ if (rv < 0) goto out; rv = acm_submit_read_urbs(acm, GFP_ATOMIC); } out: spin_unlock_irq(&acm->write_lock); return rv; } static int acm_reset_resume(struct usb_interface *intf) { struct acm *acm = usb_get_intfdata(intf); if (tty_port_initialized(&acm->port)) tty_port_tty_hangup(&acm->port, false); return acm_resume(intf); } #endif /* CONFIG_PM */ static int acm_pre_reset(struct usb_interface *intf) { struct acm *acm = usb_get_intfdata(intf); clear_bit(EVENT_RX_STALL, &acm->flags); acm->nb_index = 0; /* pending control transfers are lost */ return 0; } #define NOKIA_PCSUITE_ACM_INFO(x) \ USB_DEVICE_AND_INTERFACE_INFO(0x0421, x, \ USB_CLASS_COMM, USB_CDC_SUBCLASS_ACM, \ USB_CDC_ACM_PROTO_VENDOR) #define SAMSUNG_PCSUITE_ACM_INFO(x) \ USB_DEVICE_AND_INTERFACE_INFO(0x04e7, x, \ USB_CLASS_COMM, USB_CDC_SUBCLASS_ACM, \ USB_CDC_ACM_PROTO_VENDOR) /* * USB driver structure. */ static const struct usb_device_id acm_ids[] = { /* quirky and broken devices */ { USB_DEVICE(0x0424, 0x274e), /* Microchip Technology, Inc. (formerly SMSC) */ .driver_info = DISABLE_ECHO, }, /* DISABLE ECHO in termios flag */ { USB_DEVICE(0x076d, 0x0006), /* Denso Cradle CU-321 */ .driver_info = NO_UNION_NORMAL, },/* has no union descriptor */ { USB_DEVICE(0x17ef, 0x7000), /* Lenovo USB modem */ .driver_info = NO_UNION_NORMAL, },/* has no union descriptor */ { USB_DEVICE(0x0870, 0x0001), /* Metricom GS Modem */ .driver_info = NO_UNION_NORMAL, /* has no union descriptor */ }, { USB_DEVICE(0x045b, 0x023c), /* Renesas R-Car H3 USB Download mode */ .driver_info = DISABLE_ECHO, /* Don't echo banner */ }, { USB_DEVICE(0x045b, 0x0247), /* Renesas R-Car D3 USB Download mode */ .driver_info = DISABLE_ECHO, /* Don't echo banner */ }, { USB_DEVICE(0x045b, 0x0248), /* Renesas R-Car M3-N USB Download mode */ .driver_info = DISABLE_ECHO, /* Don't echo banner */ }, { USB_DEVICE(0x045b, 0x024D), /* Renesas R-Car E3 USB Download mode */ .driver_info = DISABLE_ECHO, /* Don't echo banner */ }, { USB_DEVICE(0x0e8d, 0x0003), /* FIREFLY, MediaTek Inc; andrey.arapov@gmail.com */ .driver_info = NO_UNION_NORMAL, /* has no union descriptor */ }, { USB_DEVICE(0x0e8d, 0x2000), /* MediaTek Inc Preloader */ .driver_info = DISABLE_ECHO, /* DISABLE ECHO in termios flag */ }, { USB_DEVICE(0x0e8d, 0x3329), /* MediaTek Inc GPS */ .driver_info = NO_UNION_NORMAL, /* has no union descriptor */ }, { USB_DEVICE(0x0482, 0x0203), /* KYOCERA AH-K3001V */ .driver_info = NO_UNION_NORMAL, /* has no union descriptor */ }, { USB_DEVICE(0x079b, 0x000f), /* BT On-Air USB MODEM */ .driver_info = NO_UNION_NORMAL, /* has no union descriptor */ }, { USB_DEVICE(0x0ace, 0x1602), /* ZyDAS 56K USB MODEM */ .driver_info = SINGLE_RX_URB, }, { USB_DEVICE(0x0ace, 0x1608), /* ZyDAS 56K USB MODEM */ .driver_info = SINGLE_RX_URB, /* firmware bug */ }, { USB_DEVICE(0x0ace, 0x1611), /* ZyDAS 56K USB MODEM - new version */ .driver_info = SINGLE_RX_URB, /* firmware bug */ }, { USB_DEVICE(0x11ca, 0x0201), /* VeriFone Mx870 Gadget Serial */ .driver_info = SINGLE_RX_URB, }, { USB_DEVICE(0x1901, 0x0006), /* GE Healthcare Patient Monitor UI Controller */ .driver_info = DISABLE_ECHO, /* DISABLE ECHO in termios flag */ }, { USB_DEVICE(0x1965, 0x0018), /* Uniden UBC125XLT */ .driver_info = NO_UNION_NORMAL, /* has no union descriptor */ }, { USB_DEVICE(0x22b8, 0x7000), /* Motorola Q Phone */ .driver_info = NO_UNION_NORMAL, /* has no union descriptor */ }, { USB_DEVICE(0x0803, 0x3095), /* Zoom Telephonics Model 3095F USB MODEM */ .driver_info = NO_UNION_NORMAL, /* has no union descriptor */ }, { USB_DEVICE(0x0572, 0x1321), /* Conexant USB MODEM CX93010 */ .driver_info = NO_UNION_NORMAL, /* has no union descriptor */ }, { USB_DEVICE(0x0572, 0x1324), /* Conexant USB MODEM RD02-D400 */ .driver_info = NO_UNION_NORMAL, /* has no union descriptor */ }, { USB_DEVICE(0x0572, 0x1328), /* Shiro / Aztech USB MODEM UM-3100 */ .driver_info = NO_UNION_NORMAL, /* has no union descriptor */ }, { USB_DEVICE(0x0572, 0x1349), /* Hiro (Conexant) USB MODEM H50228 */ .driver_info = NO_UNION_NORMAL, /* has no union descriptor */ }, { USB_DEVICE(0x20df, 0x0001), /* Simtec Electronics Entropy Key */ .driver_info = QUIRK_CONTROL_LINE_STATE, }, { USB_DEVICE(0x2184, 0x001c) }, /* GW Instek AFG-2225 */ { USB_DEVICE(0x2184, 0x0036) }, /* GW Instek AFG-125 */ { USB_DEVICE(0x22b8, 0x6425), /* Motorola MOTOMAGX phones */ }, /* Motorola H24 HSPA module: */ { USB_DEVICE(0x22b8, 0x2d91) }, /* modem */ { USB_DEVICE(0x22b8, 0x2d92), /* modem + diagnostics */ .driver_info = NO_UNION_NORMAL, /* handle only modem interface */ }, { USB_DEVICE(0x22b8, 0x2d93), /* modem + AT port */ .driver_info = NO_UNION_NORMAL, /* handle only modem interface */ }, { USB_DEVICE(0x22b8, 0x2d95), /* modem + AT port + diagnostics */ .driver_info = NO_UNION_NORMAL, /* handle only modem interface */ }, { USB_DEVICE(0x22b8, 0x2d96), /* modem + NMEA */ .driver_info = NO_UNION_NORMAL, /* handle only modem interface */ }, { USB_DEVICE(0x22b8, 0x2d97), /* modem + diagnostics + NMEA */ .driver_info = NO_UNION_NORMAL, /* handle only modem interface */ }, { USB_DEVICE(0x22b8, 0x2d99), /* modem + AT port + NMEA */ .driver_info = NO_UNION_NORMAL, /* handle only modem interface */ }, { USB_DEVICE(0x22b8, 0x2d9a), /* modem + AT port + diagnostics + NMEA */ .driver_info = NO_UNION_NORMAL, /* handle only modem interface */ }, { USB_DEVICE(0x0572, 0x1329), /* Hummingbird huc56s (Conexant) */ .driver_info = NO_UNION_NORMAL, /* union descriptor misplaced on data interface instead of communications interface. Maybe we should define a new quirk for this. */ }, { USB_DEVICE(0x0572, 0x1340), /* Conexant CX93010-2x UCMxx */ .driver_info = NO_UNION_NORMAL, }, { USB_DEVICE(0x05f9, 0x4002), /* PSC Scanning, Magellan 800i */ .driver_info = NO_UNION_NORMAL, }, { USB_DEVICE(0x1bbb, 0x0003), /* Alcatel OT-I650 */ .driver_info = NO_UNION_NORMAL, /* reports zero length descriptor */ }, { USB_DEVICE(0x1576, 0x03b1), /* Maretron USB100 */ .driver_info = NO_UNION_NORMAL, /* reports zero length descriptor */ }, { USB_DEVICE(0xfff0, 0x0100), /* DATECS FP-2000 */ .driver_info = NO_UNION_NORMAL, /* reports zero length descriptor */ }, { USB_DEVICE(0x09d8, 0x0320), /* Elatec GmbH TWN3 */ .driver_info = NO_UNION_NORMAL, /* has misplaced union descriptor */ }, { USB_DEVICE(0x0c26, 0x0020), /* Icom ICF3400 Serie */ .driver_info = NO_UNION_NORMAL, /* reports zero length descriptor */ }, { USB_DEVICE(0x0ca6, 0xa050), /* Castles VEGA3000 */ .driver_info = NO_UNION_NORMAL, /* reports zero length descriptor */ }, { USB_DEVICE(0x2912, 0x0001), /* ATOL FPrint */ .driver_info = CLEAR_HALT_CONDITIONS, }, /* Nokia S60 phones expose two ACM channels. The first is * a modem and is picked up by the standard AT-command * information below. The second is 'vendor-specific' but * is treated as a serial device at the S60 end, so we want * to expose it on Linux too. */ { NOKIA_PCSUITE_ACM_INFO(0x042D), }, /* Nokia 3250 */ { NOKIA_PCSUITE_ACM_INFO(0x04D8), }, /* Nokia 5500 Sport */ { NOKIA_PCSUITE_ACM_INFO(0x04C9), }, /* Nokia E50 */ { NOKIA_PCSUITE_ACM_INFO(0x0419), }, /* Nokia E60 */ { NOKIA_PCSUITE_ACM_INFO(0x044D), }, /* Nokia E61 */ { NOKIA_PCSUITE_ACM_INFO(0x0001), }, /* Nokia E61i */ { NOKIA_PCSUITE_ACM_INFO(0x0475), }, /* Nokia E62 */ { NOKIA_PCSUITE_ACM_INFO(0x0508), }, /* Nokia E65 */ { NOKIA_PCSUITE_ACM_INFO(0x0418), }, /* Nokia E70 */ { NOKIA_PCSUITE_ACM_INFO(0x0425), }, /* Nokia N71 */ { NOKIA_PCSUITE_ACM_INFO(0x0486), }, /* Nokia N73 */ { NOKIA_PCSUITE_ACM_INFO(0x04DF), }, /* Nokia N75 */ { NOKIA_PCSUITE_ACM_INFO(0x000e), }, /* Nokia N77 */ { NOKIA_PCSUITE_ACM_INFO(0x0445), }, /* Nokia N80 */ { NOKIA_PCSUITE_ACM_INFO(0x042F), }, /* Nokia N91 & N91 8GB */ { NOKIA_PCSUITE_ACM_INFO(0x048E), }, /* Nokia N92 */ { NOKIA_PCSUITE_ACM_INFO(0x0420), }, /* Nokia N93 */ { NOKIA_PCSUITE_ACM_INFO(0x04E6), }, /* Nokia N93i */ { NOKIA_PCSUITE_ACM_INFO(0x04B2), }, /* Nokia 5700 XpressMusic */ { NOKIA_PCSUITE_ACM_INFO(0x0134), }, /* Nokia 6110 Navigator (China) */ { NOKIA_PCSUITE_ACM_INFO(0x046E), }, /* Nokia 6110 Navigator */ { NOKIA_PCSUITE_ACM_INFO(0x002f), }, /* Nokia 6120 classic & */ { NOKIA_PCSUITE_ACM_INFO(0x0088), }, /* Nokia 6121 classic */ { NOKIA_PCSUITE_ACM_INFO(0x00fc), }, /* Nokia 6124 classic */ { NOKIA_PCSUITE_ACM_INFO(0x0042), }, /* Nokia E51 */ { NOKIA_PCSUITE_ACM_INFO(0x00b0), }, /* Nokia E66 */ { NOKIA_PCSUITE_ACM_INFO(0x00ab), }, /* Nokia E71 */ { NOKIA_PCSUITE_ACM_INFO(0x0481), }, /* Nokia N76 */ { NOKIA_PCSUITE_ACM_INFO(0x0007), }, /* Nokia N81 & N81 8GB */ { NOKIA_PCSUITE_ACM_INFO(0x0071), }, /* Nokia N82 */ { NOKIA_PCSUITE_ACM_INFO(0x04F0), }, /* Nokia N95 & N95-3 NAM */ { NOKIA_PCSUITE_ACM_INFO(0x0070), }, /* Nokia N95 8GB */ { NOKIA_PCSUITE_ACM_INFO(0x0099), }, /* Nokia 6210 Navigator, RM-367 */ { NOKIA_PCSUITE_ACM_INFO(0x0128), }, /* Nokia 6210 Navigator, RM-419 */ { NOKIA_PCSUITE_ACM_INFO(0x008f), }, /* Nokia 6220 Classic */ { NOKIA_PCSUITE_ACM_INFO(0x00a0), }, /* Nokia 6650 */ { NOKIA_PCSUITE_ACM_INFO(0x007b), }, /* Nokia N78 */ { NOKIA_PCSUITE_ACM_INFO(0x0094), }, /* Nokia N85 */ { NOKIA_PCSUITE_ACM_INFO(0x003a), }, /* Nokia N96 & N96-3 */ { NOKIA_PCSUITE_ACM_INFO(0x00e9), }, /* Nokia 5320 XpressMusic */ { NOKIA_PCSUITE_ACM_INFO(0x0108), }, /* Nokia 5320 XpressMusic 2G */ { NOKIA_PCSUITE_ACM_INFO(0x01f5), }, /* Nokia N97, RM-505 */ { NOKIA_PCSUITE_ACM_INFO(0x02e3), }, /* Nokia 5230, RM-588 */ { NOKIA_PCSUITE_ACM_INFO(0x0178), }, /* Nokia E63 */ { NOKIA_PCSUITE_ACM_INFO(0x010e), }, /* Nokia E75 */ { NOKIA_PCSUITE_ACM_INFO(0x02d9), }, /* Nokia 6760 Slide */ { NOKIA_PCSUITE_ACM_INFO(0x01d0), }, /* Nokia E52 */ { NOKIA_PCSUITE_ACM_INFO(0x0223), }, /* Nokia E72 */ { NOKIA_PCSUITE_ACM_INFO(0x0275), }, /* Nokia X6 */ { NOKIA_PCSUITE_ACM_INFO(0x026c), }, /* Nokia N97 Mini */ { NOKIA_PCSUITE_ACM_INFO(0x0154), }, /* Nokia 5800 XpressMusic */ { NOKIA_PCSUITE_ACM_INFO(0x04ce), }, /* Nokia E90 */ { NOKIA_PCSUITE_ACM_INFO(0x01d4), }, /* Nokia E55 */ { NOKIA_PCSUITE_ACM_INFO(0x0302), }, /* Nokia N8 */ { NOKIA_PCSUITE_ACM_INFO(0x0335), }, /* Nokia E7 */ { NOKIA_PCSUITE_ACM_INFO(0x03cd), }, /* Nokia C7 */ { SAMSUNG_PCSUITE_ACM_INFO(0x6651), }, /* Samsung GTi8510 (INNOV8) */ /* Support for Owen devices */ { USB_DEVICE(0x03eb, 0x0030), }, /* Owen SI30 */ /* NOTE: non-Nokia COMM/ACM/0xff is likely MSFT RNDIS... NOT a modem! */ #if IS_ENABLED(CONFIG_INPUT_IMS_PCU) { USB_DEVICE(0x04d8, 0x0082), /* Application mode */ .driver_info = IGNORE_DEVICE, }, { USB_DEVICE(0x04d8, 0x0083), /* Bootloader mode */ .driver_info = IGNORE_DEVICE, }, #endif #if IS_ENABLED(CONFIG_IR_TOY) { USB_DEVICE(0x04d8, 0xfd08), .driver_info = IGNORE_DEVICE, }, { USB_DEVICE(0x04d8, 0xf58b), .driver_info = IGNORE_DEVICE, }, #endif #if IS_ENABLED(CONFIG_USB_SERIAL_XR) { USB_DEVICE(0x04e2, 0x1400), .driver_info = IGNORE_DEVICE }, { USB_DEVICE(0x04e2, 0x1401), .driver_info = IGNORE_DEVICE }, { USB_DEVICE(0x04e2, 0x1402), .driver_info = IGNORE_DEVICE }, { USB_DEVICE(0x04e2, 0x1403), .driver_info = IGNORE_DEVICE }, { USB_DEVICE(0x04e2, 0x1410), .driver_info = IGNORE_DEVICE }, { USB_DEVICE(0x04e2, 0x1411), .driver_info = IGNORE_DEVICE }, { USB_DEVICE(0x04e2, 0x1412), .driver_info = IGNORE_DEVICE }, { USB_DEVICE(0x04e2, 0x1414), .driver_info = IGNORE_DEVICE }, { USB_DEVICE(0x04e2, 0x1420), .driver_info = IGNORE_DEVICE }, { USB_DEVICE(0x04e2, 0x1422), .driver_info = IGNORE_DEVICE }, { USB_DEVICE(0x04e2, 0x1424), .driver_info = IGNORE_DEVICE }, #endif /*Samsung phone in firmware update mode */ { USB_DEVICE(0x04e8, 0x685d), .driver_info = IGNORE_DEVICE, }, /* Exclude Infineon Flash Loader utility */ { USB_DEVICE(0x058b, 0x0041), .driver_info = IGNORE_DEVICE, }, /* Exclude ETAS ES58x */ { USB_DEVICE(0x108c, 0x0159), /* ES581.4 */ .driver_info = IGNORE_DEVICE, }, { USB_DEVICE(0x108c, 0x0168), /* ES582.1 */ .driver_info = IGNORE_DEVICE, }, { USB_DEVICE(0x108c, 0x0169), /* ES584.1 */ .driver_info = IGNORE_DEVICE, }, { USB_DEVICE(0x1bc7, 0x0021), /* Telit 3G ACM only composition */ .driver_info = SEND_ZERO_PACKET, }, { USB_DEVICE(0x1bc7, 0x0023), /* Telit 3G ACM + ECM composition */ .driver_info = SEND_ZERO_PACKET, }, /* Exclude Goodix Fingerprint Reader */ { USB_DEVICE(0x27c6, 0x5395), .driver_info = IGNORE_DEVICE, }, /* Exclude Heimann Sensor GmbH USB appset demo */ { USB_DEVICE(0x32a7, 0x0000), .driver_info = IGNORE_DEVICE, }, /* control interfaces without any protocol set */ { USB_INTERFACE_INFO(USB_CLASS_COMM, USB_CDC_SUBCLASS_ACM, USB_CDC_PROTO_NONE) }, /* control interfaces with various AT-command sets */ { USB_INTERFACE_INFO(USB_CLASS_COMM, USB_CDC_SUBCLASS_ACM, USB_CDC_ACM_PROTO_AT_V25TER) }, { USB_INTERFACE_INFO(USB_CLASS_COMM, USB_CDC_SUBCLASS_ACM, USB_CDC_ACM_PROTO_AT_PCCA101) }, { USB_INTERFACE_INFO(USB_CLASS_COMM, USB_CDC_SUBCLASS_ACM, USB_CDC_ACM_PROTO_AT_PCCA101_WAKE) }, { USB_INTERFACE_INFO(USB_CLASS_COMM, USB_CDC_SUBCLASS_ACM, USB_CDC_ACM_PROTO_AT_GSM) }, { USB_INTERFACE_INFO(USB_CLASS_COMM, USB_CDC_SUBCLASS_ACM, USB_CDC_ACM_PROTO_AT_3G) }, { USB_INTERFACE_INFO(USB_CLASS_COMM, USB_CDC_SUBCLASS_ACM, USB_CDC_ACM_PROTO_AT_CDMA) }, { USB_DEVICE(0x1519, 0x0452), /* Intel 7260 modem */ .driver_info = SEND_ZERO_PACKET, }, { } }; MODULE_DEVICE_TABLE(usb, acm_ids); static struct usb_driver acm_driver = { .name = "cdc_acm", .probe = acm_probe, .disconnect = acm_disconnect, #ifdef CONFIG_PM .suspend = acm_suspend, .resume = acm_resume, .reset_resume = acm_reset_resume, #endif .pre_reset = acm_pre_reset, .id_table = acm_ids, #ifdef CONFIG_PM .supports_autosuspend = 1, #endif .disable_hub_initiated_lpm = 1, }; /* * TTY driver structures. */ static const struct tty_operations acm_ops = { .install = acm_tty_install, .open = acm_tty_open, .close = acm_tty_close, .cleanup = acm_tty_cleanup, .hangup = acm_tty_hangup, .write = acm_tty_write, .write_room = acm_tty_write_room, .flush_buffer = acm_tty_flush_buffer, .ioctl = acm_tty_ioctl, .throttle = acm_tty_throttle, .unthrottle = acm_tty_unthrottle, .chars_in_buffer = acm_tty_chars_in_buffer, .break_ctl = acm_tty_break_ctl, .set_termios = acm_tty_set_termios, .tiocmget = acm_tty_tiocmget, .tiocmset = acm_tty_tiocmset, .get_serial = get_serial_info, .set_serial = set_serial_info, .get_icount = acm_tty_get_icount, }; /* * Init / exit. */ static int __init acm_init(void) { int retval; acm_tty_driver = tty_alloc_driver(ACM_TTY_MINORS, TTY_DRIVER_REAL_RAW | TTY_DRIVER_DYNAMIC_DEV); if (IS_ERR(acm_tty_driver)) return PTR_ERR(acm_tty_driver); acm_tty_driver->driver_name = "acm", acm_tty_driver->name = "ttyACM", acm_tty_driver->major = ACM_TTY_MAJOR, acm_tty_driver->minor_start = 0, acm_tty_driver->type = TTY_DRIVER_TYPE_SERIAL, acm_tty_driver->subtype = SERIAL_TYPE_NORMAL, acm_tty_driver->init_termios = tty_std_termios; acm_tty_driver->init_termios.c_cflag = B9600 | CS8 | CREAD | HUPCL | CLOCAL; tty_set_operations(acm_tty_driver, &acm_ops); retval = tty_register_driver(acm_tty_driver); if (retval) { tty_driver_kref_put(acm_tty_driver); return retval; } retval = usb_register(&acm_driver); if (retval) { tty_unregister_driver(acm_tty_driver); tty_driver_kref_put(acm_tty_driver); return retval; } printk(KERN_INFO KBUILD_MODNAME ": " DRIVER_DESC "\n"); return 0; } static void __exit acm_exit(void) { usb_deregister(&acm_driver); tty_unregister_driver(acm_tty_driver); tty_driver_kref_put(acm_tty_driver); idr_destroy(&acm_minors); } module_init(acm_init); module_exit(acm_exit); MODULE_AUTHOR(DRIVER_AUTHOR); MODULE_DESCRIPTION(DRIVER_DESC); MODULE_LICENSE("GPL"); MODULE_ALIAS_CHARDEV_MAJOR(ACM_TTY_MAJOR); |
| 1 3 3 2 2 1 2 4 3 2 4 1 1 2 1 2 4 2 4 4 4 4 3 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 | // SPDX-License-Identifier: GPL-2.0-only /* * TCP Low Priority (TCP-LP) * * TCP Low Priority is a distributed algorithm whose goal is to utilize only * the excess network bandwidth as compared to the ``fair share`` of * bandwidth as targeted by TCP. * * As of 2.6.13, Linux supports pluggable congestion control algorithms. * Due to the limitation of the API, we take the following changes from * the original TCP-LP implementation: * o We use newReno in most core CA handling. Only add some checking * within cong_avoid. * o Error correcting in remote HZ, therefore remote HZ will be keeped * on checking and updating. * o Handling calculation of One-Way-Delay (OWD) within rtt_sample, since * OWD have a similar meaning as RTT. Also correct the buggy formular. * o Handle reaction for Early Congestion Indication (ECI) within * pkts_acked, as mentioned within pseudo code. * o OWD is handled in relative format, where local time stamp will in * tcp_time_stamp format. * * Original Author: * Aleksandar Kuzmanovic <akuzma@northwestern.edu> * Available from: * http://www.ece.rice.edu/~akuzma/Doc/akuzma/TCP-LP.pdf * Original implementation for 2.4.19: * http://www-ece.rice.edu/networks/TCP-LP/ * * 2.6.x module Authors: * Wong Hoi Sing, Edison <hswong3i@gmail.com> * Hung Hing Lun, Mike <hlhung3i@gmail.com> * SourceForge project page: * http://tcp-lp-mod.sourceforge.net/ */ #include <linux/module.h> #include <net/tcp.h> /* resolution of owd */ #define LP_RESOL TCP_TS_HZ /** * enum tcp_lp_state * @LP_VALID_RHZ: is remote HZ valid? * @LP_VALID_OWD: is OWD valid? * @LP_WITHIN_THR: are we within threshold? * @LP_WITHIN_INF: are we within inference? * * TCP-LP's state flags. * We create this set of state flag mainly for debugging. */ enum tcp_lp_state { LP_VALID_RHZ = (1 << 0), LP_VALID_OWD = (1 << 1), LP_WITHIN_THR = (1 << 3), LP_WITHIN_INF = (1 << 4), }; /** * struct lp * @flag: TCP-LP state flag * @sowd: smoothed OWD << 3 * @owd_min: min OWD * @owd_max: max OWD * @owd_max_rsv: reserved max owd * @remote_hz: estimated remote HZ * @remote_ref_time: remote reference time * @local_ref_time: local reference time * @last_drop: time for last active drop * @inference: current inference * * TCP-LP's private struct. * We get the idea from original TCP-LP implementation where only left those we * found are really useful. */ struct lp { u32 flag; u32 sowd; u32 owd_min; u32 owd_max; u32 owd_max_rsv; u32 remote_hz; u32 remote_ref_time; u32 local_ref_time; u32 last_drop; u32 inference; }; /** * tcp_lp_init * @sk: socket to initialize congestion control algorithm for * * Init all required variables. * Clone the handling from Vegas module implementation. */ static void tcp_lp_init(struct sock *sk) { struct lp *lp = inet_csk_ca(sk); lp->flag = 0; lp->sowd = 0; lp->owd_min = 0xffffffff; lp->owd_max = 0; lp->owd_max_rsv = 0; lp->remote_hz = 0; lp->remote_ref_time = 0; lp->local_ref_time = 0; lp->last_drop = 0; lp->inference = 0; } /** * tcp_lp_cong_avoid * @sk: socket to avoid congesting * * Implementation of cong_avoid. * Will only call newReno CA when away from inference. * From TCP-LP's paper, this will be handled in additive increasement. */ static void tcp_lp_cong_avoid(struct sock *sk, u32 ack, u32 acked) { struct lp *lp = inet_csk_ca(sk); if (!(lp->flag & LP_WITHIN_INF)) tcp_reno_cong_avoid(sk, ack, acked); } /** * tcp_lp_remote_hz_estimator * @sk: socket which needs an estimate for the remote HZs * * Estimate remote HZ. * We keep on updating the estimated value, where original TCP-LP * implementation only guest it for once and use forever. */ static u32 tcp_lp_remote_hz_estimator(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); struct lp *lp = inet_csk_ca(sk); s64 rhz = lp->remote_hz << 6; /* remote HZ << 6 */ s64 m = 0; /* not yet record reference time * go away!! record it before come back!! */ if (lp->remote_ref_time == 0 || lp->local_ref_time == 0) goto out; /* we can't calc remote HZ with no different!! */ if (tp->rx_opt.rcv_tsval == lp->remote_ref_time || tp->rx_opt.rcv_tsecr == lp->local_ref_time) goto out; m = TCP_TS_HZ * (tp->rx_opt.rcv_tsval - lp->remote_ref_time) / (tp->rx_opt.rcv_tsecr - lp->local_ref_time); if (m < 0) m = -m; if (rhz > 0) { m -= rhz >> 6; /* m is now error in remote HZ est */ rhz += m; /* 63/64 old + 1/64 new */ } else rhz = m << 6; out: /* record time for successful remote HZ calc */ if ((rhz >> 6) > 0) lp->flag |= LP_VALID_RHZ; else lp->flag &= ~LP_VALID_RHZ; /* record reference time stamp */ lp->remote_ref_time = tp->rx_opt.rcv_tsval; lp->local_ref_time = tp->rx_opt.rcv_tsecr; return rhz >> 6; } /** * tcp_lp_owd_calculator * @sk: socket to calculate one way delay for * * Calculate one way delay (in relative format). * Original implement OWD as minus of remote time difference to local time * difference directly. As this time difference just simply equal to RTT, when * the network status is stable, remote RTT will equal to local RTT, and result * OWD into zero. * It seems to be a bug and so we fixed it. */ static u32 tcp_lp_owd_calculator(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); struct lp *lp = inet_csk_ca(sk); s64 owd = 0; lp->remote_hz = tcp_lp_remote_hz_estimator(sk); if (lp->flag & LP_VALID_RHZ) { owd = tp->rx_opt.rcv_tsval * (LP_RESOL / lp->remote_hz) - tp->rx_opt.rcv_tsecr * (LP_RESOL / TCP_TS_HZ); if (owd < 0) owd = -owd; } if (owd > 0) lp->flag |= LP_VALID_OWD; else lp->flag &= ~LP_VALID_OWD; return owd; } /** * tcp_lp_rtt_sample * @sk: socket to add a rtt sample to * @rtt: round trip time, which is ignored! * * Implementation or rtt_sample. * Will take the following action, * 1. calc OWD, * 2. record the min/max OWD, * 3. calc smoothed OWD (SOWD). * Most ideas come from the original TCP-LP implementation. */ static void tcp_lp_rtt_sample(struct sock *sk, u32 rtt) { struct lp *lp = inet_csk_ca(sk); s64 mowd = tcp_lp_owd_calculator(sk); /* sorry that we don't have valid data */ if (!(lp->flag & LP_VALID_RHZ) || !(lp->flag & LP_VALID_OWD)) return; /* record the next min owd */ if (mowd < lp->owd_min) lp->owd_min = mowd; /* always forget the max of the max * we just set owd_max as one below it */ if (mowd > lp->owd_max) { if (mowd > lp->owd_max_rsv) { if (lp->owd_max_rsv == 0) lp->owd_max = mowd; else lp->owd_max = lp->owd_max_rsv; lp->owd_max_rsv = mowd; } else lp->owd_max = mowd; } /* calc for smoothed owd */ if (lp->sowd != 0) { mowd -= lp->sowd >> 3; /* m is now error in owd est */ lp->sowd += mowd; /* owd = 7/8 owd + 1/8 new */ } else lp->sowd = mowd << 3; /* take the measured time be owd */ } /** * tcp_lp_pkts_acked * @sk: socket requiring congestion avoidance calculations * * Implementation of pkts_acked. * Deal with active drop under Early Congestion Indication. * Only drop to half and 1 will be handle, because we hope to use back * newReno in increase case. * We work it out by following the idea from TCP-LP's paper directly */ static void tcp_lp_pkts_acked(struct sock *sk, const struct ack_sample *sample) { struct tcp_sock *tp = tcp_sk(sk); struct lp *lp = inet_csk_ca(sk); u32 now = tcp_time_stamp_ts(tp); u32 delta; if (sample->rtt_us > 0) tcp_lp_rtt_sample(sk, sample->rtt_us); /* calc inference */ delta = now - tp->rx_opt.rcv_tsecr; if ((s32)delta > 0) lp->inference = 3 * delta; /* test if within inference */ if (lp->last_drop && (now - lp->last_drop < lp->inference)) lp->flag |= LP_WITHIN_INF; else lp->flag &= ~LP_WITHIN_INF; /* test if within threshold */ if (lp->sowd >> 3 < lp->owd_min + 15 * (lp->owd_max - lp->owd_min) / 100) lp->flag |= LP_WITHIN_THR; else lp->flag &= ~LP_WITHIN_THR; pr_debug("TCP-LP: %05o|%5u|%5u|%15u|%15u|%15u\n", lp->flag, tcp_snd_cwnd(tp), lp->remote_hz, lp->owd_min, lp->owd_max, lp->sowd >> 3); if (lp->flag & LP_WITHIN_THR) return; /* FIXME: try to reset owd_min and owd_max here * so decrease the chance the min/max is no longer suitable * and will usually within threshold when within inference */ lp->owd_min = lp->sowd >> 3; lp->owd_max = lp->sowd >> 2; lp->owd_max_rsv = lp->sowd >> 2; /* happened within inference * drop snd_cwnd into 1 */ if (lp->flag & LP_WITHIN_INF) tcp_snd_cwnd_set(tp, 1U); /* happened after inference * cut snd_cwnd into half */ else tcp_snd_cwnd_set(tp, max(tcp_snd_cwnd(tp) >> 1U, 1U)); /* record this drop time */ lp->last_drop = now; } static struct tcp_congestion_ops tcp_lp __read_mostly = { .init = tcp_lp_init, .ssthresh = tcp_reno_ssthresh, .undo_cwnd = tcp_reno_undo_cwnd, .cong_avoid = tcp_lp_cong_avoid, .pkts_acked = tcp_lp_pkts_acked, .owner = THIS_MODULE, .name = "lp" }; static int __init tcp_lp_register(void) { BUILD_BUG_ON(sizeof(struct lp) > ICSK_CA_PRIV_SIZE); return tcp_register_congestion_control(&tcp_lp); } static void __exit tcp_lp_unregister(void) { tcp_unregister_congestion_control(&tcp_lp); } module_init(tcp_lp_register); module_exit(tcp_lp_unregister); MODULE_AUTHOR("Wong Hoi Sing Edison, Hung Hing Lun Mike"); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("TCP Low Priority"); |
| 2 625 73 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_PID_NS_H #define _LINUX_PID_NS_H #include <linux/sched.h> #include <linux/bug.h> #include <linux/mm.h> #include <linux/workqueue.h> #include <linux/threads.h> #include <linux/nsproxy.h> #include <linux/ns_common.h> #include <linux/idr.h> /* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */ #define MAX_PID_NS_LEVEL 32 struct fs_pin; #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) /* modes for vm.memfd_noexec sysctl */ #define MEMFD_NOEXEC_SCOPE_EXEC 0 /* MFD_EXEC implied if unset */ #define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1 /* MFD_NOEXEC_SEAL implied if unset */ #define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2 /* same as 1, except MFD_EXEC rejected */ #endif struct pid_namespace { struct idr idr; struct rcu_head rcu; unsigned int pid_allocated; struct task_struct *child_reaper; struct kmem_cache *pid_cachep; unsigned int level; int pid_max; struct pid_namespace *parent; #ifdef CONFIG_BSD_PROCESS_ACCT struct fs_pin *bacct; #endif struct user_namespace *user_ns; struct ucounts *ucounts; int reboot; /* group exit code if this pidns was rebooted */ struct ns_common ns; struct work_struct work; #ifdef CONFIG_SYSCTL struct ctl_table_set set; struct ctl_table_header *sysctls; #if defined(CONFIG_MEMFD_CREATE) int memfd_noexec_scope; #endif #endif } __randomize_layout; extern struct pid_namespace init_pid_ns; #define PIDNS_ADDING (1U << 31) #ifdef CONFIG_PID_NS static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns) { if (ns != &init_pid_ns) refcount_inc(&ns->ns.count); return ns; } #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns) { int scope = MEMFD_NOEXEC_SCOPE_EXEC; for (; ns; ns = ns->parent) scope = max(scope, READ_ONCE(ns->memfd_noexec_scope)); return scope; } #else static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns) { return 0; } #endif extern struct pid_namespace *copy_pid_ns(unsigned long flags, struct user_namespace *user_ns, struct pid_namespace *ns); extern void zap_pid_ns_processes(struct pid_namespace *pid_ns); extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd); extern void put_pid_ns(struct pid_namespace *ns); #else /* !CONFIG_PID_NS */ #include <linux/err.h> static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns) { return ns; } static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns) { return 0; } static inline struct pid_namespace *copy_pid_ns(unsigned long flags, struct user_namespace *user_ns, struct pid_namespace *ns) { if (flags & CLONE_NEWPID) ns = ERR_PTR(-EINVAL); return ns; } static inline void put_pid_ns(struct pid_namespace *ns) { } static inline void zap_pid_ns_processes(struct pid_namespace *ns) { BUG(); } static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd) { return 0; } #endif /* CONFIG_PID_NS */ extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk); void pidhash_init(void); void pid_idr_init(void); int register_pidns_sysctls(struct pid_namespace *pidns); void unregister_pidns_sysctls(struct pid_namespace *pidns); static inline bool task_is_in_init_pid_ns(struct task_struct *tsk) { return task_active_pid_ns(tsk) == &init_pid_ns; } #endif /* _LINUX_PID_NS_H */ |
| 9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | /* SPDX-License-Identifier: GPL-2.0 */ #undef TRACE_SYSTEM #define TRACE_SYSTEM tlb #if !defined(_TRACE_TLB_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_TLB_H #include <linux/mm_types.h> #include <linux/tracepoint.h> #define TLB_FLUSH_REASON \ EM( TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" ) \ EM( TLB_REMOTE_SHOOTDOWN, "remote shootdown" ) \ EM( TLB_LOCAL_SHOOTDOWN, "local shootdown" ) \ EM( TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" ) \ EMe( TLB_REMOTE_SEND_IPI, "remote ipi send" ) /* * First define the enums in TLB_FLUSH_REASON to be exported to userspace * via TRACE_DEFINE_ENUM(). */ #undef EM #undef EMe #define EM(a,b) TRACE_DEFINE_ENUM(a); #define EMe(a,b) TRACE_DEFINE_ENUM(a); TLB_FLUSH_REASON /* * Now redefine the EM() and EMe() macros to map the enums to the strings * that will be printed in the output. */ #undef EM #undef EMe #define EM(a,b) { a, b }, #define EMe(a,b) { a, b } TRACE_EVENT(tlb_flush, TP_PROTO(int reason, unsigned long pages), TP_ARGS(reason, pages), TP_STRUCT__entry( __field( int, reason) __field(unsigned long, pages) ), TP_fast_assign( __entry->reason = reason; __entry->pages = pages; ), TP_printk("pages:%ld reason:%s (%d)", __entry->pages, __print_symbolic(__entry->reason, TLB_FLUSH_REASON), __entry->reason) ); #endif /* _TRACE_TLB_H */ /* This part must be outside protection */ #include <trace/define_trace.h> |
| 358 353 2 352 6 20 117 1 4 135 136 136 41 96 1 98 98 2 1 3 96 202 195 40 18 5 1 3 1 19 186 174 174 166 174 175 174 324 255 87 8 5 8 22 3 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 | /* * linux/fs/nls/nls_base.c * * Native language support--charsets and unicode translations. * By Gordon Chaffee 1996, 1997 * * Unicode based case conversion 1999 by Wolfram Pienkoss * */ #include <linux/module.h> #include <linux/string.h> #include <linux/nls.h> #include <linux/kernel.h> #include <linux/errno.h> #include <linux/kmod.h> #include <linux/spinlock.h> #include <asm/byteorder.h> static struct nls_table default_table; static struct nls_table *tables = &default_table; static DEFINE_SPINLOCK(nls_lock); /* * Sample implementation from Unicode home page. * http://www.stonehand.com/unicode/standard/fss-utf.html */ struct utf8_table { int cmask; int cval; int shift; long lmask; long lval; }; static const struct utf8_table utf8_table[] = { {0x80, 0x00, 0*6, 0x7F, 0, /* 1 byte sequence */}, {0xE0, 0xC0, 1*6, 0x7FF, 0x80, /* 2 byte sequence */}, {0xF0, 0xE0, 2*6, 0xFFFF, 0x800, /* 3 byte sequence */}, {0xF8, 0xF0, 3*6, 0x1FFFFF, 0x10000, /* 4 byte sequence */}, {0xFC, 0xF8, 4*6, 0x3FFFFFF, 0x200000, /* 5 byte sequence */}, {0xFE, 0xFC, 5*6, 0x7FFFFFFF, 0x4000000, /* 6 byte sequence */}, {0, /* end of table */} }; #define UNICODE_MAX 0x0010ffff #define PLANE_SIZE 0x00010000 #define SURROGATE_MASK 0xfffff800 #define SURROGATE_PAIR 0x0000d800 #define SURROGATE_LOW 0x00000400 #define SURROGATE_BITS 0x000003ff int utf8_to_utf32(const u8 *s, int inlen, unicode_t *pu) { unsigned long l; int c0, c, nc; const struct utf8_table *t; nc = 0; c0 = *s; l = c0; for (t = utf8_table; t->cmask; t++) { nc++; if ((c0 & t->cmask) == t->cval) { l &= t->lmask; if (l < t->lval || l > UNICODE_MAX || (l & SURROGATE_MASK) == SURROGATE_PAIR) return -1; *pu = (unicode_t) l; return nc; } if (inlen <= nc) return -1; s++; c = (*s ^ 0x80) & 0xFF; if (c & 0xC0) return -1; l = (l << 6) | c; } return -1; } EXPORT_SYMBOL(utf8_to_utf32); int utf32_to_utf8(unicode_t u, u8 *s, int maxout) { unsigned long l; int c, nc; const struct utf8_table *t; if (!s) return 0; l = u; if (l > UNICODE_MAX || (l & SURROGATE_MASK) == SURROGATE_PAIR) return -1; nc = 0; for (t = utf8_table; t->cmask && maxout; t++, maxout--) { nc++; if (l <= t->lmask) { c = t->shift; *s = (u8) (t->cval | (l >> c)); while (c > 0) { c -= 6; s++; *s = (u8) (0x80 | ((l >> c) & 0x3F)); } return nc; } } return -1; } EXPORT_SYMBOL(utf32_to_utf8); static inline void put_utf16(wchar_t *s, unsigned c, enum utf16_endian endian) { switch (endian) { default: *s = (wchar_t) c; break; case UTF16_LITTLE_ENDIAN: *s = __cpu_to_le16(c); break; case UTF16_BIG_ENDIAN: *s = __cpu_to_be16(c); break; } } int utf8s_to_utf16s(const u8 *s, int inlen, enum utf16_endian endian, wchar_t *pwcs, int maxout) { u16 *op; int size; unicode_t u; op = pwcs; while (inlen > 0 && maxout > 0 && *s) { if (*s & 0x80) { size = utf8_to_utf32(s, inlen, &u); if (size < 0) return -EINVAL; s += size; inlen -= size; if (u >= PLANE_SIZE) { if (maxout < 2) break; u -= PLANE_SIZE; put_utf16(op++, SURROGATE_PAIR | ((u >> 10) & SURROGATE_BITS), endian); put_utf16(op++, SURROGATE_PAIR | SURROGATE_LOW | (u & SURROGATE_BITS), endian); maxout -= 2; } else { put_utf16(op++, u, endian); maxout--; } } else { put_utf16(op++, *s++, endian); inlen--; maxout--; } } return op - pwcs; } EXPORT_SYMBOL(utf8s_to_utf16s); static inline unsigned long get_utf16(unsigned c, enum utf16_endian endian) { switch (endian) { default: return c; case UTF16_LITTLE_ENDIAN: return __le16_to_cpu(c); case UTF16_BIG_ENDIAN: return __be16_to_cpu(c); } } int utf16s_to_utf8s(const wchar_t *pwcs, int inlen, enum utf16_endian endian, u8 *s, int maxout) { u8 *op; int size; unsigned long u, v; op = s; while (inlen > 0 && maxout > 0) { u = get_utf16(*pwcs, endian); if (!u) break; pwcs++; inlen--; if (u > 0x7f) { if ((u & SURROGATE_MASK) == SURROGATE_PAIR) { if (u & SURROGATE_LOW) { /* Ignore character and move on */ continue; } if (inlen <= 0) break; v = get_utf16(*pwcs, endian); if ((v & SURROGATE_MASK) != SURROGATE_PAIR || !(v & SURROGATE_LOW)) { /* Ignore character and move on */ continue; } u = PLANE_SIZE + ((u & SURROGATE_BITS) << 10) + (v & SURROGATE_BITS); pwcs++; inlen--; } size = utf32_to_utf8(u, op, maxout); if (size == -1) { /* Ignore character and move on */ } else { op += size; maxout -= size; } } else { *op++ = (u8) u; maxout--; } } return op - s; } EXPORT_SYMBOL(utf16s_to_utf8s); int __register_nls(struct nls_table *nls, struct module *owner) { struct nls_table ** tmp = &tables; if (nls->next) return -EBUSY; nls->owner = owner; spin_lock(&nls_lock); while (*tmp) { if (nls == *tmp) { spin_unlock(&nls_lock); return -EBUSY; } tmp = &(*tmp)->next; } nls->next = tables; tables = nls; spin_unlock(&nls_lock); return 0; } EXPORT_SYMBOL(__register_nls); int unregister_nls(struct nls_table * nls) { struct nls_table ** tmp = &tables; spin_lock(&nls_lock); while (*tmp) { if (nls == *tmp) { *tmp = nls->next; spin_unlock(&nls_lock); return 0; } tmp = &(*tmp)->next; } spin_unlock(&nls_lock); return -EINVAL; } static struct nls_table *find_nls(const char *charset) { struct nls_table *nls; spin_lock(&nls_lock); for (nls = tables; nls; nls = nls->next) { if (!strcmp(nls->charset, charset)) break; if (nls->alias && !strcmp(nls->alias, charset)) break; } if (nls && !try_module_get(nls->owner)) nls = NULL; spin_unlock(&nls_lock); return nls; } struct nls_table *load_nls(const char *charset) { return try_then_request_module(find_nls(charset), "nls_%s", charset); } void unload_nls(struct nls_table *nls) { if (nls) module_put(nls->owner); } static const wchar_t charset2uni[256] = { /* 0x00*/ 0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000a, 0x000b, 0x000c, 0x000d, 0x000e, 0x000f, /* 0x10*/ 0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017, 0x0018, 0x0019, 0x001a, 0x001b, 0x001c, 0x001d, 0x001e, 0x001f, /* 0x20*/ 0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002a, 0x002b, 0x002c, 0x002d, 0x002e, 0x002f, /* 0x30*/ 0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003a, 0x003b, 0x003c, 0x003d, 0x003e, 0x003f, /* 0x40*/ 0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004a, 0x004b, 0x004c, 0x004d, 0x004e, 0x004f, /* 0x50*/ 0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005a, 0x005b, 0x005c, 0x005d, 0x005e, 0x005f, /* 0x60*/ 0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069, 0x006a, 0x006b, 0x006c, 0x006d, 0x006e, 0x006f, /* 0x70*/ 0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078, 0x0079, 0x007a, 0x007b, 0x007c, 0x007d, 0x007e, 0x007f, /* 0x80*/ 0x0080, 0x0081, 0x0082, 0x0083, 0x0084, 0x0085, 0x0086, 0x0087, 0x0088, 0x0089, 0x008a, 0x008b, 0x008c, 0x008d, 0x008e, 0x008f, /* 0x90*/ 0x0090, 0x0091, 0x0092, 0x0093, 0x0094, 0x0095, 0x0096, 0x0097, 0x0098, 0x0099, 0x009a, 0x009b, 0x009c, 0x009d, 0x009e, 0x009f, /* 0xa0*/ 0x00a0, 0x00a1, 0x00a2, 0x00a3, 0x00a4, 0x00a5, 0x00a6, 0x00a7, 0x00a8, 0x00a9, 0x00aa, 0x00ab, 0x00ac, 0x00ad, 0x00ae, 0x00af, /* 0xb0*/ 0x00b0, 0x00b1, 0x00b2, 0x00b3, 0x00b4, 0x00b5, 0x00b6, 0x00b7, 0x00b8, 0x00b9, 0x00ba, 0x00bb, 0x00bc, 0x00bd, 0x00be, 0x00bf, /* 0xc0*/ 0x00c0, 0x00c1, 0x00c2, 0x00c3, 0x00c4, 0x00c5, 0x00c6, 0x00c7, 0x00c8, 0x00c9, 0x00ca, 0x00cb, 0x00cc, 0x00cd, 0x00ce, 0x00cf, /* 0xd0*/ 0x00d0, 0x00d1, 0x00d2, 0x00d3, 0x00d4, 0x00d5, 0x00d6, 0x00d7, 0x00d8, 0x00d9, 0x00da, 0x00db, 0x00dc, 0x00dd, 0x00de, 0x00df, /* 0xe0*/ 0x00e0, 0x00e1, 0x00e2, 0x00e3, 0x00e4, 0x00e5, 0x00e6, 0x00e7, 0x00e8, 0x00e9, 0x00ea, 0x00eb, 0x00ec, 0x00ed, 0x00ee, 0x00ef, /* 0xf0*/ 0x00f0, 0x00f1, 0x00f2, 0x00f3, 0x00f4, 0x00f5, 0x00f6, 0x00f7, 0x00f8, 0x00f9, 0x00fa, 0x00fb, 0x00fc, 0x00fd, 0x00fe, 0x00ff, }; static const unsigned char page00[256] = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, /* 0x00-0x07 */ 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, /* 0x08-0x0f */ 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, /* 0x10-0x17 */ 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f, /* 0x18-0x1f */ 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, /* 0x20-0x27 */ 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f, /* 0x28-0x2f */ 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, /* 0x30-0x37 */ 0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f, /* 0x38-0x3f */ 0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, /* 0x40-0x47 */ 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, /* 0x48-0x4f */ 0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, /* 0x50-0x57 */ 0x58, 0x59, 0x5a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f, /* 0x58-0x5f */ 0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, /* 0x60-0x67 */ 0x68, 0x69, 0x6a, 0x6b, 0x6c, 0x6d, 0x6e, 0x6f, /* 0x68-0x6f */ 0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, /* 0x70-0x77 */ 0x78, 0x79, 0x7a, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f, /* 0x78-0x7f */ 0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, /* 0x80-0x87 */ 0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f, /* 0x88-0x8f */ 0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, /* 0x90-0x97 */ 0x98, 0x99, 0x9a, 0x9b, 0x9c, 0x9d, 0x9e, 0x9f, /* 0x98-0x9f */ 0xa0, 0xa1, 0xa2, 0xa3, 0xa4, 0xa5, 0xa6, 0xa7, /* 0xa0-0xa7 */ 0xa8, 0xa9, 0xaa, 0xab, 0xac, 0xad, 0xae, 0xaf, /* 0xa8-0xaf */ 0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0xb5, 0xb6, 0xb7, /* 0xb0-0xb7 */ 0xb8, 0xb9, 0xba, 0xbb, 0xbc, 0xbd, 0xbe, 0xbf, /* 0xb8-0xbf */ 0xc0, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc6, 0xc7, /* 0xc0-0xc7 */ 0xc8, 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf, /* 0xc8-0xcf */ 0xd0, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0xd6, 0xd7, /* 0xd0-0xd7 */ 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde, 0xdf, /* 0xd8-0xdf */ 0xe0, 0xe1, 0xe2, 0xe3, 0xe4, 0xe5, 0xe6, 0xe7, /* 0xe0-0xe7 */ 0xe8, 0xe9, 0xea, 0xeb, 0xec, 0xed, 0xee, 0xef, /* 0xe8-0xef */ 0xf0, 0xf1, 0xf2, 0xf3, 0xf4, 0xf5, 0xf6, 0xf7, /* 0xf0-0xf7 */ 0xf8, 0xf9, 0xfa, 0xfb, 0xfc, 0xfd, 0xfe, 0xff, /* 0xf8-0xff */ }; static const unsigned char *const page_uni2charset[256] = { page00 }; static const unsigned char charset2lower[256] = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, /* 0x00-0x07 */ 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, /* 0x08-0x0f */ 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, /* 0x10-0x17 */ 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f, /* 0x18-0x1f */ 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, /* 0x20-0x27 */ 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f, /* 0x28-0x2f */ 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, /* 0x30-0x37 */ 0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f, /* 0x38-0x3f */ 0x40, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, /* 0x40-0x47 */ 0x68, 0x69, 0x6a, 0x6b, 0x6c, 0x6d, 0x6e, 0x6f, /* 0x48-0x4f */ 0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, /* 0x50-0x57 */ 0x78, 0x79, 0x7a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f, /* 0x58-0x5f */ 0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, /* 0x60-0x67 */ 0x68, 0x69, 0x6a, 0x6b, 0x6c, 0x6d, 0x6e, 0x6f, /* 0x68-0x6f */ 0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, /* 0x70-0x77 */ 0x78, 0x79, 0x7a, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f, /* 0x78-0x7f */ 0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, /* 0x80-0x87 */ 0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f, /* 0x88-0x8f */ 0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, /* 0x90-0x97 */ 0x98, 0x99, 0x9a, 0x9b, 0x9c, 0x9d, 0x9e, 0x9f, /* 0x98-0x9f */ 0xa0, 0xa1, 0xa2, 0xa3, 0xa4, 0xa5, 0xa6, 0xa7, /* 0xa0-0xa7 */ 0xa8, 0xa9, 0xaa, 0xab, 0xac, 0xad, 0xae, 0xaf, /* 0xa8-0xaf */ 0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0xb5, 0xb6, 0xb7, /* 0xb0-0xb7 */ 0xb8, 0xb9, 0xba, 0xbb, 0xbc, 0xbd, 0xbe, 0xbf, /* 0xb8-0xbf */ 0xc0, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc6, 0xc7, /* 0xc0-0xc7 */ 0xc8, 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf, /* 0xc8-0xcf */ 0xd0, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0xd6, 0xd7, /* 0xd0-0xd7 */ 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde, 0xdf, /* 0xd8-0xdf */ 0xe0, 0xe1, 0xe2, 0xe3, 0xe4, 0xe5, 0xe6, 0xe7, /* 0xe0-0xe7 */ 0xe8, 0xe9, 0xea, 0xeb, 0xec, 0xed, 0xee, 0xef, /* 0xe8-0xef */ 0xf0, 0xf1, 0xf2, 0xf3, 0xf4, 0xf5, 0xf6, 0xf7, /* 0xf0-0xf7 */ 0xf8, 0xf9, 0xfa, 0xfb, 0xfc, 0xfd, 0xfe, 0xff, /* 0xf8-0xff */ }; static const unsigned char charset2upper[256] = { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, /* 0x00-0x07 */ 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, /* 0x08-0x0f */ 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, /* 0x10-0x17 */ 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f, /* 0x18-0x1f */ 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, /* 0x20-0x27 */ 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f, /* 0x28-0x2f */ 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, /* 0x30-0x37 */ 0x38, 0x39, 0x3a, 0x3b, 0x3c, 0x3d, 0x3e, 0x3f, /* 0x38-0x3f */ 0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, /* 0x40-0x47 */ 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, /* 0x48-0x4f */ 0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, /* 0x50-0x57 */ 0x58, 0x59, 0x5a, 0x5b, 0x5c, 0x5d, 0x5e, 0x5f, /* 0x58-0x5f */ 0x60, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, /* 0x60-0x67 */ 0x48, 0x49, 0x4a, 0x4b, 0x4c, 0x4d, 0x4e, 0x4f, /* 0x68-0x6f */ 0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, /* 0x70-0x77 */ 0x58, 0x59, 0x5a, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f, /* 0x78-0x7f */ 0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, /* 0x80-0x87 */ 0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f, /* 0x88-0x8f */ 0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, /* 0x90-0x97 */ 0x98, 0x99, 0x9a, 0x9b, 0x9c, 0x9d, 0x9e, 0x9f, /* 0x98-0x9f */ 0xa0, 0xa1, 0xa2, 0xa3, 0xa4, 0xa5, 0xa6, 0xa7, /* 0xa0-0xa7 */ 0xa8, 0xa9, 0xaa, 0xab, 0xac, 0xad, 0xae, 0xaf, /* 0xa8-0xaf */ 0xb0, 0xb1, 0xb2, 0xb3, 0xb4, 0xb5, 0xb6, 0xb7, /* 0xb0-0xb7 */ 0xb8, 0xb9, 0xba, 0xbb, 0xbc, 0xbd, 0xbe, 0xbf, /* 0xb8-0xbf */ 0xc0, 0xc1, 0xc2, 0xc3, 0xc4, 0xc5, 0xc6, 0xc7, /* 0xc0-0xc7 */ 0xc8, 0xc9, 0xca, 0xcb, 0xcc, 0xcd, 0xce, 0xcf, /* 0xc8-0xcf */ 0xd0, 0xd1, 0xd2, 0xd3, 0xd4, 0xd5, 0xd6, 0xd7, /* 0xd0-0xd7 */ 0xd8, 0xd9, 0xda, 0xdb, 0xdc, 0xdd, 0xde, 0xdf, /* 0xd8-0xdf */ 0xe0, 0xe1, 0xe2, 0xe3, 0xe4, 0xe5, 0xe6, 0xe7, /* 0xe0-0xe7 */ 0xe8, 0xe9, 0xea, 0xeb, 0xec, 0xed, 0xee, 0xef, /* 0xe8-0xef */ 0xf0, 0xf1, 0xf2, 0xf3, 0xf4, 0xf5, 0xf6, 0xf7, /* 0xf0-0xf7 */ 0xf8, 0xf9, 0xfa, 0xfb, 0xfc, 0xfd, 0xfe, 0xff, /* 0xf8-0xff */ }; static int uni2char(wchar_t uni, unsigned char *out, int boundlen) { const unsigned char *uni2charset; unsigned char cl = uni & 0x00ff; unsigned char ch = (uni & 0xff00) >> 8; if (boundlen <= 0) return -ENAMETOOLONG; uni2charset = page_uni2charset[ch]; if (uni2charset && uni2charset[cl]) out[0] = uni2charset[cl]; else return -EINVAL; return 1; } static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) { *uni = charset2uni[*rawstring]; if (*uni == 0x0000) return -EINVAL; return 1; } static struct nls_table default_table = { .charset = "default", .uni2char = uni2char, .char2uni = char2uni, .charset2lower = charset2lower, .charset2upper = charset2upper, }; /* Returns a simple default translation table */ struct nls_table *load_nls_default(void) { struct nls_table *default_nls; default_nls = load_nls(CONFIG_NLS_DEFAULT); if (default_nls != NULL) return default_nls; else return &default_table; } EXPORT_SYMBOL(unregister_nls); EXPORT_SYMBOL(unload_nls); EXPORT_SYMBOL(load_nls); EXPORT_SYMBOL(load_nls_default); MODULE_DESCRIPTION("Base file system native language support"); MODULE_LICENSE("Dual BSD/GPL"); |
| 10 7 1 2 1 6 3 3 1 1 2 2 4 6 7 7 7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 | // SPDX-License-Identifier: GPL-2.0-only /* * linux/fs/befs/linuxvfs.c * * Copyright (C) 2001 Will Dyson <will_dyson@pobox.com * */ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/module.h> #include <linux/slab.h> #include <linux/fs.h> #include <linux/fs_context.h> #include <linux/fs_parser.h> #include <linux/errno.h> #include <linux/stat.h> #include <linux/nls.h> #include <linux/buffer_head.h> #include <linux/vfs.h> #include <linux/namei.h> #include <linux/sched.h> #include <linux/cred.h> #include <linux/exportfs.h> #include <linux/seq_file.h> #include <linux/blkdev.h> #include "befs.h" #include "btree.h" #include "inode.h" #include "datastream.h" #include "super.h" #include "io.h" MODULE_DESCRIPTION("BeOS File System (BeFS) driver"); MODULE_AUTHOR("Will Dyson"); MODULE_LICENSE("GPL"); /* The units the vfs expects inode->i_blocks to be in */ #define VFS_BLOCK_SIZE 512 static int befs_readdir(struct file *, struct dir_context *); static int befs_get_block(struct inode *, sector_t, struct buffer_head *, int); static int befs_read_folio(struct file *file, struct folio *folio); static sector_t befs_bmap(struct address_space *mapping, sector_t block); static struct dentry *befs_lookup(struct inode *, struct dentry *, unsigned int); static struct inode *befs_iget(struct super_block *, unsigned long); static struct inode *befs_alloc_inode(struct super_block *sb); static void befs_free_inode(struct inode *inode); static void befs_destroy_inodecache(void); static int befs_symlink_read_folio(struct file *, struct folio *); static int befs_utf2nls(struct super_block *sb, const char *in, int in_len, char **out, int *out_len); static int befs_nls2utf(struct super_block *sb, const char *in, int in_len, char **out, int *out_len); static void befs_put_super(struct super_block *); static int befs_statfs(struct dentry *, struct kstatfs *); static int befs_show_options(struct seq_file *, struct dentry *); static struct dentry *befs_fh_to_dentry(struct super_block *sb, struct fid *fid, int fh_len, int fh_type); static struct dentry *befs_fh_to_parent(struct super_block *sb, struct fid *fid, int fh_len, int fh_type); static struct dentry *befs_get_parent(struct dentry *child); static void befs_free_fc(struct fs_context *fc); static const struct super_operations befs_sops = { .alloc_inode = befs_alloc_inode, /* allocate a new inode */ .free_inode = befs_free_inode, /* deallocate an inode */ .put_super = befs_put_super, /* uninit super */ .statfs = befs_statfs, /* statfs */ .show_options = befs_show_options, }; /* slab cache for befs_inode_info objects */ static struct kmem_cache *befs_inode_cachep; static const struct file_operations befs_dir_operations = { .read = generic_read_dir, .iterate_shared = befs_readdir, .llseek = generic_file_llseek, }; static const struct inode_operations befs_dir_inode_operations = { .lookup = befs_lookup, }; static const struct address_space_operations befs_aops = { .read_folio = befs_read_folio, .bmap = befs_bmap, }; static const struct address_space_operations befs_symlink_aops = { .read_folio = befs_symlink_read_folio, }; static const struct export_operations befs_export_operations = { .encode_fh = generic_encode_ino32_fh, .fh_to_dentry = befs_fh_to_dentry, .fh_to_parent = befs_fh_to_parent, .get_parent = befs_get_parent, }; /* * Called by generic_file_read() to read a folio of data * * In turn, simply calls a generic block read function and * passes it the address of befs_get_block, for mapping file * positions to disk blocks. */ static int befs_read_folio(struct file *file, struct folio *folio) { return block_read_full_folio(folio, befs_get_block); } static sector_t befs_bmap(struct address_space *mapping, sector_t block) { return generic_block_bmap(mapping, block, befs_get_block); } /* * Generic function to map a file position (block) to a * disk offset (passed back in bh_result). * * Used by many higher level functions. * * Calls befs_fblock2brun() in datastream.c to do the real work. */ static int befs_get_block(struct inode *inode, sector_t block, struct buffer_head *bh_result, int create) { struct super_block *sb = inode->i_sb; befs_data_stream *ds = &BEFS_I(inode)->i_data.ds; befs_block_run run = BAD_IADDR; int res; ulong disk_off; befs_debug(sb, "---> befs_get_block() for inode %lu, block %ld", (unsigned long)inode->i_ino, (long)block); if (create) { befs_error(sb, "befs_get_block() was asked to write to " "block %ld in inode %lu", (long)block, (unsigned long)inode->i_ino); return -EPERM; } res = befs_fblock2brun(sb, ds, block, &run); if (res != BEFS_OK) { befs_error(sb, "<--- %s for inode %lu, block %ld ERROR", __func__, (unsigned long)inode->i_ino, (long)block); return -EFBIG; } disk_off = (ulong) iaddr2blockno(sb, &run); map_bh(bh_result, inode->i_sb, disk_off); befs_debug(sb, "<--- %s for inode %lu, block %ld, disk address %lu", __func__, (unsigned long)inode->i_ino, (long)block, (unsigned long)disk_off); return 0; } static struct dentry * befs_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) { struct inode *inode; struct super_block *sb = dir->i_sb; const befs_data_stream *ds = &BEFS_I(dir)->i_data.ds; befs_off_t offset; int ret; int utfnamelen; char *utfname; const char *name = dentry->d_name.name; befs_debug(sb, "---> %s name %pd inode %ld", __func__, dentry, dir->i_ino); /* Convert to UTF-8 */ if (BEFS_SB(sb)->nls) { ret = befs_nls2utf(sb, name, strlen(name), &utfname, &utfnamelen); if (ret < 0) { befs_debug(sb, "<--- %s ERROR", __func__); return ERR_PTR(ret); } ret = befs_btree_find(sb, ds, utfname, &offset); kfree(utfname); } else { ret = befs_btree_find(sb, ds, name, &offset); } if (ret == BEFS_BT_NOT_FOUND) { befs_debug(sb, "<--- %s %pd not found", __func__, dentry); inode = NULL; } else if (ret != BEFS_OK || offset == 0) { befs_error(sb, "<--- %s Error", __func__); inode = ERR_PTR(-ENODATA); } else { inode = befs_iget(dir->i_sb, (ino_t) offset); } befs_debug(sb, "<--- %s", __func__); return d_splice_alias(inode, dentry); } static int befs_readdir(struct file *file, struct dir_context *ctx) { struct inode *inode = file_inode(file); struct super_block *sb = inode->i_sb; const befs_data_stream *ds = &BEFS_I(inode)->i_data.ds; befs_off_t value; int result; size_t keysize; char keybuf[BEFS_NAME_LEN + 1]; befs_debug(sb, "---> %s name %pD, inode %ld, ctx->pos %lld", __func__, file, inode->i_ino, ctx->pos); while (1) { result = befs_btree_read(sb, ds, ctx->pos, BEFS_NAME_LEN + 1, keybuf, &keysize, &value); if (result == BEFS_ERR) { befs_debug(sb, "<--- %s ERROR", __func__); befs_error(sb, "IO error reading %pD (inode %lu)", file, inode->i_ino); return -EIO; } else if (result == BEFS_BT_END) { befs_debug(sb, "<--- %s END", __func__); return 0; } else if (result == BEFS_BT_EMPTY) { befs_debug(sb, "<--- %s Empty directory", __func__); return 0; } /* Convert to NLS */ if (BEFS_SB(sb)->nls) { char *nlsname; int nlsnamelen; result = befs_utf2nls(sb, keybuf, keysize, &nlsname, &nlsnamelen); if (result < 0) { befs_debug(sb, "<--- %s ERROR", __func__); return result; } if (!dir_emit(ctx, nlsname, nlsnamelen, (ino_t) value, DT_UNKNOWN)) { kfree(nlsname); return 0; } kfree(nlsname); } else { if (!dir_emit(ctx, keybuf, keysize, (ino_t) value, DT_UNKNOWN)) return 0; } ctx->pos++; } } static struct inode * befs_alloc_inode(struct super_block *sb) { struct befs_inode_info *bi; bi = alloc_inode_sb(sb, befs_inode_cachep, GFP_KERNEL); if (!bi) return NULL; return &bi->vfs_inode; } static void befs_free_inode(struct inode *inode) { kmem_cache_free(befs_inode_cachep, BEFS_I(inode)); } static void init_once(void *foo) { struct befs_inode_info *bi = (struct befs_inode_info *) foo; inode_init_once(&bi->vfs_inode); } static struct inode *befs_iget(struct super_block *sb, unsigned long ino) { struct buffer_head *bh; befs_inode *raw_inode; struct befs_sb_info *befs_sb = BEFS_SB(sb); struct befs_inode_info *befs_ino; struct inode *inode; befs_debug(sb, "---> %s inode = %lu", __func__, ino); inode = iget_locked(sb, ino); if (!inode) return ERR_PTR(-ENOMEM); if (!(inode->i_state & I_NEW)) return inode; befs_ino = BEFS_I(inode); /* convert from vfs's inode number to befs's inode number */ befs_ino->i_inode_num = blockno2iaddr(sb, inode->i_ino); befs_debug(sb, " real inode number [%u, %hu, %hu]", befs_ino->i_inode_num.allocation_group, befs_ino->i_inode_num.start, befs_ino->i_inode_num.len); bh = sb_bread(sb, inode->i_ino); if (!bh) { befs_error(sb, "unable to read inode block - " "inode = %lu", inode->i_ino); goto unacquire_none; } raw_inode = (befs_inode *) bh->b_data; befs_dump_inode(sb, raw_inode); if (befs_check_inode(sb, raw_inode, inode->i_ino) != BEFS_OK) { befs_error(sb, "Bad inode: %lu", inode->i_ino); goto unacquire_bh; } inode->i_mode = (umode_t) fs32_to_cpu(sb, raw_inode->mode); /* * set uid and gid. But since current BeOS is single user OS, so * you can change by "uid" or "gid" options. */ inode->i_uid = befs_sb->mount_opts.use_uid ? befs_sb->mount_opts.uid : make_kuid(&init_user_ns, fs32_to_cpu(sb, raw_inode->uid)); inode->i_gid = befs_sb->mount_opts.use_gid ? befs_sb->mount_opts.gid : make_kgid(&init_user_ns, fs32_to_cpu(sb, raw_inode->gid)); set_nlink(inode, 1); /* * BEFS's time is 64 bits, but current VFS is 32 bits... * BEFS don't have access time. Nor inode change time. VFS * doesn't have creation time. * Also, the lower 16 bits of the last_modified_time and * create_time are just a counter to help ensure uniqueness * for indexing purposes. (PFD, page 54) */ inode_set_mtime(inode, fs64_to_cpu(sb, raw_inode->last_modified_time) >> 16, 0);/* lower 16 bits are not a time */ inode_set_ctime_to_ts(inode, inode_get_mtime(inode)); inode_set_atime_to_ts(inode, inode_get_mtime(inode)); befs_ino->i_inode_num = fsrun_to_cpu(sb, raw_inode->inode_num); befs_ino->i_parent = fsrun_to_cpu(sb, raw_inode->parent); befs_ino->i_attribute = fsrun_to_cpu(sb, raw_inode->attributes); befs_ino->i_flags = fs32_to_cpu(sb, raw_inode->flags); if (S_ISLNK(inode->i_mode) && !(befs_ino->i_flags & BEFS_LONG_SYMLINK)){ inode->i_size = 0; inode->i_blocks = befs_sb->block_size / VFS_BLOCK_SIZE; strscpy(befs_ino->i_data.symlink, raw_inode->data.symlink, BEFS_SYMLINK_LEN); } else { int num_blks; befs_ino->i_data.ds = fsds_to_cpu(sb, &raw_inode->data.datastream); num_blks = befs_count_blocks(sb, &befs_ino->i_data.ds); inode->i_blocks = num_blks * (befs_sb->block_size / VFS_BLOCK_SIZE); inode->i_size = befs_ino->i_data.ds.size; } inode->i_mapping->a_ops = &befs_aops; if (S_ISREG(inode->i_mode)) { inode->i_fop = &generic_ro_fops; } else if (S_ISDIR(inode->i_mode)) { inode->i_op = &befs_dir_inode_operations; inode->i_fop = &befs_dir_operations; } else if (S_ISLNK(inode->i_mode)) { if (befs_ino->i_flags & BEFS_LONG_SYMLINK) { inode->i_op = &page_symlink_inode_operations; inode_nohighmem(inode); inode->i_mapping->a_ops = &befs_symlink_aops; } else { inode->i_link = befs_ino->i_data.symlink; inode->i_op = &simple_symlink_inode_operations; } } else { befs_error(sb, "Inode %lu is not a regular file, " "directory or symlink. THAT IS WRONG! BeFS has no " "on disk special files", inode->i_ino); goto unacquire_bh; } brelse(bh); befs_debug(sb, "<--- %s", __func__); unlock_new_inode(inode); return inode; unacquire_bh: brelse(bh); unacquire_none: iget_failed(inode); befs_debug(sb, "<--- %s - Bad inode", __func__); return ERR_PTR(-EIO); } /* Initialize the inode cache. Called at fs setup. * * Taken from NFS implementation by Al Viro. */ static int __init befs_init_inodecache(void) { befs_inode_cachep = kmem_cache_create_usercopy("befs_inode_cache", sizeof(struct befs_inode_info), 0, SLAB_RECLAIM_ACCOUNT | SLAB_ACCOUNT, offsetof(struct befs_inode_info, i_data.symlink), sizeof_field(struct befs_inode_info, i_data.symlink), init_once); if (befs_inode_cachep == NULL) return -ENOMEM; return 0; } /* Called at fs teardown. * * Taken from NFS implementation by Al Viro. */ static void befs_destroy_inodecache(void) { /* * Make sure all delayed rcu free inodes are flushed before we * destroy cache. */ rcu_barrier(); kmem_cache_destroy(befs_inode_cachep); } /* * The inode of symbolic link is different to data stream. * The data stream become link name. Unless the LONG_SYMLINK * flag is set. */ static int befs_symlink_read_folio(struct file *unused, struct folio *folio) { struct inode *inode = folio->mapping->host; struct super_block *sb = inode->i_sb; struct befs_inode_info *befs_ino = BEFS_I(inode); befs_data_stream *data = &befs_ino->i_data.ds; befs_off_t len = data->size; char *link = folio_address(folio); int err = -EIO; if (len == 0 || len > PAGE_SIZE) { befs_error(sb, "Long symlink with illegal length"); goto fail; } befs_debug(sb, "Follow long symlink"); if (befs_read_lsymlink(sb, data, link, len) != len) { befs_error(sb, "Failed to read entire long symlink"); goto fail; } link[len - 1] = '\0'; err = 0; fail: folio_end_read(folio, err == 0); return err; } /* * UTF-8 to NLS charset convert routine * * Uses uni2char() / char2uni() rather than the nls tables directly */ static int befs_utf2nls(struct super_block *sb, const char *in, int in_len, char **out, int *out_len) { struct nls_table *nls = BEFS_SB(sb)->nls; int i, o; unicode_t uni; int unilen, utflen; char *result; /* The utf8->nls conversion won't make the final nls string bigger * than the utf one, but if the string is pure ascii they'll have the * same width and an extra char is needed to save the additional \0 */ int maxlen = in_len + 1; befs_debug(sb, "---> %s", __func__); if (!nls) { befs_error(sb, "%s called with no NLS table loaded", __func__); return -EINVAL; } *out = result = kmalloc(maxlen, GFP_NOFS); if (!*out) return -ENOMEM; for (i = o = 0; i < in_len; i += utflen, o += unilen) { /* convert from UTF-8 to Unicode */ utflen = utf8_to_utf32(&in[i], in_len - i, &uni); if (utflen < 0) goto conv_err; /* convert from Unicode to nls */ if (uni > MAX_WCHAR_T) goto conv_err; unilen = nls->uni2char(uni, &result[o], in_len - o); if (unilen < 0) goto conv_err; } result[o] = '\0'; *out_len = o; befs_debug(sb, "<--- %s", __func__); return o; conv_err: befs_error(sb, "Name using character set %s contains a character that " "cannot be converted to unicode.", nls->charset); befs_debug(sb, "<--- %s", __func__); kfree(result); return -EILSEQ; } /** * befs_nls2utf - Convert NLS string to utf8 encodeing * @sb: Superblock * @in: Input string buffer in NLS format * @in_len: Length of input string in bytes * @out: The output string in UTF-8 format * @out_len: Length of the output buffer * * Converts input string @in, which is in the format of the loaded NLS map, * into a utf8 string. * * The destination string @out is allocated by this function and the caller is * responsible for freeing it with kfree() * * On return, *@out_len is the length of @out in bytes. * * On success, the return value is the number of utf8 characters written to * the output buffer @out. * * On Failure, a negative number coresponding to the error code is returned. */ static int befs_nls2utf(struct super_block *sb, const char *in, int in_len, char **out, int *out_len) { struct nls_table *nls = BEFS_SB(sb)->nls; int i, o; wchar_t uni; int unilen, utflen; char *result; /* * There are nls characters that will translate to 3-chars-wide UTF-8 * characters, an additional byte is needed to save the final \0 * in special cases */ int maxlen = (3 * in_len) + 1; befs_debug(sb, "---> %s\n", __func__); if (!nls) { befs_error(sb, "%s called with no NLS table loaded.", __func__); return -EINVAL; } *out = result = kmalloc(maxlen, GFP_NOFS); if (!*out) { *out_len = 0; return -ENOMEM; } for (i = o = 0; i < in_len; i += unilen, o += utflen) { /* convert from nls to unicode */ unilen = nls->char2uni(&in[i], in_len - i, &uni); if (unilen < 0) goto conv_err; /* convert from unicode to UTF-8 */ utflen = utf32_to_utf8(uni, &result[o], 3); if (utflen <= 0) goto conv_err; } result[o] = '\0'; *out_len = o; befs_debug(sb, "<--- %s", __func__); return i; conv_err: befs_error(sb, "Name using character set %s contains a character that " "cannot be converted to unicode.", nls->charset); befs_debug(sb, "<--- %s", __func__); kfree(result); return -EILSEQ; } static struct inode *befs_nfs_get_inode(struct super_block *sb, uint64_t ino, uint32_t generation) { /* No need to handle i_generation */ return befs_iget(sb, ino); } /* * Map a NFS file handle to a corresponding dentry */ static struct dentry *befs_fh_to_dentry(struct super_block *sb, struct fid *fid, int fh_len, int fh_type) { return generic_fh_to_dentry(sb, fid, fh_len, fh_type, befs_nfs_get_inode); } /* * Find the parent for a file specified by NFS handle */ static struct dentry *befs_fh_to_parent(struct super_block *sb, struct fid *fid, int fh_len, int fh_type) { return generic_fh_to_parent(sb, fid, fh_len, fh_type, befs_nfs_get_inode); } static struct dentry *befs_get_parent(struct dentry *child) { struct inode *parent; struct befs_inode_info *befs_ino = BEFS_I(d_inode(child)); parent = befs_iget(child->d_sb, (unsigned long)befs_ino->i_parent.start); return d_obtain_alias(parent); } enum { Opt_uid, Opt_gid, Opt_charset, Opt_debug, }; static const struct fs_parameter_spec befs_param_spec[] = { fsparam_uid ("uid", Opt_uid), fsparam_gid ("gid", Opt_gid), fsparam_string ("iocharset", Opt_charset), fsparam_flag ("debug", Opt_debug), {} }; static int befs_parse_param(struct fs_context *fc, struct fs_parameter *param) { struct befs_mount_options *opts = fc->fs_private; int token; struct fs_parse_result result; /* befs ignores all options on remount */ if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE) return 0; token = fs_parse(fc, befs_param_spec, param, &result); if (token < 0) return token; switch (token) { case Opt_uid: opts->uid = result.uid; opts->use_uid = 1; break; case Opt_gid: opts->gid = result.gid; opts->use_gid = 1; break; case Opt_charset: kfree(opts->iocharset); opts->iocharset = param->string; param->string = NULL; break; case Opt_debug: opts->debug = 1; break; default: return -EINVAL; } return 0; } static int befs_show_options(struct seq_file *m, struct dentry *root) { struct befs_sb_info *befs_sb = BEFS_SB(root->d_sb); struct befs_mount_options *opts = &befs_sb->mount_opts; if (!uid_eq(opts->uid, GLOBAL_ROOT_UID)) seq_printf(m, ",uid=%u", from_kuid_munged(&init_user_ns, opts->uid)); if (!gid_eq(opts->gid, GLOBAL_ROOT_GID)) seq_printf(m, ",gid=%u", from_kgid_munged(&init_user_ns, opts->gid)); if (opts->iocharset) seq_printf(m, ",charset=%s", opts->iocharset); if (opts->debug) seq_puts(m, ",debug"); return 0; } /* This function has the responsibiltiy of getting the * filesystem ready for unmounting. * Basically, we free everything that we allocated in * befs_read_inode */ static void befs_put_super(struct super_block *sb) { kfree(BEFS_SB(sb)->mount_opts.iocharset); BEFS_SB(sb)->mount_opts.iocharset = NULL; unload_nls(BEFS_SB(sb)->nls); kfree(sb->s_fs_info); sb->s_fs_info = NULL; } /* * Copy the parsed options into the sbi mount_options member */ static void befs_set_options(struct befs_sb_info *sbi, struct befs_mount_options *opts) { sbi->mount_opts.uid = opts->uid; sbi->mount_opts.gid = opts->gid; sbi->mount_opts.use_uid = opts->use_uid; sbi->mount_opts.use_gid = opts->use_gid; sbi->mount_opts.debug = opts->debug; sbi->mount_opts.iocharset = opts->iocharset; opts->iocharset = NULL; } /* Allocate private field of the superblock, fill it. * * Finish filling the public superblock fields * Make the root directory * Load a set of NLS translations if needed. */ static int befs_fill_super(struct super_block *sb, struct fs_context *fc) { struct buffer_head *bh; struct befs_sb_info *befs_sb; befs_super_block *disk_sb; struct inode *root; long ret = -EINVAL; const unsigned long sb_block = 0; const off_t x86_sb_off = 512; int blocksize; struct befs_mount_options *parsed_opts = fc->fs_private; int silent = fc->sb_flags & SB_SILENT; sb->s_fs_info = kzalloc(sizeof(*befs_sb), GFP_KERNEL); if (sb->s_fs_info == NULL) goto unacquire_none; befs_sb = BEFS_SB(sb); befs_set_options(befs_sb, parsed_opts); befs_debug(sb, "---> %s", __func__); if (!sb_rdonly(sb)) { befs_warning(sb, "No write support. Marking filesystem read-only"); sb->s_flags |= SB_RDONLY; } /* * Set dummy blocksize to read super block. * Will be set to real fs blocksize later. * * Linux 2.4.10 and later refuse to read blocks smaller than * the logical block size for the device. But we also need to read at * least 1k to get the second 512 bytes of the volume. */ blocksize = sb_min_blocksize(sb, 1024); if (!blocksize) { if (!silent) befs_error(sb, "unable to set blocksize"); goto unacquire_priv_sbp; } bh = sb_bread(sb, sb_block); if (!bh) { if (!silent) befs_error(sb, "unable to read superblock"); goto unacquire_priv_sbp; } /* account for offset of super block on x86 */ disk_sb = (befs_super_block *) bh->b_data; if ((disk_sb->magic1 == BEFS_SUPER_MAGIC1_LE) || (disk_sb->magic1 == BEFS_SUPER_MAGIC1_BE)) { befs_debug(sb, "Using PPC superblock location"); } else { befs_debug(sb, "Using x86 superblock location"); disk_sb = (befs_super_block *) ((void *) bh->b_data + x86_sb_off); } if ((befs_load_sb(sb, disk_sb) != BEFS_OK) || (befs_check_sb(sb) != BEFS_OK)) goto unacquire_bh; befs_dump_super_block(sb, disk_sb); brelse(bh); if (befs_sb->num_blocks > ~((sector_t)0)) { if (!silent) befs_error(sb, "blocks count: %llu is larger than the host can use", befs_sb->num_blocks); goto unacquire_priv_sbp; } /* * set up enough so that it can read an inode * Fill in kernel superblock fields from private sb */ sb->s_magic = BEFS_SUPER_MAGIC; /* Set real blocksize of fs */ sb_set_blocksize(sb, (ulong) befs_sb->block_size); sb->s_op = &befs_sops; sb->s_export_op = &befs_export_operations; sb->s_time_min = 0; sb->s_time_max = 0xffffffffffffll; root = befs_iget(sb, iaddr2blockno(sb, &(befs_sb->root_dir))); if (IS_ERR(root)) { ret = PTR_ERR(root); goto unacquire_priv_sbp; } sb->s_root = d_make_root(root); if (!sb->s_root) { if (!silent) befs_error(sb, "get root inode failed"); goto unacquire_priv_sbp; } /* load nls library */ if (befs_sb->mount_opts.iocharset) { befs_debug(sb, "Loading nls: %s", befs_sb->mount_opts.iocharset); befs_sb->nls = load_nls(befs_sb->mount_opts.iocharset); if (!befs_sb->nls) { befs_warning(sb, "Cannot load nls %s" " loading default nls", befs_sb->mount_opts.iocharset); befs_sb->nls = load_nls_default(); } /* load default nls if none is specified in mount options */ } else { befs_debug(sb, "Loading default nls"); befs_sb->nls = load_nls_default(); } return 0; unacquire_bh: brelse(bh); unacquire_priv_sbp: kfree(befs_sb->mount_opts.iocharset); kfree(sb->s_fs_info); sb->s_fs_info = NULL; unacquire_none: return ret; } static int befs_reconfigure(struct fs_context *fc) { sync_filesystem(fc->root->d_sb); if (!(fc->sb_flags & SB_RDONLY)) return -EINVAL; return 0; } static int befs_statfs(struct dentry *dentry, struct kstatfs *buf) { struct super_block *sb = dentry->d_sb; u64 id = huge_encode_dev(sb->s_bdev->bd_dev); befs_debug(sb, "---> %s", __func__); buf->f_type = BEFS_SUPER_MAGIC; buf->f_bsize = sb->s_blocksize; buf->f_blocks = BEFS_SB(sb)->num_blocks; buf->f_bfree = BEFS_SB(sb)->num_blocks - BEFS_SB(sb)->used_blocks; buf->f_bavail = buf->f_bfree; buf->f_files = 0; /* UNKNOWN */ buf->f_ffree = 0; /* UNKNOWN */ buf->f_fsid = u64_to_fsid(id); buf->f_namelen = BEFS_NAME_LEN; befs_debug(sb, "<--- %s", __func__); return 0; } static int befs_get_tree(struct fs_context *fc) { return get_tree_bdev(fc, befs_fill_super); } static const struct fs_context_operations befs_context_ops = { .parse_param = befs_parse_param, .get_tree = befs_get_tree, .reconfigure = befs_reconfigure, .free = befs_free_fc, }; static int befs_init_fs_context(struct fs_context *fc) { struct befs_mount_options *opts; opts = kzalloc(sizeof(*opts), GFP_KERNEL); if (!opts) return -ENOMEM; /* Initialize options */ opts->uid = GLOBAL_ROOT_UID; opts->gid = GLOBAL_ROOT_GID; fc->fs_private = opts; fc->ops = &befs_context_ops; return 0; } static void befs_free_fc(struct fs_context *fc) { struct befs_mount_options *opts = fc->fs_private; kfree(opts->iocharset); kfree(fc->fs_private); } static struct file_system_type befs_fs_type = { .owner = THIS_MODULE, .name = "befs", .kill_sb = kill_block_super, .fs_flags = FS_REQUIRES_DEV, .init_fs_context = befs_init_fs_context, .parameters = befs_param_spec, }; MODULE_ALIAS_FS("befs"); static int __init init_befs_fs(void) { int err; pr_info("version: %s\n", BEFS_VERSION); err = befs_init_inodecache(); if (err) goto unacquire_none; err = register_filesystem(&befs_fs_type); if (err) goto unacquire_inodecache; return 0; unacquire_inodecache: befs_destroy_inodecache(); unacquire_none: return err; } static void __exit exit_befs_fs(void) { befs_destroy_inodecache(); unregister_filesystem(&befs_fs_type); } /* * Macros that typecheck the init and exit functions, * ensures that they are called at init and cleanup, * and eliminates warnings about unused functions. */ module_init(init_befs_fs) module_exit(exit_befs_fs) |
| 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 2 3 2 1 2 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 | // SPDX-License-Identifier: (BSD-3-Clause OR GPL-2.0-only) /* Copyright(c) 2014 - 2020 Intel Corporation */ #include <crypto/algapi.h> #include <linux/module.h> #include <linux/mutex.h> #include <linux/slab.h> #include <linux/fs.h> #include <linux/bitops.h> #include <linux/pci.h> #include <linux/cdev.h> #include <linux/uaccess.h> #include "adf_accel_devices.h" #include "adf_common_drv.h" #include "adf_cfg.h" #include "adf_cfg_common.h" #include "adf_cfg_user.h" #define ADF_CFG_MAX_SECTION 512 #define ADF_CFG_MAX_KEY_VAL 256 #define DEVICE_NAME "qat_adf_ctl" static DEFINE_MUTEX(adf_ctl_lock); static long adf_ctl_ioctl(struct file *fp, unsigned int cmd, unsigned long arg); static const struct file_operations adf_ctl_ops = { .owner = THIS_MODULE, .unlocked_ioctl = adf_ctl_ioctl, .compat_ioctl = compat_ptr_ioctl, }; static const struct class adf_ctl_class = { .name = DEVICE_NAME, }; struct adf_ctl_drv_info { unsigned int major; struct cdev drv_cdev; }; static struct adf_ctl_drv_info adf_ctl_drv; static void adf_chr_drv_destroy(void) { device_destroy(&adf_ctl_class, MKDEV(adf_ctl_drv.major, 0)); cdev_del(&adf_ctl_drv.drv_cdev); class_unregister(&adf_ctl_class); unregister_chrdev_region(MKDEV(adf_ctl_drv.major, 0), 1); } static int adf_chr_drv_create(void) { dev_t dev_id; struct device *drv_device; int ret; if (alloc_chrdev_region(&dev_id, 0, 1, DEVICE_NAME)) { pr_err("QAT: unable to allocate chrdev region\n"); return -EFAULT; } ret = class_register(&adf_ctl_class); if (ret) goto err_chrdev_unreg; adf_ctl_drv.major = MAJOR(dev_id); cdev_init(&adf_ctl_drv.drv_cdev, &adf_ctl_ops); if (cdev_add(&adf_ctl_drv.drv_cdev, dev_id, 1)) { pr_err("QAT: cdev add failed\n"); goto err_class_destr; } drv_device = device_create(&adf_ctl_class, NULL, MKDEV(adf_ctl_drv.major, 0), NULL, DEVICE_NAME); if (IS_ERR(drv_device)) { pr_err("QAT: failed to create device\n"); goto err_cdev_del; } return 0; err_cdev_del: cdev_del(&adf_ctl_drv.drv_cdev); err_class_destr: class_unregister(&adf_ctl_class); err_chrdev_unreg: unregister_chrdev_region(dev_id, 1); return -EFAULT; } static int adf_ctl_alloc_resources(struct adf_user_cfg_ctl_data **ctl_data, unsigned long arg) { struct adf_user_cfg_ctl_data *cfg_data; cfg_data = kzalloc(sizeof(*cfg_data), GFP_KERNEL); if (!cfg_data) return -ENOMEM; /* Initialize device id to NO DEVICE as 0 is a valid device id */ cfg_data->device_id = ADF_CFG_NO_DEVICE; if (copy_from_user(cfg_data, (void __user *)arg, sizeof(*cfg_data))) { pr_err("QAT: failed to copy from user cfg_data.\n"); kfree(cfg_data); return -EIO; } *ctl_data = cfg_data; return 0; } static int adf_add_key_value_data(struct adf_accel_dev *accel_dev, const char *section, const struct adf_user_cfg_key_val *key_val) { if (key_val->type == ADF_HEX) { long *ptr = (long *)key_val->val; long val = *ptr; if (adf_cfg_add_key_value_param(accel_dev, section, key_val->key, (void *)val, key_val->type)) { dev_err(&GET_DEV(accel_dev), "failed to add hex keyvalue.\n"); return -EFAULT; } } else { if (adf_cfg_add_key_value_param(accel_dev, section, key_val->key, key_val->val, key_val->type)) { dev_err(&GET_DEV(accel_dev), "failed to add keyvalue.\n"); return -EFAULT; } } return 0; } static int adf_copy_key_value_data(struct adf_accel_dev *accel_dev, struct adf_user_cfg_ctl_data *ctl_data) { struct adf_user_cfg_key_val key_val; struct adf_user_cfg_key_val *params_head; struct adf_user_cfg_section section, *section_head; int i, j; section_head = ctl_data->config_section; for (i = 0; section_head && i < ADF_CFG_MAX_SECTION; i++) { if (copy_from_user(§ion, (void __user *)section_head, sizeof(*section_head))) { dev_err(&GET_DEV(accel_dev), "failed to copy section info\n"); goto out_err; } if (adf_cfg_section_add(accel_dev, section.name)) { dev_err(&GET_DEV(accel_dev), "failed to add section.\n"); goto out_err; } params_head = section.params; for (j = 0; params_head && j < ADF_CFG_MAX_KEY_VAL; j++) { if (copy_from_user(&key_val, (void __user *)params_head, sizeof(key_val))) { dev_err(&GET_DEV(accel_dev), "Failed to copy keyvalue.\n"); goto out_err; } if (adf_add_key_value_data(accel_dev, section.name, &key_val)) { goto out_err; } params_head = key_val.next; } section_head = section.next; } return 0; out_err: adf_cfg_del_all(accel_dev); return -EFAULT; } static int adf_ctl_ioctl_dev_config(struct file *fp, unsigned int cmd, unsigned long arg) { int ret; struct adf_user_cfg_ctl_data *ctl_data; struct adf_accel_dev *accel_dev; ret = adf_ctl_alloc_resources(&ctl_data, arg); if (ret) return ret; accel_dev = adf_devmgr_get_dev_by_id(ctl_data->device_id); if (!accel_dev) { ret = -EFAULT; goto out; } if (adf_dev_started(accel_dev)) { ret = -EFAULT; goto out; } if (adf_copy_key_value_data(accel_dev, ctl_data)) { ret = -EFAULT; goto out; } set_bit(ADF_STATUS_CONFIGURED, &accel_dev->status); out: kfree(ctl_data); return ret; } static int adf_ctl_is_device_in_use(int id) { struct adf_accel_dev *dev; list_for_each_entry(dev, adf_devmgr_get_head(), list) { if (id == dev->accel_id || id == ADF_CFG_ALL_DEVICES) { if (adf_devmgr_in_reset(dev) || adf_dev_in_use(dev)) { dev_info(&GET_DEV(dev), "device qat_dev%d is busy\n", dev->accel_id); return -EBUSY; } } } return 0; } static void adf_ctl_stop_devices(u32 id) { struct adf_accel_dev *accel_dev; list_for_each_entry(accel_dev, adf_devmgr_get_head(), list) { if (id == accel_dev->accel_id || id == ADF_CFG_ALL_DEVICES) { if (!adf_dev_started(accel_dev)) continue; /* First stop all VFs */ if (!accel_dev->is_vf) continue; adf_dev_down(accel_dev); } } list_for_each_entry(accel_dev, adf_devmgr_get_head(), list) { if (id == accel_dev->accel_id || id == ADF_CFG_ALL_DEVICES) { if (!adf_dev_started(accel_dev)) continue; adf_dev_down(accel_dev); } } } static int adf_ctl_ioctl_dev_stop(struct file *fp, unsigned int cmd, unsigned long arg) { int ret; struct adf_user_cfg_ctl_data *ctl_data; ret = adf_ctl_alloc_resources(&ctl_data, arg); if (ret) return ret; if (adf_devmgr_verify_id(ctl_data->device_id)) { pr_err("QAT: Device %d not found\n", ctl_data->device_id); ret = -ENODEV; goto out; } ret = adf_ctl_is_device_in_use(ctl_data->device_id); if (ret) goto out; if (ctl_data->device_id == ADF_CFG_ALL_DEVICES) pr_info("QAT: Stopping all acceleration devices.\n"); else pr_info("QAT: Stopping acceleration device qat_dev%d.\n", ctl_data->device_id); adf_ctl_stop_devices(ctl_data->device_id); out: kfree(ctl_data); return ret; } static int adf_ctl_ioctl_dev_start(struct file *fp, unsigned int cmd, unsigned long arg) { int ret; struct adf_user_cfg_ctl_data *ctl_data; struct adf_accel_dev *accel_dev; ret = adf_ctl_alloc_resources(&ctl_data, arg); if (ret) return ret; ret = -ENODEV; accel_dev = adf_devmgr_get_dev_by_id(ctl_data->device_id); if (!accel_dev) goto out; dev_info(&GET_DEV(accel_dev), "Starting acceleration device qat_dev%d.\n", ctl_data->device_id); ret = adf_dev_up(accel_dev, false); if (ret) { dev_err(&GET_DEV(accel_dev), "Failed to start qat_dev%d\n", ctl_data->device_id); adf_dev_down(accel_dev); } out: kfree(ctl_data); return ret; } static int adf_ctl_ioctl_get_num_devices(struct file *fp, unsigned int cmd, unsigned long arg) { u32 num_devices = 0; adf_devmgr_get_num_dev(&num_devices); if (copy_to_user((void __user *)arg, &num_devices, sizeof(num_devices))) return -EFAULT; return 0; } static int adf_ctl_ioctl_get_status(struct file *fp, unsigned int cmd, unsigned long arg) { struct adf_hw_device_data *hw_data; struct adf_dev_status_info dev_info; struct adf_accel_dev *accel_dev; if (copy_from_user(&dev_info, (void __user *)arg, sizeof(struct adf_dev_status_info))) { pr_err("QAT: failed to copy from user.\n"); return -EFAULT; } accel_dev = adf_devmgr_get_dev_by_id(dev_info.accel_id); if (!accel_dev) return -ENODEV; hw_data = accel_dev->hw_device; dev_info.state = adf_dev_started(accel_dev) ? DEV_UP : DEV_DOWN; dev_info.num_ae = hw_data->get_num_aes(hw_data); dev_info.num_accel = hw_data->get_num_accels(hw_data); dev_info.num_logical_accel = hw_data->num_logical_accel; dev_info.banks_per_accel = hw_data->num_banks / hw_data->num_logical_accel; strscpy(dev_info.name, hw_data->dev_class->name, sizeof(dev_info.name)); dev_info.instance_id = hw_data->instance_id; dev_info.type = hw_data->dev_class->type; dev_info.bus = accel_to_pci_dev(accel_dev)->bus->number; dev_info.dev = PCI_SLOT(accel_to_pci_dev(accel_dev)->devfn); dev_info.fun = PCI_FUNC(accel_to_pci_dev(accel_dev)->devfn); if (copy_to_user((void __user *)arg, &dev_info, sizeof(struct adf_dev_status_info))) { dev_err(&GET_DEV(accel_dev), "failed to copy status.\n"); return -EFAULT; } return 0; } static long adf_ctl_ioctl(struct file *fp, unsigned int cmd, unsigned long arg) { int ret; if (mutex_lock_interruptible(&adf_ctl_lock)) return -EFAULT; switch (cmd) { case IOCTL_CONFIG_SYS_RESOURCE_PARAMETERS: ret = adf_ctl_ioctl_dev_config(fp, cmd, arg); break; case IOCTL_STOP_ACCEL_DEV: ret = adf_ctl_ioctl_dev_stop(fp, cmd, arg); break; case IOCTL_START_ACCEL_DEV: ret = adf_ctl_ioctl_dev_start(fp, cmd, arg); break; case IOCTL_GET_NUM_DEVICES: ret = adf_ctl_ioctl_get_num_devices(fp, cmd, arg); break; case IOCTL_STATUS_ACCEL_DEV: ret = adf_ctl_ioctl_get_status(fp, cmd, arg); break; default: pr_err_ratelimited("QAT: Invalid ioctl %d\n", cmd); ret = -EFAULT; break; } mutex_unlock(&adf_ctl_lock); return ret; } static int __init adf_register_ctl_device_driver(void) { if (adf_chr_drv_create()) goto err_chr_dev; if (adf_init_misc_wq()) goto err_misc_wq; if (adf_init_aer()) goto err_aer; if (adf_init_pf_wq()) goto err_pf_wq; if (adf_init_vf_wq()) goto err_vf_wq; if (qat_crypto_register()) goto err_crypto_register; if (qat_compression_register()) goto err_compression_register; return 0; err_compression_register: qat_crypto_unregister(); err_crypto_register: adf_exit_vf_wq(); err_vf_wq: adf_exit_pf_wq(); err_pf_wq: adf_exit_aer(); err_aer: adf_exit_misc_wq(); err_misc_wq: adf_chr_drv_destroy(); err_chr_dev: mutex_destroy(&adf_ctl_lock); return -EFAULT; } static void __exit adf_unregister_ctl_device_driver(void) { adf_chr_drv_destroy(); adf_exit_misc_wq(); adf_exit_aer(); adf_exit_vf_wq(); adf_exit_pf_wq(); qat_crypto_unregister(); qat_compression_unregister(); adf_clean_vf_map(false); mutex_destroy(&adf_ctl_lock); } module_init(adf_register_ctl_device_driver); module_exit(adf_unregister_ctl_device_driver); MODULE_LICENSE("Dual BSD/GPL"); MODULE_AUTHOR("Intel"); MODULE_DESCRIPTION("Intel(R) QuickAssist Technology"); MODULE_ALIAS_CRYPTO("intel_qat"); MODULE_VERSION(ADF_DRV_VERSION); MODULE_IMPORT_NS("CRYPTO_INTERNAL"); |
| 107 74 60 149 7 7 2 21 21 21 21 19 2 19 1 1 1 19 18 7 1 6 9 6 6 2 4 10 10 5 2 2 3 17 17 16 1 8 8 8 4 8 4 3 1 12 9 3 14 11 22 7 14 9 9 7 2 2 6 2 5 2 2 10 3 5 7 10 8 10 4 1 6 6 10 6 6 6 7 7 14 14 10 10 7 3 6 6 17 4 4 12 12 11 11 22 22 9 3 1 2 3 1 1 1 5 5 10 10 10 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Memory-to-memory device framework for Video for Linux 2 and vb2. * * Helper functions for devices that use vb2 buffers for both their * source and destination. * * Copyright (c) 2009-2010 Samsung Electronics Co., Ltd. * Pawel Osciak, <pawel@osciak.com> * Marek Szyprowski, <m.szyprowski@samsung.com> */ #include <linux/module.h> #include <linux/sched.h> #include <linux/slab.h> #include <media/media-device.h> #include <media/videobuf2-v4l2.h> #include <media/v4l2-mem2mem.h> #include <media/v4l2-dev.h> #include <media/v4l2-device.h> #include <media/v4l2-fh.h> #include <media/v4l2-event.h> MODULE_DESCRIPTION("Mem to mem device framework for vb2"); MODULE_AUTHOR("Pawel Osciak, <pawel@osciak.com>"); MODULE_LICENSE("GPL"); static bool debug; module_param(debug, bool, 0644); #define dprintk(fmt, arg...) \ do { \ if (debug) \ printk(KERN_DEBUG "%s: " fmt, __func__, ## arg);\ } while (0) /* Instance is already queued on the job_queue */ #define TRANS_QUEUED (1 << 0) /* Instance is currently running in hardware */ #define TRANS_RUNNING (1 << 1) /* Instance is currently aborting */ #define TRANS_ABORT (1 << 2) /* The job queue is not running new jobs */ #define QUEUE_PAUSED (1 << 0) /* Offset base for buffers on the destination queue - used to distinguish * between source and destination buffers when mmapping - they receive the same * offsets but for different queues */ #define DST_QUEUE_OFF_BASE (1 << 30) enum v4l2_m2m_entity_type { MEM2MEM_ENT_TYPE_SOURCE, MEM2MEM_ENT_TYPE_SINK, MEM2MEM_ENT_TYPE_PROC }; static const char * const m2m_entity_name[] = { "source", "sink", "proc" }; /** * struct v4l2_m2m_dev - per-device context * @source: &struct media_entity pointer with the source entity * Used only when the M2M device is registered via * v4l2_m2m_register_media_controller(). * @source_pad: &struct media_pad with the source pad. * Used only when the M2M device is registered via * v4l2_m2m_register_media_controller(). * @sink: &struct media_entity pointer with the sink entity * Used only when the M2M device is registered via * v4l2_m2m_register_media_controller(). * @sink_pad: &struct media_pad with the sink pad. * Used only when the M2M device is registered via * v4l2_m2m_register_media_controller(). * @proc: &struct media_entity pointer with the M2M device itself. * @proc_pads: &struct media_pad with the @proc pads. * Used only when the M2M device is registered via * v4l2_m2m_unregister_media_controller(). * @intf_devnode: &struct media_intf devnode pointer with the interface * with controls the M2M device. * @curr_ctx: currently running instance * @job_queue: instances queued to run * @job_spinlock: protects job_queue * @job_work: worker to run queued jobs. * @job_queue_flags: flags of the queue status, %QUEUE_PAUSED. * @m2m_ops: driver callbacks */ struct v4l2_m2m_dev { struct v4l2_m2m_ctx *curr_ctx; #ifdef CONFIG_MEDIA_CONTROLLER struct media_entity *source; struct media_pad source_pad; struct media_entity sink; struct media_pad sink_pad; struct media_entity proc; struct media_pad proc_pads[2]; struct media_intf_devnode *intf_devnode; #endif struct list_head job_queue; spinlock_t job_spinlock; struct work_struct job_work; unsigned long job_queue_flags; const struct v4l2_m2m_ops *m2m_ops; }; static struct v4l2_m2m_queue_ctx *get_queue_ctx(struct v4l2_m2m_ctx *m2m_ctx, enum v4l2_buf_type type) { if (V4L2_TYPE_IS_OUTPUT(type)) return &m2m_ctx->out_q_ctx; else return &m2m_ctx->cap_q_ctx; } struct vb2_queue *v4l2_m2m_get_vq(struct v4l2_m2m_ctx *m2m_ctx, enum v4l2_buf_type type) { struct v4l2_m2m_queue_ctx *q_ctx; q_ctx = get_queue_ctx(m2m_ctx, type); if (!q_ctx) return NULL; return &q_ctx->q; } EXPORT_SYMBOL(v4l2_m2m_get_vq); struct vb2_v4l2_buffer *v4l2_m2m_next_buf(struct v4l2_m2m_queue_ctx *q_ctx) { struct v4l2_m2m_buffer *b; unsigned long flags; spin_lock_irqsave(&q_ctx->rdy_spinlock, flags); if (list_empty(&q_ctx->rdy_queue)) { spin_unlock_irqrestore(&q_ctx->rdy_spinlock, flags); return NULL; } b = list_first_entry(&q_ctx->rdy_queue, struct v4l2_m2m_buffer, list); spin_unlock_irqrestore(&q_ctx->rdy_spinlock, flags); return &b->vb; } EXPORT_SYMBOL_GPL(v4l2_m2m_next_buf); struct vb2_v4l2_buffer *v4l2_m2m_last_buf(struct v4l2_m2m_queue_ctx *q_ctx) { struct v4l2_m2m_buffer *b; unsigned long flags; spin_lock_irqsave(&q_ctx->rdy_spinlock, flags); if (list_empty(&q_ctx->rdy_queue)) { spin_unlock_irqrestore(&q_ctx->rdy_spinlock, flags); return NULL; } b = list_last_entry(&q_ctx->rdy_queue, struct v4l2_m2m_buffer, list); spin_unlock_irqrestore(&q_ctx->rdy_spinlock, flags); return &b->vb; } EXPORT_SYMBOL_GPL(v4l2_m2m_last_buf); struct vb2_v4l2_buffer *v4l2_m2m_buf_remove(struct v4l2_m2m_queue_ctx *q_ctx) { struct v4l2_m2m_buffer *b; unsigned long flags; spin_lock_irqsave(&q_ctx->rdy_spinlock, flags); if (list_empty(&q_ctx->rdy_queue)) { spin_unlock_irqrestore(&q_ctx->rdy_spinlock, flags); return NULL; } b = list_first_entry(&q_ctx->rdy_queue, struct v4l2_m2m_buffer, list); list_del(&b->list); q_ctx->num_rdy--; spin_unlock_irqrestore(&q_ctx->rdy_spinlock, flags); return &b->vb; } EXPORT_SYMBOL_GPL(v4l2_m2m_buf_remove); void v4l2_m2m_buf_remove_by_buf(struct v4l2_m2m_queue_ctx *q_ctx, struct vb2_v4l2_buffer *vbuf) { struct v4l2_m2m_buffer *b; unsigned long flags; spin_lock_irqsave(&q_ctx->rdy_spinlock, flags); b = container_of(vbuf, struct v4l2_m2m_buffer, vb); list_del(&b->list); q_ctx->num_rdy--; spin_unlock_irqrestore(&q_ctx->rdy_spinlock, flags); } EXPORT_SYMBOL_GPL(v4l2_m2m_buf_remove_by_buf); struct vb2_v4l2_buffer * v4l2_m2m_buf_remove_by_idx(struct v4l2_m2m_queue_ctx *q_ctx, unsigned int idx) { struct v4l2_m2m_buffer *b, *tmp; struct vb2_v4l2_buffer *ret = NULL; unsigned long flags; spin_lock_irqsave(&q_ctx->rdy_spinlock, flags); list_for_each_entry_safe(b, tmp, &q_ctx->rdy_queue, list) { if (b->vb.vb2_buf.index == idx) { list_del(&b->list); q_ctx->num_rdy--; ret = &b->vb; break; } } spin_unlock_irqrestore(&q_ctx->rdy_spinlock, flags); return ret; } EXPORT_SYMBOL_GPL(v4l2_m2m_buf_remove_by_idx); /* * Scheduling handlers */ void *v4l2_m2m_get_curr_priv(struct v4l2_m2m_dev *m2m_dev) { unsigned long flags; void *ret = NULL; spin_lock_irqsave(&m2m_dev->job_spinlock, flags); if (m2m_dev->curr_ctx) ret = m2m_dev->curr_ctx->priv; spin_unlock_irqrestore(&m2m_dev->job_spinlock, flags); return ret; } EXPORT_SYMBOL(v4l2_m2m_get_curr_priv); /** * v4l2_m2m_try_run() - select next job to perform and run it if possible * @m2m_dev: per-device context * * Get next transaction (if present) from the waiting jobs list and run it. * * Note that this function can run on a given v4l2_m2m_ctx context, * but call .device_run for another context. */ static void v4l2_m2m_try_run(struct v4l2_m2m_dev *m2m_dev) { unsigned long flags; spin_lock_irqsave(&m2m_dev->job_spinlock, flags); if (NULL != m2m_dev->curr_ctx) { spin_unlock_irqrestore(&m2m_dev->job_spinlock, flags); dprintk("Another instance is running, won't run now\n"); return; } if (list_empty(&m2m_dev->job_queue)) { spin_unlock_irqrestore(&m2m_dev->job_spinlock, flags); dprintk("No job pending\n"); return; } if (m2m_dev->job_queue_flags & QUEUE_PAUSED) { spin_unlock_irqrestore(&m2m_dev->job_spinlock, flags); dprintk("Running new jobs is paused\n"); return; } m2m_dev->curr_ctx = list_first_entry(&m2m_dev->job_queue, struct v4l2_m2m_ctx, queue); m2m_dev->curr_ctx->job_flags |= TRANS_RUNNING; spin_unlock_irqrestore(&m2m_dev->job_spinlock, flags); dprintk("Running job on m2m_ctx: %p\n", m2m_dev->curr_ctx); m2m_dev->m2m_ops->device_run(m2m_dev->curr_ctx->priv); } /* * __v4l2_m2m_try_queue() - queue a job * @m2m_dev: m2m device * @m2m_ctx: m2m context * * Check if this context is ready to queue a job. * * This function can run in interrupt context. */ static void __v4l2_m2m_try_queue(struct v4l2_m2m_dev *m2m_dev, struct v4l2_m2m_ctx *m2m_ctx) { unsigned long flags_job; struct vb2_v4l2_buffer *dst, *src; dprintk("Trying to schedule a job for m2m_ctx: %p\n", m2m_ctx); if (!m2m_ctx->out_q_ctx.q.streaming || (!m2m_ctx->cap_q_ctx.q.streaming && !m2m_ctx->ignore_cap_streaming)) { if (!m2m_ctx->ignore_cap_streaming) dprintk("Streaming needs to be on for both queues\n"); else dprintk("Streaming needs to be on for the OUTPUT queue\n"); return; } spin_lock_irqsave(&m2m_dev->job_spinlock, flags_job); /* If the context is aborted then don't schedule it */ if (m2m_ctx->job_flags & TRANS_ABORT) { dprintk("Aborted context\n"); goto job_unlock; } if (m2m_ctx->job_flags & TRANS_QUEUED) { dprintk("On job queue already\n"); goto job_unlock; } src = v4l2_m2m_next_src_buf(m2m_ctx); dst = v4l2_m2m_next_dst_buf(m2m_ctx); if (!src && !m2m_ctx->out_q_ctx.buffered) { dprintk("No input buffers available\n"); goto job_unlock; } if (!dst && !m2m_ctx->cap_q_ctx.buffered) { dprintk("No output buffers available\n"); goto job_unlock; } m2m_ctx->new_frame = true; if (src && dst && dst->is_held && dst->vb2_buf.copied_timestamp && dst->vb2_buf.timestamp != src->vb2_buf.timestamp) { dprintk("Timestamp mismatch, returning held capture buffer\n"); dst->is_held = false; v4l2_m2m_dst_buf_remove(m2m_ctx); v4l2_m2m_buf_done(dst, VB2_BUF_STATE_DONE); dst = v4l2_m2m_next_dst_buf(m2m_ctx); if (!dst && !m2m_ctx->cap_q_ctx.buffered) { dprintk("No output buffers available after returning held buffer\n"); goto job_unlock; } } if (src && dst && (m2m_ctx->out_q_ctx.q.subsystem_flags & VB2_V4L2_FL_SUPPORTS_M2M_HOLD_CAPTURE_BUF)) m2m_ctx->new_frame = !dst->vb2_buf.copied_timestamp || dst->vb2_buf.timestamp != src->vb2_buf.timestamp; if (m2m_ctx->has_stopped) { dprintk("Device has stopped\n"); goto job_unlock; } if (m2m_dev->m2m_ops->job_ready && (!m2m_dev->m2m_ops->job_ready(m2m_ctx->priv))) { dprintk("Driver not ready\n"); goto job_unlock; } list_add_tail(&m2m_ctx->queue, &m2m_dev->job_queue); m2m_ctx->job_flags |= TRANS_QUEUED; job_unlock: spin_unlock_irqrestore(&m2m_dev->job_spinlock, flags_job); } /** * v4l2_m2m_try_schedule() - schedule and possibly run a job for any context * @m2m_ctx: m2m context * * Check if this context is ready to queue a job. If suitable, * run the next queued job on the mem2mem device. * * This function shouldn't run in interrupt context. * * Note that v4l2_m2m_try_schedule() can schedule one job for this context, * and then run another job for another context. */ void v4l2_m2m_try_schedule(struct v4l2_m2m_ctx *m2m_ctx) { struct v4l2_m2m_dev *m2m_dev = m2m_ctx->m2m_dev; __v4l2_m2m_try_queue(m2m_dev, m2m_ctx); v4l2_m2m_try_run(m2m_dev); } EXPORT_SYMBOL_GPL(v4l2_m2m_try_schedule); /** * v4l2_m2m_device_run_work() - run pending jobs for the context * @work: Work structure used for scheduling the execution of this function. */ static void v4l2_m2m_device_run_work(struct work_struct *work) { struct v4l2_m2m_dev *m2m_dev = container_of(work, struct v4l2_m2m_dev, job_work); v4l2_m2m_try_run(m2m_dev); } /** * v4l2_m2m_cancel_job() - cancel pending jobs for the context * @m2m_ctx: m2m context with jobs to be canceled * * In case of streamoff or release called on any context, * 1] If the context is currently running, then abort job will be called * 2] If the context is queued, then the context will be removed from * the job_queue */ static void v4l2_m2m_cancel_job(struct v4l2_m2m_ctx *m2m_ctx) { struct v4l2_m2m_dev *m2m_dev; unsigned long flags; m2m_dev = m2m_ctx->m2m_dev; spin_lock_irqsave(&m2m_dev->job_spinlock, flags); m2m_ctx->job_flags |= TRANS_ABORT; if (m2m_ctx->job_flags & TRANS_RUNNING) { spin_unlock_irqrestore(&m2m_dev->job_spinlock, flags); if (m2m_dev->m2m_ops->job_abort) m2m_dev->m2m_ops->job_abort(m2m_ctx->priv); dprintk("m2m_ctx %p running, will wait to complete\n", m2m_ctx); wait_event(m2m_ctx->finished, !(m2m_ctx->job_flags & TRANS_RUNNING)); } else if (m2m_ctx->job_flags & TRANS_QUEUED) { list_del(&m2m_ctx->queue); m2m_ctx->job_flags &= ~(TRANS_QUEUED | TRANS_RUNNING); spin_unlock_irqrestore(&m2m_dev->job_spinlock, flags); dprintk("m2m_ctx: %p had been on queue and was removed\n", m2m_ctx); } else { /* Do nothing, was not on queue/running */ spin_unlock_irqrestore(&m2m_dev->job_spinlock, flags); } } /* * Schedule the next job, called from v4l2_m2m_job_finish() or * v4l2_m2m_buf_done_and_job_finish(). */ static void v4l2_m2m_schedule_next_job(struct v4l2_m2m_dev *m2m_dev, struct v4l2_m2m_ctx *m2m_ctx) { /* * This instance might have more buffers ready, but since we do not * allow more than one job on the job_queue per instance, each has * to be scheduled separately after the previous one finishes. */ __v4l2_m2m_try_queue(m2m_dev, m2m_ctx); /* * We might be running in atomic context, * but the job must be run in non-atomic context. */ schedule_work(&m2m_dev->job_work); } /* * Assumes job_spinlock is held, called from v4l2_m2m_job_finish() or * v4l2_m2m_buf_done_and_job_finish(). */ static bool _v4l2_m2m_job_finish(struct v4l2_m2m_dev *m2m_dev, struct v4l2_m2m_ctx *m2m_ctx) { if (!m2m_dev->curr_ctx || m2m_dev->curr_ctx != m2m_ctx) { dprintk("Called by an instance not currently running\n"); return false; } list_del(&m2m_dev->curr_ctx->queue); m2m_dev->curr_ctx->job_flags &= ~(TRANS_QUEUED | TRANS_RUNNING); wake_up(&m2m_dev->curr_ctx->finished); m2m_dev->curr_ctx = NULL; return true; } void v4l2_m2m_job_finish(struct v4l2_m2m_dev *m2m_dev, struct v4l2_m2m_ctx *m2m_ctx) { unsigned long flags; bool schedule_next; /* * This function should not be used for drivers that support * holding capture buffers. Those should use * v4l2_m2m_buf_done_and_job_finish() instead. */ WARN_ON(m2m_ctx->out_q_ctx.q.subsystem_flags & VB2_V4L2_FL_SUPPORTS_M2M_HOLD_CAPTURE_BUF); spin_lock_irqsave(&m2m_dev->job_spinlock, flags); schedule_next = _v4l2_m2m_job_finish(m2m_dev, m2m_ctx); spin_unlock_irqrestore(&m2m_dev->job_spinlock, flags); if (schedule_next) v4l2_m2m_schedule_next_job(m2m_dev, m2m_ctx); } EXPORT_SYMBOL(v4l2_m2m_job_finish); void v4l2_m2m_buf_done_and_job_finish(struct v4l2_m2m_dev *m2m_dev, struct v4l2_m2m_ctx *m2m_ctx, enum vb2_buffer_state state) { struct vb2_v4l2_buffer *src_buf, *dst_buf; bool schedule_next = false; unsigned long flags; spin_lock_irqsave(&m2m_dev->job_spinlock, flags); src_buf = v4l2_m2m_src_buf_remove(m2m_ctx); dst_buf = v4l2_m2m_next_dst_buf(m2m_ctx); if (WARN_ON(!src_buf || !dst_buf)) goto unlock; dst_buf->is_held = src_buf->flags & V4L2_BUF_FLAG_M2M_HOLD_CAPTURE_BUF; if (!dst_buf->is_held) { v4l2_m2m_dst_buf_remove(m2m_ctx); v4l2_m2m_buf_done(dst_buf, state); } /* * If the request API is being used, returning the OUTPUT * (src) buffer will wake-up any process waiting on the * request file descriptor. * * Therefore, return the CAPTURE (dst) buffer first, * to avoid signalling the request file descriptor * before the CAPTURE buffer is done. */ v4l2_m2m_buf_done(src_buf, state); schedule_next = _v4l2_m2m_job_finish(m2m_dev, m2m_ctx); unlock: spin_unlock_irqrestore(&m2m_dev->job_spinlock, flags); if (schedule_next) v4l2_m2m_schedule_next_job(m2m_dev, m2m_ctx); } EXPORT_SYMBOL(v4l2_m2m_buf_done_and_job_finish); void v4l2_m2m_suspend(struct v4l2_m2m_dev *m2m_dev) { unsigned long flags; struct v4l2_m2m_ctx *curr_ctx; spin_lock_irqsave(&m2m_dev->job_spinlock, flags); m2m_dev->job_queue_flags |= QUEUE_PAUSED; curr_ctx = m2m_dev->curr_ctx; spin_unlock_irqrestore(&m2m_dev->job_spinlock, flags); if (curr_ctx) wait_event(curr_ctx->finished, !(curr_ctx->job_flags & TRANS_RUNNING)); } EXPORT_SYMBOL(v4l2_m2m_suspend); void v4l2_m2m_resume(struct v4l2_m2m_dev *m2m_dev) { unsigned long flags; spin_lock_irqsave(&m2m_dev->job_spinlock, flags); m2m_dev->job_queue_flags &= ~QUEUE_PAUSED; spin_unlock_irqrestore(&m2m_dev->job_spinlock, flags); v4l2_m2m_try_run(m2m_dev); } EXPORT_SYMBOL(v4l2_m2m_resume); int v4l2_m2m_reqbufs(struct file *file, struct v4l2_m2m_ctx *m2m_ctx, struct v4l2_requestbuffers *reqbufs) { struct vb2_queue *vq; int ret; vq = v4l2_m2m_get_vq(m2m_ctx, reqbufs->type); ret = vb2_reqbufs(vq, reqbufs); /* If count == 0, then the owner has released all buffers and he is no longer owner of the queue. Otherwise we have an owner. */ if (ret == 0) vq->owner = reqbufs->count ? file->private_data : NULL; return ret; } EXPORT_SYMBOL_GPL(v4l2_m2m_reqbufs); static void v4l2_m2m_adjust_mem_offset(struct vb2_queue *vq, struct v4l2_buffer *buf) { /* Adjust MMAP memory offsets for the CAPTURE queue */ if (buf->memory == V4L2_MEMORY_MMAP && V4L2_TYPE_IS_CAPTURE(vq->type)) { if (V4L2_TYPE_IS_MULTIPLANAR(vq->type)) { unsigned int i; for (i = 0; i < buf->length; ++i) buf->m.planes[i].m.mem_offset += DST_QUEUE_OFF_BASE; } else { buf->m.offset += DST_QUEUE_OFF_BASE; } } } int v4l2_m2m_querybuf(struct file *file, struct v4l2_m2m_ctx *m2m_ctx, struct v4l2_buffer *buf) { struct vb2_queue *vq; int ret; vq = v4l2_m2m_get_vq(m2m_ctx, buf->type); ret = vb2_querybuf(vq, buf); if (ret) return ret; /* Adjust MMAP memory offsets for the CAPTURE queue */ v4l2_m2m_adjust_mem_offset(vq, buf); return 0; } EXPORT_SYMBOL_GPL(v4l2_m2m_querybuf); /* * This will add the LAST flag and mark the buffer management * state as stopped. * This is called when the last capture buffer must be flagged as LAST * in draining mode from the encoder/decoder driver buf_queue() callback * or from v4l2_update_last_buf_state() when a capture buffer is available. */ void v4l2_m2m_last_buffer_done(struct v4l2_m2m_ctx *m2m_ctx, struct vb2_v4l2_buffer *vbuf) { vbuf->flags |= V4L2_BUF_FLAG_LAST; vb2_buffer_done(&vbuf->vb2_buf, VB2_BUF_STATE_DONE); v4l2_m2m_mark_stopped(m2m_ctx); } EXPORT_SYMBOL_GPL(v4l2_m2m_last_buffer_done); /* When stop command is issued, update buffer management state */ static int v4l2_update_last_buf_state(struct v4l2_m2m_ctx *m2m_ctx) { struct vb2_v4l2_buffer *next_dst_buf; if (m2m_ctx->is_draining) return -EBUSY; if (m2m_ctx->has_stopped) return 0; m2m_ctx->last_src_buf = v4l2_m2m_last_src_buf(m2m_ctx); m2m_ctx->is_draining = true; /* * The processing of the last output buffer queued before * the STOP command is expected to mark the buffer management * state as stopped with v4l2_m2m_mark_stopped(). */ if (m2m_ctx->last_src_buf) return 0; /* * In case the output queue is empty, try to mark the last capture * buffer as LAST. */ next_dst_buf = v4l2_m2m_dst_buf_remove(m2m_ctx); if (!next_dst_buf) { /* * Wait for the next queued one in encoder/decoder driver * buf_queue() callback using the v4l2_m2m_dst_buf_is_last() * helper or in v4l2_m2m_qbuf() if encoder/decoder is not yet * streaming. */ m2m_ctx->next_buf_last = true; return 0; } v4l2_m2m_last_buffer_done(m2m_ctx, next_dst_buf); return 0; } /* * Updates the encoding/decoding buffer management state, should * be called from encoder/decoder drivers start_streaming() */ void v4l2_m2m_update_start_streaming_state(struct v4l2_m2m_ctx *m2m_ctx, struct vb2_queue *q) { /* If start streaming again, untag the last output buffer */ if (V4L2_TYPE_IS_OUTPUT(q->type)) m2m_ctx->last_src_buf = NULL; } EXPORT_SYMBOL_GPL(v4l2_m2m_update_start_streaming_state); /* * Updates the encoding/decoding buffer management state, should * be called from encoder/decoder driver stop_streaming() */ void v4l2_m2m_update_stop_streaming_state(struct v4l2_m2m_ctx *m2m_ctx, struct vb2_queue *q) { if (V4L2_TYPE_IS_OUTPUT(q->type)) { /* * If in draining state, either mark next dst buffer as * done or flag next one to be marked as done either * in encoder/decoder driver buf_queue() callback using * the v4l2_m2m_dst_buf_is_last() helper or in v4l2_m2m_qbuf() * if encoder/decoder is not yet streaming */ if (m2m_ctx->is_draining) { struct vb2_v4l2_buffer *next_dst_buf; m2m_ctx->last_src_buf = NULL; next_dst_buf = v4l2_m2m_dst_buf_remove(m2m_ctx); if (!next_dst_buf) m2m_ctx->next_buf_last = true; else v4l2_m2m_last_buffer_done(m2m_ctx, next_dst_buf); } } else { v4l2_m2m_clear_state(m2m_ctx); } } EXPORT_SYMBOL_GPL(v4l2_m2m_update_stop_streaming_state); static void v4l2_m2m_force_last_buf_done(struct v4l2_m2m_ctx *m2m_ctx, struct vb2_queue *q) { struct vb2_buffer *vb; struct vb2_v4l2_buffer *vbuf; unsigned int i; if (WARN_ON(q->is_output)) return; if (list_empty(&q->queued_list)) return; vb = list_first_entry(&q->queued_list, struct vb2_buffer, queued_entry); for (i = 0; i < vb->num_planes; i++) vb2_set_plane_payload(vb, i, 0); /* * Since the buffer hasn't been queued to the ready queue, * mark is active and owned before marking it LAST and DONE */ vb->state = VB2_BUF_STATE_ACTIVE; atomic_inc(&q->owned_by_drv_count); vbuf = to_vb2_v4l2_buffer(vb); vbuf->field = V4L2_FIELD_NONE; v4l2_m2m_last_buffer_done(m2m_ctx, vbuf); } int v4l2_m2m_qbuf(struct file *file, struct v4l2_m2m_ctx *m2m_ctx, struct v4l2_buffer *buf) { struct video_device *vdev = video_devdata(file); struct vb2_queue *vq; int ret; vq = v4l2_m2m_get_vq(m2m_ctx, buf->type); if (V4L2_TYPE_IS_CAPTURE(vq->type) && (buf->flags & V4L2_BUF_FLAG_REQUEST_FD)) { dprintk("%s: requests cannot be used with capture buffers\n", __func__); return -EPERM; } ret = vb2_qbuf(vq, vdev->v4l2_dev->mdev, buf); if (ret) return ret; /* Adjust MMAP memory offsets for the CAPTURE queue */ v4l2_m2m_adjust_mem_offset(vq, buf); /* * If the capture queue is streaming, but streaming hasn't started * on the device, but was asked to stop, mark the previously queued * buffer as DONE with LAST flag since it won't be queued on the * device. */ if (V4L2_TYPE_IS_CAPTURE(vq->type) && vb2_is_streaming(vq) && !vb2_start_streaming_called(vq) && (v4l2_m2m_has_stopped(m2m_ctx) || v4l2_m2m_dst_buf_is_last(m2m_ctx))) v4l2_m2m_force_last_buf_done(m2m_ctx, vq); else if (!(buf->flags & V4L2_BUF_FLAG_IN_REQUEST)) v4l2_m2m_try_schedule(m2m_ctx); return 0; } EXPORT_SYMBOL_GPL(v4l2_m2m_qbuf); int v4l2_m2m_dqbuf(struct file *file, struct v4l2_m2m_ctx *m2m_ctx, struct v4l2_buffer *buf) { struct vb2_queue *vq; int ret; vq = v4l2_m2m_get_vq(m2m_ctx, buf->type); ret = vb2_dqbuf(vq, buf, file->f_flags & O_NONBLOCK); if (ret) return ret; /* Adjust MMAP memory offsets for the CAPTURE queue */ v4l2_m2m_adjust_mem_offset(vq, buf); return 0; } EXPORT_SYMBOL_GPL(v4l2_m2m_dqbuf); int v4l2_m2m_prepare_buf(struct file *file, struct v4l2_m2m_ctx *m2m_ctx, struct v4l2_buffer *buf) { struct video_device *vdev = video_devdata(file); struct vb2_queue *vq; int ret; vq = v4l2_m2m_get_vq(m2m_ctx, buf->type); ret = vb2_prepare_buf(vq, vdev->v4l2_dev->mdev, buf); if (ret) return ret; /* Adjust MMAP memory offsets for the CAPTURE queue */ v4l2_m2m_adjust_mem_offset(vq, buf); return 0; } EXPORT_SYMBOL_GPL(v4l2_m2m_prepare_buf); int v4l2_m2m_create_bufs(struct file *file, struct v4l2_m2m_ctx *m2m_ctx, struct v4l2_create_buffers *create) { struct vb2_queue *vq; vq = v4l2_m2m_get_vq(m2m_ctx, create->format.type); return vb2_create_bufs(vq, create); } EXPORT_SYMBOL_GPL(v4l2_m2m_create_bufs); int v4l2_m2m_expbuf(struct file *file, struct v4l2_m2m_ctx *m2m_ctx, struct v4l2_exportbuffer *eb) { struct vb2_queue *vq; vq = v4l2_m2m_get_vq(m2m_ctx, eb->type); return vb2_expbuf(vq, eb); } EXPORT_SYMBOL_GPL(v4l2_m2m_expbuf); int v4l2_m2m_streamon(struct file *file, struct v4l2_m2m_ctx *m2m_ctx, enum v4l2_buf_type type) { struct vb2_queue *vq; int ret; vq = v4l2_m2m_get_vq(m2m_ctx, type); ret = vb2_streamon(vq, type); if (!ret) v4l2_m2m_try_schedule(m2m_ctx); return ret; } EXPORT_SYMBOL_GPL(v4l2_m2m_streamon); int v4l2_m2m_streamoff(struct file *file, struct v4l2_m2m_ctx *m2m_ctx, enum v4l2_buf_type type) { struct v4l2_m2m_dev *m2m_dev; struct v4l2_m2m_queue_ctx *q_ctx; unsigned long flags_job, flags; int ret; /* wait until the current context is dequeued from job_queue */ v4l2_m2m_cancel_job(m2m_ctx); q_ctx = get_queue_ctx(m2m_ctx, type); ret = vb2_streamoff(&q_ctx->q, type); if (ret) return ret; m2m_dev = m2m_ctx->m2m_dev; spin_lock_irqsave(&m2m_dev->job_spinlock, flags_job); /* We should not be scheduled anymore, since we're dropping a queue. */ if (m2m_ctx->job_flags & TRANS_QUEUED) list_del(&m2m_ctx->queue); m2m_ctx->job_flags = 0; spin_lock_irqsave(&q_ctx->rdy_spinlock, flags); /* Drop queue, since streamoff returns device to the same state as after * calling reqbufs. */ INIT_LIST_HEAD(&q_ctx->rdy_queue); q_ctx->num_rdy = 0; spin_unlock_irqrestore(&q_ctx->rdy_spinlock, flags); if (m2m_dev->curr_ctx == m2m_ctx) { m2m_dev->curr_ctx = NULL; wake_up(&m2m_ctx->finished); } spin_unlock_irqrestore(&m2m_dev->job_spinlock, flags_job); return 0; } EXPORT_SYMBOL_GPL(v4l2_m2m_streamoff); static __poll_t v4l2_m2m_poll_for_data(struct file *file, struct v4l2_m2m_ctx *m2m_ctx, struct poll_table_struct *wait) { struct vb2_queue *src_q, *dst_q; __poll_t rc = 0; unsigned long flags; src_q = v4l2_m2m_get_src_vq(m2m_ctx); dst_q = v4l2_m2m_get_dst_vq(m2m_ctx); /* * There has to be at least one buffer queued on each queued_list, which * means either in driver already or waiting for driver to claim it * and start processing. */ if ((!vb2_is_streaming(src_q) || src_q->error || list_empty(&src_q->queued_list)) && (!vb2_is_streaming(dst_q) || dst_q->error || (list_empty(&dst_q->queued_list) && !dst_q->last_buffer_dequeued))) return EPOLLERR; spin_lock_irqsave(&src_q->done_lock, flags); if (!list_empty(&src_q->done_list)) rc |= EPOLLOUT | EPOLLWRNORM; spin_unlock_irqrestore(&src_q->done_lock, flags); spin_lock_irqsave(&dst_q->done_lock, flags); /* * If the last buffer was dequeued from the capture queue, signal * userspace. DQBUF(CAPTURE) will return -EPIPE. */ if (!list_empty(&dst_q->done_list) || dst_q->last_buffer_dequeued) rc |= EPOLLIN | EPOLLRDNORM; spin_unlock_irqrestore(&dst_q->done_lock, flags); return rc; } __poll_t v4l2_m2m_poll(struct file *file, struct v4l2_m2m_ctx *m2m_ctx, struct poll_table_struct *wait) { struct video_device *vfd = video_devdata(file); struct vb2_queue *src_q = v4l2_m2m_get_src_vq(m2m_ctx); struct vb2_queue *dst_q = v4l2_m2m_get_dst_vq(m2m_ctx); __poll_t req_events = poll_requested_events(wait); __poll_t rc = 0; /* * poll_wait() MUST be called on the first invocation on all the * potential queues of interest, even if we are not interested in their * events during this first call. Failure to do so will result in * queue's events to be ignored because the poll_table won't be capable * of adding new wait queues thereafter. */ poll_wait(file, &src_q->done_wq, wait); poll_wait(file, &dst_q->done_wq, wait); if (req_events & (EPOLLOUT | EPOLLWRNORM | EPOLLIN | EPOLLRDNORM)) rc = v4l2_m2m_poll_for_data(file, m2m_ctx, wait); if (test_bit(V4L2_FL_USES_V4L2_FH, &vfd->flags)) { struct v4l2_fh *fh = file->private_data; poll_wait(file, &fh->wait, wait); if (v4l2_event_pending(fh)) rc |= EPOLLPRI; } return rc; } EXPORT_SYMBOL_GPL(v4l2_m2m_poll); int v4l2_m2m_mmap(struct file *file, struct v4l2_m2m_ctx *m2m_ctx, struct vm_area_struct *vma) { unsigned long offset = vma->vm_pgoff << PAGE_SHIFT; struct vb2_queue *vq; if (offset < DST_QUEUE_OFF_BASE) { vq = v4l2_m2m_get_src_vq(m2m_ctx); } else { vq = v4l2_m2m_get_dst_vq(m2m_ctx); vma->vm_pgoff -= (DST_QUEUE_OFF_BASE >> PAGE_SHIFT); } return vb2_mmap(vq, vma); } EXPORT_SYMBOL(v4l2_m2m_mmap); #ifndef CONFIG_MMU unsigned long v4l2_m2m_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { struct v4l2_fh *fh = file->private_data; unsigned long offset = pgoff << PAGE_SHIFT; struct vb2_queue *vq; if (offset < DST_QUEUE_OFF_BASE) { vq = v4l2_m2m_get_src_vq(fh->m2m_ctx); } else { vq = v4l2_m2m_get_dst_vq(fh->m2m_ctx); pgoff -= (DST_QUEUE_OFF_BASE >> PAGE_SHIFT); } return vb2_get_unmapped_area(vq, addr, len, pgoff, flags); } EXPORT_SYMBOL_GPL(v4l2_m2m_get_unmapped_area); #endif #if defined(CONFIG_MEDIA_CONTROLLER) void v4l2_m2m_unregister_media_controller(struct v4l2_m2m_dev *m2m_dev) { media_remove_intf_links(&m2m_dev->intf_devnode->intf); media_devnode_remove(m2m_dev->intf_devnode); media_entity_remove_links(m2m_dev->source); media_entity_remove_links(&m2m_dev->sink); media_entity_remove_links(&m2m_dev->proc); media_device_unregister_entity(m2m_dev->source); media_device_unregister_entity(&m2m_dev->sink); media_device_unregister_entity(&m2m_dev->proc); kfree(m2m_dev->source->name); kfree(m2m_dev->sink.name); kfree(m2m_dev->proc.name); } EXPORT_SYMBOL_GPL(v4l2_m2m_unregister_media_controller); static int v4l2_m2m_register_entity(struct media_device *mdev, struct v4l2_m2m_dev *m2m_dev, enum v4l2_m2m_entity_type type, struct video_device *vdev, int function) { struct media_entity *entity; struct media_pad *pads; char *name; unsigned int len; int num_pads; int ret; switch (type) { case MEM2MEM_ENT_TYPE_SOURCE: entity = m2m_dev->source; pads = &m2m_dev->source_pad; pads[0].flags = MEDIA_PAD_FL_SOURCE; num_pads = 1; break; case MEM2MEM_ENT_TYPE_SINK: entity = &m2m_dev->sink; pads = &m2m_dev->sink_pad; pads[0].flags = MEDIA_PAD_FL_SINK; num_pads = 1; break; case MEM2MEM_ENT_TYPE_PROC: entity = &m2m_dev->proc; pads = m2m_dev->proc_pads; pads[0].flags = MEDIA_PAD_FL_SINK; pads[1].flags = MEDIA_PAD_FL_SOURCE; num_pads = 2; break; default: return -EINVAL; } entity->obj_type = MEDIA_ENTITY_TYPE_BASE; if (type != MEM2MEM_ENT_TYPE_PROC) { entity->info.dev.major = VIDEO_MAJOR; entity->info.dev.minor = vdev->minor; } len = strlen(vdev->name) + 2 + strlen(m2m_entity_name[type]); name = kmalloc(len, GFP_KERNEL); if (!name) return -ENOMEM; snprintf(name, len, "%s-%s", vdev->name, m2m_entity_name[type]); entity->name = name; entity->function = function; ret = media_entity_pads_init(entity, num_pads, pads); if (ret) { kfree(entity->name); entity->name = NULL; return ret; } ret = media_device_register_entity(mdev, entity); if (ret) { kfree(entity->name); entity->name = NULL; return ret; } return 0; } int v4l2_m2m_register_media_controller(struct v4l2_m2m_dev *m2m_dev, struct video_device *vdev, int function) { struct media_device *mdev = vdev->v4l2_dev->mdev; struct media_link *link; int ret; if (!mdev) return 0; /* A memory-to-memory device consists in two * DMA engine and one video processing entities. * The DMA engine entities are linked to a V4L interface */ /* Create the three entities with their pads */ m2m_dev->source = &vdev->entity; ret = v4l2_m2m_register_entity(mdev, m2m_dev, MEM2MEM_ENT_TYPE_SOURCE, vdev, MEDIA_ENT_F_IO_V4L); if (ret) return ret; ret = v4l2_m2m_register_entity(mdev, m2m_dev, MEM2MEM_ENT_TYPE_PROC, vdev, function); if (ret) goto err_rel_entity0; ret = v4l2_m2m_register_entity(mdev, m2m_dev, MEM2MEM_ENT_TYPE_SINK, vdev, MEDIA_ENT_F_IO_V4L); if (ret) goto err_rel_entity1; /* Connect the three entities */ ret = media_create_pad_link(m2m_dev->source, 0, &m2m_dev->proc, 0, MEDIA_LNK_FL_IMMUTABLE | MEDIA_LNK_FL_ENABLED); if (ret) goto err_rel_entity2; ret = media_create_pad_link(&m2m_dev->proc, 1, &m2m_dev->sink, 0, MEDIA_LNK_FL_IMMUTABLE | MEDIA_LNK_FL_ENABLED); if (ret) goto err_rm_links0; /* Create video interface */ m2m_dev->intf_devnode = media_devnode_create(mdev, MEDIA_INTF_T_V4L_VIDEO, 0, VIDEO_MAJOR, vdev->minor); if (!m2m_dev->intf_devnode) { ret = -ENOMEM; goto err_rm_links1; } /* Connect the two DMA engines to the interface */ link = media_create_intf_link(m2m_dev->source, &m2m_dev->intf_devnode->intf, MEDIA_LNK_FL_IMMUTABLE | MEDIA_LNK_FL_ENABLED); if (!link) { ret = -ENOMEM; goto err_rm_devnode; } link = media_create_intf_link(&m2m_dev->sink, &m2m_dev->intf_devnode->intf, MEDIA_LNK_FL_IMMUTABLE | MEDIA_LNK_FL_ENABLED); if (!link) { ret = -ENOMEM; goto err_rm_intf_link; } return 0; err_rm_intf_link: media_remove_intf_links(&m2m_dev->intf_devnode->intf); err_rm_devnode: media_devnode_remove(m2m_dev->intf_devnode); err_rm_links1: media_entity_remove_links(&m2m_dev->sink); err_rm_links0: media_entity_remove_links(&m2m_dev->proc); media_entity_remove_links(m2m_dev->source); err_rel_entity2: media_device_unregister_entity(&m2m_dev->proc); kfree(m2m_dev->proc.name); err_rel_entity1: media_device_unregister_entity(&m2m_dev->sink); kfree(m2m_dev->sink.name); err_rel_entity0: media_device_unregister_entity(m2m_dev->source); kfree(m2m_dev->source->name); return ret; return 0; } EXPORT_SYMBOL_GPL(v4l2_m2m_register_media_controller); #endif struct v4l2_m2m_dev *v4l2_m2m_init(const struct v4l2_m2m_ops *m2m_ops) { struct v4l2_m2m_dev *m2m_dev; if (!m2m_ops || WARN_ON(!m2m_ops->device_run)) return ERR_PTR(-EINVAL); m2m_dev = kzalloc(sizeof *m2m_dev, GFP_KERNEL); if (!m2m_dev) return ERR_PTR(-ENOMEM); m2m_dev->curr_ctx = NULL; m2m_dev->m2m_ops = m2m_ops; INIT_LIST_HEAD(&m2m_dev->job_queue); spin_lock_init(&m2m_dev->job_spinlock); INIT_WORK(&m2m_dev->job_work, v4l2_m2m_device_run_work); return m2m_dev; } EXPORT_SYMBOL_GPL(v4l2_m2m_init); void v4l2_m2m_release(struct v4l2_m2m_dev *m2m_dev) { kfree(m2m_dev); } EXPORT_SYMBOL_GPL(v4l2_m2m_release); struct v4l2_m2m_ctx *v4l2_m2m_ctx_init(struct v4l2_m2m_dev *m2m_dev, void *drv_priv, int (*queue_init)(void *priv, struct vb2_queue *src_vq, struct vb2_queue *dst_vq)) { struct v4l2_m2m_ctx *m2m_ctx; struct v4l2_m2m_queue_ctx *out_q_ctx, *cap_q_ctx; int ret; m2m_ctx = kzalloc(sizeof *m2m_ctx, GFP_KERNEL); if (!m2m_ctx) return ERR_PTR(-ENOMEM); m2m_ctx->priv = drv_priv; m2m_ctx->m2m_dev = m2m_dev; init_waitqueue_head(&m2m_ctx->finished); out_q_ctx = &m2m_ctx->out_q_ctx; cap_q_ctx = &m2m_ctx->cap_q_ctx; INIT_LIST_HEAD(&out_q_ctx->rdy_queue); INIT_LIST_HEAD(&cap_q_ctx->rdy_queue); spin_lock_init(&out_q_ctx->rdy_spinlock); spin_lock_init(&cap_q_ctx->rdy_spinlock); INIT_LIST_HEAD(&m2m_ctx->queue); ret = queue_init(drv_priv, &out_q_ctx->q, &cap_q_ctx->q); if (ret) goto err; /* * Both queues should use same the mutex to lock the m2m context. * This lock is used in some v4l2_m2m_* helpers. */ if (WARN_ON(out_q_ctx->q.lock != cap_q_ctx->q.lock)) { ret = -EINVAL; goto err; } m2m_ctx->q_lock = out_q_ctx->q.lock; return m2m_ctx; err: kfree(m2m_ctx); return ERR_PTR(ret); } EXPORT_SYMBOL_GPL(v4l2_m2m_ctx_init); void v4l2_m2m_ctx_release(struct v4l2_m2m_ctx *m2m_ctx) { /* wait until the current context is dequeued from job_queue */ v4l2_m2m_cancel_job(m2m_ctx); vb2_queue_release(&m2m_ctx->cap_q_ctx.q); vb2_queue_release(&m2m_ctx->out_q_ctx.q); kfree(m2m_ctx); } EXPORT_SYMBOL_GPL(v4l2_m2m_ctx_release); void v4l2_m2m_buf_queue(struct v4l2_m2m_ctx *m2m_ctx, struct vb2_v4l2_buffer *vbuf) { struct v4l2_m2m_buffer *b = container_of(vbuf, struct v4l2_m2m_buffer, vb); struct v4l2_m2m_queue_ctx *q_ctx; unsigned long flags; q_ctx = get_queue_ctx(m2m_ctx, vbuf->vb2_buf.vb2_queue->type); if (!q_ctx) return; spin_lock_irqsave(&q_ctx->rdy_spinlock, flags); list_add_tail(&b->list, &q_ctx->rdy_queue); q_ctx->num_rdy++; spin_unlock_irqrestore(&q_ctx->rdy_spinlock, flags); } EXPORT_SYMBOL_GPL(v4l2_m2m_buf_queue); void v4l2_m2m_buf_copy_metadata(const struct vb2_v4l2_buffer *out_vb, struct vb2_v4l2_buffer *cap_vb, bool copy_frame_flags) { u32 mask = V4L2_BUF_FLAG_TIMECODE | V4L2_BUF_FLAG_TSTAMP_SRC_MASK; if (copy_frame_flags) mask |= V4L2_BUF_FLAG_KEYFRAME | V4L2_BUF_FLAG_PFRAME | V4L2_BUF_FLAG_BFRAME; cap_vb->vb2_buf.timestamp = out_vb->vb2_buf.timestamp; if (out_vb->flags & V4L2_BUF_FLAG_TIMECODE) cap_vb->timecode = out_vb->timecode; cap_vb->field = out_vb->field; cap_vb->flags &= ~mask; cap_vb->flags |= out_vb->flags & mask; cap_vb->vb2_buf.copied_timestamp = 1; } EXPORT_SYMBOL_GPL(v4l2_m2m_buf_copy_metadata); void v4l2_m2m_request_queue(struct media_request *req) { struct media_request_object *obj, *obj_safe; struct v4l2_m2m_ctx *m2m_ctx = NULL; /* * Queue all objects. Note that buffer objects are at the end of the * objects list, after all other object types. Once buffer objects * are queued, the driver might delete them immediately (if the driver * processes the buffer at once), so we have to use * list_for_each_entry_safe() to handle the case where the object we * queue is deleted. */ list_for_each_entry_safe(obj, obj_safe, &req->objects, list) { struct v4l2_m2m_ctx *m2m_ctx_obj; struct vb2_buffer *vb; if (!obj->ops->queue) continue; if (vb2_request_object_is_buffer(obj)) { /* Sanity checks */ vb = container_of(obj, struct vb2_buffer, req_obj); WARN_ON(!V4L2_TYPE_IS_OUTPUT(vb->vb2_queue->type)); m2m_ctx_obj = container_of(vb->vb2_queue, struct v4l2_m2m_ctx, out_q_ctx.q); WARN_ON(m2m_ctx && m2m_ctx_obj != m2m_ctx); m2m_ctx = m2m_ctx_obj; } /* * The buffer we queue here can in theory be immediately * unbound, hence the use of list_for_each_entry_safe() * above and why we call the queue op last. */ obj->ops->queue(obj); } WARN_ON(!m2m_ctx); if (m2m_ctx) v4l2_m2m_try_schedule(m2m_ctx); } EXPORT_SYMBOL_GPL(v4l2_m2m_request_queue); /* Videobuf2 ioctl helpers */ int v4l2_m2m_ioctl_reqbufs(struct file *file, void *priv, struct v4l2_requestbuffers *rb) { struct v4l2_fh *fh = file->private_data; return v4l2_m2m_reqbufs(file, fh->m2m_ctx, rb); } EXPORT_SYMBOL_GPL(v4l2_m2m_ioctl_reqbufs); int v4l2_m2m_ioctl_create_bufs(struct file *file, void *priv, struct v4l2_create_buffers *create) { struct v4l2_fh *fh = file->private_data; return v4l2_m2m_create_bufs(file, fh->m2m_ctx, create); } EXPORT_SYMBOL_GPL(v4l2_m2m_ioctl_create_bufs); int v4l2_m2m_ioctl_remove_bufs(struct file *file, void *priv, struct v4l2_remove_buffers *remove) { struct v4l2_fh *fh = file->private_data; struct vb2_queue *q = v4l2_m2m_get_vq(fh->m2m_ctx, remove->type); if (!q) return -EINVAL; if (q->type != remove->type) return -EINVAL; return vb2_core_remove_bufs(q, remove->index, remove->count); } EXPORT_SYMBOL_GPL(v4l2_m2m_ioctl_remove_bufs); int v4l2_m2m_ioctl_querybuf(struct file *file, void *priv, struct v4l2_buffer *buf) { struct v4l2_fh *fh = file->private_data; return v4l2_m2m_querybuf(file, fh->m2m_ctx, buf); } EXPORT_SYMBOL_GPL(v4l2_m2m_ioctl_querybuf); int v4l2_m2m_ioctl_qbuf(struct file *file, void *priv, struct v4l2_buffer *buf) { struct v4l2_fh *fh = file->private_data; return v4l2_m2m_qbuf(file, fh->m2m_ctx, buf); } EXPORT_SYMBOL_GPL(v4l2_m2m_ioctl_qbuf); int v4l2_m2m_ioctl_dqbuf(struct file *file, void *priv, struct v4l2_buffer *buf) { struct v4l2_fh *fh = file->private_data; return v4l2_m2m_dqbuf(file, fh->m2m_ctx, buf); } EXPORT_SYMBOL_GPL(v4l2_m2m_ioctl_dqbuf); int v4l2_m2m_ioctl_prepare_buf(struct file *file, void *priv, struct v4l2_buffer *buf) { struct v4l2_fh *fh = file->private_data; return v4l2_m2m_prepare_buf(file, fh->m2m_ctx, buf); } EXPORT_SYMBOL_GPL(v4l2_m2m_ioctl_prepare_buf); int v4l2_m2m_ioctl_expbuf(struct file *file, void *priv, struct v4l2_exportbuffer *eb) { struct v4l2_fh *fh = file->private_data; return v4l2_m2m_expbuf(file, fh->m2m_ctx, eb); } EXPORT_SYMBOL_GPL(v4l2_m2m_ioctl_expbuf); int v4l2_m2m_ioctl_streamon(struct file *file, void *priv, enum v4l2_buf_type type) { struct v4l2_fh *fh = file->private_data; return v4l2_m2m_streamon(file, fh->m2m_ctx, type); } EXPORT_SYMBOL_GPL(v4l2_m2m_ioctl_streamon); int v4l2_m2m_ioctl_streamoff(struct file *file, void *priv, enum v4l2_buf_type type) { struct v4l2_fh *fh = file->private_data; return v4l2_m2m_streamoff(file, fh->m2m_ctx, type); } EXPORT_SYMBOL_GPL(v4l2_m2m_ioctl_streamoff); int v4l2_m2m_ioctl_try_encoder_cmd(struct file *file, void *fh, struct v4l2_encoder_cmd *ec) { if (ec->cmd != V4L2_ENC_CMD_STOP && ec->cmd != V4L2_ENC_CMD_START) return -EINVAL; ec->flags = 0; return 0; } EXPORT_SYMBOL_GPL(v4l2_m2m_ioctl_try_encoder_cmd); int v4l2_m2m_ioctl_try_decoder_cmd(struct file *file, void *fh, struct v4l2_decoder_cmd *dc) { if (dc->cmd != V4L2_DEC_CMD_STOP && dc->cmd != V4L2_DEC_CMD_START) return -EINVAL; dc->flags = 0; if (dc->cmd == V4L2_DEC_CMD_STOP) { dc->stop.pts = 0; } else if (dc->cmd == V4L2_DEC_CMD_START) { dc->start.speed = 0; dc->start.format = V4L2_DEC_START_FMT_NONE; } return 0; } EXPORT_SYMBOL_GPL(v4l2_m2m_ioctl_try_decoder_cmd); /* * Updates the encoding state on ENC_CMD_STOP/ENC_CMD_START * Should be called from the encoder driver encoder_cmd() callback */ int v4l2_m2m_encoder_cmd(struct file *file, struct v4l2_m2m_ctx *m2m_ctx, struct v4l2_encoder_cmd *ec) { if (ec->cmd != V4L2_ENC_CMD_STOP && ec->cmd != V4L2_ENC_CMD_START) return -EINVAL; if (ec->cmd == V4L2_ENC_CMD_STOP) return v4l2_update_last_buf_state(m2m_ctx); if (m2m_ctx->is_draining) return -EBUSY; if (m2m_ctx->has_stopped) m2m_ctx->has_stopped = false; return 0; } EXPORT_SYMBOL_GPL(v4l2_m2m_encoder_cmd); /* * Updates the decoding state on DEC_CMD_STOP/DEC_CMD_START * Should be called from the decoder driver decoder_cmd() callback */ int v4l2_m2m_decoder_cmd(struct file *file, struct v4l2_m2m_ctx *m2m_ctx, struct v4l2_decoder_cmd *dc) { if (dc->cmd != V4L2_DEC_CMD_STOP && dc->cmd != V4L2_DEC_CMD_START) return -EINVAL; if (dc->cmd == V4L2_DEC_CMD_STOP) return v4l2_update_last_buf_state(m2m_ctx); if (m2m_ctx->is_draining) return -EBUSY; if (m2m_ctx->has_stopped) m2m_ctx->has_stopped = false; return 0; } EXPORT_SYMBOL_GPL(v4l2_m2m_decoder_cmd); int v4l2_m2m_ioctl_encoder_cmd(struct file *file, void *priv, struct v4l2_encoder_cmd *ec) { struct v4l2_fh *fh = file->private_data; return v4l2_m2m_encoder_cmd(file, fh->m2m_ctx, ec); } EXPORT_SYMBOL_GPL(v4l2_m2m_ioctl_encoder_cmd); int v4l2_m2m_ioctl_decoder_cmd(struct file *file, void *priv, struct v4l2_decoder_cmd *dc) { struct v4l2_fh *fh = file->private_data; return v4l2_m2m_decoder_cmd(file, fh->m2m_ctx, dc); } EXPORT_SYMBOL_GPL(v4l2_m2m_ioctl_decoder_cmd); int v4l2_m2m_ioctl_stateless_try_decoder_cmd(struct file *file, void *fh, struct v4l2_decoder_cmd *dc) { if (dc->cmd != V4L2_DEC_CMD_FLUSH) return -EINVAL; dc->flags = 0; return 0; } EXPORT_SYMBOL_GPL(v4l2_m2m_ioctl_stateless_try_decoder_cmd); int v4l2_m2m_ioctl_stateless_decoder_cmd(struct file *file, void *priv, struct v4l2_decoder_cmd *dc) { struct v4l2_fh *fh = file->private_data; struct vb2_v4l2_buffer *out_vb, *cap_vb; struct v4l2_m2m_dev *m2m_dev = fh->m2m_ctx->m2m_dev; unsigned long flags; int ret; ret = v4l2_m2m_ioctl_stateless_try_decoder_cmd(file, priv, dc); if (ret < 0) return ret; spin_lock_irqsave(&m2m_dev->job_spinlock, flags); out_vb = v4l2_m2m_last_src_buf(fh->m2m_ctx); cap_vb = v4l2_m2m_last_dst_buf(fh->m2m_ctx); /* * If there is an out buffer pending, then clear any HOLD flag. * * By clearing this flag we ensure that when this output * buffer is processed any held capture buffer will be released. */ if (out_vb) { out_vb->flags &= ~V4L2_BUF_FLAG_M2M_HOLD_CAPTURE_BUF; } else if (cap_vb && cap_vb->is_held) { /* * If there were no output buffers, but there is a * capture buffer that is held, then release that * buffer. */ cap_vb->is_held = false; v4l2_m2m_dst_buf_remove(fh->m2m_ctx); v4l2_m2m_buf_done(cap_vb, VB2_BUF_STATE_DONE); } spin_unlock_irqrestore(&m2m_dev->job_spinlock, flags); return 0; } EXPORT_SYMBOL_GPL(v4l2_m2m_ioctl_stateless_decoder_cmd); /* * v4l2_file_operations helpers. It is assumed here same lock is used * for the output and the capture buffer queue. */ int v4l2_m2m_fop_mmap(struct file *file, struct vm_area_struct *vma) { struct v4l2_fh *fh = file->private_data; return v4l2_m2m_mmap(file, fh->m2m_ctx, vma); } EXPORT_SYMBOL_GPL(v4l2_m2m_fop_mmap); __poll_t v4l2_m2m_fop_poll(struct file *file, poll_table *wait) { struct v4l2_fh *fh = file->private_data; struct v4l2_m2m_ctx *m2m_ctx = fh->m2m_ctx; __poll_t ret; if (m2m_ctx->q_lock) mutex_lock(m2m_ctx->q_lock); ret = v4l2_m2m_poll(file, m2m_ctx, wait); if (m2m_ctx->q_lock) mutex_unlock(m2m_ctx->q_lock); return ret; } EXPORT_SYMBOL_GPL(v4l2_m2m_fop_poll); |
| 143 94 73 23 7 40 40 35 33 26 27 3 6 33 1 3 30 30 8 30 19 7 3 10 10 10 1 8 1 2 11 6 10 4 16 19 16 6 4 19 23 23 23 9 14 2 21 21 2 19 22 23 23 23 4 13 7 7 1 2 2 493 3 12 483 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 | /* * Copyright (c) 2005 Voltaire Inc. All rights reserved. * Copyright (c) 2002-2005, Network Appliance, Inc. All rights reserved. * Copyright (c) 1999-2005, Mellanox Technologies, Inc. All rights reserved. * Copyright (c) 2005 Intel Corporation. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. */ #include <linux/mutex.h> #include <linux/inetdevice.h> #include <linux/slab.h> #include <linux/workqueue.h> #include <net/arp.h> #include <net/neighbour.h> #include <net/route.h> #include <net/netevent.h> #include <net/ipv6_stubs.h> #include <net/ip6_route.h> #include <rdma/ib_addr.h> #include <rdma/ib_cache.h> #include <rdma/ib_sa.h> #include <rdma/ib.h> #include <rdma/rdma_netlink.h> #include <net/netlink.h> #include "core_priv.h" struct addr_req { struct list_head list; struct sockaddr_storage src_addr; struct sockaddr_storage dst_addr; struct rdma_dev_addr *addr; void *context; void (*callback)(int status, struct sockaddr *src_addr, struct rdma_dev_addr *addr, void *context); unsigned long timeout; struct delayed_work work; bool resolve_by_gid_attr; /* Consider gid attr in resolve phase */ int status; u32 seq; }; static atomic_t ib_nl_addr_request_seq = ATOMIC_INIT(0); static DEFINE_SPINLOCK(lock); static LIST_HEAD(req_list); static struct workqueue_struct *addr_wq; static const struct nla_policy ib_nl_addr_policy[LS_NLA_TYPE_MAX] = { [LS_NLA_TYPE_DGID] = {.type = NLA_BINARY, .len = sizeof(struct rdma_nla_ls_gid), .validation_type = NLA_VALIDATE_MIN, .min = sizeof(struct rdma_nla_ls_gid)}, }; static inline bool ib_nl_is_good_ip_resp(const struct nlmsghdr *nlh) { struct nlattr *tb[LS_NLA_TYPE_MAX] = {}; int ret; if (nlh->nlmsg_flags & RDMA_NL_LS_F_ERR) return false; ret = nla_parse_deprecated(tb, LS_NLA_TYPE_MAX - 1, nlmsg_data(nlh), nlmsg_len(nlh), ib_nl_addr_policy, NULL); if (ret) return false; return true; } static void ib_nl_process_good_ip_rsep(const struct nlmsghdr *nlh) { const struct nlattr *head, *curr; union ib_gid gid; struct addr_req *req; int len, rem; int found = 0; head = (const struct nlattr *)nlmsg_data(nlh); len = nlmsg_len(nlh); nla_for_each_attr(curr, head, len, rem) { if (curr->nla_type == LS_NLA_TYPE_DGID) memcpy(&gid, nla_data(curr), nla_len(curr)); } spin_lock_bh(&lock); list_for_each_entry(req, &req_list, list) { if (nlh->nlmsg_seq != req->seq) continue; /* We set the DGID part, the rest was set earlier */ rdma_addr_set_dgid(req->addr, &gid); req->status = 0; found = 1; break; } spin_unlock_bh(&lock); if (!found) pr_info("Couldn't find request waiting for DGID: %pI6\n", &gid); } int ib_nl_handle_ip_res_resp(struct sk_buff *skb, struct nlmsghdr *nlh, struct netlink_ext_ack *extack) { if ((nlh->nlmsg_flags & NLM_F_REQUEST) || !(NETLINK_CB(skb).sk)) return -EPERM; if (ib_nl_is_good_ip_resp(nlh)) ib_nl_process_good_ip_rsep(nlh); return 0; } static int ib_nl_ip_send_msg(struct rdma_dev_addr *dev_addr, const void *daddr, u32 seq, u16 family) { struct sk_buff *skb = NULL; struct nlmsghdr *nlh; struct rdma_ls_ip_resolve_header *header; void *data; size_t size; int attrtype; int len; if (family == AF_INET) { size = sizeof(struct in_addr); attrtype = RDMA_NLA_F_MANDATORY | LS_NLA_TYPE_IPV4; } else { size = sizeof(struct in6_addr); attrtype = RDMA_NLA_F_MANDATORY | LS_NLA_TYPE_IPV6; } len = nla_total_size(sizeof(size)); len += NLMSG_ALIGN(sizeof(*header)); skb = nlmsg_new(len, GFP_KERNEL); if (!skb) return -ENOMEM; data = ibnl_put_msg(skb, &nlh, seq, 0, RDMA_NL_LS, RDMA_NL_LS_OP_IP_RESOLVE, NLM_F_REQUEST); if (!data) { nlmsg_free(skb); return -ENODATA; } /* Construct the family header first */ header = skb_put(skb, NLMSG_ALIGN(sizeof(*header))); header->ifindex = dev_addr->bound_dev_if; nla_put(skb, attrtype, size, daddr); /* Repair the nlmsg header length */ nlmsg_end(skb, nlh); rdma_nl_multicast(&init_net, skb, RDMA_NL_GROUP_LS, GFP_KERNEL); /* Make the request retry, so when we get the response from userspace * we will have something. */ return -ENODATA; } int rdma_addr_size(const struct sockaddr *addr) { switch (addr->sa_family) { case AF_INET: return sizeof(struct sockaddr_in); case AF_INET6: return sizeof(struct sockaddr_in6); case AF_IB: return sizeof(struct sockaddr_ib); default: return 0; } } EXPORT_SYMBOL(rdma_addr_size); int rdma_addr_size_in6(struct sockaddr_in6 *addr) { int ret = rdma_addr_size((struct sockaddr *) addr); return ret <= sizeof(*addr) ? ret : 0; } EXPORT_SYMBOL(rdma_addr_size_in6); int rdma_addr_size_kss(struct __kernel_sockaddr_storage *addr) { int ret = rdma_addr_size((struct sockaddr *) addr); return ret <= sizeof(*addr) ? ret : 0; } EXPORT_SYMBOL(rdma_addr_size_kss); /** * rdma_copy_src_l2_addr - Copy netdevice source addresses * @dev_addr: Destination address pointer where to copy the addresses * @dev: Netdevice whose source addresses to copy * * rdma_copy_src_l2_addr() copies source addresses from the specified netdevice. * This includes unicast address, broadcast address, device type and * interface index. */ void rdma_copy_src_l2_addr(struct rdma_dev_addr *dev_addr, const struct net_device *dev) { dev_addr->dev_type = dev->type; memcpy(dev_addr->src_dev_addr, dev->dev_addr, MAX_ADDR_LEN); memcpy(dev_addr->broadcast, dev->broadcast, MAX_ADDR_LEN); dev_addr->bound_dev_if = dev->ifindex; } EXPORT_SYMBOL(rdma_copy_src_l2_addr); static struct net_device * rdma_find_ndev_for_src_ip_rcu(struct net *net, const struct sockaddr *src_in) { struct net_device *dev = NULL; int ret = -EADDRNOTAVAIL; switch (src_in->sa_family) { case AF_INET: dev = __ip_dev_find(net, ((const struct sockaddr_in *)src_in)->sin_addr.s_addr, false); if (dev) ret = 0; break; #if IS_ENABLED(CONFIG_IPV6) case AF_INET6: for_each_netdev_rcu(net, dev) { if (ipv6_chk_addr(net, &((const struct sockaddr_in6 *)src_in)->sin6_addr, dev, 1)) { ret = 0; break; } } break; #endif } return ret ? ERR_PTR(ret) : dev; } int rdma_translate_ip(const struct sockaddr *addr, struct rdma_dev_addr *dev_addr) { struct net_device *dev; if (dev_addr->bound_dev_if) { dev = dev_get_by_index(dev_addr->net, dev_addr->bound_dev_if); if (!dev) return -ENODEV; rdma_copy_src_l2_addr(dev_addr, dev); dev_put(dev); return 0; } rcu_read_lock(); dev = rdma_find_ndev_for_src_ip_rcu(dev_addr->net, addr); if (!IS_ERR(dev)) rdma_copy_src_l2_addr(dev_addr, dev); rcu_read_unlock(); return PTR_ERR_OR_ZERO(dev); } EXPORT_SYMBOL(rdma_translate_ip); static void set_timeout(struct addr_req *req, unsigned long time) { unsigned long delay; delay = time - jiffies; if ((long)delay < 0) delay = 0; mod_delayed_work(addr_wq, &req->work, delay); } static void queue_req(struct addr_req *req) { spin_lock_bh(&lock); list_add_tail(&req->list, &req_list); set_timeout(req, req->timeout); spin_unlock_bh(&lock); } static int ib_nl_fetch_ha(struct rdma_dev_addr *dev_addr, const void *daddr, u32 seq, u16 family) { if (!rdma_nl_chk_listeners(RDMA_NL_GROUP_LS)) return -EADDRNOTAVAIL; return ib_nl_ip_send_msg(dev_addr, daddr, seq, family); } static int dst_fetch_ha(const struct dst_entry *dst, struct rdma_dev_addr *dev_addr, const void *daddr) { struct neighbour *n; int ret = 0; n = dst_neigh_lookup(dst, daddr); if (!n) return -ENODATA; if (!(n->nud_state & NUD_VALID)) { neigh_event_send(n, NULL); ret = -ENODATA; } else { neigh_ha_snapshot(dev_addr->dst_dev_addr, n, dst->dev); } neigh_release(n); return ret; } static bool has_gateway(const struct dst_entry *dst, sa_family_t family) { if (family == AF_INET) return dst_rtable(dst)->rt_uses_gateway; return dst_rt6_info(dst)->rt6i_flags & RTF_GATEWAY; } static int fetch_ha(const struct dst_entry *dst, struct rdma_dev_addr *dev_addr, const struct sockaddr *dst_in, u32 seq) { const struct sockaddr_in *dst_in4 = (const struct sockaddr_in *)dst_in; const struct sockaddr_in6 *dst_in6 = (const struct sockaddr_in6 *)dst_in; const void *daddr = (dst_in->sa_family == AF_INET) ? (const void *)&dst_in4->sin_addr.s_addr : (const void *)&dst_in6->sin6_addr; sa_family_t family = dst_in->sa_family; might_sleep(); /* If we have a gateway in IB mode then it must be an IB network */ if (has_gateway(dst, family) && dev_addr->network == RDMA_NETWORK_IB) return ib_nl_fetch_ha(dev_addr, daddr, seq, family); else return dst_fetch_ha(dst, dev_addr, daddr); } static int addr4_resolve(struct sockaddr *src_sock, const struct sockaddr *dst_sock, struct rdma_dev_addr *addr, struct rtable **prt) { struct sockaddr_in *src_in = (struct sockaddr_in *)src_sock; const struct sockaddr_in *dst_in = (const struct sockaddr_in *)dst_sock; __be32 src_ip = src_in->sin_addr.s_addr; __be32 dst_ip = dst_in->sin_addr.s_addr; struct rtable *rt; struct flowi4 fl4; int ret; memset(&fl4, 0, sizeof(fl4)); fl4.daddr = dst_ip; fl4.saddr = src_ip; fl4.flowi4_oif = addr->bound_dev_if; rt = ip_route_output_key(addr->net, &fl4); ret = PTR_ERR_OR_ZERO(rt); if (ret) return ret; src_in->sin_addr.s_addr = fl4.saddr; addr->hoplimit = ip4_dst_hoplimit(&rt->dst); *prt = rt; return 0; } #if IS_ENABLED(CONFIG_IPV6) static int addr6_resolve(struct sockaddr *src_sock, const struct sockaddr *dst_sock, struct rdma_dev_addr *addr, struct dst_entry **pdst) { struct sockaddr_in6 *src_in = (struct sockaddr_in6 *)src_sock; const struct sockaddr_in6 *dst_in = (const struct sockaddr_in6 *)dst_sock; struct flowi6 fl6; struct dst_entry *dst; memset(&fl6, 0, sizeof fl6); fl6.daddr = dst_in->sin6_addr; fl6.saddr = src_in->sin6_addr; fl6.flowi6_oif = addr->bound_dev_if; dst = ipv6_stub->ipv6_dst_lookup_flow(addr->net, NULL, &fl6, NULL); if (IS_ERR(dst)) return PTR_ERR(dst); if (ipv6_addr_any(&src_in->sin6_addr)) src_in->sin6_addr = fl6.saddr; addr->hoplimit = ip6_dst_hoplimit(dst); *pdst = dst; return 0; } #else static int addr6_resolve(struct sockaddr *src_sock, const struct sockaddr *dst_sock, struct rdma_dev_addr *addr, struct dst_entry **pdst) { return -EADDRNOTAVAIL; } #endif static int addr_resolve_neigh(const struct dst_entry *dst, const struct sockaddr *dst_in, struct rdma_dev_addr *addr, unsigned int ndev_flags, u32 seq) { int ret = 0; if (ndev_flags & IFF_LOOPBACK) { memcpy(addr->dst_dev_addr, addr->src_dev_addr, MAX_ADDR_LEN); } else { if (!(ndev_flags & IFF_NOARP)) { /* If the device doesn't do ARP internally */ ret = fetch_ha(dst, addr, dst_in, seq); } } return ret; } static int copy_src_l2_addr(struct rdma_dev_addr *dev_addr, const struct sockaddr *dst_in, const struct dst_entry *dst, const struct net_device *ndev) { int ret = 0; if (dst->dev->flags & IFF_LOOPBACK) ret = rdma_translate_ip(dst_in, dev_addr); else rdma_copy_src_l2_addr(dev_addr, dst->dev); /* * If there's a gateway and type of device not ARPHRD_INFINIBAND, * we're definitely in RoCE v2 (as RoCE v1 isn't routable) set the * network type accordingly. */ if (has_gateway(dst, dst_in->sa_family) && ndev->type != ARPHRD_INFINIBAND) dev_addr->network = dst_in->sa_family == AF_INET ? RDMA_NETWORK_IPV4 : RDMA_NETWORK_IPV6; else dev_addr->network = RDMA_NETWORK_IB; return ret; } static int rdma_set_src_addr_rcu(struct rdma_dev_addr *dev_addr, unsigned int *ndev_flags, const struct sockaddr *dst_in, const struct dst_entry *dst) { struct net_device *ndev = READ_ONCE(dst->dev); *ndev_flags = ndev->flags; /* A physical device must be the RDMA device to use */ if (ndev->flags & IFF_LOOPBACK) { /* * RDMA (IB/RoCE, iWarp) doesn't run on lo interface or * loopback IP address. So if route is resolved to loopback * interface, translate that to a real ndev based on non * loopback IP address. */ ndev = rdma_find_ndev_for_src_ip_rcu(dev_net(ndev), dst_in); if (IS_ERR(ndev)) return -ENODEV; } return copy_src_l2_addr(dev_addr, dst_in, dst, ndev); } static int set_addr_netns_by_gid_rcu(struct rdma_dev_addr *addr) { struct net_device *ndev; ndev = rdma_read_gid_attr_ndev_rcu(addr->sgid_attr); if (IS_ERR(ndev)) return PTR_ERR(ndev); /* * Since we are holding the rcu, reading net and ifindex * are safe without any additional reference; because * change_net_namespace() in net/core/dev.c does rcu sync * after it changes the state to IFF_DOWN and before * updating netdev fields {net, ifindex}. */ addr->net = dev_net(ndev); addr->bound_dev_if = ndev->ifindex; return 0; } static void rdma_addr_set_net_defaults(struct rdma_dev_addr *addr) { addr->net = &init_net; addr->bound_dev_if = 0; } static int addr_resolve(struct sockaddr *src_in, const struct sockaddr *dst_in, struct rdma_dev_addr *addr, bool resolve_neigh, bool resolve_by_gid_attr, u32 seq) { struct dst_entry *dst = NULL; unsigned int ndev_flags = 0; struct rtable *rt = NULL; int ret; if (!addr->net) { pr_warn_ratelimited("%s: missing namespace\n", __func__); return -EINVAL; } rcu_read_lock(); if (resolve_by_gid_attr) { if (!addr->sgid_attr) { rcu_read_unlock(); pr_warn_ratelimited("%s: missing gid_attr\n", __func__); return -EINVAL; } /* * If the request is for a specific gid attribute of the * rdma_dev_addr, derive net from the netdevice of the * GID attribute. */ ret = set_addr_netns_by_gid_rcu(addr); if (ret) { rcu_read_unlock(); return ret; } } if (src_in->sa_family == AF_INET) { ret = addr4_resolve(src_in, dst_in, addr, &rt); dst = &rt->dst; } else { ret = addr6_resolve(src_in, dst_in, addr, &dst); } if (ret) { rcu_read_unlock(); goto done; } ret = rdma_set_src_addr_rcu(addr, &ndev_flags, dst_in, dst); rcu_read_unlock(); /* * Resolve neighbor destination address if requested and * only if src addr translation didn't fail. */ if (!ret && resolve_neigh) ret = addr_resolve_neigh(dst, dst_in, addr, ndev_flags, seq); if (src_in->sa_family == AF_INET) ip_rt_put(rt); else dst_release(dst); done: /* * Clear the addr net to go back to its original state, only if it was * derived from GID attribute in this context. */ if (resolve_by_gid_attr) rdma_addr_set_net_defaults(addr); return ret; } static void process_one_req(struct work_struct *_work) { struct addr_req *req; struct sockaddr *src_in, *dst_in; req = container_of(_work, struct addr_req, work.work); if (req->status == -ENODATA) { src_in = (struct sockaddr *)&req->src_addr; dst_in = (struct sockaddr *)&req->dst_addr; req->status = addr_resolve(src_in, dst_in, req->addr, true, req->resolve_by_gid_attr, req->seq); if (req->status && time_after_eq(jiffies, req->timeout)) { req->status = -ETIMEDOUT; } else if (req->status == -ENODATA) { /* requeue the work for retrying again */ spin_lock_bh(&lock); if (!list_empty(&req->list)) set_timeout(req, req->timeout); spin_unlock_bh(&lock); return; } } req->callback(req->status, (struct sockaddr *)&req->src_addr, req->addr, req->context); req->callback = NULL; spin_lock_bh(&lock); /* * Although the work will normally have been canceled by the workqueue, * it can still be requeued as long as it is on the req_list. */ cancel_delayed_work(&req->work); if (!list_empty(&req->list)) { list_del_init(&req->list); kfree(req); } spin_unlock_bh(&lock); } int rdma_resolve_ip(struct sockaddr *src_addr, const struct sockaddr *dst_addr, struct rdma_dev_addr *addr, unsigned long timeout_ms, void (*callback)(int status, struct sockaddr *src_addr, struct rdma_dev_addr *addr, void *context), bool resolve_by_gid_attr, void *context) { struct sockaddr *src_in, *dst_in; struct addr_req *req; int ret = 0; req = kzalloc(sizeof *req, GFP_KERNEL); if (!req) return -ENOMEM; src_in = (struct sockaddr *) &req->src_addr; dst_in = (struct sockaddr *) &req->dst_addr; if (src_addr) { if (src_addr->sa_family != dst_addr->sa_family) { ret = -EINVAL; goto err; } memcpy(src_in, src_addr, rdma_addr_size(src_addr)); } else { src_in->sa_family = dst_addr->sa_family; } memcpy(dst_in, dst_addr, rdma_addr_size(dst_addr)); req->addr = addr; req->callback = callback; req->context = context; req->resolve_by_gid_attr = resolve_by_gid_attr; INIT_DELAYED_WORK(&req->work, process_one_req); req->seq = (u32)atomic_inc_return(&ib_nl_addr_request_seq); req->status = addr_resolve(src_in, dst_in, addr, true, req->resolve_by_gid_attr, req->seq); switch (req->status) { case 0: req->timeout = jiffies; queue_req(req); break; case -ENODATA: req->timeout = msecs_to_jiffies(timeout_ms) + jiffies; queue_req(req); break; default: ret = req->status; goto err; } return ret; err: kfree(req); return ret; } EXPORT_SYMBOL(rdma_resolve_ip); int roce_resolve_route_from_path(struct sa_path_rec *rec, const struct ib_gid_attr *attr) { union { struct sockaddr _sockaddr; struct sockaddr_in _sockaddr_in; struct sockaddr_in6 _sockaddr_in6; } sgid, dgid; struct rdma_dev_addr dev_addr = {}; int ret; might_sleep(); if (rec->roce.route_resolved) return 0; rdma_gid2ip((struct sockaddr *)&sgid, &rec->sgid); rdma_gid2ip((struct sockaddr *)&dgid, &rec->dgid); if (sgid._sockaddr.sa_family != dgid._sockaddr.sa_family) return -EINVAL; if (!attr || !attr->ndev) return -EINVAL; dev_addr.net = &init_net; dev_addr.sgid_attr = attr; ret = addr_resolve((struct sockaddr *)&sgid, (struct sockaddr *)&dgid, &dev_addr, false, true, 0); if (ret) return ret; if ((dev_addr.network == RDMA_NETWORK_IPV4 || dev_addr.network == RDMA_NETWORK_IPV6) && rec->rec_type != SA_PATH_REC_TYPE_ROCE_V2) return -EINVAL; rec->roce.route_resolved = true; return 0; } /** * rdma_addr_cancel - Cancel resolve ip request * @addr: Pointer to address structure given previously * during rdma_resolve_ip(). * rdma_addr_cancel() is synchronous function which cancels any pending * request if there is any. */ void rdma_addr_cancel(struct rdma_dev_addr *addr) { struct addr_req *req, *temp_req; struct addr_req *found = NULL; spin_lock_bh(&lock); list_for_each_entry_safe(req, temp_req, &req_list, list) { if (req->addr == addr) { /* * Removing from the list means we take ownership of * the req */ list_del_init(&req->list); found = req; break; } } spin_unlock_bh(&lock); if (!found) return; /* * sync canceling the work after removing it from the req_list * guarentees no work is running and none will be started. */ cancel_delayed_work_sync(&found->work); kfree(found); } EXPORT_SYMBOL(rdma_addr_cancel); struct resolve_cb_context { struct completion comp; int status; }; static void resolve_cb(int status, struct sockaddr *src_addr, struct rdma_dev_addr *addr, void *context) { ((struct resolve_cb_context *)context)->status = status; complete(&((struct resolve_cb_context *)context)->comp); } int rdma_addr_find_l2_eth_by_grh(const union ib_gid *sgid, const union ib_gid *dgid, u8 *dmac, const struct ib_gid_attr *sgid_attr, int *hoplimit) { struct rdma_dev_addr dev_addr; struct resolve_cb_context ctx; union { struct sockaddr_in _sockaddr_in; struct sockaddr_in6 _sockaddr_in6; } sgid_addr, dgid_addr; int ret; rdma_gid2ip((struct sockaddr *)&sgid_addr, sgid); rdma_gid2ip((struct sockaddr *)&dgid_addr, dgid); memset(&dev_addr, 0, sizeof(dev_addr)); dev_addr.net = &init_net; dev_addr.sgid_attr = sgid_attr; init_completion(&ctx.comp); ret = rdma_resolve_ip((struct sockaddr *)&sgid_addr, (struct sockaddr *)&dgid_addr, &dev_addr, 1000, resolve_cb, true, &ctx); if (ret) return ret; wait_for_completion(&ctx.comp); ret = ctx.status; if (ret) return ret; memcpy(dmac, dev_addr.dst_dev_addr, ETH_ALEN); *hoplimit = dev_addr.hoplimit; return 0; } static int netevent_callback(struct notifier_block *self, unsigned long event, void *ctx) { struct addr_req *req; if (event == NETEVENT_NEIGH_UPDATE) { struct neighbour *neigh = ctx; if (neigh->nud_state & NUD_VALID) { spin_lock_bh(&lock); list_for_each_entry(req, &req_list, list) set_timeout(req, jiffies); spin_unlock_bh(&lock); } } return 0; } static struct notifier_block nb = { .notifier_call = netevent_callback }; int addr_init(void) { addr_wq = alloc_ordered_workqueue("ib_addr", 0); if (!addr_wq) return -ENOMEM; register_netevent_notifier(&nb); return 0; } void addr_cleanup(void) { unregister_netevent_notifier(&nb); destroy_workqueue(addr_wq); WARN_ON(!list_empty(&req_list)); } |
| 54 7 38 2 510 252 524 525 526 7 516 2 516 515 13 24 84 73 10 83 39 7 2 30 25 5 25 29 29 5 5 27 27 27 1 2 3 27 27 9 24 1 2 6 21 9 4 14 2 5 1 22 28 31 35 35 45 7 7 7 3 4 96 3 39 32 32 72 30 30 29 3 21 25 31 31 27 9 12 39 39 35 5 5 5 1 5 35 39 39 3 29 30 18 2 28 73 72 16 15 1 228 228 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 | // SPDX-License-Identifier: GPL-2.0 /* * linux/fs/ext4/dir.c * * Copyright (C) 1992, 1993, 1994, 1995 * Remy Card (card@masi.ibp.fr) * Laboratoire MASI - Institut Blaise Pascal * Universite Pierre et Marie Curie (Paris VI) * * from * * linux/fs/minix/dir.c * * Copyright (C) 1991, 1992 Linus Torvalds * * ext4 directory handling functions * * Big-endian to little-endian byte-swapping/bitmaps by * David S. Miller (davem@caip.rutgers.edu), 1995 * * Hash Tree Directory indexing (c) 2001 Daniel Phillips * */ #include <linux/fs.h> #include <linux/buffer_head.h> #include <linux/slab.h> #include <linux/iversion.h> #include <linux/unicode.h> #include "ext4.h" #include "xattr.h" static int ext4_dx_readdir(struct file *, struct dir_context *); /** * is_dx_dir() - check if a directory is using htree indexing * @inode: directory inode * * Check if the given dir-inode refers to an htree-indexed directory * (or a directory which could potentially get converted to use htree * indexing). * * Return 1 if it is a dx dir, 0 if not */ static int is_dx_dir(struct inode *inode) { struct super_block *sb = inode->i_sb; if (ext4_has_feature_dir_index(inode->i_sb) && ((ext4_test_inode_flag(inode, EXT4_INODE_INDEX)) || ((inode->i_size >> sb->s_blocksize_bits) == 1) || ext4_has_inline_data(inode))) return 1; return 0; } static bool is_fake_dir_entry(struct ext4_dir_entry_2 *de) { /* Check if . or .. , or skip if namelen is 0 */ if ((de->name_len > 0) && (de->name_len <= 2) && (de->name[0] == '.') && (de->name[1] == '.' || de->name[1] == '\0')) return true; /* Check if this is a csum entry */ if (de->file_type == EXT4_FT_DIR_CSUM) return true; return false; } /* * Return 0 if the directory entry is OK, and 1 if there is a problem * * Note: this is the opposite of what ext2 and ext3 historically returned... * * bh passed here can be an inode block or a dir data block, depending * on the inode inline data flag. */ int __ext4_check_dir_entry(const char *function, unsigned int line, struct inode *dir, struct file *filp, struct ext4_dir_entry_2 *de, struct buffer_head *bh, char *buf, int size, unsigned int offset) { const char *error_msg = NULL; const int rlen = ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize); const int next_offset = ((char *) de - buf) + rlen; bool fake = is_fake_dir_entry(de); bool has_csum = ext4_has_metadata_csum(dir->i_sb); if (unlikely(rlen < ext4_dir_rec_len(1, fake ? NULL : dir))) error_msg = "rec_len is smaller than minimal"; else if (unlikely(rlen % 4 != 0)) error_msg = "rec_len % 4 != 0"; else if (unlikely(rlen < ext4_dir_rec_len(de->name_len, fake ? NULL : dir))) error_msg = "rec_len is too small for name_len"; else if (unlikely(next_offset > size)) error_msg = "directory entry overrun"; else if (unlikely(next_offset > size - ext4_dir_rec_len(1, has_csum ? NULL : dir) && next_offset != size)) error_msg = "directory entry too close to block end"; else if (unlikely(le32_to_cpu(de->inode) > le32_to_cpu(EXT4_SB(dir->i_sb)->s_es->s_inodes_count))) error_msg = "inode out of bounds"; else return 0; if (filp) ext4_error_file(filp, function, line, bh->b_blocknr, "bad entry in directory: %s - offset=%u, " "inode=%u, rec_len=%d, size=%d fake=%d", error_msg, offset, le32_to_cpu(de->inode), rlen, size, fake); else ext4_error_inode(dir, function, line, bh->b_blocknr, "bad entry in directory: %s - offset=%u, " "inode=%u, rec_len=%d, size=%d fake=%d", error_msg, offset, le32_to_cpu(de->inode), rlen, size, fake); return 1; } static int ext4_readdir(struct file *file, struct dir_context *ctx) { unsigned int offset; int i; struct ext4_dir_entry_2 *de; int err; struct inode *inode = file_inode(file); struct super_block *sb = inode->i_sb; struct buffer_head *bh = NULL; struct fscrypt_str fstr = FSTR_INIT(NULL, 0); struct dir_private_info *info = file->private_data; err = fscrypt_prepare_readdir(inode); if (err) return err; if (is_dx_dir(inode)) { err = ext4_dx_readdir(file, ctx); if (err != ERR_BAD_DX_DIR) return err; /* Can we just clear INDEX flag to ignore htree information? */ if (!ext4_has_metadata_csum(sb)) { /* * We don't set the inode dirty flag since it's not * critical that it gets flushed back to the disk. */ ext4_clear_inode_flag(inode, EXT4_INODE_INDEX); } } if (ext4_has_inline_data(inode)) { int has_inline_data = 1; err = ext4_read_inline_dir(file, ctx, &has_inline_data); if (has_inline_data) return err; } if (IS_ENCRYPTED(inode)) { err = fscrypt_fname_alloc_buffer(EXT4_NAME_LEN, &fstr); if (err < 0) return err; } while (ctx->pos < inode->i_size) { struct ext4_map_blocks map; if (fatal_signal_pending(current)) { err = -ERESTARTSYS; goto errout; } cond_resched(); offset = ctx->pos & (sb->s_blocksize - 1); map.m_lblk = ctx->pos >> EXT4_BLOCK_SIZE_BITS(sb); map.m_len = 1; err = ext4_map_blocks(NULL, inode, &map, 0); if (err == 0) { /* m_len should never be zero but let's avoid * an infinite loop if it somehow is */ if (map.m_len == 0) map.m_len = 1; ctx->pos += map.m_len * sb->s_blocksize; continue; } if (err > 0) { pgoff_t index = map.m_pblk >> (PAGE_SHIFT - inode->i_blkbits); if (!ra_has_index(&file->f_ra, index)) page_cache_sync_readahead( sb->s_bdev->bd_mapping, &file->f_ra, file, index, 1); file->f_ra.prev_pos = (loff_t)index << PAGE_SHIFT; bh = ext4_bread(NULL, inode, map.m_lblk, 0); if (IS_ERR(bh)) { err = PTR_ERR(bh); bh = NULL; goto errout; } } if (!bh) { /* corrupt size? Maybe no more blocks to read */ if (ctx->pos > inode->i_blocks << 9) break; ctx->pos += sb->s_blocksize - offset; continue; } /* Check the checksum */ if (!buffer_verified(bh) && !ext4_dirblock_csum_verify(inode, bh)) { EXT4_ERROR_FILE(file, 0, "directory fails checksum " "at offset %llu", (unsigned long long)ctx->pos); ctx->pos += sb->s_blocksize - offset; brelse(bh); bh = NULL; continue; } set_buffer_verified(bh); /* If the dir block has changed since the last call to * readdir(2), then we might be pointing to an invalid * dirent right now. Scan from the start of the block * to make sure. */ if (!inode_eq_iversion(inode, info->cookie)) { for (i = 0; i < sb->s_blocksize && i < offset; ) { de = (struct ext4_dir_entry_2 *) (bh->b_data + i); /* It's too expensive to do a full * dirent test each time round this * loop, but we do have to test at * least that it is non-zero. A * failure will be detected in the * dirent test below. */ if (ext4_rec_len_from_disk(de->rec_len, sb->s_blocksize) < ext4_dir_rec_len(1, inode)) break; i += ext4_rec_len_from_disk(de->rec_len, sb->s_blocksize); } offset = i; ctx->pos = (ctx->pos & ~(sb->s_blocksize - 1)) | offset; info->cookie = inode_query_iversion(inode); } while (ctx->pos < inode->i_size && offset < sb->s_blocksize) { de = (struct ext4_dir_entry_2 *) (bh->b_data + offset); if (ext4_check_dir_entry(inode, file, de, bh, bh->b_data, bh->b_size, offset)) { /* * On error, skip to the next block */ ctx->pos = (ctx->pos | (sb->s_blocksize - 1)) + 1; break; } offset += ext4_rec_len_from_disk(de->rec_len, sb->s_blocksize); if (le32_to_cpu(de->inode)) { if (!IS_ENCRYPTED(inode)) { if (!dir_emit(ctx, de->name, de->name_len, le32_to_cpu(de->inode), get_dtype(sb, de->file_type))) goto done; } else { int save_len = fstr.len; struct fscrypt_str de_name = FSTR_INIT(de->name, de->name_len); u32 hash; u32 minor_hash; if (IS_CASEFOLDED(inode)) { hash = EXT4_DIRENT_HASH(de); minor_hash = EXT4_DIRENT_MINOR_HASH(de); } else { hash = 0; minor_hash = 0; } /* Directory is encrypted */ err = fscrypt_fname_disk_to_usr(inode, hash, minor_hash, &de_name, &fstr); de_name = fstr; fstr.len = save_len; if (err) goto errout; if (!dir_emit(ctx, de_name.name, de_name.len, le32_to_cpu(de->inode), get_dtype(sb, de->file_type))) goto done; } } ctx->pos += ext4_rec_len_from_disk(de->rec_len, sb->s_blocksize); } if ((ctx->pos < inode->i_size) && !dir_relax_shared(inode)) goto done; brelse(bh); bh = NULL; } done: err = 0; errout: fscrypt_fname_free_buffer(&fstr); brelse(bh); return err; } static inline int is_32bit_api(void) { #ifdef CONFIG_COMPAT return in_compat_syscall(); #else return (BITS_PER_LONG == 32); #endif } /* * These functions convert from the major/minor hash to an f_pos * value for dx directories * * Upper layer (for example NFS) should specify FMODE_32BITHASH or * FMODE_64BITHASH explicitly. On the other hand, we allow ext4 to be mounted * directly on both 32-bit and 64-bit nodes, under such case, neither * FMODE_32BITHASH nor FMODE_64BITHASH is specified. */ static inline loff_t hash2pos(struct file *filp, __u32 major, __u32 minor) { if ((filp->f_mode & FMODE_32BITHASH) || (!(filp->f_mode & FMODE_64BITHASH) && is_32bit_api())) return major >> 1; else return ((__u64)(major >> 1) << 32) | (__u64)minor; } static inline __u32 pos2maj_hash(struct file *filp, loff_t pos) { if ((filp->f_mode & FMODE_32BITHASH) || (!(filp->f_mode & FMODE_64BITHASH) && is_32bit_api())) return (pos << 1) & 0xffffffff; else return ((pos >> 32) << 1) & 0xffffffff; } static inline __u32 pos2min_hash(struct file *filp, loff_t pos) { if ((filp->f_mode & FMODE_32BITHASH) || (!(filp->f_mode & FMODE_64BITHASH) && is_32bit_api())) return 0; else return pos & 0xffffffff; } /* * Return 32- or 64-bit end-of-file for dx directories */ static inline loff_t ext4_get_htree_eof(struct file *filp) { if ((filp->f_mode & FMODE_32BITHASH) || (!(filp->f_mode & FMODE_64BITHASH) && is_32bit_api())) return EXT4_HTREE_EOF_32BIT; else return EXT4_HTREE_EOF_64BIT; } /* * ext4_dir_llseek() calls generic_file_llseek_size to handle htree * directories, where the "offset" is in terms of the filename hash * value instead of the byte offset. * * Because we may return a 64-bit hash that is well beyond offset limits, * we need to pass the max hash as the maximum allowable offset in * the htree directory case. * * For non-htree, ext4_llseek already chooses the proper max offset. */ static loff_t ext4_dir_llseek(struct file *file, loff_t offset, int whence) { struct inode *inode = file->f_mapping->host; struct dir_private_info *info = file->private_data; int dx_dir = is_dx_dir(inode); loff_t ret, htree_max = ext4_get_htree_eof(file); if (likely(dx_dir)) ret = generic_file_llseek_size(file, offset, whence, htree_max, htree_max); else ret = ext4_llseek(file, offset, whence); info->cookie = inode_peek_iversion(inode) - 1; return ret; } /* * This structure holds the nodes of the red-black tree used to store * the directory entry in hash order. */ struct fname { __u32 hash; __u32 minor_hash; struct rb_node rb_hash; struct fname *next; __u32 inode; __u8 name_len; __u8 file_type; char name[] __counted_by(name_len); }; /* * This function implements a non-recursive way of freeing all of the * nodes in the red-black tree. */ static void free_rb_tree_fname(struct rb_root *root) { struct fname *fname, *next; rbtree_postorder_for_each_entry_safe(fname, next, root, rb_hash) while (fname) { struct fname *old = fname; fname = fname->next; kfree(old); } *root = RB_ROOT; } static void ext4_htree_init_dir_info(struct file *filp, loff_t pos) { struct dir_private_info *p = filp->private_data; if (is_dx_dir(file_inode(filp)) && !p->initialized) { p->curr_hash = pos2maj_hash(filp, pos); p->curr_minor_hash = pos2min_hash(filp, pos); p->initialized = true; } } void ext4_htree_free_dir_info(struct dir_private_info *p) { free_rb_tree_fname(&p->root); kfree(p); } /* * Given a directory entry, enter it into the fname rb tree. * * When filename encryption is enabled, the dirent will hold the * encrypted filename, while the htree will hold decrypted filename. * The decrypted filename is passed in via ent_name. parameter. */ int ext4_htree_store_dirent(struct file *dir_file, __u32 hash, __u32 minor_hash, struct ext4_dir_entry_2 *dirent, struct fscrypt_str *ent_name) { struct rb_node **p, *parent = NULL; struct fname *fname, *new_fn; struct dir_private_info *info; info = dir_file->private_data; p = &info->root.rb_node; /* Create and allocate the fname structure */ new_fn = kzalloc(struct_size(new_fn, name, ent_name->len + 1), GFP_KERNEL); if (!new_fn) return -ENOMEM; new_fn->hash = hash; new_fn->minor_hash = minor_hash; new_fn->inode = le32_to_cpu(dirent->inode); new_fn->name_len = ent_name->len; new_fn->file_type = dirent->file_type; memcpy(new_fn->name, ent_name->name, ent_name->len); while (*p) { parent = *p; fname = rb_entry(parent, struct fname, rb_hash); /* * If the hash and minor hash match up, then we put * them on a linked list. This rarely happens... */ if ((new_fn->hash == fname->hash) && (new_fn->minor_hash == fname->minor_hash)) { new_fn->next = fname->next; fname->next = new_fn; return 0; } if (new_fn->hash < fname->hash) p = &(*p)->rb_left; else if (new_fn->hash > fname->hash) p = &(*p)->rb_right; else if (new_fn->minor_hash < fname->minor_hash) p = &(*p)->rb_left; else /* if (new_fn->minor_hash > fname->minor_hash) */ p = &(*p)->rb_right; } rb_link_node(&new_fn->rb_hash, parent, p); rb_insert_color(&new_fn->rb_hash, &info->root); return 0; } /* * This is a helper function for ext4_dx_readdir. It calls filldir * for all entries on the fname linked list. (Normally there is only * one entry on the linked list, unless there are 62 bit hash collisions.) */ static int call_filldir(struct file *file, struct dir_context *ctx, struct fname *fname) { struct dir_private_info *info = file->private_data; struct inode *inode = file_inode(file); struct super_block *sb = inode->i_sb; if (!fname) { ext4_msg(sb, KERN_ERR, "%s:%d: inode #%lu: comm %s: " "called with null fname?!?", __func__, __LINE__, inode->i_ino, current->comm); return 0; } ctx->pos = hash2pos(file, fname->hash, fname->minor_hash); while (fname) { if (!dir_emit(ctx, fname->name, fname->name_len, fname->inode, get_dtype(sb, fname->file_type))) { info->extra_fname = fname; return 1; } fname = fname->next; } return 0; } static int ext4_dx_readdir(struct file *file, struct dir_context *ctx) { struct dir_private_info *info = file->private_data; struct inode *inode = file_inode(file); struct fname *fname; int ret = 0; ext4_htree_init_dir_info(file, ctx->pos); if (ctx->pos == ext4_get_htree_eof(file)) return 0; /* EOF */ /* Some one has messed with f_pos; reset the world */ if (info->last_pos != ctx->pos) { free_rb_tree_fname(&info->root); info->curr_node = NULL; info->extra_fname = NULL; info->curr_hash = pos2maj_hash(file, ctx->pos); info->curr_minor_hash = pos2min_hash(file, ctx->pos); } /* * If there are any leftover names on the hash collision * chain, return them first. */ if (info->extra_fname) { if (call_filldir(file, ctx, info->extra_fname)) goto finished; info->extra_fname = NULL; goto next_node; } else if (!info->curr_node) info->curr_node = rb_first(&info->root); while (1) { /* * Fill the rbtree if we have no more entries, * or the inode has changed since we last read in the * cached entries. */ if ((!info->curr_node) || !inode_eq_iversion(inode, info->cookie)) { info->curr_node = NULL; free_rb_tree_fname(&info->root); info->cookie = inode_query_iversion(inode); ret = ext4_htree_fill_tree(file, info->curr_hash, info->curr_minor_hash, &info->next_hash); if (ret < 0) goto finished; if (ret == 0) { ctx->pos = ext4_get_htree_eof(file); break; } info->curr_node = rb_first(&info->root); } fname = rb_entry(info->curr_node, struct fname, rb_hash); info->curr_hash = fname->hash; info->curr_minor_hash = fname->minor_hash; if (call_filldir(file, ctx, fname)) break; next_node: info->curr_node = rb_next(info->curr_node); if (info->curr_node) { fname = rb_entry(info->curr_node, struct fname, rb_hash); info->curr_hash = fname->hash; info->curr_minor_hash = fname->minor_hash; } else { if (info->next_hash == ~0) { ctx->pos = ext4_get_htree_eof(file); break; } info->curr_hash = info->next_hash; info->curr_minor_hash = 0; } } finished: info->last_pos = ctx->pos; return ret < 0 ? ret : 0; } static int ext4_release_dir(struct inode *inode, struct file *filp) { if (filp->private_data) ext4_htree_free_dir_info(filp->private_data); return 0; } int ext4_check_all_de(struct inode *dir, struct buffer_head *bh, void *buf, int buf_size) { struct ext4_dir_entry_2 *de; int rlen; unsigned int offset = 0; char *top; de = buf; top = buf + buf_size; while ((char *) de < top) { if (ext4_check_dir_entry(dir, NULL, de, bh, buf, buf_size, offset)) return -EFSCORRUPTED; rlen = ext4_rec_len_from_disk(de->rec_len, buf_size); de = (struct ext4_dir_entry_2 *)((char *)de + rlen); offset += rlen; } if ((char *) de > top) return -EFSCORRUPTED; return 0; } static int ext4_dir_open(struct inode *inode, struct file *file) { struct dir_private_info *info; info = kzalloc(sizeof(*info), GFP_KERNEL); if (!info) return -ENOMEM; file->private_data = info; return 0; } const struct file_operations ext4_dir_operations = { .open = ext4_dir_open, .llseek = ext4_dir_llseek, .read = generic_read_dir, .iterate_shared = ext4_readdir, .unlocked_ioctl = ext4_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = ext4_compat_ioctl, #endif .fsync = ext4_sync_file, .release = ext4_release_dir, }; |
| 115 115 115 17 17 259 258 257 259 259 1 6 5 3 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 6 49 20 1 1 1 2 2 22 22 1 4 1 29 13 14 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 | // SPDX-License-Identifier: GPL-2.0-only /* * Copyright (C) 1995 Linus Torvalds * * Pentium III FXSR, SSE support * Gareth Hughes <gareth@valinux.com>, May 2000 * * X86-64 port * Andi Kleen. * * CPU hotplug support - ashok.raj@intel.com */ /* * This file handles the architecture-dependent parts of process handling.. */ #include <linux/cpu.h> #include <linux/errno.h> #include <linux/sched.h> #include <linux/sched/task.h> #include <linux/sched/task_stack.h> #include <linux/fs.h> #include <linux/kernel.h> #include <linux/mm.h> #include <linux/elfcore.h> #include <linux/smp.h> #include <linux/slab.h> #include <linux/user.h> #include <linux/interrupt.h> #include <linux/delay.h> #include <linux/export.h> #include <linux/ptrace.h> #include <linux/notifier.h> #include <linux/kprobes.h> #include <linux/kdebug.h> #include <linux/prctl.h> #include <linux/uaccess.h> #include <linux/io.h> #include <linux/ftrace.h> #include <linux/syscalls.h> #include <linux/iommu.h> #include <asm/processor.h> #include <asm/pkru.h> #include <asm/fpu/sched.h> #include <asm/mmu_context.h> #include <asm/prctl.h> #include <asm/desc.h> #include <asm/proto.h> #include <asm/ia32.h> #include <asm/debugreg.h> #include <asm/switch_to.h> #include <asm/xen/hypervisor.h> #include <asm/vdso.h> #include <asm/resctrl.h> #include <asm/unistd.h> #include <asm/fsgsbase.h> #include <asm/fred.h> #ifdef CONFIG_IA32_EMULATION /* Not included via unistd.h */ #include <asm/unistd_32_ia32.h> #endif #include "process.h" /* Prints also some state that isn't saved in the pt_regs */ void __show_regs(struct pt_regs *regs, enum show_regs_mode mode, const char *log_lvl) { unsigned long cr0 = 0L, cr2 = 0L, cr3 = 0L, cr4 = 0L, fs, gs, shadowgs; unsigned long d0, d1, d2, d3, d6, d7; unsigned int fsindex, gsindex; unsigned int ds, es; show_iret_regs(regs, log_lvl); if (regs->orig_ax != -1) pr_cont(" ORIG_RAX: %016lx\n", regs->orig_ax); else pr_cont("\n"); printk("%sRAX: %016lx RBX: %016lx RCX: %016lx\n", log_lvl, regs->ax, regs->bx, regs->cx); printk("%sRDX: %016lx RSI: %016lx RDI: %016lx\n", log_lvl, regs->dx, regs->si, regs->di); printk("%sRBP: %016lx R08: %016lx R09: %016lx\n", log_lvl, regs->bp, regs->r8, regs->r9); printk("%sR10: %016lx R11: %016lx R12: %016lx\n", log_lvl, regs->r10, regs->r11, regs->r12); printk("%sR13: %016lx R14: %016lx R15: %016lx\n", log_lvl, regs->r13, regs->r14, regs->r15); if (mode == SHOW_REGS_SHORT) return; if (mode == SHOW_REGS_USER) { rdmsrl(MSR_FS_BASE, fs); rdmsrl(MSR_KERNEL_GS_BASE, shadowgs); printk("%sFS: %016lx GS: %016lx\n", log_lvl, fs, shadowgs); return; } asm("movl %%ds,%0" : "=r" (ds)); asm("movl %%es,%0" : "=r" (es)); asm("movl %%fs,%0" : "=r" (fsindex)); asm("movl %%gs,%0" : "=r" (gsindex)); rdmsrl(MSR_FS_BASE, fs); rdmsrl(MSR_GS_BASE, gs); rdmsrl(MSR_KERNEL_GS_BASE, shadowgs); cr0 = read_cr0(); cr2 = read_cr2(); cr3 = __read_cr3(); cr4 = __read_cr4(); printk("%sFS: %016lx(%04x) GS:%016lx(%04x) knlGS:%016lx\n", log_lvl, fs, fsindex, gs, gsindex, shadowgs); printk("%sCS: %04x DS: %04x ES: %04x CR0: %016lx\n", log_lvl, regs->cs, ds, es, cr0); printk("%sCR2: %016lx CR3: %016lx CR4: %016lx\n", log_lvl, cr2, cr3, cr4); get_debugreg(d0, 0); get_debugreg(d1, 1); get_debugreg(d2, 2); get_debugreg(d3, 3); get_debugreg(d6, 6); get_debugreg(d7, 7); /* Only print out debug registers if they are in their non-default state. */ if (!((d0 == 0) && (d1 == 0) && (d2 == 0) && (d3 == 0) && (d6 == DR6_RESERVED) && (d7 == 0x400))) { printk("%sDR0: %016lx DR1: %016lx DR2: %016lx\n", log_lvl, d0, d1, d2); printk("%sDR3: %016lx DR6: %016lx DR7: %016lx\n", log_lvl, d3, d6, d7); } if (cr4 & X86_CR4_PKE) printk("%sPKRU: %08x\n", log_lvl, read_pkru()); } void release_thread(struct task_struct *dead_task) { WARN_ON(dead_task->mm); } enum which_selector { FS, GS }; /* * Out of line to be protected from kprobes and tracing. If this would be * traced or probed than any access to a per CPU variable happens with * the wrong GS. * * It is not used on Xen paravirt. When paravirt support is needed, it * needs to be renamed with native_ prefix. */ static noinstr unsigned long __rdgsbase_inactive(void) { unsigned long gsbase; lockdep_assert_irqs_disabled(); /* * SWAPGS is no longer needed thus NOT allowed with FRED because * FRED transitions ensure that an operating system can _always_ * operate with its own GS base address: * - For events that occur in ring 3, FRED event delivery swaps * the GS base address with the IA32_KERNEL_GS_BASE MSR. * - ERETU (the FRED transition that returns to ring 3) also swaps * the GS base address with the IA32_KERNEL_GS_BASE MSR. * * And the operating system can still setup the GS segment for a * user thread without the need of loading a user thread GS with: * - Using LKGS, available with FRED, to modify other attributes * of the GS segment without compromising its ability always to * operate with its own GS base address. * - Accessing the GS segment base address for a user thread as * before using RDMSR or WRMSR on the IA32_KERNEL_GS_BASE MSR. * * Note, LKGS loads the GS base address into the IA32_KERNEL_GS_BASE * MSR instead of the GS segment’s descriptor cache. As such, the * operating system never changes its runtime GS base address. */ if (!cpu_feature_enabled(X86_FEATURE_FRED) && !cpu_feature_enabled(X86_FEATURE_XENPV)) { native_swapgs(); gsbase = rdgsbase(); native_swapgs(); } else { instrumentation_begin(); rdmsrl(MSR_KERNEL_GS_BASE, gsbase); instrumentation_end(); } return gsbase; } /* * Out of line to be protected from kprobes and tracing. If this would be * traced or probed than any access to a per CPU variable happens with * the wrong GS. * * It is not used on Xen paravirt. When paravirt support is needed, it * needs to be renamed with native_ prefix. */ static noinstr void __wrgsbase_inactive(unsigned long gsbase) { lockdep_assert_irqs_disabled(); if (!cpu_feature_enabled(X86_FEATURE_FRED) && !cpu_feature_enabled(X86_FEATURE_XENPV)) { native_swapgs(); wrgsbase(gsbase); native_swapgs(); } else { instrumentation_begin(); wrmsrl(MSR_KERNEL_GS_BASE, gsbase); instrumentation_end(); } } /* * Saves the FS or GS base for an outgoing thread if FSGSBASE extensions are * not available. The goal is to be reasonably fast on non-FSGSBASE systems. * It's forcibly inlined because it'll generate better code and this function * is hot. */ static __always_inline void save_base_legacy(struct task_struct *prev_p, unsigned short selector, enum which_selector which) { if (likely(selector == 0)) { /* * On Intel (without X86_BUG_NULL_SEG), the segment base could * be the pre-existing saved base or it could be zero. On AMD * (with X86_BUG_NULL_SEG), the segment base could be almost * anything. * * This branch is very hot (it's hit twice on almost every * context switch between 64-bit programs), and avoiding * the RDMSR helps a lot, so we just assume that whatever * value is already saved is correct. This matches historical * Linux behavior, so it won't break existing applications. * * To avoid leaking state, on non-X86_BUG_NULL_SEG CPUs, if we * report that the base is zero, it needs to actually be zero: * see the corresponding logic in load_seg_legacy. */ } else { /* * If the selector is 1, 2, or 3, then the base is zero on * !X86_BUG_NULL_SEG CPUs and could be anything on * X86_BUG_NULL_SEG CPUs. In the latter case, Linux * has never attempted to preserve the base across context * switches. * * If selector > 3, then it refers to a real segment, and * saving the base isn't necessary. */ if (which == FS) prev_p->thread.fsbase = 0; else prev_p->thread.gsbase = 0; } } static __always_inline void save_fsgs(struct task_struct *task) { savesegment(fs, task->thread.fsindex); savesegment(gs, task->thread.gsindex); if (static_cpu_has(X86_FEATURE_FSGSBASE)) { /* * If FSGSBASE is enabled, we can't make any useful guesses * about the base, and user code expects us to save the current * value. Fortunately, reading the base directly is efficient. */ task->thread.fsbase = rdfsbase(); task->thread.gsbase = __rdgsbase_inactive(); } else { save_base_legacy(task, task->thread.fsindex, FS); save_base_legacy(task, task->thread.gsindex, GS); } } /* * While a process is running,current->thread.fsbase and current->thread.gsbase * may not match the corresponding CPU registers (see save_base_legacy()). */ void current_save_fsgs(void) { unsigned long flags; /* Interrupts need to be off for FSGSBASE */ local_irq_save(flags); save_fsgs(current); local_irq_restore(flags); } #if IS_ENABLED(CONFIG_KVM) EXPORT_SYMBOL_GPL(current_save_fsgs); #endif static __always_inline void loadseg(enum which_selector which, unsigned short sel) { if (which == FS) loadsegment(fs, sel); else load_gs_index(sel); } static __always_inline void load_seg_legacy(unsigned short prev_index, unsigned long prev_base, unsigned short next_index, unsigned long next_base, enum which_selector which) { if (likely(next_index <= 3)) { /* * The next task is using 64-bit TLS, is not using this * segment at all, or is having fun with arcane CPU features. */ if (next_base == 0) { /* * Nasty case: on AMD CPUs, we need to forcibly zero * the base. */ if (static_cpu_has_bug(X86_BUG_NULL_SEG)) { loadseg(which, __USER_DS); loadseg(which, next_index); } else { /* * We could try to exhaustively detect cases * under which we can skip the segment load, * but there's really only one case that matters * for performance: if both the previous and * next states are fully zeroed, we can skip * the load. * * (This assumes that prev_base == 0 has no * false positives. This is the case on * Intel-style CPUs.) */ if (likely(prev_index | next_index | prev_base)) loadseg(which, next_index); } } else { if (prev_index != next_index) loadseg(which, next_index); wrmsrl(which == FS ? MSR_FS_BASE : MSR_KERNEL_GS_BASE, next_base); } } else { /* * The next task is using a real segment. Loading the selector * is sufficient. */ loadseg(which, next_index); } } /* * Store prev's PKRU value and load next's PKRU value if they differ. PKRU * is not XSTATE managed on context switch because that would require a * lookup in the task's FPU xsave buffer and require to keep that updated * in various places. */ static __always_inline void x86_pkru_load(struct thread_struct *prev, struct thread_struct *next) { if (!cpu_feature_enabled(X86_FEATURE_OSPKE)) return; /* Stash the prev task's value: */ prev->pkru = rdpkru(); /* * PKRU writes are slightly expensive. Avoid them when not * strictly necessary: */ if (prev->pkru != next->pkru) wrpkru(next->pkru); } static __always_inline void x86_fsgsbase_load(struct thread_struct *prev, struct thread_struct *next) { if (static_cpu_has(X86_FEATURE_FSGSBASE)) { /* Update the FS and GS selectors if they could have changed. */ if (unlikely(prev->fsindex || next->fsindex)) loadseg(FS, next->fsindex); if (unlikely(prev->gsindex || next->gsindex)) loadseg(GS, next->gsindex); /* Update the bases. */ wrfsbase(next->fsbase); __wrgsbase_inactive(next->gsbase); } else { load_seg_legacy(prev->fsindex, prev->fsbase, next->fsindex, next->fsbase, FS); load_seg_legacy(prev->gsindex, prev->gsbase, next->gsindex, next->gsbase, GS); } } unsigned long x86_fsgsbase_read_task(struct task_struct *task, unsigned short selector) { unsigned short idx = selector >> 3; unsigned long base; if (likely((selector & SEGMENT_TI_MASK) == 0)) { if (unlikely(idx >= GDT_ENTRIES)) return 0; /* * There are no user segments in the GDT with nonzero bases * other than the TLS segments. */ if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX) return 0; idx -= GDT_ENTRY_TLS_MIN; base = get_desc_base(&task->thread.tls_array[idx]); } else { #ifdef CONFIG_MODIFY_LDT_SYSCALL struct ldt_struct *ldt; /* * If performance here mattered, we could protect the LDT * with RCU. This is a slow path, though, so we can just * take the mutex. */ mutex_lock(&task->mm->context.lock); ldt = task->mm->context.ldt; if (unlikely(!ldt || idx >= ldt->nr_entries)) base = 0; else base = get_desc_base(ldt->entries + idx); mutex_unlock(&task->mm->context.lock); #else base = 0; #endif } return base; } unsigned long x86_gsbase_read_cpu_inactive(void) { unsigned long gsbase; if (boot_cpu_has(X86_FEATURE_FSGSBASE)) { unsigned long flags; local_irq_save(flags); gsbase = __rdgsbase_inactive(); local_irq_restore(flags); } else { rdmsrl(MSR_KERNEL_GS_BASE, gsbase); } return gsbase; } void x86_gsbase_write_cpu_inactive(unsigned long gsbase) { if (boot_cpu_has(X86_FEATURE_FSGSBASE)) { unsigned long flags; local_irq_save(flags); __wrgsbase_inactive(gsbase); local_irq_restore(flags); } else { wrmsrl(MSR_KERNEL_GS_BASE, gsbase); } } unsigned long x86_fsbase_read_task(struct task_struct *task) { unsigned long fsbase; if (task == current) fsbase = x86_fsbase_read_cpu(); else if (boot_cpu_has(X86_FEATURE_FSGSBASE) || (task->thread.fsindex == 0)) fsbase = task->thread.fsbase; else fsbase = x86_fsgsbase_read_task(task, task->thread.fsindex); return fsbase; } unsigned long x86_gsbase_read_task(struct task_struct *task) { unsigned long gsbase; if (task == current) gsbase = x86_gsbase_read_cpu_inactive(); else if (boot_cpu_has(X86_FEATURE_FSGSBASE) || (task->thread.gsindex == 0)) gsbase = task->thread.gsbase; else gsbase = x86_fsgsbase_read_task(task, task->thread.gsindex); return gsbase; } void x86_fsbase_write_task(struct task_struct *task, unsigned long fsbase) { WARN_ON_ONCE(task == current); task->thread.fsbase = fsbase; } void x86_gsbase_write_task(struct task_struct *task, unsigned long gsbase) { WARN_ON_ONCE(task == current); task->thread.gsbase = gsbase; } static void start_thread_common(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp, u16 _cs, u16 _ss, u16 _ds) { WARN_ON_ONCE(regs != current_pt_regs()); if (static_cpu_has(X86_BUG_NULL_SEG)) { /* Loading zero below won't clear the base. */ loadsegment(fs, __USER_DS); load_gs_index(__USER_DS); } reset_thread_features(); loadsegment(fs, 0); loadsegment(es, _ds); loadsegment(ds, _ds); load_gs_index(0); regs->ip = new_ip; regs->sp = new_sp; regs->csx = _cs; regs->ssx = _ss; /* * Allow single-step trap and NMI when starting a new task, thus * once the new task enters user space, single-step trap and NMI * are both enabled immediately. * * Entering a new task is logically speaking a return from a * system call (exec, fork, clone, etc.). As such, if ptrace * enables single stepping a single step exception should be * allowed to trigger immediately upon entering user space. * This is not optional. * * NMI should *never* be disabled in user space. As such, this * is an optional, opportunistic way to catch errors. * * Paranoia: High-order 48 bits above the lowest 16 bit SS are * discarded by the legacy IRET instruction on all Intel, AMD, * and Cyrix/Centaur/VIA CPUs, thus can be set unconditionally, * even when FRED is not enabled. But we choose the safer side * to use these bits only when FRED is enabled. */ if (cpu_feature_enabled(X86_FEATURE_FRED)) { regs->fred_ss.swevent = true; regs->fred_ss.nmi = true; } regs->flags = X86_EFLAGS_IF | X86_EFLAGS_FIXED; } void start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp) { start_thread_common(regs, new_ip, new_sp, __USER_CS, __USER_DS, 0); } EXPORT_SYMBOL_GPL(start_thread); #ifdef CONFIG_COMPAT void compat_start_thread(struct pt_regs *regs, u32 new_ip, u32 new_sp, bool x32) { start_thread_common(regs, new_ip, new_sp, x32 ? __USER_CS : __USER32_CS, __USER_DS, __USER_DS); } #endif /* * switch_to(x,y) should switch tasks from x to y. * * This could still be optimized: * - fold all the options into a flag word and test it with a single test. * - could test fs/gs bitsliced * * Kprobes not supported here. Set the probe on schedule instead. * Function graph tracer not supported too. */ __no_kmsan_checks __visible __notrace_funcgraph struct task_struct * __switch_to(struct task_struct *prev_p, struct task_struct *next_p) { struct thread_struct *prev = &prev_p->thread; struct thread_struct *next = &next_p->thread; int cpu = smp_processor_id(); WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) && this_cpu_read(pcpu_hot.hardirq_stack_inuse)); if (!test_tsk_thread_flag(prev_p, TIF_NEED_FPU_LOAD)) switch_fpu_prepare(prev_p, cpu); /* We must save %fs and %gs before load_TLS() because * %fs and %gs may be cleared by load_TLS(). * * (e.g. xen_load_tls()) */ save_fsgs(prev_p); /* * Load TLS before restoring any segments so that segment loads * reference the correct GDT entries. */ load_TLS(next, cpu); /* * Leave lazy mode, flushing any hypercalls made here. This * must be done after loading TLS entries in the GDT but before * loading segments that might reference them. */ arch_end_context_switch(next_p); /* Switch DS and ES. * * Reading them only returns the selectors, but writing them (if * nonzero) loads the full descriptor from the GDT or LDT. The * LDT for next is loaded in switch_mm, and the GDT is loaded * above. * * We therefore need to write new values to the segment * registers on every context switch unless both the new and old * values are zero. * * Note that we don't need to do anything for CS and SS, as * those are saved and restored as part of pt_regs. */ savesegment(es, prev->es); if (unlikely(next->es | prev->es)) loadsegment(es, next->es); savesegment(ds, prev->ds); if (unlikely(next->ds | prev->ds)) loadsegment(ds, next->ds); x86_fsgsbase_load(prev, next); x86_pkru_load(prev, next); /* * Switch the PDA and FPU contexts. */ raw_cpu_write(pcpu_hot.current_task, next_p); raw_cpu_write(pcpu_hot.top_of_stack, task_top_of_stack(next_p)); switch_fpu_finish(next_p); /* Reload sp0. */ update_task_stack(next_p); switch_to_extra(prev_p, next_p); if (static_cpu_has_bug(X86_BUG_SYSRET_SS_ATTRS)) { /* * AMD CPUs have a misfeature: SYSRET sets the SS selector but * does not update the cached descriptor. As a result, if we * do SYSRET while SS is NULL, we'll end up in user mode with * SS apparently equal to __USER_DS but actually unusable. * * The straightforward workaround would be to fix it up just * before SYSRET, but that would slow down the system call * fast paths. Instead, we ensure that SS is never NULL in * system call context. We do this by replacing NULL SS * selectors at every context switch. SYSCALL sets up a valid * SS, so the only way to get NULL is to re-enter the kernel * from CPL 3 through an interrupt. Since that can't happen * in the same task as a running syscall, we are guaranteed to * context switch between every interrupt vector entry and a * subsequent SYSRET. * * We read SS first because SS reads are much faster than * writes. Out of caution, we force SS to __KERNEL_DS even if * it previously had a different non-NULL value. */ unsigned short ss_sel; savesegment(ss, ss_sel); if (ss_sel != __KERNEL_DS) loadsegment(ss, __KERNEL_DS); } /* Load the Intel cache allocation PQR MSR. */ resctrl_sched_in(next_p); return prev_p; } void set_personality_64bit(void) { /* inherit personality from parent */ /* Make sure to be in 64bit mode */ clear_thread_flag(TIF_ADDR32); /* Pretend that this comes from a 64bit execve */ task_pt_regs(current)->orig_ax = __NR_execve; current_thread_info()->status &= ~TS_COMPAT; if (current->mm) __set_bit(MM_CONTEXT_HAS_VSYSCALL, ¤t->mm->context.flags); /* TBD: overwrites user setup. Should have two bits. But 64bit processes have always behaved this way, so it's not too bad. The main problem is just that 32bit children are affected again. */ current->personality &= ~READ_IMPLIES_EXEC; } static void __set_personality_x32(void) { #ifdef CONFIG_X86_X32_ABI if (current->mm) current->mm->context.flags = 0; current->personality &= ~READ_IMPLIES_EXEC; /* * in_32bit_syscall() uses the presence of the x32 syscall bit * flag to determine compat status. The x86 mmap() code relies on * the syscall bitness so set x32 syscall bit right here to make * in_32bit_syscall() work during exec(). * * Pretend to come from a x32 execve. */ task_pt_regs(current)->orig_ax = __NR_x32_execve | __X32_SYSCALL_BIT; current_thread_info()->status &= ~TS_COMPAT; #endif } static void __set_personality_ia32(void) { #ifdef CONFIG_IA32_EMULATION if (current->mm) { /* * uprobes applied to this MM need to know this and * cannot use user_64bit_mode() at that time. */ __set_bit(MM_CONTEXT_UPROBE_IA32, ¤t->mm->context.flags); } current->personality |= force_personality32; /* Prepare the first "return" to user space */ task_pt_regs(current)->orig_ax = __NR_ia32_execve; current_thread_info()->status |= TS_COMPAT; #endif } void set_personality_ia32(bool x32) { /* Make sure to be in 32bit mode */ set_thread_flag(TIF_ADDR32); if (x32) __set_personality_x32(); else __set_personality_ia32(); } EXPORT_SYMBOL_GPL(set_personality_ia32); #ifdef CONFIG_CHECKPOINT_RESTORE static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr) { int ret; ret = map_vdso_once(image, addr); if (ret) return ret; return (long)image->size; } #endif #ifdef CONFIG_ADDRESS_MASKING #define LAM_U57_BITS 6 static void enable_lam_func(void *__mm) { struct mm_struct *mm = __mm; unsigned long lam; if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm) { lam = mm_lam_cr3_mask(mm); write_cr3(__read_cr3() | lam); cpu_tlbstate_update_lam(lam, mm_untag_mask(mm)); } } static void mm_enable_lam(struct mm_struct *mm) { mm->context.lam_cr3_mask = X86_CR3_LAM_U57; mm->context.untag_mask = ~GENMASK(62, 57); /* * Even though the process must still be single-threaded at this * point, kernel threads may be using the mm. IPI those kernel * threads if they exist. */ on_each_cpu_mask(mm_cpumask(mm), enable_lam_func, mm, true); set_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags); } static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits) { if (!cpu_feature_enabled(X86_FEATURE_LAM)) return -ENODEV; /* PTRACE_ARCH_PRCTL */ if (current->mm != mm) return -EINVAL; if (mm_valid_pasid(mm) && !test_bit(MM_CONTEXT_FORCE_TAGGED_SVA, &mm->context.flags)) return -EINVAL; if (mmap_write_lock_killable(mm)) return -EINTR; /* * MM_CONTEXT_LOCK_LAM is set on clone. Prevent LAM from * being enabled unless the process is single threaded: */ if (test_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags)) { mmap_write_unlock(mm); return -EBUSY; } if (!nr_bits || nr_bits > LAM_U57_BITS) { mmap_write_unlock(mm); return -EINVAL; } mm_enable_lam(mm); mmap_write_unlock(mm); return 0; } #endif long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2) { int ret = 0; switch (option) { case ARCH_SET_GS: { if (unlikely(arg2 >= TASK_SIZE_MAX)) return -EPERM; preempt_disable(); /* * ARCH_SET_GS has always overwritten the index * and the base. Zero is the most sensible value * to put in the index, and is the only value that * makes any sense if FSGSBASE is unavailable. */ if (task == current) { loadseg(GS, 0); x86_gsbase_write_cpu_inactive(arg2); /* * On non-FSGSBASE systems, save_base_legacy() expects * that we also fill in thread.gsbase. */ task->thread.gsbase = arg2; } else { task->thread.gsindex = 0; x86_gsbase_write_task(task, arg2); } preempt_enable(); break; } case ARCH_SET_FS: { /* * Not strictly needed for %fs, but do it for symmetry * with %gs */ if (unlikely(arg2 >= TASK_SIZE_MAX)) return -EPERM; preempt_disable(); /* * Set the selector to 0 for the same reason * as %gs above. */ if (task == current) { loadseg(FS, 0); x86_fsbase_write_cpu(arg2); /* * On non-FSGSBASE systems, save_base_legacy() expects * that we also fill in thread.fsbase. */ task->thread.fsbase = arg2; } else { task->thread.fsindex = 0; x86_fsbase_write_task(task, arg2); } preempt_enable(); break; } case ARCH_GET_FS: { unsigned long base = x86_fsbase_read_task(task); ret = put_user(base, (unsigned long __user *)arg2); break; } case ARCH_GET_GS: { unsigned long base = x86_gsbase_read_task(task); ret = put_user(base, (unsigned long __user *)arg2); break; } #ifdef CONFIG_CHECKPOINT_RESTORE # ifdef CONFIG_X86_X32_ABI case ARCH_MAP_VDSO_X32: return prctl_map_vdso(&vdso_image_x32, arg2); # endif # if defined CONFIG_X86_32 || defined CONFIG_IA32_EMULATION case ARCH_MAP_VDSO_32: return prctl_map_vdso(&vdso_image_32, arg2); # endif case ARCH_MAP_VDSO_64: return prctl_map_vdso(&vdso_image_64, arg2); #endif #ifdef CONFIG_ADDRESS_MASKING case ARCH_GET_UNTAG_MASK: return put_user(task->mm->context.untag_mask, (unsigned long __user *)arg2); case ARCH_ENABLE_TAGGED_ADDR: return prctl_enable_tagged_addr(task->mm, arg2); case ARCH_FORCE_TAGGED_SVA: if (current != task) return -EINVAL; set_bit(MM_CONTEXT_FORCE_TAGGED_SVA, &task->mm->context.flags); return 0; case ARCH_GET_MAX_TAG_BITS: if (!cpu_feature_enabled(X86_FEATURE_LAM)) return put_user(0, (unsigned long __user *)arg2); else return put_user(LAM_U57_BITS, (unsigned long __user *)arg2); #endif case ARCH_SHSTK_ENABLE: case ARCH_SHSTK_DISABLE: case ARCH_SHSTK_LOCK: case ARCH_SHSTK_UNLOCK: case ARCH_SHSTK_STATUS: return shstk_prctl(task, option, arg2); default: ret = -EINVAL; break; } return ret; } SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2) { long ret; ret = do_arch_prctl_64(current, option, arg2); if (ret == -EINVAL) ret = do_arch_prctl_common(option, arg2); return ret; } #ifdef CONFIG_IA32_EMULATION COMPAT_SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2) { return do_arch_prctl_common(option, arg2); } #endif unsigned long KSTK_ESP(struct task_struct *task) { return task_pt_regs(task)->sp; } |
| 3 3 3 3 2263 2543 342 2272 2266 2263 2268 2712 1746 118 118 118 117 117 118 118 118 118 119 118 117 119 120 119 119 115 119 119 119 30 119 119 120 119 119 120 120 119 82 82 2 82 82 82 55 80 120 118 116 120 118 119 119 118 119 119 119 2713 896 2712 879 204 894 2611 2614 1759 9 521 1323 522 1321 250 1746 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 | // SPDX-License-Identifier: GPL-2.0 #include <linux/memcontrol.h> #include <linux/rwsem.h> #include <linux/shrinker.h> #include <linux/rculist.h> #include <trace/events/vmscan.h> #include "internal.h" LIST_HEAD(shrinker_list); DEFINE_MUTEX(shrinker_mutex); #ifdef CONFIG_MEMCG static int shrinker_nr_max; static inline int shrinker_unit_size(int nr_items) { return (DIV_ROUND_UP(nr_items, SHRINKER_UNIT_BITS) * sizeof(struct shrinker_info_unit *)); } static inline void shrinker_unit_free(struct shrinker_info *info, int start) { struct shrinker_info_unit **unit; int nr, i; if (!info) return; unit = info->unit; nr = DIV_ROUND_UP(info->map_nr_max, SHRINKER_UNIT_BITS); for (i = start; i < nr; i++) { if (!unit[i]) break; kfree(unit[i]); unit[i] = NULL; } } static inline int shrinker_unit_alloc(struct shrinker_info *new, struct shrinker_info *old, int nid) { struct shrinker_info_unit *unit; int nr = DIV_ROUND_UP(new->map_nr_max, SHRINKER_UNIT_BITS); int start = old ? DIV_ROUND_UP(old->map_nr_max, SHRINKER_UNIT_BITS) : 0; int i; for (i = start; i < nr; i++) { unit = kzalloc_node(sizeof(*unit), GFP_KERNEL, nid); if (!unit) { shrinker_unit_free(new, start); return -ENOMEM; } new->unit[i] = unit; } return 0; } void free_shrinker_info(struct mem_cgroup *memcg) { struct mem_cgroup_per_node *pn; struct shrinker_info *info; int nid; for_each_node(nid) { pn = memcg->nodeinfo[nid]; info = rcu_dereference_protected(pn->shrinker_info, true); shrinker_unit_free(info, 0); kvfree(info); rcu_assign_pointer(pn->shrinker_info, NULL); } } int alloc_shrinker_info(struct mem_cgroup *memcg) { int nid, ret = 0; int array_size = 0; mutex_lock(&shrinker_mutex); array_size = shrinker_unit_size(shrinker_nr_max); for_each_node(nid) { struct shrinker_info *info = kvzalloc_node(sizeof(*info) + array_size, GFP_KERNEL, nid); if (!info) goto err; info->map_nr_max = shrinker_nr_max; if (shrinker_unit_alloc(info, NULL, nid)) { kvfree(info); goto err; } rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info); } mutex_unlock(&shrinker_mutex); return ret; err: mutex_unlock(&shrinker_mutex); free_shrinker_info(memcg); return -ENOMEM; } static struct shrinker_info *shrinker_info_protected(struct mem_cgroup *memcg, int nid) { return rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info, lockdep_is_held(&shrinker_mutex)); } static int expand_one_shrinker_info(struct mem_cgroup *memcg, int new_size, int old_size, int new_nr_max) { struct shrinker_info *new, *old; struct mem_cgroup_per_node *pn; int nid; for_each_node(nid) { pn = memcg->nodeinfo[nid]; old = shrinker_info_protected(memcg, nid); /* Not yet online memcg */ if (!old) return 0; /* Already expanded this shrinker_info */ if (new_nr_max <= old->map_nr_max) continue; new = kvzalloc_node(sizeof(*new) + new_size, GFP_KERNEL, nid); if (!new) return -ENOMEM; new->map_nr_max = new_nr_max; memcpy(new->unit, old->unit, old_size); if (shrinker_unit_alloc(new, old, nid)) { kvfree(new); return -ENOMEM; } rcu_assign_pointer(pn->shrinker_info, new); kvfree_rcu(old, rcu); } return 0; } static int expand_shrinker_info(int new_id) { int ret = 0; int new_nr_max = round_up(new_id + 1, SHRINKER_UNIT_BITS); int new_size, old_size = 0; struct mem_cgroup *memcg; if (!root_mem_cgroup) goto out; lockdep_assert_held(&shrinker_mutex); new_size = shrinker_unit_size(new_nr_max); old_size = shrinker_unit_size(shrinker_nr_max); memcg = mem_cgroup_iter(NULL, NULL, NULL); do { ret = expand_one_shrinker_info(memcg, new_size, old_size, new_nr_max); if (ret) { mem_cgroup_iter_break(NULL, memcg); goto out; } } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); out: if (!ret) shrinker_nr_max = new_nr_max; return ret; } static inline int shrinker_id_to_index(int shrinker_id) { return shrinker_id / SHRINKER_UNIT_BITS; } static inline int shrinker_id_to_offset(int shrinker_id) { return shrinker_id % SHRINKER_UNIT_BITS; } static inline int calc_shrinker_id(int index, int offset) { return index * SHRINKER_UNIT_BITS + offset; } void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id) { if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) { struct shrinker_info *info; struct shrinker_info_unit *unit; rcu_read_lock(); info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); unit = info->unit[shrinker_id_to_index(shrinker_id)]; if (!WARN_ON_ONCE(shrinker_id >= info->map_nr_max)) { /* Pairs with smp mb in shrink_slab() */ smp_mb__before_atomic(); set_bit(shrinker_id_to_offset(shrinker_id), unit->map); } rcu_read_unlock(); } } static DEFINE_IDR(shrinker_idr); static int shrinker_memcg_alloc(struct shrinker *shrinker) { int id, ret = -ENOMEM; if (mem_cgroup_disabled()) return -ENOSYS; mutex_lock(&shrinker_mutex); id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL); if (id < 0) goto unlock; if (id >= shrinker_nr_max) { if (expand_shrinker_info(id)) { idr_remove(&shrinker_idr, id); goto unlock; } } shrinker->id = id; ret = 0; unlock: mutex_unlock(&shrinker_mutex); return ret; } static void shrinker_memcg_remove(struct shrinker *shrinker) { int id = shrinker->id; BUG_ON(id < 0); lockdep_assert_held(&shrinker_mutex); idr_remove(&shrinker_idr, id); } static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker, struct mem_cgroup *memcg) { struct shrinker_info *info; struct shrinker_info_unit *unit; long nr_deferred; rcu_read_lock(); info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); unit = info->unit[shrinker_id_to_index(shrinker->id)]; nr_deferred = atomic_long_xchg(&unit->nr_deferred[shrinker_id_to_offset(shrinker->id)], 0); rcu_read_unlock(); return nr_deferred; } static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker, struct mem_cgroup *memcg) { struct shrinker_info *info; struct shrinker_info_unit *unit; long nr_deferred; rcu_read_lock(); info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); unit = info->unit[shrinker_id_to_index(shrinker->id)]; nr_deferred = atomic_long_add_return(nr, &unit->nr_deferred[shrinker_id_to_offset(shrinker->id)]); rcu_read_unlock(); return nr_deferred; } void reparent_shrinker_deferred(struct mem_cgroup *memcg) { int nid, index, offset; long nr; struct mem_cgroup *parent; struct shrinker_info *child_info, *parent_info; struct shrinker_info_unit *child_unit, *parent_unit; parent = parent_mem_cgroup(memcg); if (!parent) parent = root_mem_cgroup; /* Prevent from concurrent shrinker_info expand */ mutex_lock(&shrinker_mutex); for_each_node(nid) { child_info = shrinker_info_protected(memcg, nid); parent_info = shrinker_info_protected(parent, nid); for (index = 0; index < shrinker_id_to_index(child_info->map_nr_max); index++) { child_unit = child_info->unit[index]; parent_unit = parent_info->unit[index]; for (offset = 0; offset < SHRINKER_UNIT_BITS; offset++) { nr = atomic_long_read(&child_unit->nr_deferred[offset]); atomic_long_add(nr, &parent_unit->nr_deferred[offset]); } } } mutex_unlock(&shrinker_mutex); } #else static int shrinker_memcg_alloc(struct shrinker *shrinker) { return -ENOSYS; } static void shrinker_memcg_remove(struct shrinker *shrinker) { } static long xchg_nr_deferred_memcg(int nid, struct shrinker *shrinker, struct mem_cgroup *memcg) { return 0; } static long add_nr_deferred_memcg(long nr, int nid, struct shrinker *shrinker, struct mem_cgroup *memcg) { return 0; } #endif /* CONFIG_MEMCG */ static long xchg_nr_deferred(struct shrinker *shrinker, struct shrink_control *sc) { int nid = sc->nid; if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) nid = 0; if (sc->memcg && (shrinker->flags & SHRINKER_MEMCG_AWARE)) return xchg_nr_deferred_memcg(nid, shrinker, sc->memcg); return atomic_long_xchg(&shrinker->nr_deferred[nid], 0); } static long add_nr_deferred(long nr, struct shrinker *shrinker, struct shrink_control *sc) { int nid = sc->nid; if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) nid = 0; if (sc->memcg && (shrinker->flags & SHRINKER_MEMCG_AWARE)) return add_nr_deferred_memcg(nr, nid, shrinker, sc->memcg); return atomic_long_add_return(nr, &shrinker->nr_deferred[nid]); } #define SHRINK_BATCH 128 static unsigned long do_shrink_slab(struct shrink_control *shrinkctl, struct shrinker *shrinker, int priority) { unsigned long freed = 0; unsigned long long delta; long total_scan; long freeable; long nr; long new_nr; long batch_size = shrinker->batch ? shrinker->batch : SHRINK_BATCH; long scanned = 0, next_deferred; freeable = shrinker->count_objects(shrinker, shrinkctl); if (freeable == 0 || freeable == SHRINK_EMPTY) return freeable; /* * copy the current shrinker scan count into a local variable * and zero it so that other concurrent shrinker invocations * don't also do this scanning work. */ nr = xchg_nr_deferred(shrinker, shrinkctl); if (shrinker->seeks) { delta = freeable >> priority; delta *= 4; do_div(delta, shrinker->seeks); } else { /* * These objects don't require any IO to create. Trim * them aggressively under memory pressure to keep * them from causing refetches in the IO caches. */ delta = freeable / 2; } total_scan = nr >> priority; total_scan += delta; total_scan = min(total_scan, (2 * freeable)); trace_mm_shrink_slab_start(shrinker, shrinkctl, nr, freeable, delta, total_scan, priority); /* * Normally, we should not scan less than batch_size objects in one * pass to avoid too frequent shrinker calls, but if the slab has less * than batch_size objects in total and we are really tight on memory, * we will try to reclaim all available objects, otherwise we can end * up failing allocations although there are plenty of reclaimable * objects spread over several slabs with usage less than the * batch_size. * * We detect the "tight on memory" situations by looking at the total * number of objects we want to scan (total_scan). If it is greater * than the total number of objects on slab (freeable), we must be * scanning at high prio and therefore should try to reclaim as much as * possible. */ while (total_scan >= batch_size || total_scan >= freeable) { unsigned long ret; unsigned long nr_to_scan = min(batch_size, total_scan); shrinkctl->nr_to_scan = nr_to_scan; shrinkctl->nr_scanned = nr_to_scan; ret = shrinker->scan_objects(shrinker, shrinkctl); if (ret == SHRINK_STOP) break; freed += ret; count_vm_events(SLABS_SCANNED, shrinkctl->nr_scanned); total_scan -= shrinkctl->nr_scanned; scanned += shrinkctl->nr_scanned; cond_resched(); } /* * The deferred work is increased by any new work (delta) that wasn't * done, decreased by old deferred work that was done now. * * And it is capped to two times of the freeable items. */ next_deferred = max_t(long, (nr + delta - scanned), 0); next_deferred = min(next_deferred, (2 * freeable)); /* * move the unused scan count back into the shrinker in a * manner that handles concurrent updates. */ new_nr = add_nr_deferred(next_deferred, shrinker, shrinkctl); trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan); return freed; } #ifdef CONFIG_MEMCG static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, int priority) { struct shrinker_info *info; unsigned long ret, freed = 0; int offset, index = 0; if (!mem_cgroup_online(memcg)) return 0; /* * lockless algorithm of memcg shrink. * * The shrinker_info may be freed asynchronously via RCU in the * expand_one_shrinker_info(), so the rcu_read_lock() needs to be used * to ensure the existence of the shrinker_info. * * The shrinker_info_unit is never freed unless its corresponding memcg * is destroyed. Here we already hold the refcount of memcg, so the * memcg will not be destroyed, and of course shrinker_info_unit will * not be freed. * * So in the memcg shrink: * step 1: use rcu_read_lock() to guarantee existence of the * shrinker_info. * step 2: after getting shrinker_info_unit we can safely release the * RCU lock. * step 3: traverse the bitmap and calculate shrinker_id * step 4: use rcu_read_lock() to guarantee existence of the shrinker. * step 5: use shrinker_id to find the shrinker, then use * shrinker_try_get() to guarantee existence of the shrinker, * then we can release the RCU lock to do do_shrink_slab() that * may sleep. * step 6: do shrinker_put() paired with step 5 to put the refcount, * if the refcount reaches 0, then wake up the waiter in * shrinker_free() by calling complete(). * Note: here is different from the global shrink, we don't * need to acquire the RCU lock to guarantee existence of * the shrinker, because we don't need to use this * shrinker to traverse the next shrinker in the bitmap. * step 7: we have already exited the read-side of rcu critical section * before calling do_shrink_slab(), the shrinker_info may be * released in expand_one_shrinker_info(), so go back to step 1 * to reacquire the shrinker_info. */ again: rcu_read_lock(); info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info); if (unlikely(!info)) goto unlock; if (index < shrinker_id_to_index(info->map_nr_max)) { struct shrinker_info_unit *unit; unit = info->unit[index]; rcu_read_unlock(); for_each_set_bit(offset, unit->map, SHRINKER_UNIT_BITS) { struct shrink_control sc = { .gfp_mask = gfp_mask, .nid = nid, .memcg = memcg, }; struct shrinker *shrinker; int shrinker_id = calc_shrinker_id(index, offset); rcu_read_lock(); shrinker = idr_find(&shrinker_idr, shrinker_id); if (unlikely(!shrinker || !shrinker_try_get(shrinker))) { clear_bit(offset, unit->map); rcu_read_unlock(); continue; } rcu_read_unlock(); /* Call non-slab shrinkers even though kmem is disabled */ if (!memcg_kmem_online() && !(shrinker->flags & SHRINKER_NONSLAB)) continue; ret = do_shrink_slab(&sc, shrinker, priority); if (ret == SHRINK_EMPTY) { clear_bit(offset, unit->map); /* * After the shrinker reported that it had no objects to * free, but before we cleared the corresponding bit in * the memcg shrinker map, a new object might have been * added. To make sure, we have the bit set in this * case, we invoke the shrinker one more time and reset * the bit if it reports that it is not empty anymore. * The memory barrier here pairs with the barrier in * set_shrinker_bit(): * * list_lru_add() shrink_slab_memcg() * list_add_tail() clear_bit() * <MB> <MB> * set_bit() do_shrink_slab() */ smp_mb__after_atomic(); ret = do_shrink_slab(&sc, shrinker, priority); if (ret == SHRINK_EMPTY) ret = 0; else set_shrinker_bit(memcg, nid, shrinker_id); } freed += ret; shrinker_put(shrinker); } index++; goto again; } unlock: rcu_read_unlock(); return freed; } #else /* !CONFIG_MEMCG */ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, int priority) { return 0; } #endif /* CONFIG_MEMCG */ /** * shrink_slab - shrink slab caches * @gfp_mask: allocation context * @nid: node whose slab caches to target * @memcg: memory cgroup whose slab caches to target * @priority: the reclaim priority * * Call the shrink functions to age shrinkable caches. * * @nid is passed along to shrinkers with SHRINKER_NUMA_AWARE set, * unaware shrinkers will receive a node id of 0 instead. * * @memcg specifies the memory cgroup to target. Unaware shrinkers * are called only if it is the root cgroup. * * @priority is sc->priority, we take the number of objects and >> by priority * in order to get the scan target. * * Returns the number of reclaimed slab objects. */ unsigned long shrink_slab(gfp_t gfp_mask, int nid, struct mem_cgroup *memcg, int priority) { unsigned long ret, freed = 0; struct shrinker *shrinker; /* * The root memcg might be allocated even though memcg is disabled * via "cgroup_disable=memory" boot parameter. This could make * mem_cgroup_is_root() return false, then just run memcg slab * shrink, but skip global shrink. This may result in premature * oom. */ if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) return shrink_slab_memcg(gfp_mask, nid, memcg, priority); /* * lockless algorithm of global shrink. * * In the unregistration setp, the shrinker will be freed asynchronously * via RCU after its refcount reaches 0. So both rcu_read_lock() and * shrinker_try_get() can be used to ensure the existence of the shrinker. * * So in the global shrink: * step 1: use rcu_read_lock() to guarantee existence of the shrinker * and the validity of the shrinker_list walk. * step 2: use shrinker_try_get() to try get the refcount, if successful, * then the existence of the shrinker can also be guaranteed, * so we can release the RCU lock to do do_shrink_slab() that * may sleep. * step 3: *MUST* to reacquire the RCU lock before calling shrinker_put(), * which ensures that neither this shrinker nor the next shrinker * will be freed in the next traversal operation. * step 4: do shrinker_put() paired with step 2 to put the refcount, * if the refcount reaches 0, then wake up the waiter in * shrinker_free() by calling complete(). */ rcu_read_lock(); list_for_each_entry_rcu(shrinker, &shrinker_list, list) { struct shrink_control sc = { .gfp_mask = gfp_mask, .nid = nid, .memcg = memcg, }; if (!shrinker_try_get(shrinker)) continue; rcu_read_unlock(); ret = do_shrink_slab(&sc, shrinker, priority); if (ret == SHRINK_EMPTY) ret = 0; freed += ret; rcu_read_lock(); shrinker_put(shrinker); } rcu_read_unlock(); cond_resched(); return freed; } struct shrinker *shrinker_alloc(unsigned int flags, const char *fmt, ...) { struct shrinker *shrinker; unsigned int size; va_list ap; int err; shrinker = kzalloc(sizeof(struct shrinker), GFP_KERNEL); if (!shrinker) return NULL; va_start(ap, fmt); err = shrinker_debugfs_name_alloc(shrinker, fmt, ap); va_end(ap); if (err) goto err_name; shrinker->flags = flags | SHRINKER_ALLOCATED; shrinker->seeks = DEFAULT_SEEKS; if (flags & SHRINKER_MEMCG_AWARE) { err = shrinker_memcg_alloc(shrinker); if (err == -ENOSYS) { /* Memcg is not supported, fallback to non-memcg-aware shrinker. */ shrinker->flags &= ~SHRINKER_MEMCG_AWARE; goto non_memcg; } if (err) goto err_flags; return shrinker; } non_memcg: /* * The nr_deferred is available on per memcg level for memcg aware * shrinkers, so only allocate nr_deferred in the following cases: * - non-memcg-aware shrinkers * - !CONFIG_MEMCG * - memcg is disabled by kernel command line */ size = sizeof(*shrinker->nr_deferred); if (flags & SHRINKER_NUMA_AWARE) size *= nr_node_ids; shrinker->nr_deferred = kzalloc(size, GFP_KERNEL); if (!shrinker->nr_deferred) goto err_flags; return shrinker; err_flags: shrinker_debugfs_name_free(shrinker); err_name: kfree(shrinker); return NULL; } EXPORT_SYMBOL_GPL(shrinker_alloc); void shrinker_register(struct shrinker *shrinker) { if (unlikely(!(shrinker->flags & SHRINKER_ALLOCATED))) { pr_warn("Must use shrinker_alloc() to dynamically allocate the shrinker"); return; } mutex_lock(&shrinker_mutex); list_add_tail_rcu(&shrinker->list, &shrinker_list); shrinker->flags |= SHRINKER_REGISTERED; shrinker_debugfs_add(shrinker); mutex_unlock(&shrinker_mutex); init_completion(&shrinker->done); /* * Now the shrinker is fully set up, take the first reference to it to * indicate that lookup operations are now allowed to use it via * shrinker_try_get(). */ refcount_set(&shrinker->refcount, 1); } EXPORT_SYMBOL_GPL(shrinker_register); static void shrinker_free_rcu_cb(struct rcu_head *head) { struct shrinker *shrinker = container_of(head, struct shrinker, rcu); kfree(shrinker->nr_deferred); kfree(shrinker); } void shrinker_free(struct shrinker *shrinker) { struct dentry *debugfs_entry = NULL; int debugfs_id; if (!shrinker) return; if (shrinker->flags & SHRINKER_REGISTERED) { /* drop the initial refcount */ shrinker_put(shrinker); /* * Wait for all lookups of the shrinker to complete, after that, * no shrinker is running or will run again, then we can safely * free it asynchronously via RCU and safely free the structure * where the shrinker is located, such as super_block etc. */ wait_for_completion(&shrinker->done); } mutex_lock(&shrinker_mutex); if (shrinker->flags & SHRINKER_REGISTERED) { /* * Now we can safely remove it from the shrinker_list and then * free it. */ list_del_rcu(&shrinker->list); debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id); shrinker->flags &= ~SHRINKER_REGISTERED; } shrinker_debugfs_name_free(shrinker); if (shrinker->flags & SHRINKER_MEMCG_AWARE) shrinker_memcg_remove(shrinker); mutex_unlock(&shrinker_mutex); if (debugfs_entry) shrinker_debugfs_remove(debugfs_entry, debugfs_id); call_rcu(&shrinker->rcu, shrinker_free_rcu_cb); } EXPORT_SYMBOL_GPL(shrinker_free); |
| 8 8 8 7 1 2 4 1 8 5 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 | // SPDX-License-Identifier: GPL-2.0-or-later /* * * Copyright (C) Jonathan Naylor G4KLX (g4klx@g4klx.demon.co.uk) */ #include <linux/errno.h> #include <linux/types.h> #include <linux/socket.h> #include <linux/in.h> #include <linux/kernel.h> #include <linux/module.h> #include <linux/spinlock.h> #include <linux/timer.h> #include <linux/string.h> #include <linux/sockios.h> #include <linux/net.h> #include <linux/slab.h> #include <net/ax25.h> #include <linux/inet.h> #include <linux/netdevice.h> #include <linux/skbuff.h> #include <net/sock.h> #include <linux/uaccess.h> #include <linux/fcntl.h> #include <linux/mm.h> #include <linux/interrupt.h> static struct ax25_protocol *protocol_list; static DEFINE_RWLOCK(protocol_list_lock); static HLIST_HEAD(ax25_linkfail_list); static DEFINE_SPINLOCK(linkfail_lock); static struct listen_struct { struct listen_struct *next; ax25_address callsign; struct net_device *dev; } *listen_list = NULL; static DEFINE_SPINLOCK(listen_lock); /* * Do not register the internal protocols AX25_P_TEXT, AX25_P_SEGMENT, * AX25_P_IP or AX25_P_ARP ... */ void ax25_register_pid(struct ax25_protocol *ap) { write_lock_bh(&protocol_list_lock); ap->next = protocol_list; protocol_list = ap; write_unlock_bh(&protocol_list_lock); } EXPORT_SYMBOL_GPL(ax25_register_pid); void ax25_protocol_release(unsigned int pid) { struct ax25_protocol *protocol; write_lock_bh(&protocol_list_lock); protocol = protocol_list; if (protocol == NULL) goto out; if (protocol->pid == pid) { protocol_list = protocol->next; goto out; } while (protocol != NULL && protocol->next != NULL) { if (protocol->next->pid == pid) { protocol->next = protocol->next->next; goto out; } protocol = protocol->next; } out: write_unlock_bh(&protocol_list_lock); } EXPORT_SYMBOL(ax25_protocol_release); void ax25_linkfail_register(struct ax25_linkfail *lf) { spin_lock_bh(&linkfail_lock); hlist_add_head(&lf->lf_node, &ax25_linkfail_list); spin_unlock_bh(&linkfail_lock); } EXPORT_SYMBOL(ax25_linkfail_register); void ax25_linkfail_release(struct ax25_linkfail *lf) { spin_lock_bh(&linkfail_lock); hlist_del_init(&lf->lf_node); spin_unlock_bh(&linkfail_lock); } EXPORT_SYMBOL(ax25_linkfail_release); int ax25_listen_register(const ax25_address *callsign, struct net_device *dev) { struct listen_struct *listen; if (ax25_listen_mine(callsign, dev)) return 0; if ((listen = kmalloc(sizeof(*listen), GFP_ATOMIC)) == NULL) return -ENOMEM; listen->callsign = *callsign; listen->dev = dev; spin_lock_bh(&listen_lock); listen->next = listen_list; listen_list = listen; spin_unlock_bh(&listen_lock); return 0; } EXPORT_SYMBOL(ax25_listen_register); void ax25_listen_release(const ax25_address *callsign, struct net_device *dev) { struct listen_struct *s, *listen; spin_lock_bh(&listen_lock); listen = listen_list; if (listen == NULL) { spin_unlock_bh(&listen_lock); return; } if (ax25cmp(&listen->callsign, callsign) == 0 && listen->dev == dev) { listen_list = listen->next; spin_unlock_bh(&listen_lock); kfree(listen); return; } while (listen != NULL && listen->next != NULL) { if (ax25cmp(&listen->next->callsign, callsign) == 0 && listen->next->dev == dev) { s = listen->next; listen->next = listen->next->next; spin_unlock_bh(&listen_lock); kfree(s); return; } listen = listen->next; } spin_unlock_bh(&listen_lock); } EXPORT_SYMBOL(ax25_listen_release); int (*ax25_protocol_function(unsigned int pid))(struct sk_buff *, ax25_cb *) { int (*res)(struct sk_buff *, ax25_cb *) = NULL; struct ax25_protocol *protocol; read_lock(&protocol_list_lock); for (protocol = protocol_list; protocol != NULL; protocol = protocol->next) if (protocol->pid == pid) { res = protocol->func; break; } read_unlock(&protocol_list_lock); return res; } int ax25_listen_mine(const ax25_address *callsign, struct net_device *dev) { struct listen_struct *listen; spin_lock_bh(&listen_lock); for (listen = listen_list; listen != NULL; listen = listen->next) if (ax25cmp(&listen->callsign, callsign) == 0 && (listen->dev == dev || listen->dev == NULL)) { spin_unlock_bh(&listen_lock); return 1; } spin_unlock_bh(&listen_lock); return 0; } void ax25_link_failed(ax25_cb *ax25, int reason) { struct ax25_linkfail *lf; spin_lock_bh(&linkfail_lock); hlist_for_each_entry(lf, &ax25_linkfail_list, lf_node) lf->func(ax25, reason); spin_unlock_bh(&linkfail_lock); } int ax25_protocol_is_registered(unsigned int pid) { struct ax25_protocol *protocol; int res = 0; read_lock_bh(&protocol_list_lock); for (protocol = protocol_list; protocol != NULL; protocol = protocol->next) if (protocol->pid == pid) { res = 1; break; } read_unlock_bh(&protocol_list_lock); return res; } |
| 4 4 3 1 3 4 4 69 47 47 45 6 66 79 79 62 30 62 14 72 72 71 72 22 20 67 28 72 2 1 46 5 46 21 67 67 33 38 3 25 27 47 22 47 47 7 43 46 47 2 19 1 19 71 18 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 | // SPDX-License-Identifier: GPL-2.0 /* * linux/fs/ext4/readpage.c * * Copyright (C) 2002, Linus Torvalds. * Copyright (C) 2015, Google, Inc. * * This was originally taken from fs/mpage.c * * The ext4_mpage_readpages() function here is intended to * replace mpage_readahead() in the general case, not just for * encrypted files. It has some limitations (see below), where it * will fall back to read_block_full_page(), but these limitations * should only be hit when page_size != block_size. * * This will allow us to attach a callback function to support ext4 * encryption. * * If anything unusual happens, such as: * * - encountering a page which has buffers * - encountering a page which has a non-hole after a hole * - encountering a page with non-contiguous blocks * * then this code just gives up and calls the buffer_head-based read function. * It does handle a page which has holes at the end - that is a common case: * the end-of-file on blocksize < PAGE_SIZE setups. * */ #include <linux/kernel.h> #include <linux/export.h> #include <linux/mm.h> #include <linux/kdev_t.h> #include <linux/gfp.h> #include <linux/bio.h> #include <linux/fs.h> #include <linux/buffer_head.h> #include <linux/blkdev.h> #include <linux/highmem.h> #include <linux/prefetch.h> #include <linux/mpage.h> #include <linux/writeback.h> #include <linux/backing-dev.h> #include <linux/pagevec.h> #include "ext4.h" #define NUM_PREALLOC_POST_READ_CTXS 128 static struct kmem_cache *bio_post_read_ctx_cache; static mempool_t *bio_post_read_ctx_pool; /* postprocessing steps for read bios */ enum bio_post_read_step { STEP_INITIAL = 0, STEP_DECRYPT, STEP_VERITY, STEP_MAX, }; struct bio_post_read_ctx { struct bio *bio; struct work_struct work; unsigned int cur_step; unsigned int enabled_steps; }; static void __read_end_io(struct bio *bio) { struct folio_iter fi; bio_for_each_folio_all(fi, bio) folio_end_read(fi.folio, bio->bi_status == 0); if (bio->bi_private) mempool_free(bio->bi_private, bio_post_read_ctx_pool); bio_put(bio); } static void bio_post_read_processing(struct bio_post_read_ctx *ctx); static void decrypt_work(struct work_struct *work) { struct bio_post_read_ctx *ctx = container_of(work, struct bio_post_read_ctx, work); struct bio *bio = ctx->bio; if (fscrypt_decrypt_bio(bio)) bio_post_read_processing(ctx); else __read_end_io(bio); } static void verity_work(struct work_struct *work) { struct bio_post_read_ctx *ctx = container_of(work, struct bio_post_read_ctx, work); struct bio *bio = ctx->bio; /* * fsverity_verify_bio() may call readahead() again, and although verity * will be disabled for that, decryption may still be needed, causing * another bio_post_read_ctx to be allocated. So to guarantee that * mempool_alloc() never deadlocks we must free the current ctx first. * This is safe because verity is the last post-read step. */ BUILD_BUG_ON(STEP_VERITY + 1 != STEP_MAX); mempool_free(ctx, bio_post_read_ctx_pool); bio->bi_private = NULL; fsverity_verify_bio(bio); __read_end_io(bio); } static void bio_post_read_processing(struct bio_post_read_ctx *ctx) { /* * We use different work queues for decryption and for verity because * verity may require reading metadata pages that need decryption, and * we shouldn't recurse to the same workqueue. */ switch (++ctx->cur_step) { case STEP_DECRYPT: if (ctx->enabled_steps & (1 << STEP_DECRYPT)) { INIT_WORK(&ctx->work, decrypt_work); fscrypt_enqueue_decrypt_work(&ctx->work); return; } ctx->cur_step++; fallthrough; case STEP_VERITY: if (ctx->enabled_steps & (1 << STEP_VERITY)) { INIT_WORK(&ctx->work, verity_work); fsverity_enqueue_verify_work(&ctx->work); return; } ctx->cur_step++; fallthrough; default: __read_end_io(ctx->bio); } } static bool bio_post_read_required(struct bio *bio) { return bio->bi_private && !bio->bi_status; } /* * I/O completion handler for multipage BIOs. * * The mpage code never puts partial pages into a BIO (except for end-of-file). * If a page does not map to a contiguous run of blocks then it simply falls * back to block_read_full_folio(). * * Why is this? If a page's completion depends on a number of different BIOs * which can complete in any order (or at the same time) then determining the * status of that page is hard. See end_buffer_async_read() for the details. * There is no point in duplicating all that complexity. */ static void mpage_end_io(struct bio *bio) { if (bio_post_read_required(bio)) { struct bio_post_read_ctx *ctx = bio->bi_private; ctx->cur_step = STEP_INITIAL; bio_post_read_processing(ctx); return; } __read_end_io(bio); } static inline bool ext4_need_verity(const struct inode *inode, pgoff_t idx) { return fsverity_active(inode) && idx < DIV_ROUND_UP(inode->i_size, PAGE_SIZE); } static void ext4_set_bio_post_read_ctx(struct bio *bio, const struct inode *inode, pgoff_t first_idx) { unsigned int post_read_steps = 0; if (fscrypt_inode_uses_fs_layer_crypto(inode)) post_read_steps |= 1 << STEP_DECRYPT; if (ext4_need_verity(inode, first_idx)) post_read_steps |= 1 << STEP_VERITY; if (post_read_steps) { /* Due to the mempool, this never fails. */ struct bio_post_read_ctx *ctx = mempool_alloc(bio_post_read_ctx_pool, GFP_NOFS); ctx->bio = bio; ctx->enabled_steps = post_read_steps; bio->bi_private = ctx; } } static inline loff_t ext4_readpage_limit(struct inode *inode) { if (IS_ENABLED(CONFIG_FS_VERITY) && IS_VERITY(inode)) return inode->i_sb->s_maxbytes; return i_size_read(inode); } int ext4_mpage_readpages(struct inode *inode, struct readahead_control *rac, struct folio *folio) { struct bio *bio = NULL; sector_t last_block_in_bio = 0; const unsigned blkbits = inode->i_blkbits; const unsigned blocks_per_page = PAGE_SIZE >> blkbits; const unsigned blocksize = 1 << blkbits; sector_t next_block; sector_t block_in_file; sector_t last_block; sector_t last_block_in_file; sector_t first_block; unsigned page_block; struct block_device *bdev = inode->i_sb->s_bdev; int length; unsigned relative_block = 0; struct ext4_map_blocks map; unsigned int nr_pages = rac ? readahead_count(rac) : 1; map.m_pblk = 0; map.m_lblk = 0; map.m_len = 0; map.m_flags = 0; for (; nr_pages; nr_pages--) { int fully_mapped = 1; unsigned first_hole = blocks_per_page; if (rac) folio = readahead_folio(rac); prefetchw(&folio->flags); if (folio_buffers(folio)) goto confused; block_in_file = next_block = (sector_t)folio->index << (PAGE_SHIFT - blkbits); last_block = block_in_file + nr_pages * blocks_per_page; last_block_in_file = (ext4_readpage_limit(inode) + blocksize - 1) >> blkbits; if (last_block > last_block_in_file) last_block = last_block_in_file; page_block = 0; /* * Map blocks using the previous result first. */ if ((map.m_flags & EXT4_MAP_MAPPED) && block_in_file > map.m_lblk && block_in_file < (map.m_lblk + map.m_len)) { unsigned map_offset = block_in_file - map.m_lblk; unsigned last = map.m_len - map_offset; first_block = map.m_pblk + map_offset; for (relative_block = 0; ; relative_block++) { if (relative_block == last) { /* needed? */ map.m_flags &= ~EXT4_MAP_MAPPED; break; } if (page_block == blocks_per_page) break; page_block++; block_in_file++; } } /* * Then do more ext4_map_blocks() calls until we are * done with this folio. */ while (page_block < blocks_per_page) { if (block_in_file < last_block) { map.m_lblk = block_in_file; map.m_len = last_block - block_in_file; if (ext4_map_blocks(NULL, inode, &map, 0) < 0) { set_error_page: folio_zero_segment(folio, 0, folio_size(folio)); folio_unlock(folio); goto next_page; } } if ((map.m_flags & EXT4_MAP_MAPPED) == 0) { fully_mapped = 0; if (first_hole == blocks_per_page) first_hole = page_block; page_block++; block_in_file++; continue; } if (first_hole != blocks_per_page) goto confused; /* hole -> non-hole */ /* Contiguous blocks? */ if (!page_block) first_block = map.m_pblk; else if (first_block + page_block != map.m_pblk) goto confused; for (relative_block = 0; ; relative_block++) { if (relative_block == map.m_len) { /* needed? */ map.m_flags &= ~EXT4_MAP_MAPPED; break; } else if (page_block == blocks_per_page) break; page_block++; block_in_file++; } } if (first_hole != blocks_per_page) { folio_zero_segment(folio, first_hole << blkbits, folio_size(folio)); if (first_hole == 0) { if (ext4_need_verity(inode, folio->index) && !fsverity_verify_folio(folio)) goto set_error_page; folio_end_read(folio, true); continue; } } else if (fully_mapped) { folio_set_mappedtodisk(folio); } /* * This folio will go to BIO. Do we need to send this * BIO off first? */ if (bio && (last_block_in_bio != first_block - 1 || !fscrypt_mergeable_bio(bio, inode, next_block))) { submit_and_realloc: submit_bio(bio); bio = NULL; } if (bio == NULL) { /* * bio_alloc will _always_ be able to allocate a bio if * __GFP_DIRECT_RECLAIM is set, see bio_alloc_bioset(). */ bio = bio_alloc(bdev, bio_max_segs(nr_pages), REQ_OP_READ, GFP_KERNEL); fscrypt_set_bio_crypt_ctx(bio, inode, next_block, GFP_KERNEL); ext4_set_bio_post_read_ctx(bio, inode, folio->index); bio->bi_iter.bi_sector = first_block << (blkbits - 9); bio->bi_end_io = mpage_end_io; if (rac) bio->bi_opf |= REQ_RAHEAD; } length = first_hole << blkbits; if (!bio_add_folio(bio, folio, length, 0)) goto submit_and_realloc; if (((map.m_flags & EXT4_MAP_BOUNDARY) && (relative_block == map.m_len)) || (first_hole != blocks_per_page)) { submit_bio(bio); bio = NULL; } else last_block_in_bio = first_block + blocks_per_page - 1; continue; confused: if (bio) { submit_bio(bio); bio = NULL; } if (!folio_test_uptodate(folio)) block_read_full_folio(folio, ext4_get_block); else folio_unlock(folio); next_page: ; /* A label shall be followed by a statement until C23 */ } if (bio) submit_bio(bio); return 0; } int __init ext4_init_post_read_processing(void) { bio_post_read_ctx_cache = KMEM_CACHE(bio_post_read_ctx, SLAB_RECLAIM_ACCOUNT); if (!bio_post_read_ctx_cache) goto fail; bio_post_read_ctx_pool = mempool_create_slab_pool(NUM_PREALLOC_POST_READ_CTXS, bio_post_read_ctx_cache); if (!bio_post_read_ctx_pool) goto fail_free_cache; return 0; fail_free_cache: kmem_cache_destroy(bio_post_read_ctx_cache); fail: return -ENOMEM; } void ext4_exit_post_read_processing(void) { mempool_destroy(bio_post_read_ctx_pool); kmem_cache_destroy(bio_post_read_ctx_cache); } |
| 59 1 62 62 119 119 59 53 50 58 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 | /* SPDX-License-Identifier: GPL-2.0 */ #undef TRACE_SYSTEM #define TRACE_SYSTEM vmscan #if !defined(_TRACE_VMSCAN_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_VMSCAN_H #include <linux/types.h> #include <linux/tracepoint.h> #include <linux/mm.h> #include <linux/memcontrol.h> #include <trace/events/mmflags.h> #define RECLAIM_WB_ANON 0x0001u #define RECLAIM_WB_FILE 0x0002u #define RECLAIM_WB_MIXED 0x0010u #define RECLAIM_WB_SYNC 0x0004u /* Unused, all reclaim async */ #define RECLAIM_WB_ASYNC 0x0008u #define RECLAIM_WB_LRU (RECLAIM_WB_ANON|RECLAIM_WB_FILE) #define show_reclaim_flags(flags) \ (flags) ? __print_flags(flags, "|", \ {RECLAIM_WB_ANON, "RECLAIM_WB_ANON"}, \ {RECLAIM_WB_FILE, "RECLAIM_WB_FILE"}, \ {RECLAIM_WB_MIXED, "RECLAIM_WB_MIXED"}, \ {RECLAIM_WB_SYNC, "RECLAIM_WB_SYNC"}, \ {RECLAIM_WB_ASYNC, "RECLAIM_WB_ASYNC"} \ ) : "RECLAIM_WB_NONE" #define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK) #define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED) #define _VMSCAN_THROTTLE_NOPROGRESS (1 << VMSCAN_THROTTLE_NOPROGRESS) #define _VMSCAN_THROTTLE_CONGESTED (1 << VMSCAN_THROTTLE_CONGESTED) #define show_throttle_flags(flags) \ (flags) ? __print_flags(flags, "|", \ {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \ {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"}, \ {_VMSCAN_THROTTLE_NOPROGRESS, "VMSCAN_THROTTLE_NOPROGRESS"}, \ {_VMSCAN_THROTTLE_CONGESTED, "VMSCAN_THROTTLE_CONGESTED"} \ ) : "VMSCAN_THROTTLE_NONE" #define trace_reclaim_flags(file) ( \ (file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \ (RECLAIM_WB_ASYNC) \ ) TRACE_EVENT(mm_vmscan_kswapd_sleep, TP_PROTO(int nid), TP_ARGS(nid), TP_STRUCT__entry( __field( int, nid ) ), TP_fast_assign( __entry->nid = nid; ), TP_printk("nid=%d", __entry->nid) ); TRACE_EVENT(mm_vmscan_kswapd_wake, TP_PROTO(int nid, int zid, int order), TP_ARGS(nid, zid, order), TP_STRUCT__entry( __field( int, nid ) __field( int, zid ) __field( int, order ) ), TP_fast_assign( __entry->nid = nid; __entry->zid = zid; __entry->order = order; ), TP_printk("nid=%d order=%d", __entry->nid, __entry->order) ); TRACE_EVENT(mm_vmscan_wakeup_kswapd, TP_PROTO(int nid, int zid, int order, gfp_t gfp_flags), TP_ARGS(nid, zid, order, gfp_flags), TP_STRUCT__entry( __field( int, nid ) __field( int, zid ) __field( int, order ) __field( unsigned long, gfp_flags ) ), TP_fast_assign( __entry->nid = nid; __entry->zid = zid; __entry->order = order; __entry->gfp_flags = (__force unsigned long)gfp_flags; ), TP_printk("nid=%d order=%d gfp_flags=%s", __entry->nid, __entry->order, show_gfp_flags(__entry->gfp_flags)) ); DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_begin_template, TP_PROTO(int order, gfp_t gfp_flags), TP_ARGS(order, gfp_flags), TP_STRUCT__entry( __field( int, order ) __field( unsigned long, gfp_flags ) ), TP_fast_assign( __entry->order = order; __entry->gfp_flags = (__force unsigned long)gfp_flags; ), TP_printk("order=%d gfp_flags=%s", __entry->order, show_gfp_flags(__entry->gfp_flags)) ); DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_direct_reclaim_begin, TP_PROTO(int order, gfp_t gfp_flags), TP_ARGS(order, gfp_flags) ); #ifdef CONFIG_MEMCG DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_reclaim_begin, TP_PROTO(int order, gfp_t gfp_flags), TP_ARGS(order, gfp_flags) ); DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, mm_vmscan_memcg_softlimit_reclaim_begin, TP_PROTO(int order, gfp_t gfp_flags), TP_ARGS(order, gfp_flags) ); #endif /* CONFIG_MEMCG */ DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_end_template, TP_PROTO(unsigned long nr_reclaimed), TP_ARGS(nr_reclaimed), TP_STRUCT__entry( __field( unsigned long, nr_reclaimed ) ), TP_fast_assign( __entry->nr_reclaimed = nr_reclaimed; ), TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed) ); DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_direct_reclaim_end, TP_PROTO(unsigned long nr_reclaimed), TP_ARGS(nr_reclaimed) ); #ifdef CONFIG_MEMCG DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_reclaim_end, TP_PROTO(unsigned long nr_reclaimed), TP_ARGS(nr_reclaimed) ); DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_reclaim_end, TP_PROTO(unsigned long nr_reclaimed), TP_ARGS(nr_reclaimed) ); #endif /* CONFIG_MEMCG */ TRACE_EVENT(mm_shrink_slab_start, TP_PROTO(struct shrinker *shr, struct shrink_control *sc, long nr_objects_to_shrink, unsigned long cache_items, unsigned long long delta, unsigned long total_scan, int priority), TP_ARGS(shr, sc, nr_objects_to_shrink, cache_items, delta, total_scan, priority), TP_STRUCT__entry( __field(struct shrinker *, shr) __field(void *, shrink) __field(int, nid) __field(long, nr_objects_to_shrink) __field(unsigned long, gfp_flags) __field(unsigned long, cache_items) __field(unsigned long long, delta) __field(unsigned long, total_scan) __field(int, priority) ), TP_fast_assign( __entry->shr = shr; __entry->shrink = shr->scan_objects; __entry->nid = sc->nid; __entry->nr_objects_to_shrink = nr_objects_to_shrink; __entry->gfp_flags = (__force unsigned long)sc->gfp_mask; __entry->cache_items = cache_items; __entry->delta = delta; __entry->total_scan = total_scan; __entry->priority = priority; ), TP_printk("%pS %p: nid: %d objects to shrink %ld gfp_flags %s cache items %ld delta %lld total_scan %ld priority %d", __entry->shrink, __entry->shr, __entry->nid, __entry->nr_objects_to_shrink, show_gfp_flags(__entry->gfp_flags), __entry->cache_items, __entry->delta, __entry->total_scan, __entry->priority) ); TRACE_EVENT(mm_shrink_slab_end, TP_PROTO(struct shrinker *shr, int nid, int shrinker_retval, long unused_scan_cnt, long new_scan_cnt, long total_scan), TP_ARGS(shr, nid, shrinker_retval, unused_scan_cnt, new_scan_cnt, total_scan), TP_STRUCT__entry( __field(struct shrinker *, shr) __field(int, nid) __field(void *, shrink) __field(long, unused_scan) __field(long, new_scan) __field(int, retval) __field(long, total_scan) ), TP_fast_assign( __entry->shr = shr; __entry->nid = nid; __entry->shrink = shr->scan_objects; __entry->unused_scan = unused_scan_cnt; __entry->new_scan = new_scan_cnt; __entry->retval = shrinker_retval; __entry->total_scan = total_scan; ), TP_printk("%pS %p: nid: %d unused scan count %ld new scan count %ld total_scan %ld last shrinker return val %d", __entry->shrink, __entry->shr, __entry->nid, __entry->unused_scan, __entry->new_scan, __entry->total_scan, __entry->retval) ); TRACE_EVENT(mm_vmscan_lru_isolate, TP_PROTO(int highest_zoneidx, int order, unsigned long nr_requested, unsigned long nr_scanned, unsigned long nr_skipped, unsigned long nr_taken, int lru), TP_ARGS(highest_zoneidx, order, nr_requested, nr_scanned, nr_skipped, nr_taken, lru), TP_STRUCT__entry( __field(int, highest_zoneidx) __field(int, order) __field(unsigned long, nr_requested) __field(unsigned long, nr_scanned) __field(unsigned long, nr_skipped) __field(unsigned long, nr_taken) __field(int, lru) ), TP_fast_assign( __entry->highest_zoneidx = highest_zoneidx; __entry->order = order; __entry->nr_requested = nr_requested; __entry->nr_scanned = nr_scanned; __entry->nr_skipped = nr_skipped; __entry->nr_taken = nr_taken; __entry->lru = lru; ), /* * classzone is previous name of the highest_zoneidx. * Reason not to change it is the ABI requirement of the tracepoint. */ TP_printk("classzone=%d order=%d nr_requested=%lu nr_scanned=%lu nr_skipped=%lu nr_taken=%lu lru=%s", __entry->highest_zoneidx, __entry->order, __entry->nr_requested, __entry->nr_scanned, __entry->nr_skipped, __entry->nr_taken, __print_symbolic(__entry->lru, LRU_NAMES)) ); TRACE_EVENT(mm_vmscan_write_folio, TP_PROTO(struct folio *folio), TP_ARGS(folio), TP_STRUCT__entry( __field(unsigned long, pfn) __field(int, reclaim_flags) ), TP_fast_assign( __entry->pfn = folio_pfn(folio); __entry->reclaim_flags = trace_reclaim_flags( folio_is_file_lru(folio)); ), TP_printk("page=%p pfn=0x%lx flags=%s", pfn_to_page(__entry->pfn), __entry->pfn, show_reclaim_flags(__entry->reclaim_flags)) ); TRACE_EVENT(mm_vmscan_reclaim_pages, TP_PROTO(int nid, unsigned long nr_scanned, unsigned long nr_reclaimed, struct reclaim_stat *stat), TP_ARGS(nid, nr_scanned, nr_reclaimed, stat), TP_STRUCT__entry( __field(int, nid) __field(unsigned long, nr_scanned) __field(unsigned long, nr_reclaimed) __field(unsigned long, nr_dirty) __field(unsigned long, nr_writeback) __field(unsigned long, nr_congested) __field(unsigned long, nr_immediate) __field(unsigned int, nr_activate0) __field(unsigned int, nr_activate1) __field(unsigned long, nr_ref_keep) __field(unsigned long, nr_unmap_fail) ), TP_fast_assign( __entry->nid = nid; __entry->nr_scanned = nr_scanned; __entry->nr_reclaimed = nr_reclaimed; __entry->nr_dirty = stat->nr_dirty; __entry->nr_writeback = stat->nr_writeback; __entry->nr_congested = stat->nr_congested; __entry->nr_immediate = stat->nr_immediate; __entry->nr_activate0 = stat->nr_activate[0]; __entry->nr_activate1 = stat->nr_activate[1]; __entry->nr_ref_keep = stat->nr_ref_keep; __entry->nr_unmap_fail = stat->nr_unmap_fail; ), TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld nr_dirty=%ld nr_writeback=%ld nr_congested=%ld nr_immediate=%ld nr_activate_anon=%d nr_activate_file=%d nr_ref_keep=%ld nr_unmap_fail=%ld", __entry->nid, __entry->nr_scanned, __entry->nr_reclaimed, __entry->nr_dirty, __entry->nr_writeback, __entry->nr_congested, __entry->nr_immediate, __entry->nr_activate0, __entry->nr_activate1, __entry->nr_ref_keep, __entry->nr_unmap_fail) ); TRACE_EVENT(mm_vmscan_lru_shrink_inactive, TP_PROTO(int nid, unsigned long nr_scanned, unsigned long nr_reclaimed, struct reclaim_stat *stat, int priority, int file), TP_ARGS(nid, nr_scanned, nr_reclaimed, stat, priority, file), TP_STRUCT__entry( __field(int, nid) __field(unsigned long, nr_scanned) __field(unsigned long, nr_reclaimed) __field(unsigned long, nr_dirty) __field(unsigned long, nr_writeback) __field(unsigned long, nr_congested) __field(unsigned long, nr_immediate) __field(unsigned int, nr_activate0) __field(unsigned int, nr_activate1) __field(unsigned long, nr_ref_keep) __field(unsigned long, nr_unmap_fail) __field(int, priority) __field(int, reclaim_flags) ), TP_fast_assign( __entry->nid = nid; __entry->nr_scanned = nr_scanned; __entry->nr_reclaimed = nr_reclaimed; __entry->nr_dirty = stat->nr_dirty; __entry->nr_writeback = stat->nr_writeback; __entry->nr_congested = stat->nr_congested; __entry->nr_immediate = stat->nr_immediate; __entry->nr_activate0 = stat->nr_activate[0]; __entry->nr_activate1 = stat->nr_activate[1]; __entry->nr_ref_keep = stat->nr_ref_keep; __entry->nr_unmap_fail = stat->nr_unmap_fail; __entry->priority = priority; __entry->reclaim_flags = trace_reclaim_flags(file); ), TP_printk("nid=%d nr_scanned=%ld nr_reclaimed=%ld nr_dirty=%ld nr_writeback=%ld nr_congested=%ld nr_immediate=%ld nr_activate_anon=%d nr_activate_file=%d nr_ref_keep=%ld nr_unmap_fail=%ld priority=%d flags=%s", __entry->nid, __entry->nr_scanned, __entry->nr_reclaimed, __entry->nr_dirty, __entry->nr_writeback, __entry->nr_congested, __entry->nr_immediate, __entry->nr_activate0, __entry->nr_activate1, __entry->nr_ref_keep, __entry->nr_unmap_fail, __entry->priority, show_reclaim_flags(__entry->reclaim_flags)) ); TRACE_EVENT(mm_vmscan_lru_shrink_active, TP_PROTO(int nid, unsigned long nr_taken, unsigned long nr_active, unsigned long nr_deactivated, unsigned long nr_referenced, int priority, int file), TP_ARGS(nid, nr_taken, nr_active, nr_deactivated, nr_referenced, priority, file), TP_STRUCT__entry( __field(int, nid) __field(unsigned long, nr_taken) __field(unsigned long, nr_active) __field(unsigned long, nr_deactivated) __field(unsigned long, nr_referenced) __field(int, priority) __field(int, reclaim_flags) ), TP_fast_assign( __entry->nid = nid; __entry->nr_taken = nr_taken; __entry->nr_active = nr_active; __entry->nr_deactivated = nr_deactivated; __entry->nr_referenced = nr_referenced; __entry->priority = priority; __entry->reclaim_flags = trace_reclaim_flags(file); ), TP_printk("nid=%d nr_taken=%ld nr_active=%ld nr_deactivated=%ld nr_referenced=%ld priority=%d flags=%s", __entry->nid, __entry->nr_taken, __entry->nr_active, __entry->nr_deactivated, __entry->nr_referenced, __entry->priority, show_reclaim_flags(__entry->reclaim_flags)) ); TRACE_EVENT(mm_vmscan_node_reclaim_begin, TP_PROTO(int nid, int order, gfp_t gfp_flags), TP_ARGS(nid, order, gfp_flags), TP_STRUCT__entry( __field(int, nid) __field(int, order) __field(unsigned long, gfp_flags) ), TP_fast_assign( __entry->nid = nid; __entry->order = order; __entry->gfp_flags = (__force unsigned long)gfp_flags; ), TP_printk("nid=%d order=%d gfp_flags=%s", __entry->nid, __entry->order, show_gfp_flags(__entry->gfp_flags)) ); DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_node_reclaim_end, TP_PROTO(unsigned long nr_reclaimed), TP_ARGS(nr_reclaimed) ); TRACE_EVENT(mm_vmscan_throttled, TP_PROTO(int nid, int usec_timeout, int usec_delayed, int reason), TP_ARGS(nid, usec_timeout, usec_delayed, reason), TP_STRUCT__entry( __field(int, nid) __field(int, usec_timeout) __field(int, usec_delayed) __field(int, reason) ), TP_fast_assign( __entry->nid = nid; __entry->usec_timeout = usec_timeout; __entry->usec_delayed = usec_delayed; __entry->reason = 1U << reason; ), TP_printk("nid=%d usec_timeout=%d usect_delayed=%d reason=%s", __entry->nid, __entry->usec_timeout, __entry->usec_delayed, show_throttle_flags(__entry->reason)) ); #endif /* _TRACE_VMSCAN_H */ /* This part must be outside protection */ #include <trace/define_trace.h> |
| 562 561 511 226 196 562 523 523 382 369 371 449 449 73 449 530 532 523 523 489 444 531 531 317 523 523 443 442 442 443 443 441 362 355 411 201 74 152 202 92 136 338 337 150 267 425 133 430 222 225 225 221 223 438 437 438 436 437 438 404 386 437 438 423 290 13 514 522 52 53 332 403 243 488 40 40 40 40 500 295 338 100 100 436 519 363 353 181 179 362 564 564 565 562 564 147 563 564 563 562 564 563 38 564 563 531 523 444 488 172 522 456 53 509 518 519 339 209 159 518 515 502 518 172 509 517 518 2 514 171 509 260 41 454 52 53 506 506 509 499 31 517 432 399 260 1 14 1 2 231 230 198 136 532 489 443 444 407 314 286 532 486 231 516 222 522 359 337 68 14 8 21 337 108 250 251 251 250 570 570 565 338 513 18 251 473 58 530 521 266 41 467 56 530 4 4 222 532 360 518 333 478 549 570 109 334 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 | // SPDX-License-Identifier: GPL-2.0 #include "bcachefs.h" #include "alloc_foreground.h" #include "btree_gc.h" #include "btree_io.h" #include "btree_iter.h" #include "btree_journal_iter.h" #include "btree_key_cache.h" #include "btree_update_interior.h" #include "btree_write_buffer.h" #include "buckets.h" #include "disk_accounting.h" #include "errcode.h" #include "error.h" #include "journal.h" #include "journal_io.h" #include "journal_reclaim.h" #include "replicas.h" #include "snapshot.h" #include <linux/prefetch.h> static const char * const trans_commit_flags_strs[] = { #define x(n, ...) #n, BCH_TRANS_COMMIT_FLAGS() #undef x NULL }; void bch2_trans_commit_flags_to_text(struct printbuf *out, enum bch_trans_commit_flags flags) { enum bch_watermark watermark = flags & BCH_WATERMARK_MASK; prt_printf(out, "watermark=%s", bch2_watermarks[watermark]); flags >>= BCH_WATERMARK_BITS; if (flags) { prt_char(out, ' '); bch2_prt_bitflags(out, trans_commit_flags_strs, flags); } } static void verify_update_old_key(struct btree_trans *trans, struct btree_insert_entry *i) { #ifdef CONFIG_BCACHEFS_DEBUG struct bch_fs *c = trans->c; struct bkey u; struct bkey_s_c k = bch2_btree_path_peek_slot_exact(trans->paths + i->path, &u); if (unlikely(trans->journal_replay_not_finished)) { struct bkey_i *j_k = bch2_journal_keys_peek_slot(c, i->btree_id, i->level, i->k->k.p); if (j_k) k = bkey_i_to_s_c(j_k); } u = *k.k; u.needs_whiteout = i->old_k.needs_whiteout; BUG_ON(memcmp(&i->old_k, &u, sizeof(struct bkey))); BUG_ON(i->old_v != k.v); #endif } static inline struct btree_path_level *insert_l(struct btree_trans *trans, struct btree_insert_entry *i) { return (trans->paths + i->path)->l + i->level; } static inline bool same_leaf_as_prev(struct btree_trans *trans, struct btree_insert_entry *i) { return i != trans->updates && insert_l(trans, &i[0])->b == insert_l(trans, &i[-1])->b; } static inline bool same_leaf_as_next(struct btree_trans *trans, struct btree_insert_entry *i) { return i + 1 < trans->updates + trans->nr_updates && insert_l(trans, &i[0])->b == insert_l(trans, &i[1])->b; } inline void bch2_btree_node_prep_for_write(struct btree_trans *trans, struct btree_path *path, struct btree *b) { struct bch_fs *c = trans->c; if (unlikely(btree_node_just_written(b)) && bch2_btree_post_write_cleanup(c, b)) bch2_trans_node_reinit_iter(trans, b); /* * If the last bset has been written, or if it's gotten too big - start * a new bset to insert into: */ if (want_new_bset(c, b)) bch2_btree_init_next(trans, b); } static noinline int trans_lock_write_fail(struct btree_trans *trans, struct btree_insert_entry *i) { while (--i >= trans->updates) { if (same_leaf_as_prev(trans, i)) continue; bch2_btree_node_unlock_write(trans, trans->paths + i->path, insert_l(trans, i)->b); } trace_and_count(trans->c, trans_restart_would_deadlock_write, trans); return btree_trans_restart(trans, BCH_ERR_transaction_restart_would_deadlock_write); } static inline int bch2_trans_lock_write(struct btree_trans *trans) { EBUG_ON(trans->write_locked); trans_for_each_update(trans, i) { if (same_leaf_as_prev(trans, i)) continue; if (bch2_btree_node_lock_write(trans, trans->paths + i->path, &insert_l(trans, i)->b->c)) return trans_lock_write_fail(trans, i); if (!i->cached) bch2_btree_node_prep_for_write(trans, trans->paths + i->path, insert_l(trans, i)->b); } trans->write_locked = true; return 0; } static inline void bch2_trans_unlock_updates_write(struct btree_trans *trans) { if (likely(trans->write_locked)) { trans_for_each_update(trans, i) if (btree_node_locked_type(trans->paths + i->path, i->level) == BTREE_NODE_WRITE_LOCKED) bch2_btree_node_unlock_write_inlined(trans, trans->paths + i->path, insert_l(trans, i)->b); trans->write_locked = false; } } /* Inserting into a given leaf node (last stage of insert): */ /* Handle overwrites and do insert, for non extents: */ bool bch2_btree_bset_insert_key(struct btree_trans *trans, struct btree_path *path, struct btree *b, struct btree_node_iter *node_iter, struct bkey_i *insert) { struct bkey_packed *k; unsigned clobber_u64s = 0, new_u64s = 0; EBUG_ON(btree_node_just_written(b)); EBUG_ON(bset_written(b, btree_bset_last(b))); EBUG_ON(bkey_deleted(&insert->k) && bkey_val_u64s(&insert->k)); EBUG_ON(bpos_lt(insert->k.p, b->data->min_key)); EBUG_ON(bpos_gt(insert->k.p, b->data->max_key)); EBUG_ON(insert->k.u64s > bch2_btree_keys_u64s_remaining(b)); EBUG_ON(!b->c.level && !bpos_eq(insert->k.p, path->pos)); k = bch2_btree_node_iter_peek_all(node_iter, b); if (k && bkey_cmp_left_packed(b, k, &insert->k.p)) k = NULL; /* @k is the key being overwritten/deleted, if any: */ EBUG_ON(k && bkey_deleted(k)); /* Deleting, but not found? nothing to do: */ if (bkey_deleted(&insert->k) && !k) return false; if (bkey_deleted(&insert->k)) { /* Deleting: */ btree_account_key_drop(b, k); k->type = KEY_TYPE_deleted; if (k->needs_whiteout) push_whiteout(b, insert->k.p); k->needs_whiteout = false; if (k >= btree_bset_last(b)->start) { clobber_u64s = k->u64s; bch2_bset_delete(b, k, clobber_u64s); goto fix_iter; } else { bch2_btree_path_fix_key_modified(trans, b, k); } return true; } if (k) { /* Overwriting: */ btree_account_key_drop(b, k); k->type = KEY_TYPE_deleted; insert->k.needs_whiteout = k->needs_whiteout; k->needs_whiteout = false; if (k >= btree_bset_last(b)->start) { clobber_u64s = k->u64s; goto overwrite; } else { bch2_btree_path_fix_key_modified(trans, b, k); } } k = bch2_btree_node_iter_bset_pos(node_iter, b, bset_tree_last(b)); overwrite: bch2_bset_insert(b, k, insert, clobber_u64s); new_u64s = k->u64s; fix_iter: if (clobber_u64s != new_u64s) bch2_btree_node_iter_fix(trans, path, b, node_iter, k, clobber_u64s, new_u64s); return true; } static int __btree_node_flush(struct journal *j, struct journal_entry_pin *pin, unsigned i, u64 seq) { struct bch_fs *c = container_of(j, struct bch_fs, journal); struct btree_write *w = container_of(pin, struct btree_write, journal); struct btree *b = container_of(w, struct btree, writes[i]); struct btree_trans *trans = bch2_trans_get(c); unsigned long old, new; unsigned idx = w - b->writes; btree_node_lock_nopath_nofail(trans, &b->c, SIX_LOCK_read); old = READ_ONCE(b->flags); do { new = old; if (!(old & (1 << BTREE_NODE_dirty)) || !!(old & (1 << BTREE_NODE_write_idx)) != idx || w->journal.seq != seq) break; new &= ~BTREE_WRITE_TYPE_MASK; new |= BTREE_WRITE_journal_reclaim; new |= 1 << BTREE_NODE_need_write; } while (!try_cmpxchg(&b->flags, &old, new)); btree_node_write_if_need(trans, b, SIX_LOCK_read); six_unlock_read(&b->c.lock); bch2_trans_put(trans); return 0; } int bch2_btree_node_flush0(struct journal *j, struct journal_entry_pin *pin, u64 seq) { return __btree_node_flush(j, pin, 0, seq); } int bch2_btree_node_flush1(struct journal *j, struct journal_entry_pin *pin, u64 seq) { return __btree_node_flush(j, pin, 1, seq); } inline void bch2_btree_add_journal_pin(struct bch_fs *c, struct btree *b, u64 seq) { struct btree_write *w = btree_current_write(b); bch2_journal_pin_add(&c->journal, seq, &w->journal, btree_node_write_idx(b) == 0 ? bch2_btree_node_flush0 : bch2_btree_node_flush1); } /** * bch2_btree_insert_key_leaf() - insert a key one key into a leaf node * @trans: btree transaction object * @path: path pointing to @insert's pos * @insert: key to insert * @journal_seq: sequence number of journal reservation */ inline void bch2_btree_insert_key_leaf(struct btree_trans *trans, struct btree_path *path, struct bkey_i *insert, u64 journal_seq) { struct bch_fs *c = trans->c; struct btree *b = path_l(path)->b; struct bset_tree *t = bset_tree_last(b); struct bset *i = bset(b, t); int old_u64s = bset_u64s(t); int old_live_u64s = b->nr.live_u64s; int live_u64s_added, u64s_added; if (unlikely(!bch2_btree_bset_insert_key(trans, path, b, &path_l(path)->iter, insert))) return; i->journal_seq = cpu_to_le64(max(journal_seq, le64_to_cpu(i->journal_seq))); bch2_btree_add_journal_pin(c, b, journal_seq); if (unlikely(!btree_node_dirty(b))) { EBUG_ON(test_bit(BCH_FS_clean_shutdown, &c->flags)); set_btree_node_dirty_acct(c, b); } live_u64s_added = (int) b->nr.live_u64s - old_live_u64s; u64s_added = (int) bset_u64s(t) - old_u64s; if (b->sib_u64s[0] != U16_MAX && live_u64s_added < 0) b->sib_u64s[0] = max(0, (int) b->sib_u64s[0] + live_u64s_added); if (b->sib_u64s[1] != U16_MAX && live_u64s_added < 0) b->sib_u64s[1] = max(0, (int) b->sib_u64s[1] + live_u64s_added); if (u64s_added > live_u64s_added && bch2_maybe_compact_whiteouts(c, b)) bch2_trans_node_reinit_iter(trans, b); } /* Cached btree updates: */ /* Normal update interface: */ static inline void btree_insert_entry_checks(struct btree_trans *trans, struct btree_insert_entry *i) { struct btree_path *path = trans->paths + i->path; BUG_ON(!bpos_eq(i->k->k.p, path->pos)); BUG_ON(i->cached != path->cached); BUG_ON(i->level != path->level); BUG_ON(i->btree_id != path->btree_id); EBUG_ON(!i->level && btree_type_has_snapshots(i->btree_id) && !(i->flags & BTREE_UPDATE_internal_snapshot_node) && test_bit(JOURNAL_replay_done, &trans->c->journal.flags) && i->k->k.p.snapshot && bch2_snapshot_is_internal_node(trans->c, i->k->k.p.snapshot) > 0); } static __always_inline int bch2_trans_journal_res_get(struct btree_trans *trans, unsigned flags) { return bch2_journal_res_get(&trans->c->journal, &trans->journal_res, trans->journal_u64s, flags, trans); } #define JSET_ENTRY_LOG_U64s 4 static noinline void journal_transaction_name(struct btree_trans *trans) { struct bch_fs *c = trans->c; struct journal *j = &c->journal; struct jset_entry *entry = bch2_journal_add_entry(j, &trans->journal_res, BCH_JSET_ENTRY_log, 0, 0, JSET_ENTRY_LOG_U64s); struct jset_entry_log *l = container_of(entry, struct jset_entry_log, entry); strncpy(l->d, trans->fn, JSET_ENTRY_LOG_U64s * sizeof(u64)); } static inline int btree_key_can_insert(struct btree_trans *trans, struct btree *b, unsigned u64s) { if (!bch2_btree_node_insert_fits(b, u64s)) return -BCH_ERR_btree_insert_btree_node_full; return 0; } noinline static int btree_key_can_insert_cached_slowpath(struct btree_trans *trans, unsigned flags, struct btree_path *path, unsigned new_u64s) { struct bkey_cached *ck = (void *) path->l[0].b; struct bkey_i *new_k; int ret; bch2_trans_unlock_updates_write(trans); bch2_trans_unlock(trans); new_k = kmalloc(new_u64s * sizeof(u64), GFP_KERNEL); if (!new_k) { bch_err(trans->c, "error allocating memory for key cache key, btree %s u64s %u", bch2_btree_id_str(path->btree_id), new_u64s); return -BCH_ERR_ENOMEM_btree_key_cache_insert; } ret = bch2_trans_relock(trans) ?: bch2_trans_lock_write(trans); if (unlikely(ret)) { kfree(new_k); return ret; } memcpy(new_k, ck->k, ck->u64s * sizeof(u64)); trans_for_each_update(trans, i) if (i->old_v == &ck->k->v) i->old_v = &new_k->v; kfree(ck->k); ck->u64s = new_u64s; ck->k = new_k; return 0; } static int btree_key_can_insert_cached(struct btree_trans *trans, unsigned flags, struct btree_path *path, unsigned u64s) { struct bch_fs *c = trans->c; struct bkey_cached *ck = (void *) path->l[0].b; unsigned new_u64s; struct bkey_i *new_k; unsigned watermark = flags & BCH_WATERMARK_MASK; EBUG_ON(path->level); if (watermark < BCH_WATERMARK_reclaim && !test_bit(BKEY_CACHED_DIRTY, &ck->flags) && bch2_btree_key_cache_must_wait(c)) return -BCH_ERR_btree_insert_need_journal_reclaim; /* * bch2_varint_decode can read past the end of the buffer by at most 7 * bytes (it won't be used): */ u64s += 1; if (u64s <= ck->u64s) return 0; new_u64s = roundup_pow_of_two(u64s); new_k = krealloc(ck->k, new_u64s * sizeof(u64), GFP_NOWAIT|__GFP_NOWARN); if (unlikely(!new_k)) return btree_key_can_insert_cached_slowpath(trans, flags, path, new_u64s); trans_for_each_update(trans, i) if (i->old_v == &ck->k->v) i->old_v = &new_k->v; ck->u64s = new_u64s; ck->k = new_k; return 0; } /* Triggers: */ static int run_one_mem_trigger(struct btree_trans *trans, struct btree_insert_entry *i, unsigned flags) { verify_update_old_key(trans, i); if (unlikely(flags & BTREE_TRIGGER_norun)) return 0; struct bkey_s_c old = { &i->old_k, i->old_v }; struct bkey_i *new = i->k; const struct bkey_ops *old_ops = bch2_bkey_type_ops(old.k->type); const struct bkey_ops *new_ops = bch2_bkey_type_ops(i->k->k.type); if (old_ops->trigger == new_ops->trigger) return bch2_key_trigger(trans, i->btree_id, i->level, old, bkey_i_to_s(new), BTREE_TRIGGER_insert|BTREE_TRIGGER_overwrite|flags); else return bch2_key_trigger_new(trans, i->btree_id, i->level, bkey_i_to_s(new), flags) ?: bch2_key_trigger_old(trans, i->btree_id, i->level, old, flags); } static int run_one_trans_trigger(struct btree_trans *trans, struct btree_insert_entry *i) { verify_update_old_key(trans, i); if ((i->flags & BTREE_TRIGGER_norun) || !btree_node_type_has_trans_triggers(i->bkey_type)) return 0; /* * Transactional triggers create new btree_insert_entries, so we can't * pass them a pointer to a btree_insert_entry, that memory is going to * move: */ struct bkey old_k = i->old_k; struct bkey_s_c old = { &old_k, i->old_v }; const struct bkey_ops *old_ops = bch2_bkey_type_ops(old.k->type); const struct bkey_ops *new_ops = bch2_bkey_type_ops(i->k->k.type); unsigned flags = i->flags|BTREE_TRIGGER_transactional; if (!i->insert_trigger_run && !i->overwrite_trigger_run && old_ops->trigger == new_ops->trigger) { i->overwrite_trigger_run = true; i->insert_trigger_run = true; return bch2_key_trigger(trans, i->btree_id, i->level, old, bkey_i_to_s(i->k), BTREE_TRIGGER_insert| BTREE_TRIGGER_overwrite|flags) ?: 1; } else if (!i->overwrite_trigger_run) { i->overwrite_trigger_run = true; return bch2_key_trigger_old(trans, i->btree_id, i->level, old, flags) ?: 1; } else if (!i->insert_trigger_run) { i->insert_trigger_run = true; return bch2_key_trigger_new(trans, i->btree_id, i->level, bkey_i_to_s(i->k), flags) ?: 1; } else { return 0; } } static int run_btree_triggers(struct btree_trans *trans, enum btree_id btree_id, unsigned *btree_id_updates_start) { bool trans_trigger_run; /* * Running triggers will append more updates to the list of updates as * we're walking it: */ do { trans_trigger_run = false; for (unsigned i = *btree_id_updates_start; i < trans->nr_updates && trans->updates[i].btree_id <= btree_id; i++) { if (trans->updates[i].btree_id < btree_id) { *btree_id_updates_start = i; continue; } int ret = run_one_trans_trigger(trans, trans->updates + i); if (ret < 0) return ret; if (ret) trans_trigger_run = true; } } while (trans_trigger_run); trans_for_each_update(trans, i) BUG_ON(!(i->flags & BTREE_TRIGGER_norun) && i->btree_id == btree_id && btree_node_type_has_trans_triggers(i->bkey_type) && (!i->insert_trigger_run || !i->overwrite_trigger_run)); return 0; } static int bch2_trans_commit_run_triggers(struct btree_trans *trans) { unsigned btree_id = 0, btree_id_updates_start = 0; int ret = 0; /* * * For a given btree, this algorithm runs insert triggers before * overwrite triggers: this is so that when extents are being moved * (e.g. by FALLOCATE_FL_INSERT_RANGE), we don't drop references before * they are re-added. */ for (btree_id = 0; btree_id < BTREE_ID_NR; btree_id++) { if (btree_id == BTREE_ID_alloc) continue; ret = run_btree_triggers(trans, btree_id, &btree_id_updates_start); if (ret) return ret; } btree_id_updates_start = 0; ret = run_btree_triggers(trans, BTREE_ID_alloc, &btree_id_updates_start); if (ret) return ret; #ifdef CONFIG_BCACHEFS_DEBUG trans_for_each_update(trans, i) BUG_ON(!(i->flags & BTREE_TRIGGER_norun) && btree_node_type_has_trans_triggers(i->bkey_type) && (!i->insert_trigger_run || !i->overwrite_trigger_run)); #endif return 0; } static noinline int bch2_trans_commit_run_gc_triggers(struct btree_trans *trans) { trans_for_each_update(trans, i) if (btree_node_type_has_triggers(i->bkey_type) && gc_visited(trans->c, gc_pos_btree(i->btree_id, i->level, i->k->k.p))) { int ret = run_one_mem_trigger(trans, i, i->flags|BTREE_TRIGGER_gc); if (ret) return ret; } return 0; } static inline int bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags, struct btree_insert_entry **stopped_at, unsigned long trace_ip) { struct bch_fs *c = trans->c; struct btree_trans_commit_hook *h; unsigned u64s = 0; int ret = 0; bch2_trans_verify_not_unlocked_or_in_restart(trans); if (race_fault()) { trace_and_count(c, trans_restart_fault_inject, trans, trace_ip); return btree_trans_restart(trans, BCH_ERR_transaction_restart_fault_inject); } /* * Check if the insert will fit in the leaf node with the write lock * held, otherwise another thread could write the node changing the * amount of space available: */ prefetch(&trans->c->journal.flags); trans_for_each_update(trans, i) { /* Multiple inserts might go to same leaf: */ if (!same_leaf_as_prev(trans, i)) u64s = 0; u64s += i->k->k.u64s; ret = !i->cached ? btree_key_can_insert(trans, insert_l(trans, i)->b, u64s) : btree_key_can_insert_cached(trans, flags, trans->paths + i->path, u64s); if (ret) { *stopped_at = i; return ret; } i->k->k.needs_whiteout = false; } /* * Don't get journal reservation until after we know insert will * succeed: */ if (likely(!(flags & BCH_TRANS_COMMIT_no_journal_res))) { ret = bch2_trans_journal_res_get(trans, (flags & BCH_WATERMARK_MASK)| JOURNAL_RES_GET_NONBLOCK); if (ret) return ret; if (unlikely(trans->journal_transaction_names)) journal_transaction_name(trans); } /* * Not allowed to fail after we've gotten our journal reservation - we * have to use it: */ if (IS_ENABLED(CONFIG_BCACHEFS_DEBUG) && !(flags & BCH_TRANS_COMMIT_no_journal_res)) { if (bch2_journal_seq_verify) trans_for_each_update(trans, i) i->k->k.bversion.lo = trans->journal_res.seq; else if (bch2_inject_invalid_keys) trans_for_each_update(trans, i) i->k->k.bversion = MAX_VERSION; } h = trans->hooks; while (h) { ret = h->fn(trans, h); if (ret) return ret; h = h->next; } struct jset_entry *entry = trans->journal_entries; percpu_down_read(&c->mark_lock); for (entry = trans->journal_entries; entry != (void *) ((u64 *) trans->journal_entries + trans->journal_entries_u64s); entry = vstruct_next(entry)) if (entry->type == BCH_JSET_ENTRY_write_buffer_keys && entry->start->k.type == KEY_TYPE_accounting) { ret = bch2_accounting_trans_commit_hook(trans, bkey_i_to_accounting(entry->start), flags); if (ret) goto revert_fs_usage; } percpu_up_read(&c->mark_lock); /* XXX: we only want to run this if deltas are nonzero */ bch2_trans_account_disk_usage_change(trans); trans_for_each_update(trans, i) if (btree_node_type_has_atomic_triggers(i->bkey_type)) { ret = run_one_mem_trigger(trans, i, BTREE_TRIGGER_atomic|i->flags); if (ret) goto fatal_err; } if (unlikely(c->gc_pos.phase)) { ret = bch2_trans_commit_run_gc_triggers(trans); if (ret) goto fatal_err; } struct bkey_validate_context validate_context = { .from = BKEY_VALIDATE_commit }; if (!(flags & BCH_TRANS_COMMIT_no_journal_res)) validate_context.flags = BCH_VALIDATE_write|BCH_VALIDATE_commit; for (struct jset_entry *i = trans->journal_entries; i != (void *) ((u64 *) trans->journal_entries + trans->journal_entries_u64s); i = vstruct_next(i)) { ret = bch2_journal_entry_validate(c, NULL, i, bcachefs_metadata_version_current, CPU_BIG_ENDIAN, validate_context); if (unlikely(ret)) { bch2_trans_inconsistent(trans, "invalid journal entry on insert from %s\n", trans->fn); goto fatal_err; } } trans_for_each_update(trans, i) { validate_context.level = i->level; validate_context.btree = i->btree_id; ret = bch2_bkey_validate(c, bkey_i_to_s_c(i->k), validate_context); if (unlikely(ret)){ bch2_trans_inconsistent(trans, "invalid bkey on insert from %s -> %ps\n", trans->fn, (void *) i->ip_allocated); goto fatal_err; } btree_insert_entry_checks(trans, i); } if (likely(!(flags & BCH_TRANS_COMMIT_no_journal_res))) { struct journal *j = &c->journal; struct jset_entry *entry; trans_for_each_update(trans, i) { if (i->key_cache_already_flushed) continue; if (i->flags & BTREE_UPDATE_nojournal) continue; verify_update_old_key(trans, i); if (trans->journal_transaction_names) { entry = bch2_journal_add_entry(j, &trans->journal_res, BCH_JSET_ENTRY_overwrite, i->btree_id, i->level, i->old_k.u64s); bkey_reassemble((struct bkey_i *) entry->start, (struct bkey_s_c) { &i->old_k, i->old_v }); } entry = bch2_journal_add_entry(j, &trans->journal_res, BCH_JSET_ENTRY_btree_keys, i->btree_id, i->level, i->k->k.u64s); bkey_copy((struct bkey_i *) entry->start, i->k); } memcpy_u64s_small(journal_res_entry(&c->journal, &trans->journal_res), trans->journal_entries, trans->journal_entries_u64s); trans->journal_res.offset += trans->journal_entries_u64s; trans->journal_res.u64s -= trans->journal_entries_u64s; if (trans->journal_seq) *trans->journal_seq = trans->journal_res.seq; } trans_for_each_update(trans, i) { struct btree_path *path = trans->paths + i->path; if (!i->cached) bch2_btree_insert_key_leaf(trans, path, i->k, trans->journal_res.seq); else if (!i->key_cache_already_flushed) bch2_btree_insert_key_cached(trans, flags, i); else bch2_btree_key_cache_drop(trans, path); } return 0; fatal_err: bch2_fs_fatal_error(c, "fatal error in transaction commit: %s", bch2_err_str(ret)); percpu_down_read(&c->mark_lock); revert_fs_usage: for (struct jset_entry *entry2 = trans->journal_entries; entry2 != entry; entry2 = vstruct_next(entry2)) if (entry2->type == BCH_JSET_ENTRY_write_buffer_keys && entry2->start->k.type == KEY_TYPE_accounting) bch2_accounting_trans_commit_revert(trans, bkey_i_to_accounting(entry2->start), flags); percpu_up_read(&c->mark_lock); return ret; } static noinline void bch2_drop_overwrites_from_journal(struct btree_trans *trans) { /* * Accounting keys aren't deduped in the journal: we have to compare * each individual update against what's in the btree to see if it has * been applied yet, and accounting updates also don't overwrite, * they're deltas that accumulate. */ trans_for_each_update(trans, i) if (i->k->k.type != KEY_TYPE_accounting) bch2_journal_key_overwritten(trans->c, i->btree_id, i->level, i->k->k.p); } static int bch2_trans_commit_journal_pin_flush(struct journal *j, struct journal_entry_pin *_pin, u64 seq) { return 0; } /* * Get journal reservation, take write locks, and attempt to do btree update(s): */ static inline int do_bch2_trans_commit(struct btree_trans *trans, unsigned flags, struct btree_insert_entry **stopped_at, unsigned long trace_ip) { struct bch_fs *c = trans->c; int ret = 0, u64s_delta = 0; for (unsigned idx = 0; idx < trans->nr_updates; idx++) { struct btree_insert_entry *i = trans->updates + idx; if (i->cached) continue; u64s_delta += !bkey_deleted(&i->k->k) ? i->k->k.u64s : 0; u64s_delta -= i->old_btree_u64s; if (!same_leaf_as_next(trans, i)) { if (u64s_delta <= 0) { ret = bch2_foreground_maybe_merge(trans, i->path, i->level, flags); if (unlikely(ret)) return ret; } u64s_delta = 0; } } ret = bch2_trans_lock_write(trans); if (unlikely(ret)) return ret; ret = bch2_trans_commit_write_locked(trans, flags, stopped_at, trace_ip); if (!ret && unlikely(trans->journal_replay_not_finished)) bch2_drop_overwrites_from_journal(trans); bch2_trans_unlock_updates_write(trans); if (!ret && trans->journal_pin) bch2_journal_pin_add(&c->journal, trans->journal_res.seq, trans->journal_pin, bch2_trans_commit_journal_pin_flush); /* * Drop journal reservation after dropping write locks, since dropping * the journal reservation may kick off a journal write: */ if (likely(!(flags & BCH_TRANS_COMMIT_no_journal_res))) bch2_journal_res_put(&c->journal, &trans->journal_res); return ret; } static int journal_reclaim_wait_done(struct bch_fs *c) { int ret = bch2_journal_error(&c->journal) ?: bch2_btree_key_cache_wait_done(c); if (!ret) journal_reclaim_kick(&c->journal); return ret; } static noinline int bch2_trans_commit_error(struct btree_trans *trans, unsigned flags, struct btree_insert_entry *i, int ret, unsigned long trace_ip) { struct bch_fs *c = trans->c; enum bch_watermark watermark = flags & BCH_WATERMARK_MASK; switch (ret) { case -BCH_ERR_btree_insert_btree_node_full: ret = bch2_btree_split_leaf(trans, i->path, flags); if (bch2_err_matches(ret, BCH_ERR_transaction_restart)) trace_and_count(c, trans_restart_btree_node_split, trans, trace_ip, trans->paths + i->path); break; case -BCH_ERR_btree_insert_need_mark_replicas: ret = drop_locks_do(trans, bch2_accounting_update_sb(trans)); break; case -BCH_ERR_journal_res_get_blocked: /* * XXX: this should probably be a separate BTREE_INSERT_NONBLOCK * flag */ if ((flags & BCH_TRANS_COMMIT_journal_reclaim) && watermark < BCH_WATERMARK_reclaim) { ret = -BCH_ERR_journal_reclaim_would_deadlock; break; } ret = drop_locks_do(trans, bch2_trans_journal_res_get(trans, (flags & BCH_WATERMARK_MASK)| JOURNAL_RES_GET_CHECK)); break; case -BCH_ERR_btree_insert_need_journal_reclaim: bch2_trans_unlock(trans); trace_and_count(c, trans_blocked_journal_reclaim, trans, trace_ip); track_event_change(&c->times[BCH_TIME_blocked_key_cache_flush], true); wait_event_freezable(c->journal.reclaim_wait, (ret = journal_reclaim_wait_done(c))); track_event_change(&c->times[BCH_TIME_blocked_key_cache_flush], false); if (ret < 0) break; ret = bch2_trans_relock(trans); break; default: BUG_ON(ret >= 0); break; } BUG_ON(bch2_err_matches(ret, BCH_ERR_transaction_restart) != !!trans->restarted); bch2_fs_inconsistent_on(bch2_err_matches(ret, ENOSPC) && (flags & BCH_TRANS_COMMIT_no_enospc), c, "%s: incorrectly got %s\n", __func__, bch2_err_str(ret)); return ret; } /* * This is for updates done in the early part of fsck - btree_gc - before we've * gone RW. we only add the new key to the list of keys for journal replay to * do. */ static noinline int do_bch2_trans_commit_to_journal_replay(struct btree_trans *trans) { struct bch_fs *c = trans->c; BUG_ON(current != c->recovery_task); trans_for_each_update(trans, i) { int ret = bch2_journal_key_insert(c, i->btree_id, i->level, i->k); if (ret) return ret; } for (struct jset_entry *i = trans->journal_entries; i != (void *) ((u64 *) trans->journal_entries + trans->journal_entries_u64s); i = vstruct_next(i)) if (i->type == BCH_JSET_ENTRY_btree_keys || i->type == BCH_JSET_ENTRY_write_buffer_keys) { int ret = bch2_journal_key_insert(c, i->btree_id, i->level, i->start); if (ret) return ret; } return 0; } int __bch2_trans_commit(struct btree_trans *trans, unsigned flags) { struct btree_insert_entry *errored_at = NULL; struct bch_fs *c = trans->c; int ret = 0; bch2_trans_verify_not_unlocked_or_in_restart(trans); ret = trans_maybe_inject_restart(trans, _RET_IP_); if (unlikely(ret)) goto out_reset; if (!trans->nr_updates && !trans->journal_entries_u64s) goto out_reset; ret = bch2_trans_commit_run_triggers(trans); if (ret) goto out_reset; if (!(flags & BCH_TRANS_COMMIT_no_check_rw) && unlikely(!bch2_write_ref_tryget(c, BCH_WRITE_REF_trans))) { if (unlikely(!test_bit(BCH_FS_may_go_rw, &c->flags))) ret = do_bch2_trans_commit_to_journal_replay(trans); else ret = -BCH_ERR_erofs_trans_commit; goto out_reset; } EBUG_ON(test_bit(BCH_FS_clean_shutdown, &c->flags)); trans->journal_u64s = trans->journal_entries_u64s; trans->journal_transaction_names = READ_ONCE(c->opts.journal_transaction_names); if (trans->journal_transaction_names) trans->journal_u64s += jset_u64s(JSET_ENTRY_LOG_U64s); trans_for_each_update(trans, i) { struct btree_path *path = trans->paths + i->path; EBUG_ON(!path->should_be_locked); ret = bch2_btree_path_upgrade(trans, path, i->level + 1); if (unlikely(ret)) goto out; EBUG_ON(!btree_node_intent_locked(path, i->level)); if (i->key_cache_already_flushed) continue; if (i->flags & BTREE_UPDATE_nojournal) continue; /* we're going to journal the key being updated: */ trans->journal_u64s += jset_u64s(i->k->k.u64s); /* and we're also going to log the overwrite: */ if (trans->journal_transaction_names) trans->journal_u64s += jset_u64s(i->old_k.u64s); } if (trans->extra_disk_res) { ret = bch2_disk_reservation_add(c, trans->disk_res, trans->extra_disk_res, (flags & BCH_TRANS_COMMIT_no_enospc) ? BCH_DISK_RESERVATION_NOFAIL : 0); if (ret) goto err; } retry: errored_at = NULL; bch2_trans_verify_not_unlocked_or_in_restart(trans); if (likely(!(flags & BCH_TRANS_COMMIT_no_journal_res))) memset(&trans->journal_res, 0, sizeof(trans->journal_res)); memset(&trans->fs_usage_delta, 0, sizeof(trans->fs_usage_delta)); ret = do_bch2_trans_commit(trans, flags, &errored_at, _RET_IP_); /* make sure we didn't drop or screw up locks: */ bch2_trans_verify_locks(trans); if (ret) goto err; trace_and_count(c, transaction_commit, trans, _RET_IP_); out: if (likely(!(flags & BCH_TRANS_COMMIT_no_check_rw))) bch2_write_ref_put(c, BCH_WRITE_REF_trans); out_reset: if (!ret) bch2_trans_downgrade(trans); bch2_trans_reset_updates(trans); return ret; err: ret = bch2_trans_commit_error(trans, flags, errored_at, ret, _RET_IP_); if (ret) goto out; /* * We might have done another transaction commit in the error path - * i.e. btree write buffer flush - which will have made use of * trans->journal_res, but with BCH_TRANS_COMMIT_no_journal_res that is * how the journal sequence number to pin is passed in - so we must * restart: */ if (flags & BCH_TRANS_COMMIT_no_journal_res) { ret = -BCH_ERR_transaction_restart_nested; goto out; } goto retry; } |
| 3 1 3 3 3 4 3 3 1 1 1 1 1 1 2 1 1 2 1 2 2 1 1 2 14 1 1 2 2 2 18 12 6 14 3 38 38 38 38 38 37 38 17 21 38 38 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Linux I2C core SMBus and SMBus emulation code * * This file contains the SMBus functions which are always included in the I2C * core because they can be emulated via I2C. SMBus specific extensions * (e.g. smbalert) are handled in a separate i2c-smbus module. * * All SMBus-related things are written by Frodo Looijaard <frodol@dds.nl> * SMBus 2.0 support by Mark Studebaker <mdsxyz123@yahoo.com> and * Jean Delvare <jdelvare@suse.de> */ #include <linux/device.h> #include <linux/err.h> #include <linux/i2c.h> #include <linux/i2c-smbus.h> #include <linux/property.h> #include <linux/slab.h> #include "i2c-core.h" #define CREATE_TRACE_POINTS #include <trace/events/smbus.h> /* The SMBus parts */ #define POLY (0x1070U << 3) static u8 crc8(u16 data) { int i; for (i = 0; i < 8; i++) { if (data & 0x8000) data = data ^ POLY; data = data << 1; } return (u8)(data >> 8); } /** * i2c_smbus_pec - Incremental CRC8 over the given input data array * @crc: previous return crc8 value * @p: pointer to data buffer. * @count: number of bytes in data buffer. * * Incremental CRC8 over count bytes in the array pointed to by p */ u8 i2c_smbus_pec(u8 crc, u8 *p, size_t count) { int i; for (i = 0; i < count; i++) crc = crc8((crc ^ p[i]) << 8); return crc; } EXPORT_SYMBOL(i2c_smbus_pec); /* Assume a 7-bit address, which is reasonable for SMBus */ static u8 i2c_smbus_msg_pec(u8 pec, struct i2c_msg *msg) { /* The address will be sent first */ u8 addr = i2c_8bit_addr_from_msg(msg); pec = i2c_smbus_pec(pec, &addr, 1); /* The data buffer follows */ return i2c_smbus_pec(pec, msg->buf, msg->len); } /* Used for write only transactions */ static inline void i2c_smbus_add_pec(struct i2c_msg *msg) { msg->buf[msg->len] = i2c_smbus_msg_pec(0, msg); msg->len++; } /* Return <0 on CRC error If there was a write before this read (most cases) we need to take the partial CRC from the write part into account. Note that this function does modify the message (we need to decrease the message length to hide the CRC byte from the caller). */ static int i2c_smbus_check_pec(u8 cpec, struct i2c_msg *msg) { u8 rpec = msg->buf[--msg->len]; cpec = i2c_smbus_msg_pec(cpec, msg); if (rpec != cpec) { pr_debug("Bad PEC 0x%02x vs. 0x%02x\n", rpec, cpec); return -EBADMSG; } return 0; } /** * i2c_smbus_read_byte - SMBus "receive byte" protocol * @client: Handle to slave device * * This executes the SMBus "receive byte" protocol, returning negative errno * else the byte received from the device. */ s32 i2c_smbus_read_byte(const struct i2c_client *client) { union i2c_smbus_data data; int status; status = i2c_smbus_xfer(client->adapter, client->addr, client->flags, I2C_SMBUS_READ, 0, I2C_SMBUS_BYTE, &data); return (status < 0) ? status : data.byte; } EXPORT_SYMBOL(i2c_smbus_read_byte); /** * i2c_smbus_write_byte - SMBus "send byte" protocol * @client: Handle to slave device * @value: Byte to be sent * * This executes the SMBus "send byte" protocol, returning negative errno * else zero on success. */ s32 i2c_smbus_write_byte(const struct i2c_client *client, u8 value) { return i2c_smbus_xfer(client->adapter, client->addr, client->flags, I2C_SMBUS_WRITE, value, I2C_SMBUS_BYTE, NULL); } EXPORT_SYMBOL(i2c_smbus_write_byte); /** * i2c_smbus_read_byte_data - SMBus "read byte" protocol * @client: Handle to slave device * @command: Byte interpreted by slave * * This executes the SMBus "read byte" protocol, returning negative errno * else a data byte received from the device. */ s32 i2c_smbus_read_byte_data(const struct i2c_client *client, u8 command) { union i2c_smbus_data data; int status; status = i2c_smbus_xfer(client->adapter, client->addr, client->flags, I2C_SMBUS_READ, command, I2C_SMBUS_BYTE_DATA, &data); return (status < 0) ? status : data.byte; } EXPORT_SYMBOL(i2c_smbus_read_byte_data); /** * i2c_smbus_write_byte_data - SMBus "write byte" protocol * @client: Handle to slave device * @command: Byte interpreted by slave * @value: Byte being written * * This executes the SMBus "write byte" protocol, returning negative errno * else zero on success. */ s32 i2c_smbus_write_byte_data(const struct i2c_client *client, u8 command, u8 value) { union i2c_smbus_data data; data.byte = value; return i2c_smbus_xfer(client->adapter, client->addr, client->flags, I2C_SMBUS_WRITE, command, I2C_SMBUS_BYTE_DATA, &data); } EXPORT_SYMBOL(i2c_smbus_write_byte_data); /** * i2c_smbus_read_word_data - SMBus "read word" protocol * @client: Handle to slave device * @command: Byte interpreted by slave * * This executes the SMBus "read word" protocol, returning negative errno * else a 16-bit unsigned "word" received from the device. */ s32 i2c_smbus_read_word_data(const struct i2c_client *client, u8 command) { union i2c_smbus_data data; int status; status = i2c_smbus_xfer(client->adapter, client->addr, client->flags, I2C_SMBUS_READ, command, I2C_SMBUS_WORD_DATA, &data); return (status < 0) ? status : data.word; } EXPORT_SYMBOL(i2c_smbus_read_word_data); /** * i2c_smbus_write_word_data - SMBus "write word" protocol * @client: Handle to slave device * @command: Byte interpreted by slave * @value: 16-bit "word" being written * * This executes the SMBus "write word" protocol, returning negative errno * else zero on success. */ s32 i2c_smbus_write_word_data(const struct i2c_client *client, u8 command, u16 value) { union i2c_smbus_data data; data.word = value; return i2c_smbus_xfer(client->adapter, client->addr, client->flags, I2C_SMBUS_WRITE, command, I2C_SMBUS_WORD_DATA, &data); } EXPORT_SYMBOL(i2c_smbus_write_word_data); /** * i2c_smbus_read_block_data - SMBus "block read" protocol * @client: Handle to slave device * @command: Byte interpreted by slave * @values: Byte array into which data will be read; big enough to hold * the data returned by the slave. SMBus allows at most 32 bytes. * * This executes the SMBus "block read" protocol, returning negative errno * else the number of data bytes in the slave's response. * * Note that using this function requires that the client's adapter support * the I2C_FUNC_SMBUS_READ_BLOCK_DATA functionality. Not all adapter drivers * support this; its emulation through I2C messaging relies on a specific * mechanism (I2C_M_RECV_LEN) which may not be implemented. */ s32 i2c_smbus_read_block_data(const struct i2c_client *client, u8 command, u8 *values) { union i2c_smbus_data data; int status; status = i2c_smbus_xfer(client->adapter, client->addr, client->flags, I2C_SMBUS_READ, command, I2C_SMBUS_BLOCK_DATA, &data); if (status) return status; memcpy(values, &data.block[1], data.block[0]); return data.block[0]; } EXPORT_SYMBOL(i2c_smbus_read_block_data); /** * i2c_smbus_write_block_data - SMBus "block write" protocol * @client: Handle to slave device * @command: Byte interpreted by slave * @length: Size of data block; SMBus allows at most 32 bytes * @values: Byte array which will be written. * * This executes the SMBus "block write" protocol, returning negative errno * else zero on success. */ s32 i2c_smbus_write_block_data(const struct i2c_client *client, u8 command, u8 length, const u8 *values) { union i2c_smbus_data data; if (length > I2C_SMBUS_BLOCK_MAX) length = I2C_SMBUS_BLOCK_MAX; data.block[0] = length; memcpy(&data.block[1], values, length); return i2c_smbus_xfer(client->adapter, client->addr, client->flags, I2C_SMBUS_WRITE, command, I2C_SMBUS_BLOCK_DATA, &data); } EXPORT_SYMBOL(i2c_smbus_write_block_data); /* Returns the number of read bytes */ s32 i2c_smbus_read_i2c_block_data(const struct i2c_client *client, u8 command, u8 length, u8 *values) { union i2c_smbus_data data; int status; if (length > I2C_SMBUS_BLOCK_MAX) length = I2C_SMBUS_BLOCK_MAX; data.block[0] = length; status = i2c_smbus_xfer(client->adapter, client->addr, client->flags, I2C_SMBUS_READ, command, I2C_SMBUS_I2C_BLOCK_DATA, &data); if (status < 0) return status; memcpy(values, &data.block[1], data.block[0]); return data.block[0]; } EXPORT_SYMBOL(i2c_smbus_read_i2c_block_data); s32 i2c_smbus_write_i2c_block_data(const struct i2c_client *client, u8 command, u8 length, const u8 *values) { union i2c_smbus_data data; if (length > I2C_SMBUS_BLOCK_MAX) length = I2C_SMBUS_BLOCK_MAX; data.block[0] = length; memcpy(data.block + 1, values, length); return i2c_smbus_xfer(client->adapter, client->addr, client->flags, I2C_SMBUS_WRITE, command, I2C_SMBUS_I2C_BLOCK_DATA, &data); } EXPORT_SYMBOL(i2c_smbus_write_i2c_block_data); static void i2c_smbus_try_get_dmabuf(struct i2c_msg *msg, u8 init_val) { bool is_read = msg->flags & I2C_M_RD; unsigned char *dma_buf; dma_buf = kzalloc(I2C_SMBUS_BLOCK_MAX + (is_read ? 2 : 3), GFP_KERNEL); if (!dma_buf) return; msg->buf = dma_buf; msg->flags |= I2C_M_DMA_SAFE; if (init_val) msg->buf[0] = init_val; } /* * Simulate a SMBus command using the I2C protocol. * No checking of parameters is done! */ static s32 i2c_smbus_xfer_emulated(struct i2c_adapter *adapter, u16 addr, unsigned short flags, char read_write, u8 command, int size, union i2c_smbus_data *data) { /* * So we need to generate a series of msgs. In the case of writing, we * need to use only one message; when reading, we need two. We * initialize most things with sane defaults, to keep the code below * somewhat simpler. */ unsigned char msgbuf0[I2C_SMBUS_BLOCK_MAX+3]; unsigned char msgbuf1[I2C_SMBUS_BLOCK_MAX+2]; int nmsgs = read_write == I2C_SMBUS_READ ? 2 : 1; u8 partial_pec = 0; int status; struct i2c_msg msg[2] = { { .addr = addr, .flags = flags, .len = 1, .buf = msgbuf0, }, { .addr = addr, .flags = flags | I2C_M_RD, .len = 0, .buf = msgbuf1, }, }; bool wants_pec = ((flags & I2C_CLIENT_PEC) && size != I2C_SMBUS_QUICK && size != I2C_SMBUS_I2C_BLOCK_DATA); msgbuf0[0] = command; switch (size) { case I2C_SMBUS_QUICK: msg[0].len = 0; /* Special case: The read/write field is used as data */ msg[0].flags = flags | (read_write == I2C_SMBUS_READ ? I2C_M_RD : 0); nmsgs = 1; break; case I2C_SMBUS_BYTE: if (read_write == I2C_SMBUS_READ) { /* Special case: only a read! */ msg[0].flags = I2C_M_RD | flags; nmsgs = 1; } break; case I2C_SMBUS_BYTE_DATA: if (read_write == I2C_SMBUS_READ) msg[1].len = 1; else { msg[0].len = 2; msgbuf0[1] = data->byte; } break; case I2C_SMBUS_WORD_DATA: if (read_write == I2C_SMBUS_READ) msg[1].len = 2; else { msg[0].len = 3; msgbuf0[1] = data->word & 0xff; msgbuf0[2] = data->word >> 8; } break; case I2C_SMBUS_PROC_CALL: nmsgs = 2; /* Special case */ read_write = I2C_SMBUS_READ; msg[0].len = 3; msg[1].len = 2; msgbuf0[1] = data->word & 0xff; msgbuf0[2] = data->word >> 8; break; case I2C_SMBUS_BLOCK_DATA: if (read_write == I2C_SMBUS_READ) { msg[1].flags |= I2C_M_RECV_LEN; msg[1].len = 1; /* block length will be added by the underlying bus driver */ i2c_smbus_try_get_dmabuf(&msg[1], 0); } else { msg[0].len = data->block[0] + 2; if (msg[0].len > I2C_SMBUS_BLOCK_MAX + 2) { dev_err(&adapter->dev, "Invalid block write size %d\n", data->block[0]); return -EINVAL; } i2c_smbus_try_get_dmabuf(&msg[0], command); memcpy(msg[0].buf + 1, data->block, msg[0].len - 1); } break; case I2C_SMBUS_BLOCK_PROC_CALL: nmsgs = 2; /* Another special case */ read_write = I2C_SMBUS_READ; if (data->block[0] > I2C_SMBUS_BLOCK_MAX) { dev_err(&adapter->dev, "Invalid block write size %d\n", data->block[0]); return -EINVAL; } msg[0].len = data->block[0] + 2; i2c_smbus_try_get_dmabuf(&msg[0], command); memcpy(msg[0].buf + 1, data->block, msg[0].len - 1); msg[1].flags |= I2C_M_RECV_LEN; msg[1].len = 1; /* block length will be added by the underlying bus driver */ i2c_smbus_try_get_dmabuf(&msg[1], 0); break; case I2C_SMBUS_I2C_BLOCK_DATA: if (data->block[0] > I2C_SMBUS_BLOCK_MAX) { dev_err(&adapter->dev, "Invalid block %s size %d\n", read_write == I2C_SMBUS_READ ? "read" : "write", data->block[0]); return -EINVAL; } if (read_write == I2C_SMBUS_READ) { msg[1].len = data->block[0]; i2c_smbus_try_get_dmabuf(&msg[1], 0); } else { msg[0].len = data->block[0] + 1; i2c_smbus_try_get_dmabuf(&msg[0], command); memcpy(msg[0].buf + 1, data->block + 1, data->block[0]); } break; default: dev_err(&adapter->dev, "Unsupported transaction %d\n", size); return -EOPNOTSUPP; } if (wants_pec) { /* Compute PEC if first message is a write */ if (!(msg[0].flags & I2C_M_RD)) { if (nmsgs == 1) /* Write only */ i2c_smbus_add_pec(&msg[0]); else /* Write followed by read */ partial_pec = i2c_smbus_msg_pec(0, &msg[0]); } /* Ask for PEC if last message is a read */ if (msg[nmsgs - 1].flags & I2C_M_RD) msg[nmsgs - 1].len++; } status = __i2c_transfer(adapter, msg, nmsgs); if (status < 0) goto cleanup; if (status != nmsgs) { status = -EIO; goto cleanup; } status = 0; /* Check PEC if last message is a read */ if (wants_pec && (msg[nmsgs - 1].flags & I2C_M_RD)) { status = i2c_smbus_check_pec(partial_pec, &msg[nmsgs - 1]); if (status < 0) goto cleanup; } if (read_write == I2C_SMBUS_READ) switch (size) { case I2C_SMBUS_BYTE: data->byte = msgbuf0[0]; break; case I2C_SMBUS_BYTE_DATA: data->byte = msgbuf1[0]; break; case I2C_SMBUS_WORD_DATA: case I2C_SMBUS_PROC_CALL: data->word = msgbuf1[0] | (msgbuf1[1] << 8); break; case I2C_SMBUS_I2C_BLOCK_DATA: memcpy(data->block + 1, msg[1].buf, data->block[0]); break; case I2C_SMBUS_BLOCK_DATA: case I2C_SMBUS_BLOCK_PROC_CALL: if (msg[1].buf[0] > I2C_SMBUS_BLOCK_MAX) { dev_err(&adapter->dev, "Invalid block size returned: %d\n", msg[1].buf[0]); status = -EPROTO; goto cleanup; } memcpy(data->block, msg[1].buf, msg[1].buf[0] + 1); break; } cleanup: if (msg[0].flags & I2C_M_DMA_SAFE) kfree(msg[0].buf); if (msg[1].flags & I2C_M_DMA_SAFE) kfree(msg[1].buf); return status; } /** * i2c_smbus_xfer - execute SMBus protocol operations * @adapter: Handle to I2C bus * @addr: Address of SMBus slave on that bus * @flags: I2C_CLIENT_* flags (usually zero or I2C_CLIENT_PEC) * @read_write: I2C_SMBUS_READ or I2C_SMBUS_WRITE * @command: Byte interpreted by slave, for protocols which use such bytes * @protocol: SMBus protocol operation to execute, such as I2C_SMBUS_PROC_CALL * @data: Data to be read or written * * This executes an SMBus protocol operation, and returns a negative * errno code else zero on success. */ s32 i2c_smbus_xfer(struct i2c_adapter *adapter, u16 addr, unsigned short flags, char read_write, u8 command, int protocol, union i2c_smbus_data *data) { s32 res; res = __i2c_lock_bus_helper(adapter); if (res) return res; res = __i2c_smbus_xfer(adapter, addr, flags, read_write, command, protocol, data); i2c_unlock_bus(adapter, I2C_LOCK_SEGMENT); return res; } EXPORT_SYMBOL(i2c_smbus_xfer); s32 __i2c_smbus_xfer(struct i2c_adapter *adapter, u16 addr, unsigned short flags, char read_write, u8 command, int protocol, union i2c_smbus_data *data) { int (*xfer_func)(struct i2c_adapter *adap, u16 addr, unsigned short flags, char read_write, u8 command, int size, union i2c_smbus_data *data); unsigned long orig_jiffies; int try; s32 res; res = __i2c_check_suspended(adapter); if (res) return res; /* If enabled, the following two tracepoints are conditional on * read_write and protocol. */ trace_smbus_write(adapter, addr, flags, read_write, command, protocol, data); trace_smbus_read(adapter, addr, flags, read_write, command, protocol); flags &= I2C_M_TEN | I2C_CLIENT_PEC | I2C_CLIENT_SCCB; xfer_func = adapter->algo->smbus_xfer; if (i2c_in_atomic_xfer_mode()) { if (adapter->algo->smbus_xfer_atomic) xfer_func = adapter->algo->smbus_xfer_atomic; else if (adapter->algo->master_xfer_atomic) xfer_func = NULL; /* fallback to I2C emulation */ } if (xfer_func) { /* Retry automatically on arbitration loss */ orig_jiffies = jiffies; for (res = 0, try = 0; try <= adapter->retries; try++) { res = xfer_func(adapter, addr, flags, read_write, command, protocol, data); if (res != -EAGAIN) break; if (time_after(jiffies, orig_jiffies + adapter->timeout)) break; } if (res != -EOPNOTSUPP || !adapter->algo->master_xfer) goto trace; /* * Fall back to i2c_smbus_xfer_emulated if the adapter doesn't * implement native support for the SMBus operation. */ } res = i2c_smbus_xfer_emulated(adapter, addr, flags, read_write, command, protocol, data); trace: /* If enabled, the reply tracepoint is conditional on read_write. */ trace_smbus_reply(adapter, addr, flags, read_write, command, protocol, data, res); trace_smbus_result(adapter, addr, flags, read_write, command, protocol, res); return res; } EXPORT_SYMBOL(__i2c_smbus_xfer); /** * i2c_smbus_read_i2c_block_data_or_emulated - read block or emulate * @client: Handle to slave device * @command: Byte interpreted by slave * @length: Size of data block; SMBus allows at most I2C_SMBUS_BLOCK_MAX bytes * @values: Byte array into which data will be read; big enough to hold * the data returned by the slave. SMBus allows at most * I2C_SMBUS_BLOCK_MAX bytes. * * This executes the SMBus "block read" protocol if supported by the adapter. * If block read is not supported, it emulates it using either word or byte * read protocols depending on availability. * * The addresses of the I2C slave device that are accessed with this function * must be mapped to a linear region, so that a block read will have the same * effect as a byte read. Before using this function you must double-check * if the I2C slave does support exchanging a block transfer with a byte * transfer. */ s32 i2c_smbus_read_i2c_block_data_or_emulated(const struct i2c_client *client, u8 command, u8 length, u8 *values) { u8 i = 0; int status; if (length > I2C_SMBUS_BLOCK_MAX) length = I2C_SMBUS_BLOCK_MAX; if (i2c_check_functionality(client->adapter, I2C_FUNC_SMBUS_READ_I2C_BLOCK)) return i2c_smbus_read_i2c_block_data(client, command, length, values); if (!i2c_check_functionality(client->adapter, I2C_FUNC_SMBUS_READ_BYTE_DATA)) return -EOPNOTSUPP; if (i2c_check_functionality(client->adapter, I2C_FUNC_SMBUS_READ_WORD_DATA)) { while ((i + 2) <= length) { status = i2c_smbus_read_word_data(client, command + i); if (status < 0) return status; values[i] = status & 0xff; values[i + 1] = status >> 8; i += 2; } } while (i < length) { status = i2c_smbus_read_byte_data(client, command + i); if (status < 0) return status; values[i] = status; i++; } return i; } EXPORT_SYMBOL(i2c_smbus_read_i2c_block_data_or_emulated); /** * i2c_new_smbus_alert_device - get ara client for SMBus alert support * @adapter: the target adapter * @setup: setup data for the SMBus alert handler * Context: can sleep * * Setup handling of the SMBus alert protocol on a given I2C bus segment. * * Handling can be done either through our IRQ handler, or by the * adapter (from its handler, periodic polling, or whatever). * * This returns the ara client, which should be saved for later use with * i2c_handle_smbus_alert() and ultimately i2c_unregister_device(); or an * ERRPTR to indicate an error. */ struct i2c_client *i2c_new_smbus_alert_device(struct i2c_adapter *adapter, struct i2c_smbus_alert_setup *setup) { struct i2c_board_info ara_board_info = { I2C_BOARD_INFO("smbus_alert", 0x0c), .platform_data = setup, }; return i2c_new_client_device(adapter, &ara_board_info); } EXPORT_SYMBOL_GPL(i2c_new_smbus_alert_device); #if IS_ENABLED(CONFIG_I2C_SMBUS) int i2c_setup_smbus_alert(struct i2c_adapter *adapter) { struct device *parent = adapter->dev.parent; int irq; /* Adapter instantiated without parent, skip the SMBus alert setup */ if (!parent) return 0; /* Report serious errors */ irq = device_property_match_string(parent, "interrupt-names", "smbus_alert"); if (irq < 0 && irq != -EINVAL && irq != -ENODATA) return irq; /* Skip setup when no irq was found */ if (irq < 0 && !device_property_present(parent, "smbalert-gpios")) return 0; return PTR_ERR_OR_ZERO(i2c_new_smbus_alert_device(adapter, NULL)); } #endif |
| 4 7 4 993 5 7 415 111 463 33 298 19 9 60 60 43 43 15 10 1 2863 2089 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Portions Copyright (C) 1992 Drew Eckhardt */ #ifndef _LINUX_BLKDEV_H #define _LINUX_BLKDEV_H #include <linux/types.h> #include <linux/blk_types.h> #include <linux/device.h> #include <linux/list.h> #include <linux/llist.h> #include <linux/minmax.h> #include <linux/timer.h> #include <linux/workqueue.h> #include <linux/wait.h> #include <linux/bio.h> #include <linux/gfp.h> #include <linux/kdev_t.h> #include <linux/rcupdate.h> #include <linux/percpu-refcount.h> #include <linux/blkzoned.h> #include <linux/sched.h> #include <linux/sbitmap.h> #include <linux/uuid.h> #include <linux/xarray.h> #include <linux/file.h> #include <linux/lockdep.h> struct module; struct request_queue; struct elevator_queue; struct blk_trace; struct request; struct sg_io_hdr; struct blkcg_gq; struct blk_flush_queue; struct kiocb; struct pr_ops; struct rq_qos; struct blk_queue_stats; struct blk_stat_callback; struct blk_crypto_profile; extern const struct device_type disk_type; extern const struct device_type part_type; extern const struct class block_class; /* * Maximum number of blkcg policies allowed to be registered concurrently. * Defined here to simplify include dependency. */ #define BLKCG_MAX_POLS 6 #define DISK_MAX_PARTS 256 #define DISK_NAME_LEN 32 #define PARTITION_META_INFO_VOLNAMELTH 64 /* * Enough for the string representation of any kind of UUID plus NULL. * EFI UUID is 36 characters. MSDOS UUID is 11 characters. */ #define PARTITION_META_INFO_UUIDLTH (UUID_STRING_LEN + 1) struct partition_meta_info { char uuid[PARTITION_META_INFO_UUIDLTH]; u8 volname[PARTITION_META_INFO_VOLNAMELTH]; }; /** * DOC: genhd capability flags * * ``GENHD_FL_REMOVABLE``: indicates that the block device gives access to * removable media. When set, the device remains present even when media is not * inserted. Shall not be set for devices which are removed entirely when the * media is removed. * * ``GENHD_FL_HIDDEN``: the block device is hidden; it doesn't produce events, * doesn't appear in sysfs, and can't be opened from userspace or using * blkdev_get*. Used for the underlying components of multipath devices. * * ``GENHD_FL_NO_PART``: partition support is disabled. The kernel will not * scan for partitions from add_disk, and users can't add partitions manually. * */ enum { GENHD_FL_REMOVABLE = 1 << 0, GENHD_FL_HIDDEN = 1 << 1, GENHD_FL_NO_PART = 1 << 2, }; enum { DISK_EVENT_MEDIA_CHANGE = 1 << 0, /* media changed */ DISK_EVENT_EJECT_REQUEST = 1 << 1, /* eject requested */ }; enum { /* Poll even if events_poll_msecs is unset */ DISK_EVENT_FLAG_POLL = 1 << 0, /* Forward events to udev */ DISK_EVENT_FLAG_UEVENT = 1 << 1, /* Block event polling when open for exclusive write */ DISK_EVENT_FLAG_BLOCK_ON_EXCL_WRITE = 1 << 2, }; struct disk_events; struct badblocks; enum blk_integrity_checksum { BLK_INTEGRITY_CSUM_NONE = 0, BLK_INTEGRITY_CSUM_IP = 1, BLK_INTEGRITY_CSUM_CRC = 2, BLK_INTEGRITY_CSUM_CRC64 = 3, } __packed ; struct blk_integrity { unsigned char flags; enum blk_integrity_checksum csum_type; unsigned char tuple_size; unsigned char pi_offset; unsigned char interval_exp; unsigned char tag_size; }; typedef unsigned int __bitwise blk_mode_t; /* open for reading */ #define BLK_OPEN_READ ((__force blk_mode_t)(1 << 0)) /* open for writing */ #define BLK_OPEN_WRITE ((__force blk_mode_t)(1 << 1)) /* open exclusively (vs other exclusive openers */ #define BLK_OPEN_EXCL ((__force blk_mode_t)(1 << 2)) /* opened with O_NDELAY */ #define BLK_OPEN_NDELAY ((__force blk_mode_t)(1 << 3)) /* open for "writes" only for ioctls (specialy hack for floppy.c) */ #define BLK_OPEN_WRITE_IOCTL ((__force blk_mode_t)(1 << 4)) /* open is exclusive wrt all other BLK_OPEN_WRITE opens to the device */ #define BLK_OPEN_RESTRICT_WRITES ((__force blk_mode_t)(1 << 5)) /* return partition scanning errors */ #define BLK_OPEN_STRICT_SCAN ((__force blk_mode_t)(1 << 6)) struct gendisk { /* * major/first_minor/minors should not be set by any new driver, the * block core will take care of allocating them automatically. */ int major; int first_minor; int minors; char disk_name[DISK_NAME_LEN]; /* name of major driver */ unsigned short events; /* supported events */ unsigned short event_flags; /* flags related to event processing */ struct xarray part_tbl; struct block_device *part0; const struct block_device_operations *fops; struct request_queue *queue; void *private_data; struct bio_set bio_split; int flags; unsigned long state; #define GD_NEED_PART_SCAN 0 #define GD_READ_ONLY 1 #define GD_DEAD 2 #define GD_NATIVE_CAPACITY 3 #define GD_ADDED 4 #define GD_SUPPRESS_PART_SCAN 5 #define GD_OWNS_QUEUE 6 struct mutex open_mutex; /* open/close mutex */ unsigned open_partitions; /* number of open partitions */ struct backing_dev_info *bdi; struct kobject queue_kobj; /* the queue/ directory */ struct kobject *slave_dir; #ifdef CONFIG_BLOCK_HOLDER_DEPRECATED struct list_head slave_bdevs; #endif struct timer_rand_state *random; atomic_t sync_io; /* RAID */ struct disk_events *ev; #ifdef CONFIG_BLK_DEV_ZONED /* * Zoned block device information. Reads of this information must be * protected with blk_queue_enter() / blk_queue_exit(). Modifying this * information is only allowed while no requests are being processed. * See also blk_mq_freeze_queue() and blk_mq_unfreeze_queue(). */ unsigned int nr_zones; unsigned int zone_capacity; unsigned int last_zone_capacity; unsigned long __rcu *conv_zones_bitmap; unsigned int zone_wplugs_hash_bits; spinlock_t zone_wplugs_lock; struct mempool_s *zone_wplugs_pool; struct hlist_head *zone_wplugs_hash; struct workqueue_struct *zone_wplugs_wq; #endif /* CONFIG_BLK_DEV_ZONED */ #if IS_ENABLED(CONFIG_CDROM) struct cdrom_device_info *cdi; #endif int node_id; struct badblocks *bb; struct lockdep_map lockdep_map; u64 diskseq; blk_mode_t open_mode; /* * Independent sector access ranges. This is always NULL for * devices that do not have multiple independent access ranges. */ struct blk_independent_access_ranges *ia_ranges; }; /** * disk_openers - returns how many openers are there for a disk * @disk: disk to check * * This returns the number of openers for a disk. Note that this value is only * stable if disk->open_mutex is held. * * Note: Due to a quirk in the block layer open code, each open partition is * only counted once even if there are multiple openers. */ static inline unsigned int disk_openers(struct gendisk *disk) { return atomic_read(&disk->part0->bd_openers); } /** * disk_has_partscan - return %true if partition scanning is enabled on a disk * @disk: disk to check * * Returns %true if partitions scanning is enabled for @disk, or %false if * partition scanning is disabled either permanently or temporarily. */ static inline bool disk_has_partscan(struct gendisk *disk) { return !(disk->flags & (GENHD_FL_NO_PART | GENHD_FL_HIDDEN)) && !test_bit(GD_SUPPRESS_PART_SCAN, &disk->state); } /* * The gendisk is refcounted by the part0 block_device, and the bd_device * therein is also used for device model presentation in sysfs. */ #define dev_to_disk(device) \ (dev_to_bdev(device)->bd_disk) #define disk_to_dev(disk) \ (&((disk)->part0->bd_device)) #if IS_REACHABLE(CONFIG_CDROM) #define disk_to_cdi(disk) ((disk)->cdi) #else #define disk_to_cdi(disk) NULL #endif static inline dev_t disk_devt(struct gendisk *disk) { return MKDEV(disk->major, disk->first_minor); } /* blk_validate_limits() validates bsize, so drivers don't usually need to */ static inline int blk_validate_block_size(unsigned long bsize) { if (bsize < 512 || bsize > PAGE_SIZE || !is_power_of_2(bsize)) return -EINVAL; return 0; } static inline bool blk_op_is_passthrough(blk_opf_t op) { op &= REQ_OP_MASK; return op == REQ_OP_DRV_IN || op == REQ_OP_DRV_OUT; } /* flags set by the driver in queue_limits.features */ typedef unsigned int __bitwise blk_features_t; /* supports a volatile write cache */ #define BLK_FEAT_WRITE_CACHE ((__force blk_features_t)(1u << 0)) /* supports passing on the FUA bit */ #define BLK_FEAT_FUA ((__force blk_features_t)(1u << 1)) /* rotational device (hard drive or floppy) */ #define BLK_FEAT_ROTATIONAL ((__force blk_features_t)(1u << 2)) /* contributes to the random number pool */ #define BLK_FEAT_ADD_RANDOM ((__force blk_features_t)(1u << 3)) /* do disk/partitions IO accounting */ #define BLK_FEAT_IO_STAT ((__force blk_features_t)(1u << 4)) /* don't modify data until writeback is done */ #define BLK_FEAT_STABLE_WRITES ((__force blk_features_t)(1u << 5)) /* always completes in submit context */ #define BLK_FEAT_SYNCHRONOUS ((__force blk_features_t)(1u << 6)) /* supports REQ_NOWAIT */ #define BLK_FEAT_NOWAIT ((__force blk_features_t)(1u << 7)) /* supports DAX */ #define BLK_FEAT_DAX ((__force blk_features_t)(1u << 8)) /* supports I/O polling */ #define BLK_FEAT_POLL ((__force blk_features_t)(1u << 9)) /* is a zoned device */ #define BLK_FEAT_ZONED ((__force blk_features_t)(1u << 10)) /* supports PCI(e) p2p requests */ #define BLK_FEAT_PCI_P2PDMA ((__force blk_features_t)(1u << 12)) /* skip this queue in blk_mq_(un)quiesce_tagset */ #define BLK_FEAT_SKIP_TAGSET_QUIESCE ((__force blk_features_t)(1u << 13)) /* bounce all highmem pages */ #define BLK_FEAT_BOUNCE_HIGH ((__force blk_features_t)(1u << 14)) /* undocumented magic for bcache */ #define BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE \ ((__force blk_features_t)(1u << 15)) /* atomic writes enabled */ #define BLK_FEAT_ATOMIC_WRITES \ ((__force blk_features_t)(1u << 16)) /* * Flags automatically inherited when stacking limits. */ #define BLK_FEAT_INHERIT_MASK \ (BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA | BLK_FEAT_ROTATIONAL | \ BLK_FEAT_STABLE_WRITES | BLK_FEAT_ZONED | BLK_FEAT_BOUNCE_HIGH | \ BLK_FEAT_RAID_PARTIAL_STRIPES_EXPENSIVE) /* internal flags in queue_limits.flags */ typedef unsigned int __bitwise blk_flags_t; /* do not send FLUSH/FUA commands despite advertising a write cache */ #define BLK_FLAG_WRITE_CACHE_DISABLED ((__force blk_flags_t)(1u << 0)) /* I/O topology is misaligned */ #define BLK_FLAG_MISALIGNED ((__force blk_flags_t)(1u << 1)) /* passthrough command IO accounting */ #define BLK_FLAG_IOSTATS_PASSTHROUGH ((__force blk_flags_t)(1u << 2)) struct queue_limits { blk_features_t features; blk_flags_t flags; unsigned long seg_boundary_mask; unsigned long virt_boundary_mask; unsigned int max_hw_sectors; unsigned int max_dev_sectors; unsigned int chunk_sectors; unsigned int max_sectors; unsigned int max_user_sectors; unsigned int max_segment_size; unsigned int physical_block_size; unsigned int logical_block_size; unsigned int alignment_offset; unsigned int io_min; unsigned int io_opt; unsigned int max_discard_sectors; unsigned int max_hw_discard_sectors; unsigned int max_user_discard_sectors; unsigned int max_secure_erase_sectors; unsigned int max_write_zeroes_sectors; unsigned int max_hw_zone_append_sectors; unsigned int max_zone_append_sectors; unsigned int discard_granularity; unsigned int discard_alignment; unsigned int zone_write_granularity; /* atomic write limits */ unsigned int atomic_write_hw_max; unsigned int atomic_write_max_sectors; unsigned int atomic_write_hw_boundary; unsigned int atomic_write_boundary_sectors; unsigned int atomic_write_hw_unit_min; unsigned int atomic_write_unit_min; unsigned int atomic_write_hw_unit_max; unsigned int atomic_write_unit_max; unsigned short max_segments; unsigned short max_integrity_segments; unsigned short max_discard_segments; unsigned int max_open_zones; unsigned int max_active_zones; /* * Drivers that set dma_alignment to less than 511 must be prepared to * handle individual bvec's that are not a multiple of a SECTOR_SIZE * due to possible offsets. */ unsigned int dma_alignment; unsigned int dma_pad_mask; struct blk_integrity integrity; }; typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx, void *data); #define BLK_ALL_ZONES ((unsigned int)-1) int blkdev_report_zones(struct block_device *bdev, sector_t sector, unsigned int nr_zones, report_zones_cb cb, void *data); int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op, sector_t sectors, sector_t nr_sectors); int blk_revalidate_disk_zones(struct gendisk *disk); /* * Independent access ranges: struct blk_independent_access_range describes * a range of contiguous sectors that can be accessed using device command * execution resources that are independent from the resources used for * other access ranges. This is typically found with single-LUN multi-actuator * HDDs where each access range is served by a different set of heads. * The set of independent ranges supported by the device is defined using * struct blk_independent_access_ranges. The independent ranges must not overlap * and must include all sectors within the disk capacity (no sector holes * allowed). * For a device with multiple ranges, requests targeting sectors in different * ranges can be executed in parallel. A request can straddle an access range * boundary. */ struct blk_independent_access_range { struct kobject kobj; sector_t sector; sector_t nr_sectors; }; struct blk_independent_access_ranges { struct kobject kobj; bool sysfs_registered; unsigned int nr_ia_ranges; struct blk_independent_access_range ia_range[]; }; struct request_queue { /* * The queue owner gets to use this for whatever they like. * ll_rw_blk doesn't touch it. */ void *queuedata; struct elevator_queue *elevator; const struct blk_mq_ops *mq_ops; /* sw queues */ struct blk_mq_ctx __percpu *queue_ctx; /* * various queue flags, see QUEUE_* below */ unsigned long queue_flags; unsigned int rq_timeout; unsigned int queue_depth; refcount_t refs; /* hw dispatch queues */ unsigned int nr_hw_queues; struct xarray hctx_table; struct percpu_ref q_usage_counter; struct lock_class_key io_lock_cls_key; struct lockdep_map io_lockdep_map; struct lock_class_key q_lock_cls_key; struct lockdep_map q_lockdep_map; struct request *last_merge; spinlock_t queue_lock; int quiesce_depth; struct gendisk *disk; /* * mq queue kobject */ struct kobject *mq_kobj; struct queue_limits limits; #ifdef CONFIG_PM struct device *dev; enum rpm_status rpm_status; #endif /* * Number of contexts that have called blk_set_pm_only(). If this * counter is above zero then only RQF_PM requests are processed. */ atomic_t pm_only; struct blk_queue_stats *stats; struct rq_qos *rq_qos; struct mutex rq_qos_mutex; /* * ida allocated id for this queue. Used to index queues from * ioctx. */ int id; /* * queue settings */ unsigned long nr_requests; /* Max # of requests */ #ifdef CONFIG_BLK_INLINE_ENCRYPTION struct blk_crypto_profile *crypto_profile; struct kobject *crypto_kobject; #endif struct timer_list timeout; struct work_struct timeout_work; atomic_t nr_active_requests_shared_tags; struct blk_mq_tags *sched_shared_tags; struct list_head icq_list; #ifdef CONFIG_BLK_CGROUP DECLARE_BITMAP (blkcg_pols, BLKCG_MAX_POLS); struct blkcg_gq *root_blkg; struct list_head blkg_list; struct mutex blkcg_mutex; #endif int node; spinlock_t requeue_lock; struct list_head requeue_list; struct delayed_work requeue_work; #ifdef CONFIG_BLK_DEV_IO_TRACE struct blk_trace __rcu *blk_trace; #endif /* * for flush operations */ struct blk_flush_queue *fq; struct list_head flush_list; struct mutex sysfs_lock; struct mutex limits_lock; /* * for reusing dead hctx instance in case of updating * nr_hw_queues */ struct list_head unused_hctx_list; spinlock_t unused_hctx_lock; int mq_freeze_depth; #ifdef CONFIG_BLK_DEV_THROTTLING /* Throttle data */ struct throtl_data *td; #endif struct rcu_head rcu_head; #ifdef CONFIG_LOCKDEP struct task_struct *mq_freeze_owner; int mq_freeze_owner_depth; /* * Records disk & queue state in current context, used in unfreeze * queue */ bool mq_freeze_disk_dead; bool mq_freeze_queue_dying; #endif wait_queue_head_t mq_freeze_wq; /* * Protect concurrent access to q_usage_counter by * percpu_ref_kill() and percpu_ref_reinit(). */ struct mutex mq_freeze_lock; struct blk_mq_tag_set *tag_set; struct list_head tag_set_list; struct dentry *debugfs_dir; struct dentry *sched_debugfs_dir; struct dentry *rqos_debugfs_dir; /* * Serializes all debugfs metadata operations using the above dentries. */ struct mutex debugfs_mutex; }; /* Keep blk_queue_flag_name[] in sync with the definitions below */ enum { QUEUE_FLAG_DYING, /* queue being torn down */ QUEUE_FLAG_NOMERGES, /* disable merge attempts */ QUEUE_FLAG_SAME_COMP, /* complete on same CPU-group */ QUEUE_FLAG_FAIL_IO, /* fake timeout */ QUEUE_FLAG_NOXMERGES, /* No extended merges */ QUEUE_FLAG_SAME_FORCE, /* force complete on same CPU */ QUEUE_FLAG_INIT_DONE, /* queue is initialized */ QUEUE_FLAG_STATS, /* track IO start and completion times */ QUEUE_FLAG_REGISTERED, /* queue has been registered to a disk */ QUEUE_FLAG_QUIESCED, /* queue has been quiesced */ QUEUE_FLAG_RQ_ALLOC_TIME, /* record rq->alloc_time_ns */ QUEUE_FLAG_HCTX_ACTIVE, /* at least one blk-mq hctx is active */ QUEUE_FLAG_SQ_SCHED, /* single queue style io dispatch */ QUEUE_FLAG_MAX }; #define QUEUE_FLAG_MQ_DEFAULT (1UL << QUEUE_FLAG_SAME_COMP) void blk_queue_flag_set(unsigned int flag, struct request_queue *q); void blk_queue_flag_clear(unsigned int flag, struct request_queue *q); #define blk_queue_dying(q) test_bit(QUEUE_FLAG_DYING, &(q)->queue_flags) #define blk_queue_init_done(q) test_bit(QUEUE_FLAG_INIT_DONE, &(q)->queue_flags) #define blk_queue_nomerges(q) test_bit(QUEUE_FLAG_NOMERGES, &(q)->queue_flags) #define blk_queue_noxmerges(q) \ test_bit(QUEUE_FLAG_NOXMERGES, &(q)->queue_flags) #define blk_queue_nonrot(q) (!((q)->limits.features & BLK_FEAT_ROTATIONAL)) #define blk_queue_io_stat(q) ((q)->limits.features & BLK_FEAT_IO_STAT) #define blk_queue_passthrough_stat(q) \ ((q)->limits.flags & BLK_FLAG_IOSTATS_PASSTHROUGH) #define blk_queue_dax(q) ((q)->limits.features & BLK_FEAT_DAX) #define blk_queue_pci_p2pdma(q) ((q)->limits.features & BLK_FEAT_PCI_P2PDMA) #ifdef CONFIG_BLK_RQ_ALLOC_TIME #define blk_queue_rq_alloc_time(q) \ test_bit(QUEUE_FLAG_RQ_ALLOC_TIME, &(q)->queue_flags) #else #define blk_queue_rq_alloc_time(q) false #endif #define blk_noretry_request(rq) \ ((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \ REQ_FAILFAST_DRIVER)) #define blk_queue_quiesced(q) test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags) #define blk_queue_pm_only(q) atomic_read(&(q)->pm_only) #define blk_queue_registered(q) test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags) #define blk_queue_sq_sched(q) test_bit(QUEUE_FLAG_SQ_SCHED, &(q)->queue_flags) #define blk_queue_skip_tagset_quiesce(q) \ ((q)->limits.features & BLK_FEAT_SKIP_TAGSET_QUIESCE) extern void blk_set_pm_only(struct request_queue *q); extern void blk_clear_pm_only(struct request_queue *q); #define list_entry_rq(ptr) list_entry((ptr), struct request, queuelist) #define dma_map_bvec(dev, bv, dir, attrs) \ dma_map_page_attrs(dev, (bv)->bv_page, (bv)->bv_offset, (bv)->bv_len, \ (dir), (attrs)) static inline bool queue_is_mq(struct request_queue *q) { return q->mq_ops; } #ifdef CONFIG_PM static inline enum rpm_status queue_rpm_status(struct request_queue *q) { return q->rpm_status; } #else static inline enum rpm_status queue_rpm_status(struct request_queue *q) { return RPM_ACTIVE; } #endif static inline bool blk_queue_is_zoned(struct request_queue *q) { return IS_ENABLED(CONFIG_BLK_DEV_ZONED) && (q->limits.features & BLK_FEAT_ZONED); } #ifdef CONFIG_BLK_DEV_ZONED static inline unsigned int disk_nr_zones(struct gendisk *disk) { return disk->nr_zones; } bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs); #else /* CONFIG_BLK_DEV_ZONED */ static inline unsigned int disk_nr_zones(struct gendisk *disk) { return 0; } static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs) { return false; } #endif /* CONFIG_BLK_DEV_ZONED */ static inline unsigned int disk_zone_no(struct gendisk *disk, sector_t sector) { if (!blk_queue_is_zoned(disk->queue)) return 0; return sector >> ilog2(disk->queue->limits.chunk_sectors); } static inline unsigned int bdev_nr_zones(struct block_device *bdev) { return disk_nr_zones(bdev->bd_disk); } static inline unsigned int bdev_max_open_zones(struct block_device *bdev) { return bdev->bd_disk->queue->limits.max_open_zones; } static inline unsigned int bdev_max_active_zones(struct block_device *bdev) { return bdev->bd_disk->queue->limits.max_active_zones; } static inline unsigned int blk_queue_depth(struct request_queue *q) { if (q->queue_depth) return q->queue_depth; return q->nr_requests; } /* * default timeout for SG_IO if none specified */ #define BLK_DEFAULT_SG_TIMEOUT (60 * HZ) #define BLK_MIN_SG_TIMEOUT (7 * HZ) /* This should not be used directly - use rq_for_each_segment */ #define for_each_bio(_bio) \ for (; _bio; _bio = _bio->bi_next) int __must_check add_disk_fwnode(struct device *parent, struct gendisk *disk, const struct attribute_group **groups, struct fwnode_handle *fwnode); int __must_check device_add_disk(struct device *parent, struct gendisk *disk, const struct attribute_group **groups); static inline int __must_check add_disk(struct gendisk *disk) { return device_add_disk(NULL, disk, NULL); } void del_gendisk(struct gendisk *gp); void invalidate_disk(struct gendisk *disk); void set_disk_ro(struct gendisk *disk, bool read_only); void disk_uevent(struct gendisk *disk, enum kobject_action action); static inline u8 bdev_partno(const struct block_device *bdev) { return atomic_read(&bdev->__bd_flags) & BD_PARTNO; } static inline bool bdev_test_flag(const struct block_device *bdev, unsigned flag) { return atomic_read(&bdev->__bd_flags) & flag; } static inline void bdev_set_flag(struct block_device *bdev, unsigned flag) { atomic_or(flag, &bdev->__bd_flags); } static inline void bdev_clear_flag(struct block_device *bdev, unsigned flag) { atomic_andnot(flag, &bdev->__bd_flags); } static inline bool get_disk_ro(struct gendisk *disk) { return bdev_test_flag(disk->part0, BD_READ_ONLY) || test_bit(GD_READ_ONLY, &disk->state); } static inline bool bdev_read_only(struct block_device *bdev) { return bdev_test_flag(bdev, BD_READ_ONLY) || get_disk_ro(bdev->bd_disk); } bool set_capacity_and_notify(struct gendisk *disk, sector_t size); void disk_force_media_change(struct gendisk *disk); void bdev_mark_dead(struct block_device *bdev, bool surprise); void add_disk_randomness(struct gendisk *disk) __latent_entropy; void rand_initialize_disk(struct gendisk *disk); static inline sector_t get_start_sect(struct block_device *bdev) { return bdev->bd_start_sect; } static inline sector_t bdev_nr_sectors(struct block_device *bdev) { return bdev->bd_nr_sectors; } static inline loff_t bdev_nr_bytes(struct block_device *bdev) { return (loff_t)bdev_nr_sectors(bdev) << SECTOR_SHIFT; } static inline sector_t get_capacity(struct gendisk *disk) { return bdev_nr_sectors(disk->part0); } static inline u64 sb_bdev_nr_blocks(struct super_block *sb) { return bdev_nr_sectors(sb->s_bdev) >> (sb->s_blocksize_bits - SECTOR_SHIFT); } int bdev_disk_changed(struct gendisk *disk, bool invalidate); void put_disk(struct gendisk *disk); struct gendisk *__blk_alloc_disk(struct queue_limits *lim, int node, struct lock_class_key *lkclass); /** * blk_alloc_disk - allocate a gendisk structure * @lim: queue limits to be used for this disk. * @node_id: numa node to allocate on * * Allocate and pre-initialize a gendisk structure for use with BIO based * drivers. * * Returns an ERR_PTR on error, else the allocated disk. * * Context: can sleep */ #define blk_alloc_disk(lim, node_id) \ ({ \ static struct lock_class_key __key; \ \ __blk_alloc_disk(lim, node_id, &__key); \ }) int __register_blkdev(unsigned int major, const char *name, void (*probe)(dev_t devt)); #define register_blkdev(major, name) \ __register_blkdev(major, name, NULL) void unregister_blkdev(unsigned int major, const char *name); bool disk_check_media_change(struct gendisk *disk); void set_capacity(struct gendisk *disk, sector_t size); #ifdef CONFIG_BLOCK_HOLDER_DEPRECATED int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk); void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk); #else static inline int bd_link_disk_holder(struct block_device *bdev, struct gendisk *disk) { return 0; } static inline void bd_unlink_disk_holder(struct block_device *bdev, struct gendisk *disk) { } #endif /* CONFIG_BLOCK_HOLDER_DEPRECATED */ dev_t part_devt(struct gendisk *disk, u8 partno); void inc_diskseq(struct gendisk *disk); void blk_request_module(dev_t devt); extern int blk_register_queue(struct gendisk *disk); extern void blk_unregister_queue(struct gendisk *disk); void submit_bio_noacct(struct bio *bio); struct bio *bio_split_to_limits(struct bio *bio); extern int blk_lld_busy(struct request_queue *q); extern int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags); extern void blk_queue_exit(struct request_queue *q); extern void blk_sync_queue(struct request_queue *q); /* Helper to convert REQ_OP_XXX to its string format XXX */ extern const char *blk_op_str(enum req_op op); int blk_status_to_errno(blk_status_t status); blk_status_t errno_to_blk_status(int errno); const char *blk_status_to_str(blk_status_t status); /* only poll the hardware once, don't continue until a completion was found */ #define BLK_POLL_ONESHOT (1 << 0) int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags); int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob, unsigned int flags); static inline struct request_queue *bdev_get_queue(struct block_device *bdev) { return bdev->bd_queue; /* this is never NULL */ } /* Helper to convert BLK_ZONE_ZONE_XXX to its string format XXX */ const char *blk_zone_cond_str(enum blk_zone_cond zone_cond); static inline unsigned int bio_zone_no(struct bio *bio) { return disk_zone_no(bio->bi_bdev->bd_disk, bio->bi_iter.bi_sector); } static inline bool bio_straddles_zones(struct bio *bio) { return bio_sectors(bio) && bio_zone_no(bio) != disk_zone_no(bio->bi_bdev->bd_disk, bio_end_sector(bio) - 1); } /* * Return how much within the boundary is left to be used for I/O at a given * offset. */ static inline unsigned int blk_boundary_sectors_left(sector_t offset, unsigned int boundary_sectors) { if (unlikely(!is_power_of_2(boundary_sectors))) return boundary_sectors - sector_div(offset, boundary_sectors); return boundary_sectors - (offset & (boundary_sectors - 1)); } /** * queue_limits_start_update - start an atomic update of queue limits * @q: queue to update * * This functions starts an atomic update of the queue limits. It takes a lock * to prevent other updates and returns a snapshot of the current limits that * the caller can modify. The caller must call queue_limits_commit_update() * to finish the update. * * Context: process context. */ static inline struct queue_limits queue_limits_start_update(struct request_queue *q) { mutex_lock(&q->limits_lock); return q->limits; } int queue_limits_commit_update_frozen(struct request_queue *q, struct queue_limits *lim); int queue_limits_commit_update(struct request_queue *q, struct queue_limits *lim); int queue_limits_set(struct request_queue *q, struct queue_limits *lim); int blk_validate_limits(struct queue_limits *lim); /** * queue_limits_cancel_update - cancel an atomic update of queue limits * @q: queue to update * * This functions cancels an atomic update of the queue limits started by * queue_limits_start_update() and should be used when an error occurs after * starting update. */ static inline void queue_limits_cancel_update(struct request_queue *q) { mutex_unlock(&q->limits_lock); } /* * These helpers are for drivers that have sloppy feature negotiation and might * have to disable DISCARD, WRITE_ZEROES or SECURE_DISCARD from the I/O * completion handler when the device returned an indicator that the respective * feature is not actually supported. They are racy and the driver needs to * cope with that. Try to avoid this scheme if you can. */ static inline void blk_queue_disable_discard(struct request_queue *q) { q->limits.max_discard_sectors = 0; } static inline void blk_queue_disable_secure_erase(struct request_queue *q) { q->limits.max_secure_erase_sectors = 0; } static inline void blk_queue_disable_write_zeroes(struct request_queue *q) { q->limits.max_write_zeroes_sectors = 0; } /* * Access functions for manipulating queue properties */ extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth); extern void blk_set_stacking_limits(struct queue_limits *lim); extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, sector_t offset); void queue_limits_stack_bdev(struct queue_limits *t, struct block_device *bdev, sector_t offset, const char *pfx); extern void blk_queue_rq_timeout(struct request_queue *, unsigned int); struct blk_independent_access_ranges * disk_alloc_independent_access_ranges(struct gendisk *disk, int nr_ia_ranges); void disk_set_independent_access_ranges(struct gendisk *disk, struct blk_independent_access_ranges *iars); bool __must_check blk_get_queue(struct request_queue *); extern void blk_put_queue(struct request_queue *); void blk_mark_disk_dead(struct gendisk *disk); struct rq_list { struct request *head; struct request *tail; }; #ifdef CONFIG_BLOCK /* * blk_plug permits building a queue of related requests by holding the I/O * fragments for a short period. This allows merging of sequential requests * into single larger request. As the requests are moved from a per-task list to * the device's request_queue in a batch, this results in improved scalability * as the lock contention for request_queue lock is reduced. * * It is ok not to disable preemption when adding the request to the plug list * or when attempting a merge. For details, please see schedule() where * blk_flush_plug() is called. */ struct blk_plug { struct rq_list mq_list; /* blk-mq requests */ /* if ios_left is > 1, we can batch tag/rq allocations */ struct rq_list cached_rqs; u64 cur_ktime; unsigned short nr_ios; unsigned short rq_count; bool multiple_queues; bool has_elevator; struct list_head cb_list; /* md requires an unplug callback */ }; struct blk_plug_cb; typedef void (*blk_plug_cb_fn)(struct blk_plug_cb *, bool); struct blk_plug_cb { struct list_head list; blk_plug_cb_fn callback; void *data; }; extern struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug, void *data, int size); extern void blk_start_plug(struct blk_plug *); extern void blk_start_plug_nr_ios(struct blk_plug *, unsigned short); extern void blk_finish_plug(struct blk_plug *); void __blk_flush_plug(struct blk_plug *plug, bool from_schedule); static inline void blk_flush_plug(struct blk_plug *plug, bool async) { if (plug) __blk_flush_plug(plug, async); } /* * tsk == current here */ static inline void blk_plug_invalidate_ts(struct task_struct *tsk) { struct blk_plug *plug = tsk->plug; if (plug) plug->cur_ktime = 0; current->flags &= ~PF_BLOCK_TS; } int blkdev_issue_flush(struct block_device *bdev); long nr_blockdev_pages(void); #else /* CONFIG_BLOCK */ struct blk_plug { }; static inline void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios) { } static inline void blk_start_plug(struct blk_plug *plug) { } static inline void blk_finish_plug(struct blk_plug *plug) { } static inline void blk_flush_plug(struct blk_plug *plug, bool async) { } static inline void blk_plug_invalidate_ts(struct task_struct *tsk) { } static inline int blkdev_issue_flush(struct block_device *bdev) { return 0; } static inline long nr_blockdev_pages(void) { return 0; } #endif /* CONFIG_BLOCK */ extern void blk_io_schedule(void); int blkdev_issue_discard(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask); int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, struct bio **biop); int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp); #define BLKDEV_ZERO_NOUNMAP (1 << 0) /* do not free blocks */ #define BLKDEV_ZERO_NOFALLBACK (1 << 1) /* don't write explicit zeroes */ #define BLKDEV_ZERO_KILLABLE (1 << 2) /* interruptible by fatal signals */ extern int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, struct bio **biop, unsigned flags); extern int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask, unsigned flags); static inline int sb_issue_discard(struct super_block *sb, sector_t block, sector_t nr_blocks, gfp_t gfp_mask, unsigned long flags) { return blkdev_issue_discard(sb->s_bdev, block << (sb->s_blocksize_bits - SECTOR_SHIFT), nr_blocks << (sb->s_blocksize_bits - SECTOR_SHIFT), gfp_mask); } static inline int sb_issue_zeroout(struct super_block *sb, sector_t block, sector_t nr_blocks, gfp_t gfp_mask) { return blkdev_issue_zeroout(sb->s_bdev, block << (sb->s_blocksize_bits - SECTOR_SHIFT), nr_blocks << (sb->s_blocksize_bits - SECTOR_SHIFT), gfp_mask, 0); } static inline bool bdev_is_partition(struct block_device *bdev) { return bdev_partno(bdev) != 0; } enum blk_default_limits { BLK_MAX_SEGMENTS = 128, BLK_SAFE_MAX_SECTORS = 255, BLK_MAX_SEGMENT_SIZE = 65536, BLK_SEG_BOUNDARY_MASK = 0xFFFFFFFFUL, }; /* * Default upper limit for the software max_sectors limit used for * regular file system I/O. This can be increased through sysfs. * * Not to be confused with the max_hw_sector limit that is entirely * controlled by the driver, usually based on hardware limits. */ #define BLK_DEF_MAX_SECTORS_CAP 2560u static inline struct queue_limits *bdev_limits(struct block_device *bdev) { return &bdev_get_queue(bdev)->limits; } static inline unsigned long queue_segment_boundary(const struct request_queue *q) { return q->limits.seg_boundary_mask; } static inline unsigned long queue_virt_boundary(const struct request_queue *q) { return q->limits.virt_boundary_mask; } static inline unsigned int queue_max_sectors(const struct request_queue *q) { return q->limits.max_sectors; } static inline unsigned int queue_max_bytes(struct request_queue *q) { return min_t(unsigned int, queue_max_sectors(q), INT_MAX >> 9) << 9; } static inline unsigned int queue_max_hw_sectors(const struct request_queue *q) { return q->limits.max_hw_sectors; } static inline unsigned short queue_max_segments(const struct request_queue *q) { return q->limits.max_segments; } static inline unsigned short queue_max_discard_segments(const struct request_queue *q) { return q->limits.max_discard_segments; } static inline unsigned int queue_max_segment_size(const struct request_queue *q) { return q->limits.max_segment_size; } static inline bool queue_emulates_zone_append(struct request_queue *q) { return blk_queue_is_zoned(q) && !q->limits.max_hw_zone_append_sectors; } static inline bool bdev_emulates_zone_append(struct block_device *bdev) { return queue_emulates_zone_append(bdev_get_queue(bdev)); } static inline unsigned int bdev_max_zone_append_sectors(struct block_device *bdev) { return bdev_limits(bdev)->max_zone_append_sectors; } static inline unsigned int bdev_max_segments(struct block_device *bdev) { return queue_max_segments(bdev_get_queue(bdev)); } static inline unsigned queue_logical_block_size(const struct request_queue *q) { return q->limits.logical_block_size; } static inline unsigned int bdev_logical_block_size(struct block_device *bdev) { return queue_logical_block_size(bdev_get_queue(bdev)); } static inline unsigned int queue_physical_block_size(const struct request_queue *q) { return q->limits.physical_block_size; } static inline unsigned int bdev_physical_block_size(struct block_device *bdev) { return queue_physical_block_size(bdev_get_queue(bdev)); } static inline unsigned int queue_io_min(const struct request_queue *q) { return q->limits.io_min; } static inline unsigned int bdev_io_min(struct block_device *bdev) { return queue_io_min(bdev_get_queue(bdev)); } static inline unsigned int queue_io_opt(const struct request_queue *q) { return q->limits.io_opt; } static inline unsigned int bdev_io_opt(struct block_device *bdev) { return queue_io_opt(bdev_get_queue(bdev)); } static inline unsigned int queue_zone_write_granularity(const struct request_queue *q) { return q->limits.zone_write_granularity; } static inline unsigned int bdev_zone_write_granularity(struct block_device *bdev) { return queue_zone_write_granularity(bdev_get_queue(bdev)); } int bdev_alignment_offset(struct block_device *bdev); unsigned int bdev_discard_alignment(struct block_device *bdev); static inline unsigned int bdev_max_discard_sectors(struct block_device *bdev) { return bdev_limits(bdev)->max_discard_sectors; } static inline unsigned int bdev_discard_granularity(struct block_device *bdev) { return bdev_limits(bdev)->discard_granularity; } static inline unsigned int bdev_max_secure_erase_sectors(struct block_device *bdev) { return bdev_limits(bdev)->max_secure_erase_sectors; } static inline unsigned int bdev_write_zeroes_sectors(struct block_device *bdev) { return bdev_limits(bdev)->max_write_zeroes_sectors; } static inline bool bdev_nonrot(struct block_device *bdev) { return blk_queue_nonrot(bdev_get_queue(bdev)); } static inline bool bdev_synchronous(struct block_device *bdev) { return bdev->bd_disk->queue->limits.features & BLK_FEAT_SYNCHRONOUS; } static inline bool bdev_stable_writes(struct block_device *bdev) { struct request_queue *q = bdev_get_queue(bdev); if (IS_ENABLED(CONFIG_BLK_DEV_INTEGRITY) && q->limits.integrity.csum_type != BLK_INTEGRITY_CSUM_NONE) return true; return q->limits.features & BLK_FEAT_STABLE_WRITES; } static inline bool blk_queue_write_cache(struct request_queue *q) { return (q->limits.features & BLK_FEAT_WRITE_CACHE) && !(q->limits.flags & BLK_FLAG_WRITE_CACHE_DISABLED); } static inline bool bdev_write_cache(struct block_device *bdev) { return blk_queue_write_cache(bdev_get_queue(bdev)); } static inline bool bdev_fua(struct block_device *bdev) { return bdev_limits(bdev)->features & BLK_FEAT_FUA; } static inline bool bdev_nowait(struct block_device *bdev) { return bdev->bd_disk->queue->limits.features & BLK_FEAT_NOWAIT; } static inline bool bdev_is_zoned(struct block_device *bdev) { return blk_queue_is_zoned(bdev_get_queue(bdev)); } static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec) { return disk_zone_no(bdev->bd_disk, sec); } static inline sector_t bdev_zone_sectors(struct block_device *bdev) { struct request_queue *q = bdev_get_queue(bdev); if (!blk_queue_is_zoned(q)) return 0; return q->limits.chunk_sectors; } static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev, sector_t sector) { return sector & (bdev_zone_sectors(bdev) - 1); } static inline sector_t bio_offset_from_zone_start(struct bio *bio) { return bdev_offset_from_zone_start(bio->bi_bdev, bio->bi_iter.bi_sector); } static inline bool bdev_is_zone_start(struct block_device *bdev, sector_t sector) { return bdev_offset_from_zone_start(bdev, sector) == 0; } /** * bdev_zone_is_seq - check if a sector belongs to a sequential write zone * @bdev: block device to check * @sector: sector number * * Check if @sector on @bdev is contained in a sequential write required zone. */ static inline bool bdev_zone_is_seq(struct block_device *bdev, sector_t sector) { bool is_seq = false; #if IS_ENABLED(CONFIG_BLK_DEV_ZONED) if (bdev_is_zoned(bdev)) { struct gendisk *disk = bdev->bd_disk; unsigned long *bitmap; rcu_read_lock(); bitmap = rcu_dereference(disk->conv_zones_bitmap); is_seq = !bitmap || !test_bit(disk_zone_no(disk, sector), bitmap); rcu_read_unlock(); } #endif return is_seq; } int blk_zone_issue_zeroout(struct block_device *bdev, sector_t sector, sector_t nr_sects, gfp_t gfp_mask); static inline unsigned int queue_dma_alignment(const struct request_queue *q) { return q->limits.dma_alignment; } static inline unsigned int queue_atomic_write_unit_max_bytes(const struct request_queue *q) { return q->limits.atomic_write_unit_max; } static inline unsigned int queue_atomic_write_unit_min_bytes(const struct request_queue *q) { return q->limits.atomic_write_unit_min; } static inline unsigned int queue_atomic_write_boundary_bytes(const struct request_queue *q) { return q->limits.atomic_write_boundary_sectors << SECTOR_SHIFT; } static inline unsigned int queue_atomic_write_max_bytes(const struct request_queue *q) { return q->limits.atomic_write_max_sectors << SECTOR_SHIFT; } static inline unsigned int bdev_dma_alignment(struct block_device *bdev) { return queue_dma_alignment(bdev_get_queue(bdev)); } static inline bool bdev_iter_is_aligned(struct block_device *bdev, struct iov_iter *iter) { return iov_iter_is_aligned(iter, bdev_dma_alignment(bdev), bdev_logical_block_size(bdev) - 1); } static inline unsigned int blk_lim_dma_alignment_and_pad(struct queue_limits *lim) { return lim->dma_alignment | lim->dma_pad_mask; } static inline bool blk_rq_aligned(struct request_queue *q, unsigned long addr, unsigned int len) { unsigned int alignment = blk_lim_dma_alignment_and_pad(&q->limits); return !(addr & alignment) && !(len & alignment); } /* assumes size > 256 */ static inline unsigned int blksize_bits(unsigned int size) { return order_base_2(size >> SECTOR_SHIFT) + SECTOR_SHIFT; } int kblockd_schedule_work(struct work_struct *work); int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork, unsigned long delay); #define MODULE_ALIAS_BLOCKDEV(major,minor) \ MODULE_ALIAS("block-major-" __stringify(major) "-" __stringify(minor)) #define MODULE_ALIAS_BLOCKDEV_MAJOR(major) \ MODULE_ALIAS("block-major-" __stringify(major) "-*") #ifdef CONFIG_BLK_INLINE_ENCRYPTION bool blk_crypto_register(struct blk_crypto_profile *profile, struct request_queue *q); #else /* CONFIG_BLK_INLINE_ENCRYPTION */ static inline bool blk_crypto_register(struct blk_crypto_profile *profile, struct request_queue *q) { return true; } #endif /* CONFIG_BLK_INLINE_ENCRYPTION */ enum blk_unique_id { /* these match the Designator Types specified in SPC */ BLK_UID_T10 = 1, BLK_UID_EUI64 = 2, BLK_UID_NAA = 3, }; struct block_device_operations { void (*submit_bio)(struct bio *bio); int (*poll_bio)(struct bio *bio, struct io_comp_batch *iob, unsigned int flags); int (*open)(struct gendisk *disk, blk_mode_t mode); void (*release)(struct gendisk *disk); int (*ioctl)(struct block_device *bdev, blk_mode_t mode, unsigned cmd, unsigned long arg); int (*compat_ioctl)(struct block_device *bdev, blk_mode_t mode, unsigned cmd, unsigned long arg); unsigned int (*check_events) (struct gendisk *disk, unsigned int clearing); void (*unlock_native_capacity) (struct gendisk *); int (*getgeo)(struct block_device *, struct hd_geometry *); int (*set_read_only)(struct block_device *bdev, bool ro); void (*free_disk)(struct gendisk *disk); /* this callback is with swap_lock and sometimes page table lock held */ void (*swap_slot_free_notify) (struct block_device *, unsigned long); int (*report_zones)(struct gendisk *, sector_t sector, unsigned int nr_zones, report_zones_cb cb, void *data); char *(*devnode)(struct gendisk *disk, umode_t *mode); /* returns the length of the identifier or a negative errno: */ int (*get_unique_id)(struct gendisk *disk, u8 id[16], enum blk_unique_id id_type); struct module *owner; const struct pr_ops *pr_ops; /* * Special callback for probing GPT entry at a given sector. * Needed by Android devices, used by GPT scanner and MMC blk * driver. */ int (*alternative_gpt_sector)(struct gendisk *disk, sector_t *sector); }; #ifdef CONFIG_COMPAT extern int blkdev_compat_ptr_ioctl(struct block_device *, blk_mode_t, unsigned int, unsigned long); #else #define blkdev_compat_ptr_ioctl NULL #endif static inline void blk_wake_io_task(struct task_struct *waiter) { /* * If we're polling, the task itself is doing the completions. For * that case, we don't need to signal a wakeup, it's enough to just * mark us as RUNNING. */ if (waiter == current) __set_current_state(TASK_RUNNING); else wake_up_process(waiter); } unsigned long bdev_start_io_acct(struct block_device *bdev, enum req_op op, unsigned long start_time); void bdev_end_io_acct(struct block_device *bdev, enum req_op op, unsigned int sectors, unsigned long start_time); unsigned long bio_start_io_acct(struct bio *bio); void bio_end_io_acct_remapped(struct bio *bio, unsigned long start_time, struct block_device *orig_bdev); /** * bio_end_io_acct - end I/O accounting for bio based drivers * @bio: bio to end account for * @start_time: start time returned by bio_start_io_acct() */ static inline void bio_end_io_acct(struct bio *bio, unsigned long start_time) { return bio_end_io_acct_remapped(bio, start_time, bio->bi_bdev); } int set_blocksize(struct file *file, int size); int lookup_bdev(const char *pathname, dev_t *dev); void blkdev_show(struct seq_file *seqf, off_t offset); #define BDEVNAME_SIZE 32 /* Largest string for a blockdev identifier */ #define BDEVT_SIZE 10 /* Largest string for MAJ:MIN for blkdev */ #ifdef CONFIG_BLOCK #define BLKDEV_MAJOR_MAX 512 #else #define BLKDEV_MAJOR_MAX 0 #endif struct blk_holder_ops { void (*mark_dead)(struct block_device *bdev, bool surprise); /* * Sync the file system mounted on the block device. */ void (*sync)(struct block_device *bdev); /* * Freeze the file system mounted on the block device. */ int (*freeze)(struct block_device *bdev); /* * Thaw the file system mounted on the block device. */ int (*thaw)(struct block_device *bdev); }; /* * For filesystems using @fs_holder_ops, the @holder argument passed to * helpers used to open and claim block devices via * bd_prepare_to_claim() must point to a superblock. */ extern const struct blk_holder_ops fs_holder_ops; /* * Return the correct open flags for blkdev_get_by_* for super block flags * as stored in sb->s_flags. */ #define sb_open_mode(flags) \ (BLK_OPEN_READ | BLK_OPEN_RESTRICT_WRITES | \ (((flags) & SB_RDONLY) ? 0 : BLK_OPEN_WRITE)) struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder, const struct blk_holder_ops *hops); struct file *bdev_file_open_by_path(const char *path, blk_mode_t mode, void *holder, const struct blk_holder_ops *hops); int bd_prepare_to_claim(struct block_device *bdev, void *holder, const struct blk_holder_ops *hops); void bd_abort_claiming(struct block_device *bdev, void *holder); /* just for blk-cgroup, don't use elsewhere */ struct block_device *blkdev_get_no_open(dev_t dev); void blkdev_put_no_open(struct block_device *bdev); struct block_device *I_BDEV(struct inode *inode); struct block_device *file_bdev(struct file *bdev_file); bool disk_live(struct gendisk *disk); unsigned int block_size(struct block_device *bdev); #ifdef CONFIG_BLOCK void invalidate_bdev(struct block_device *bdev); int sync_blockdev(struct block_device *bdev); int sync_blockdev_range(struct block_device *bdev, loff_t lstart, loff_t lend); int sync_blockdev_nowait(struct block_device *bdev); void sync_bdevs(bool wait); void bdev_statx(struct path *, struct kstat *, u32); void printk_all_partitions(void); int __init early_lookup_bdev(const char *pathname, dev_t *dev); #else static inline void invalidate_bdev(struct block_device *bdev) { } static inline int sync_blockdev(struct block_device *bdev) { return 0; } static inline int sync_blockdev_nowait(struct block_device *bdev) { return 0; } static inline void sync_bdevs(bool wait) { } static inline void bdev_statx(struct path *path, struct kstat *stat, u32 request_mask) { } static inline void printk_all_partitions(void) { } static inline int early_lookup_bdev(const char *pathname, dev_t *dev) { return -EINVAL; } #endif /* CONFIG_BLOCK */ int bdev_freeze(struct block_device *bdev); int bdev_thaw(struct block_device *bdev); void bdev_fput(struct file *bdev_file); struct io_comp_batch { struct rq_list req_list; bool need_ts; void (*complete)(struct io_comp_batch *); }; static inline bool blk_atomic_write_start_sect_aligned(sector_t sector, struct queue_limits *limits) { unsigned int alignment = max(limits->atomic_write_hw_unit_min, limits->atomic_write_hw_boundary); return IS_ALIGNED(sector, alignment >> SECTOR_SHIFT); } static inline bool bdev_can_atomic_write(struct block_device *bdev) { struct request_queue *bd_queue = bdev->bd_queue; struct queue_limits *limits = &bd_queue->limits; if (!limits->atomic_write_unit_min) return false; if (bdev_is_partition(bdev)) return blk_atomic_write_start_sect_aligned(bdev->bd_start_sect, limits); return true; } static inline unsigned int bdev_atomic_write_unit_min_bytes(struct block_device *bdev) { if (!bdev_can_atomic_write(bdev)) return 0; return queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev)); } static inline unsigned int bdev_atomic_write_unit_max_bytes(struct block_device *bdev) { if (!bdev_can_atomic_write(bdev)) return 0; return queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev)); } #define DEFINE_IO_COMP_BATCH(name) struct io_comp_batch name = { } #endif /* _LINUX_BLKDEV_H */ |
| 1263 57726 190 1257 1261 57739 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Latched RB-trees * * Copyright (C) 2015 Intel Corp., Peter Zijlstra <peterz@infradead.org> * * Since RB-trees have non-atomic modifications they're not immediately suited * for RCU/lockless queries. Even though we made RB-tree lookups non-fatal for * lockless lookups; we cannot guarantee they return a correct result. * * The simplest solution is a seqlock + RB-tree, this will allow lockless * lookups; but has the constraint (inherent to the seqlock) that read sides * cannot nest in write sides. * * If we need to allow unconditional lookups (say as required for NMI context * usage) we need a more complex setup; this data structure provides this by * employing the latch technique -- see @write_seqcount_latch_begin -- to * implement a latched RB-tree which does allow for unconditional lookups by * virtue of always having (at least) one stable copy of the tree. * * However, while we have the guarantee that there is at all times one stable * copy, this does not guarantee an iteration will not observe modifications. * What might have been a stable copy at the start of the iteration, need not * remain so for the duration of the iteration. * * Therefore, this does require a lockless RB-tree iteration to be non-fatal; * see the comment in lib/rbtree.c. Note however that we only require the first * condition -- not seeing partial stores -- because the latch thing isolates * us from loops. If we were to interrupt a modification the lookup would be * pointed at the stable tree and complete while the modification was halted. */ #ifndef RB_TREE_LATCH_H #define RB_TREE_LATCH_H #include <linux/rbtree.h> #include <linux/seqlock.h> #include <linux/rcupdate.h> struct latch_tree_node { struct rb_node node[2]; }; struct latch_tree_root { seqcount_latch_t seq; struct rb_root tree[2]; }; /** * latch_tree_ops - operators to define the tree order * @less: used for insertion; provides the (partial) order between two elements. * @comp: used for lookups; provides the order between the search key and an element. * * The operators are related like: * * comp(a->key,b) < 0 := less(a,b) * comp(a->key,b) > 0 := less(b,a) * comp(a->key,b) == 0 := !less(a,b) && !less(b,a) * * If these operators define a partial order on the elements we make no * guarantee on which of the elements matching the key is found. See * latch_tree_find(). */ struct latch_tree_ops { bool (*less)(struct latch_tree_node *a, struct latch_tree_node *b); int (*comp)(void *key, struct latch_tree_node *b); }; static __always_inline struct latch_tree_node * __lt_from_rb(struct rb_node *node, int idx) { return container_of(node, struct latch_tree_node, node[idx]); } static __always_inline void __lt_insert(struct latch_tree_node *ltn, struct latch_tree_root *ltr, int idx, bool (*less)(struct latch_tree_node *a, struct latch_tree_node *b)) { struct rb_root *root = <r->tree[idx]; struct rb_node **link = &root->rb_node; struct rb_node *node = <n->node[idx]; struct rb_node *parent = NULL; struct latch_tree_node *ltp; while (*link) { parent = *link; ltp = __lt_from_rb(parent, idx); if (less(ltn, ltp)) link = &parent->rb_left; else link = &parent->rb_right; } rb_link_node_rcu(node, parent, link); rb_insert_color(node, root); } static __always_inline void __lt_erase(struct latch_tree_node *ltn, struct latch_tree_root *ltr, int idx) { rb_erase(<n->node[idx], <r->tree[idx]); } static __always_inline struct latch_tree_node * __lt_find(void *key, struct latch_tree_root *ltr, int idx, int (*comp)(void *key, struct latch_tree_node *node)) { struct rb_node *node = rcu_dereference_raw(ltr->tree[idx].rb_node); struct latch_tree_node *ltn; int c; while (node) { ltn = __lt_from_rb(node, idx); c = comp(key, ltn); if (c < 0) node = rcu_dereference_raw(node->rb_left); else if (c > 0) node = rcu_dereference_raw(node->rb_right); else return ltn; } return NULL; } /** * latch_tree_insert() - insert @node into the trees @root * @node: nodes to insert * @root: trees to insert @node into * @ops: operators defining the node order * * It inserts @node into @root in an ordered fashion such that we can always * observe one complete tree. See the comment for write_seqcount_latch_begin(). * * The inserts use rcu_assign_pointer() to publish the element such that the * tree structure is stored before we can observe the new @node. * * All modifications (latch_tree_insert, latch_tree_remove) are assumed to be * serialized. */ static __always_inline void latch_tree_insert(struct latch_tree_node *node, struct latch_tree_root *root, const struct latch_tree_ops *ops) { write_seqcount_latch_begin(&root->seq); __lt_insert(node, root, 0, ops->less); write_seqcount_latch(&root->seq); __lt_insert(node, root, 1, ops->less); write_seqcount_latch_end(&root->seq); } /** * latch_tree_erase() - removes @node from the trees @root * @node: nodes to remote * @root: trees to remove @node from * @ops: operators defining the node order * * Removes @node from the trees @root in an ordered fashion such that we can * always observe one complete tree. See the comment for * write_seqcount_latch_begin(). * * It is assumed that @node will observe one RCU quiescent state before being * reused of freed. * * All modifications (latch_tree_insert, latch_tree_remove) are assumed to be * serialized. */ static __always_inline void latch_tree_erase(struct latch_tree_node *node, struct latch_tree_root *root, const struct latch_tree_ops *ops) { write_seqcount_latch_begin(&root->seq); __lt_erase(node, root, 0); write_seqcount_latch(&root->seq); __lt_erase(node, root, 1); write_seqcount_latch_end(&root->seq); } /** * latch_tree_find() - find the node matching @key in the trees @root * @key: search key * @root: trees to search for @key * @ops: operators defining the node order * * Does a lockless lookup in the trees @root for the node matching @key. * * It is assumed that this is called while holding the appropriate RCU read * side lock. * * If the operators define a partial order on the elements (there are multiple * elements which have the same key value) it is undefined which of these * elements will be found. Nor is it possible to iterate the tree to find * further elements with the same key value. * * Returns: a pointer to the node matching @key or NULL. */ static __always_inline struct latch_tree_node * latch_tree_find(void *key, struct latch_tree_root *root, const struct latch_tree_ops *ops) { struct latch_tree_node *node; unsigned int seq; do { seq = read_seqcount_latch(&root->seq); node = __lt_find(key, root, seq & 1, ops->comp); } while (read_seqcount_latch_retry(&root->seq, seq)); return node; } #endif /* RB_TREE_LATCH_H */ |
| 13 194 176 175 171 176 156 143 164 62 53 40 33 12 12 11 12 36 35 2 33 2 4 1 2 36 2 34 2 35 35 6 11 4 4 5 1 35 2 18 231 3 10 8 217 8 2 4 217 207 208 208 12 195 10 13 13 207 1 186 186 185 9 176 9 4 7 7 8 35 150 150 80 80 4 79 2 77 2 4 80 2 77 2 9 71 71 71 80 79 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 | // SPDX-License-Identifier: GPL-2.0 /* * linux/fs/hfsplus/unicode.c * * Copyright (C) 2001 * Brad Boyer (flar@allandria.com) * (C) 2003 Ardis Technologies <roman@ardistech.com> * * Handler routines for unicode strings */ #include <linux/types.h> #include <linux/nls.h> #include "hfsplus_fs.h" #include "hfsplus_raw.h" /* Fold the case of a unicode char, given the 16 bit value */ /* Returns folded char, or 0 if ignorable */ static inline u16 case_fold(u16 c) { u16 tmp; tmp = hfsplus_case_fold_table[c >> 8]; if (tmp) tmp = hfsplus_case_fold_table[tmp + (c & 0xff)]; else tmp = c; return tmp; } /* Compare unicode strings, return values like normal strcmp */ int hfsplus_strcasecmp(const struct hfsplus_unistr *s1, const struct hfsplus_unistr *s2) { u16 len1, len2, c1, c2; const hfsplus_unichr *p1, *p2; len1 = be16_to_cpu(s1->length); len2 = be16_to_cpu(s2->length); p1 = s1->unicode; p2 = s2->unicode; while (1) { c1 = c2 = 0; while (len1 && !c1) { c1 = case_fold(be16_to_cpu(*p1)); p1++; len1--; } while (len2 && !c2) { c2 = case_fold(be16_to_cpu(*p2)); p2++; len2--; } if (c1 != c2) return (c1 < c2) ? -1 : 1; if (!c1 && !c2) return 0; } } /* Compare names as a sequence of 16-bit unsigned integers */ int hfsplus_strcmp(const struct hfsplus_unistr *s1, const struct hfsplus_unistr *s2) { u16 len1, len2, c1, c2; const hfsplus_unichr *p1, *p2; int len; len1 = be16_to_cpu(s1->length); len2 = be16_to_cpu(s2->length); p1 = s1->unicode; p2 = s2->unicode; for (len = min(len1, len2); len > 0; len--) { c1 = be16_to_cpu(*p1); c2 = be16_to_cpu(*p2); if (c1 != c2) return c1 < c2 ? -1 : 1; p1++; p2++; } return len1 < len2 ? -1 : len1 > len2 ? 1 : 0; } #define Hangul_SBase 0xac00 #define Hangul_LBase 0x1100 #define Hangul_VBase 0x1161 #define Hangul_TBase 0x11a7 #define Hangul_SCount 11172 #define Hangul_LCount 19 #define Hangul_VCount 21 #define Hangul_TCount 28 #define Hangul_NCount (Hangul_VCount * Hangul_TCount) static u16 *hfsplus_compose_lookup(u16 *p, u16 cc) { int i, s, e; s = 1; e = p[1]; if (!e || cc < p[s * 2] || cc > p[e * 2]) return NULL; do { i = (s + e) / 2; if (cc > p[i * 2]) s = i + 1; else if (cc < p[i * 2]) e = i - 1; else return hfsplus_compose_table + p[i * 2 + 1]; } while (s <= e); return NULL; } int hfsplus_uni2asc(struct super_block *sb, const struct hfsplus_unistr *ustr, char *astr, int *len_p) { const hfsplus_unichr *ip; struct nls_table *nls = HFSPLUS_SB(sb)->nls; u8 *op; u16 cc, c0, c1; u16 *ce1, *ce2; int i, len, ustrlen, res, compose; op = astr; ip = ustr->unicode; ustrlen = be16_to_cpu(ustr->length); len = *len_p; ce1 = NULL; compose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags); while (ustrlen > 0) { c0 = be16_to_cpu(*ip++); ustrlen--; /* search for single decomposed char */ if (likely(compose)) ce1 = hfsplus_compose_lookup(hfsplus_compose_table, c0); if (ce1) cc = ce1[0]; else cc = 0; if (cc) { /* start of a possibly decomposed Hangul char */ if (cc != 0xffff) goto done; if (!ustrlen) goto same; c1 = be16_to_cpu(*ip) - Hangul_VBase; if (c1 < Hangul_VCount) { /* compose the Hangul char */ cc = (c0 - Hangul_LBase) * Hangul_VCount; cc = (cc + c1) * Hangul_TCount; cc += Hangul_SBase; ip++; ustrlen--; if (!ustrlen) goto done; c1 = be16_to_cpu(*ip) - Hangul_TBase; if (c1 > 0 && c1 < Hangul_TCount) { cc += c1; ip++; ustrlen--; } goto done; } } while (1) { /* main loop for common case of not composed chars */ if (!ustrlen) goto same; c1 = be16_to_cpu(*ip); if (likely(compose)) ce1 = hfsplus_compose_lookup( hfsplus_compose_table, c1); if (ce1) break; switch (c0) { case 0: c0 = 0x2400; break; case '/': c0 = ':'; break; } res = nls->uni2char(c0, op, len); if (res < 0) { if (res == -ENAMETOOLONG) goto out; *op = '?'; res = 1; } op += res; len -= res; c0 = c1; ip++; ustrlen--; } ce2 = hfsplus_compose_lookup(ce1, c0); if (ce2) { i = 1; while (i < ustrlen) { ce1 = hfsplus_compose_lookup(ce2, be16_to_cpu(ip[i])); if (!ce1) break; i++; ce2 = ce1; } cc = ce2[0]; if (cc) { ip += i; ustrlen -= i; goto done; } } same: switch (c0) { case 0: cc = 0x2400; break; case '/': cc = ':'; break; default: cc = c0; } done: res = nls->uni2char(cc, op, len); if (res < 0) { if (res == -ENAMETOOLONG) goto out; *op = '?'; res = 1; } op += res; len -= res; } res = 0; out: *len_p = (char *)op - astr; return res; } /* * Convert one or more ASCII characters into a single unicode character. * Returns the number of ASCII characters corresponding to the unicode char. */ static inline int asc2unichar(struct super_block *sb, const char *astr, int len, wchar_t *uc) { int size = HFSPLUS_SB(sb)->nls->char2uni(astr, len, uc); if (size <= 0) { *uc = '?'; size = 1; } switch (*uc) { case 0x2400: *uc = 0; break; case ':': *uc = '/'; break; } return size; } /* Decomposes a non-Hangul unicode character. */ static u16 *hfsplus_decompose_nonhangul(wchar_t uc, int *size) { int off; off = hfsplus_decompose_table[(uc >> 12) & 0xf]; if (off == 0 || off == 0xffff) return NULL; off = hfsplus_decompose_table[off + ((uc >> 8) & 0xf)]; if (!off) return NULL; off = hfsplus_decompose_table[off + ((uc >> 4) & 0xf)]; if (!off) return NULL; off = hfsplus_decompose_table[off + (uc & 0xf)]; *size = off & 3; if (*size == 0) return NULL; return hfsplus_decompose_table + (off / 4); } /* * Try to decompose a unicode character as Hangul. Return 0 if @uc is not * precomposed Hangul, otherwise return the length of the decomposition. * * This function was adapted from sample code from the Unicode Standard * Annex #15: Unicode Normalization Forms, version 3.2.0. * * Copyright (C) 1991-2018 Unicode, Inc. All rights reserved. Distributed * under the Terms of Use in http://www.unicode.org/copyright.html. */ static int hfsplus_try_decompose_hangul(wchar_t uc, u16 *result) { int index; int l, v, t; index = uc - Hangul_SBase; if (index < 0 || index >= Hangul_SCount) return 0; l = Hangul_LBase + index / Hangul_NCount; v = Hangul_VBase + (index % Hangul_NCount) / Hangul_TCount; t = Hangul_TBase + index % Hangul_TCount; result[0] = l; result[1] = v; if (t != Hangul_TBase) { result[2] = t; return 3; } return 2; } /* Decomposes a single unicode character. */ static u16 *decompose_unichar(wchar_t uc, int *size, u16 *hangul_buffer) { u16 *result; /* Hangul is handled separately */ result = hangul_buffer; *size = hfsplus_try_decompose_hangul(uc, result); if (*size == 0) result = hfsplus_decompose_nonhangul(uc, size); return result; } int hfsplus_asc2uni(struct super_block *sb, struct hfsplus_unistr *ustr, int max_unistr_len, const char *astr, int len) { int size, dsize, decompose; u16 *dstr, outlen = 0; wchar_t c; u16 dhangul[3]; decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags); while (outlen < max_unistr_len && len > 0) { size = asc2unichar(sb, astr, len, &c); if (decompose) dstr = decompose_unichar(c, &dsize, dhangul); else dstr = NULL; if (dstr) { if (outlen + dsize > max_unistr_len) break; do { ustr->unicode[outlen++] = cpu_to_be16(*dstr++); } while (--dsize > 0); } else ustr->unicode[outlen++] = cpu_to_be16(c); astr += size; len -= size; } ustr->length = cpu_to_be16(outlen); if (len > 0) return -ENAMETOOLONG; return 0; } /* * Hash a string to an integer as appropriate for the HFS+ filesystem. * Composed unicode characters are decomposed and case-folding is performed * if the appropriate bits are (un)set on the superblock. */ int hfsplus_hash_dentry(const struct dentry *dentry, struct qstr *str) { struct super_block *sb = dentry->d_sb; const char *astr; const u16 *dstr; int casefold, decompose, size, len; unsigned long hash; wchar_t c; u16 c2; u16 dhangul[3]; casefold = test_bit(HFSPLUS_SB_CASEFOLD, &HFSPLUS_SB(sb)->flags); decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags); hash = init_name_hash(dentry); astr = str->name; len = str->len; while (len > 0) { int dsize; size = asc2unichar(sb, astr, len, &c); astr += size; len -= size; if (decompose) dstr = decompose_unichar(c, &dsize, dhangul); else dstr = NULL; if (dstr) { do { c2 = *dstr++; if (casefold) c2 = case_fold(c2); if (!casefold || c2) hash = partial_name_hash(c2, hash); } while (--dsize > 0); } else { c2 = c; if (casefold) c2 = case_fold(c2); if (!casefold || c2) hash = partial_name_hash(c2, hash); } } str->hash = end_name_hash(hash); return 0; } /* * Compare strings with HFS+ filename ordering. * Composed unicode characters are decomposed and case-folding is performed * if the appropriate bits are (un)set on the superblock. */ int hfsplus_compare_dentry(const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { struct super_block *sb = dentry->d_sb; int casefold, decompose, size; int dsize1, dsize2, len1, len2; const u16 *dstr1, *dstr2; const char *astr1, *astr2; u16 c1, c2; wchar_t c; u16 dhangul_1[3], dhangul_2[3]; casefold = test_bit(HFSPLUS_SB_CASEFOLD, &HFSPLUS_SB(sb)->flags); decompose = !test_bit(HFSPLUS_SB_NODECOMPOSE, &HFSPLUS_SB(sb)->flags); astr1 = str; len1 = len; astr2 = name->name; len2 = name->len; dsize1 = dsize2 = 0; dstr1 = dstr2 = NULL; while (len1 > 0 && len2 > 0) { if (!dsize1) { size = asc2unichar(sb, astr1, len1, &c); astr1 += size; len1 -= size; if (decompose) dstr1 = decompose_unichar(c, &dsize1, dhangul_1); if (!decompose || !dstr1) { c1 = c; dstr1 = &c1; dsize1 = 1; } } if (!dsize2) { size = asc2unichar(sb, astr2, len2, &c); astr2 += size; len2 -= size; if (decompose) dstr2 = decompose_unichar(c, &dsize2, dhangul_2); if (!decompose || !dstr2) { c2 = c; dstr2 = &c2; dsize2 = 1; } } c1 = *dstr1; c2 = *dstr2; if (casefold) { c1 = case_fold(c1); if (!c1) { dstr1++; dsize1--; continue; } c2 = case_fold(c2); if (!c2) { dstr2++; dsize2--; continue; } } if (c1 < c2) return -1; else if (c1 > c2) return 1; dstr1++; dsize1--; dstr2++; dsize2--; } if (len1 < len2) return -1; if (len1 > len2) return 1; return 0; } |
| 43 42 2 7 7 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 | // SPDX-License-Identifier: GPL-2.0-only /* * Shared Memory Communications over RDMA (SMC-R) and RoCE * * SMC statistics netlink routines * * Copyright IBM Corp. 2021 * * Author(s): Guvenc Gulce */ #include <linux/init.h> #include <linux/mutex.h> #include <linux/percpu.h> #include <linux/ctype.h> #include <linux/smc.h> #include <net/genetlink.h> #include <net/sock.h> #include "smc_netlink.h" #include "smc_stats.h" int smc_stats_init(struct net *net) { net->smc.fback_rsn = kzalloc(sizeof(*net->smc.fback_rsn), GFP_KERNEL); if (!net->smc.fback_rsn) goto err_fback; net->smc.smc_stats = alloc_percpu(struct smc_stats); if (!net->smc.smc_stats) goto err_stats; mutex_init(&net->smc.mutex_fback_rsn); return 0; err_stats: kfree(net->smc.fback_rsn); err_fback: return -ENOMEM; } void smc_stats_exit(struct net *net) { kfree(net->smc.fback_rsn); if (net->smc.smc_stats) free_percpu(net->smc.smc_stats); } static int smc_nl_fill_stats_rmb_data(struct sk_buff *skb, struct smc_stats *stats, int tech, int type) { struct smc_stats_rmbcnt *stats_rmb_cnt; struct nlattr *attrs; if (type == SMC_NLA_STATS_T_TX_RMB_STATS) stats_rmb_cnt = &stats->smc[tech].rmb_tx; else stats_rmb_cnt = &stats->smc[tech].rmb_rx; attrs = nla_nest_start(skb, type); if (!attrs) goto errout; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_RMB_REUSE_CNT, stats_rmb_cnt->reuse_cnt, SMC_NLA_STATS_RMB_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_RMB_SIZE_SM_PEER_CNT, stats_rmb_cnt->buf_size_small_peer_cnt, SMC_NLA_STATS_RMB_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_RMB_SIZE_SM_CNT, stats_rmb_cnt->buf_size_small_cnt, SMC_NLA_STATS_RMB_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_RMB_FULL_PEER_CNT, stats_rmb_cnt->buf_full_peer_cnt, SMC_NLA_STATS_RMB_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_RMB_FULL_CNT, stats_rmb_cnt->buf_full_cnt, SMC_NLA_STATS_RMB_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_RMB_ALLOC_CNT, stats_rmb_cnt->alloc_cnt, SMC_NLA_STATS_RMB_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_RMB_DGRADE_CNT, stats_rmb_cnt->dgrade_cnt, SMC_NLA_STATS_RMB_PAD)) goto errattr; nla_nest_end(skb, attrs); return 0; errattr: nla_nest_cancel(skb, attrs); errout: return -EMSGSIZE; } static int smc_nl_fill_stats_bufsize_data(struct sk_buff *skb, struct smc_stats *stats, int tech, int type) { struct smc_stats_memsize *stats_pload; struct nlattr *attrs; if (type == SMC_NLA_STATS_T_TXPLOAD_SIZE) stats_pload = &stats->smc[tech].tx_pd; else if (type == SMC_NLA_STATS_T_RXPLOAD_SIZE) stats_pload = &stats->smc[tech].rx_pd; else if (type == SMC_NLA_STATS_T_TX_RMB_SIZE) stats_pload = &stats->smc[tech].tx_rmbsize; else if (type == SMC_NLA_STATS_T_RX_RMB_SIZE) stats_pload = &stats->smc[tech].rx_rmbsize; else goto errout; attrs = nla_nest_start(skb, type); if (!attrs) goto errout; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_PLOAD_8K, stats_pload->buf[SMC_BUF_8K], SMC_NLA_STATS_PLOAD_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_PLOAD_16K, stats_pload->buf[SMC_BUF_16K], SMC_NLA_STATS_PLOAD_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_PLOAD_32K, stats_pload->buf[SMC_BUF_32K], SMC_NLA_STATS_PLOAD_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_PLOAD_64K, stats_pload->buf[SMC_BUF_64K], SMC_NLA_STATS_PLOAD_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_PLOAD_128K, stats_pload->buf[SMC_BUF_128K], SMC_NLA_STATS_PLOAD_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_PLOAD_256K, stats_pload->buf[SMC_BUF_256K], SMC_NLA_STATS_PLOAD_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_PLOAD_512K, stats_pload->buf[SMC_BUF_512K], SMC_NLA_STATS_PLOAD_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_PLOAD_1024K, stats_pload->buf[SMC_BUF_1024K], SMC_NLA_STATS_PLOAD_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_PLOAD_G_1024K, stats_pload->buf[SMC_BUF_G_1024K], SMC_NLA_STATS_PLOAD_PAD)) goto errattr; nla_nest_end(skb, attrs); return 0; errattr: nla_nest_cancel(skb, attrs); errout: return -EMSGSIZE; } static int smc_nl_fill_stats_tech_data(struct sk_buff *skb, struct smc_stats *stats, int tech) { struct smc_stats_tech *smc_tech; struct nlattr *attrs; smc_tech = &stats->smc[tech]; if (tech == SMC_TYPE_D) attrs = nla_nest_start(skb, SMC_NLA_STATS_SMCD_TECH); else attrs = nla_nest_start(skb, SMC_NLA_STATS_SMCR_TECH); if (!attrs) goto errout; if (smc_nl_fill_stats_rmb_data(skb, stats, tech, SMC_NLA_STATS_T_TX_RMB_STATS)) goto errattr; if (smc_nl_fill_stats_rmb_data(skb, stats, tech, SMC_NLA_STATS_T_RX_RMB_STATS)) goto errattr; if (smc_nl_fill_stats_bufsize_data(skb, stats, tech, SMC_NLA_STATS_T_TXPLOAD_SIZE)) goto errattr; if (smc_nl_fill_stats_bufsize_data(skb, stats, tech, SMC_NLA_STATS_T_RXPLOAD_SIZE)) goto errattr; if (smc_nl_fill_stats_bufsize_data(skb, stats, tech, SMC_NLA_STATS_T_TX_RMB_SIZE)) goto errattr; if (smc_nl_fill_stats_bufsize_data(skb, stats, tech, SMC_NLA_STATS_T_RX_RMB_SIZE)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_CLNT_V1_SUCC, smc_tech->clnt_v1_succ_cnt, SMC_NLA_STATS_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_CLNT_V2_SUCC, smc_tech->clnt_v2_succ_cnt, SMC_NLA_STATS_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_SRV_V1_SUCC, smc_tech->srv_v1_succ_cnt, SMC_NLA_STATS_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_SRV_V2_SUCC, smc_tech->srv_v2_succ_cnt, SMC_NLA_STATS_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_RX_BYTES, smc_tech->rx_bytes, SMC_NLA_STATS_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_TX_BYTES, smc_tech->tx_bytes, SMC_NLA_STATS_PAD)) goto errattr; if (nla_put_uint(skb, SMC_NLA_STATS_T_RX_RMB_USAGE, smc_tech->rx_rmbuse)) goto errattr; if (nla_put_uint(skb, SMC_NLA_STATS_T_TX_RMB_USAGE, smc_tech->tx_rmbuse)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_RX_CNT, smc_tech->rx_cnt, SMC_NLA_STATS_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_TX_CNT, smc_tech->tx_cnt, SMC_NLA_STATS_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_SENDPAGE_CNT, 0, SMC_NLA_STATS_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_CORK_CNT, smc_tech->cork_cnt, SMC_NLA_STATS_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_NDLY_CNT, smc_tech->ndly_cnt, SMC_NLA_STATS_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_SPLICE_CNT, smc_tech->splice_cnt, SMC_NLA_STATS_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_T_URG_DATA_CNT, smc_tech->urg_data_cnt, SMC_NLA_STATS_PAD)) goto errattr; nla_nest_end(skb, attrs); return 0; errattr: nla_nest_cancel(skb, attrs); errout: return -EMSGSIZE; } int smc_nl_get_stats(struct sk_buff *skb, struct netlink_callback *cb) { struct smc_nl_dmp_ctx *cb_ctx = smc_nl_dmp_ctx(cb); struct net *net = sock_net(skb->sk); struct smc_stats *stats; struct nlattr *attrs; int cpu, i, size; void *nlh; u64 *src; u64 *sum; if (cb_ctx->pos[0]) goto errmsg; nlh = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, &smc_gen_nl_family, NLM_F_MULTI, SMC_NETLINK_GET_STATS); if (!nlh) goto errmsg; attrs = nla_nest_start(skb, SMC_GEN_STATS); if (!attrs) goto errnest; stats = kzalloc(sizeof(*stats), GFP_KERNEL); if (!stats) goto erralloc; size = sizeof(*stats) / sizeof(u64); for_each_possible_cpu(cpu) { src = (u64 *)per_cpu_ptr(net->smc.smc_stats, cpu); sum = (u64 *)stats; for (i = 0; i < size; i++) *(sum++) += *(src++); } if (smc_nl_fill_stats_tech_data(skb, stats, SMC_TYPE_D)) goto errattr; if (smc_nl_fill_stats_tech_data(skb, stats, SMC_TYPE_R)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_CLNT_HS_ERR_CNT, stats->clnt_hshake_err_cnt, SMC_NLA_STATS_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_STATS_SRV_HS_ERR_CNT, stats->srv_hshake_err_cnt, SMC_NLA_STATS_PAD)) goto errattr; nla_nest_end(skb, attrs); genlmsg_end(skb, nlh); cb_ctx->pos[0] = 1; kfree(stats); return skb->len; errattr: kfree(stats); erralloc: nla_nest_cancel(skb, attrs); errnest: genlmsg_cancel(skb, nlh); errmsg: return skb->len; } static int smc_nl_get_fback_details(struct sk_buff *skb, struct netlink_callback *cb, int pos, bool is_srv) { struct smc_nl_dmp_ctx *cb_ctx = smc_nl_dmp_ctx(cb); struct net *net = sock_net(skb->sk); int cnt_reported = cb_ctx->pos[2]; struct smc_stats_fback *trgt_arr; struct nlattr *attrs; int rc = 0; void *nlh; if (is_srv) trgt_arr = &net->smc.fback_rsn->srv[0]; else trgt_arr = &net->smc.fback_rsn->clnt[0]; if (!trgt_arr[pos].fback_code) return -ENODATA; nlh = genlmsg_put(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, &smc_gen_nl_family, NLM_F_MULTI, SMC_NETLINK_GET_FBACK_STATS); if (!nlh) goto errmsg; attrs = nla_nest_start(skb, SMC_GEN_FBACK_STATS); if (!attrs) goto errout; if (nla_put_u8(skb, SMC_NLA_FBACK_STATS_TYPE, is_srv)) goto errattr; if (!cnt_reported) { if (nla_put_u64_64bit(skb, SMC_NLA_FBACK_STATS_SRV_CNT, net->smc.fback_rsn->srv_fback_cnt, SMC_NLA_FBACK_STATS_PAD)) goto errattr; if (nla_put_u64_64bit(skb, SMC_NLA_FBACK_STATS_CLNT_CNT, net->smc.fback_rsn->clnt_fback_cnt, SMC_NLA_FBACK_STATS_PAD)) goto errattr; cnt_reported = 1; } if (nla_put_u32(skb, SMC_NLA_FBACK_STATS_RSN_CODE, trgt_arr[pos].fback_code)) goto errattr; if (nla_put_u16(skb, SMC_NLA_FBACK_STATS_RSN_CNT, trgt_arr[pos].count)) goto errattr; cb_ctx->pos[2] = cnt_reported; nla_nest_end(skb, attrs); genlmsg_end(skb, nlh); return rc; errattr: nla_nest_cancel(skb, attrs); errout: genlmsg_cancel(skb, nlh); errmsg: return -EMSGSIZE; } int smc_nl_get_fback_stats(struct sk_buff *skb, struct netlink_callback *cb) { struct smc_nl_dmp_ctx *cb_ctx = smc_nl_dmp_ctx(cb); struct net *net = sock_net(skb->sk); int rc_srv = 0, rc_clnt = 0, k; int skip_serv = cb_ctx->pos[1]; int snum = cb_ctx->pos[0]; bool is_srv = true; mutex_lock(&net->smc.mutex_fback_rsn); for (k = 0; k < SMC_MAX_FBACK_RSN_CNT; k++) { if (k < snum) continue; if (!skip_serv) { rc_srv = smc_nl_get_fback_details(skb, cb, k, is_srv); if (rc_srv && rc_srv != -ENODATA) break; } else { skip_serv = 0; } rc_clnt = smc_nl_get_fback_details(skb, cb, k, !is_srv); if (rc_clnt && rc_clnt != -ENODATA) { skip_serv = 1; break; } if (rc_clnt == -ENODATA && rc_srv == -ENODATA) break; } mutex_unlock(&net->smc.mutex_fback_rsn); cb_ctx->pos[1] = skip_serv; cb_ctx->pos[0] = k; return skb->len; } |
| 5090 5061 12 78 7 15 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_CPUSET_H #define _LINUX_CPUSET_H /* * cpuset interface * * Copyright (C) 2003 BULL SA * Copyright (C) 2004-2006 Silicon Graphics, Inc. * */ #include <linux/sched.h> #include <linux/sched/topology.h> #include <linux/sched/task.h> #include <linux/cpumask.h> #include <linux/nodemask.h> #include <linux/mm.h> #include <linux/mmu_context.h> #include <linux/jump_label.h> #ifdef CONFIG_CPUSETS /* * Static branch rewrites can happen in an arbitrary order for a given * key. In code paths where we need to loop with read_mems_allowed_begin() and * read_mems_allowed_retry() to get a consistent view of mems_allowed, we need * to ensure that begin() always gets rewritten before retry() in the * disabled -> enabled transition. If not, then if local irqs are disabled * around the loop, we can deadlock since retry() would always be * comparing the latest value of the mems_allowed seqcount against 0 as * begin() still would see cpusets_enabled() as false. The enabled -> disabled * transition should happen in reverse order for the same reasons (want to stop * looking at real value of mems_allowed.sequence in retry() first). */ extern struct static_key_false cpusets_pre_enable_key; extern struct static_key_false cpusets_enabled_key; extern struct static_key_false cpusets_insane_config_key; static inline bool cpusets_enabled(void) { return static_branch_unlikely(&cpusets_enabled_key); } static inline void cpuset_inc(void) { static_branch_inc_cpuslocked(&cpusets_pre_enable_key); static_branch_inc_cpuslocked(&cpusets_enabled_key); } static inline void cpuset_dec(void) { static_branch_dec_cpuslocked(&cpusets_enabled_key); static_branch_dec_cpuslocked(&cpusets_pre_enable_key); } /* * This will get enabled whenever a cpuset configuration is considered * unsupportable in general. E.g. movable only node which cannot satisfy * any non movable allocations (see update_nodemask). Page allocator * needs to make additional checks for those configurations and this * check is meant to guard those checks without any overhead for sane * configurations. */ static inline bool cpusets_insane_config(void) { return static_branch_unlikely(&cpusets_insane_config_key); } extern int cpuset_init(void); extern void cpuset_init_smp(void); extern void cpuset_force_rebuild(void); extern void cpuset_update_active_cpus(void); extern void inc_dl_tasks_cs(struct task_struct *task); extern void dec_dl_tasks_cs(struct task_struct *task); extern void cpuset_lock(void); extern void cpuset_unlock(void); extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask); extern bool cpuset_cpus_allowed_fallback(struct task_struct *p); extern bool cpuset_cpu_is_isolated(int cpu); extern nodemask_t cpuset_mems_allowed(struct task_struct *p); #define cpuset_current_mems_allowed (current->mems_allowed) void cpuset_init_current_mems_allowed(void); int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask); extern bool cpuset_node_allowed(int node, gfp_t gfp_mask); static inline bool __cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask) { return cpuset_node_allowed(zone_to_nid(z), gfp_mask); } static inline bool cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask) { if (cpusets_enabled()) return __cpuset_zone_allowed(z, gfp_mask); return true; } extern int cpuset_mems_allowed_intersects(const struct task_struct *tsk1, const struct task_struct *tsk2); #ifdef CONFIG_CPUSETS_V1 #define cpuset_memory_pressure_bump() \ do { \ if (cpuset_memory_pressure_enabled) \ __cpuset_memory_pressure_bump(); \ } while (0) extern int cpuset_memory_pressure_enabled; extern void __cpuset_memory_pressure_bump(void); #else static inline void cpuset_memory_pressure_bump(void) { } #endif extern void cpuset_task_status_allowed(struct seq_file *m, struct task_struct *task); extern int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *tsk); extern int cpuset_mem_spread_node(void); static inline int cpuset_do_page_mem_spread(void) { return task_spread_page(current); } extern bool current_cpuset_is_being_rebound(void); extern void rebuild_sched_domains(void); extern void cpuset_print_current_mems_allowed(void); /* * read_mems_allowed_begin is required when making decisions involving * mems_allowed such as during page allocation. mems_allowed can be updated in * parallel and depending on the new value an operation can fail potentially * causing process failure. A retry loop with read_mems_allowed_begin and * read_mems_allowed_retry prevents these artificial failures. */ static inline unsigned int read_mems_allowed_begin(void) { if (!static_branch_unlikely(&cpusets_pre_enable_key)) return 0; return read_seqcount_begin(¤t->mems_allowed_seq); } /* * If this returns true, the operation that took place after * read_mems_allowed_begin may have failed artificially due to a concurrent * update of mems_allowed. It is up to the caller to retry the operation if * appropriate. */ static inline bool read_mems_allowed_retry(unsigned int seq) { if (!static_branch_unlikely(&cpusets_enabled_key)) return false; return read_seqcount_retry(¤t->mems_allowed_seq, seq); } static inline void set_mems_allowed(nodemask_t nodemask) { unsigned long flags; task_lock(current); local_irq_save(flags); write_seqcount_begin(¤t->mems_allowed_seq); current->mems_allowed = nodemask; write_seqcount_end(¤t->mems_allowed_seq); local_irq_restore(flags); task_unlock(current); } #else /* !CONFIG_CPUSETS */ static inline bool cpusets_enabled(void) { return false; } static inline bool cpusets_insane_config(void) { return false; } static inline int cpuset_init(void) { return 0; } static inline void cpuset_init_smp(void) {} static inline void cpuset_force_rebuild(void) { } static inline void cpuset_update_active_cpus(void) { partition_sched_domains(1, NULL, NULL); } static inline void inc_dl_tasks_cs(struct task_struct *task) { } static inline void dec_dl_tasks_cs(struct task_struct *task) { } static inline void cpuset_lock(void) { } static inline void cpuset_unlock(void) { } static inline void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask) { cpumask_copy(mask, task_cpu_possible_mask(p)); } static inline bool cpuset_cpus_allowed_fallback(struct task_struct *p) { return false; } static inline bool cpuset_cpu_is_isolated(int cpu) { return false; } static inline nodemask_t cpuset_mems_allowed(struct task_struct *p) { return node_possible_map; } #define cpuset_current_mems_allowed (node_states[N_MEMORY]) static inline void cpuset_init_current_mems_allowed(void) {} static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask) { return 1; } static inline bool __cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask) { return true; } static inline bool cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask) { return true; } static inline int cpuset_mems_allowed_intersects(const struct task_struct *tsk1, const struct task_struct *tsk2) { return 1; } static inline void cpuset_memory_pressure_bump(void) {} static inline void cpuset_task_status_allowed(struct seq_file *m, struct task_struct *task) { } static inline int cpuset_mem_spread_node(void) { return 0; } static inline int cpuset_do_page_mem_spread(void) { return 0; } static inline bool current_cpuset_is_being_rebound(void) { return false; } static inline void rebuild_sched_domains(void) { partition_sched_domains(1, NULL, NULL); } static inline void cpuset_print_current_mems_allowed(void) { } static inline void set_mems_allowed(nodemask_t nodemask) { } static inline unsigned int read_mems_allowed_begin(void) { return 0; } static inline bool read_mems_allowed_retry(unsigned int seq) { return false; } #endif /* !CONFIG_CPUSETS */ #endif /* _LINUX_CPUSET_H */ |
| 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | /* SPDX-License-Identifier: GPL-2.0 */ /* Copyright (C) B.A.T.M.A.N. contributors: * * Marek Lindner */ #ifndef _NET_BATMAN_ADV_GATEWAY_CLIENT_H_ #define _NET_BATMAN_ADV_GATEWAY_CLIENT_H_ #include "main.h" #include <linux/kref.h> #include <linux/netlink.h> #include <linux/skbuff.h> #include <linux/types.h> #include <uapi/linux/batadv_packet.h> void batadv_gw_check_client_stop(struct batadv_priv *bat_priv); void batadv_gw_reselect(struct batadv_priv *bat_priv); void batadv_gw_election(struct batadv_priv *bat_priv); struct batadv_orig_node * batadv_gw_get_selected_orig(struct batadv_priv *bat_priv); void batadv_gw_check_election(struct batadv_priv *bat_priv, struct batadv_orig_node *orig_node); void batadv_gw_node_update(struct batadv_priv *bat_priv, struct batadv_orig_node *orig_node, struct batadv_tvlv_gateway_data *gateway); void batadv_gw_node_delete(struct batadv_priv *bat_priv, struct batadv_orig_node *orig_node); void batadv_gw_node_free(struct batadv_priv *bat_priv); void batadv_gw_node_release(struct kref *ref); struct batadv_gw_node * batadv_gw_get_selected_gw_node(struct batadv_priv *bat_priv); int batadv_gw_dump(struct sk_buff *msg, struct netlink_callback *cb); bool batadv_gw_out_of_range(struct batadv_priv *bat_priv, struct sk_buff *skb); enum batadv_dhcp_recipient batadv_gw_dhcp_recipient_get(struct sk_buff *skb, unsigned int *header_len, u8 *chaddr); struct batadv_gw_node *batadv_gw_node_get(struct batadv_priv *bat_priv, struct batadv_orig_node *orig_node); /** * batadv_gw_node_put() - decrement the gw_node refcounter and possibly release * it * @gw_node: gateway node to free */ static inline void batadv_gw_node_put(struct batadv_gw_node *gw_node) { if (!gw_node) return; kref_put(&gw_node->refcount, batadv_gw_node_release); } #endif /* _NET_BATMAN_ADV_GATEWAY_CLIENT_H_ */ |
| 2 2 2 24 24 24 2 2 2 15 15 1 1 1 1 1 2 1 2 5 1 1 4 1 4 1 1 1 1 1 1 1 1 2 23 1 4 2 6 5 3 2 1 1 1 18 9 1 3 1 1 1 1 1 1 1 3 1 1 1 1 1 2 1 2 2 2 2 3 3 9 9 9 9 5 5 5 6 1 4 2 3 3 3 2 3 3 3 5 3 3 1 2 10 24 17 1 1 1 1 13 1 1 10 1 1 2 5 5 1 1 1 2 1 1 2 35 35 35 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216 3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229 3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248 3249 3250 3251 3252 3253 3254 3255 3256 3257 3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273 3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284 3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347 3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398 3399 3400 3401 3402 3403 3404 3405 3406 3407 3408 3409 3410 3411 3412 3413 3414 3415 3416 3417 3418 3419 3420 3421 3422 3423 3424 3425 3426 3427 3428 3429 3430 3431 3432 3433 3434 3435 3436 3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 3453 3454 3455 3456 3457 3458 3459 3460 3461 3462 3463 3464 3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501 3502 3503 3504 3505 3506 3507 3508 3509 3510 3511 3512 3513 3514 3515 3516 3517 3518 3519 3520 3521 3522 3523 3524 3525 3526 3527 3528 3529 3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585 3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639 3640 3641 3642 3643 3644 3645 3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679 3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729 3730 3731 3732 3733 3734 3735 3736 3737 3738 3739 3740 3741 3742 3743 3744 3745 3746 3747 3748 3749 3750 3751 3752 3753 3754 3755 3756 3757 3758 3759 3760 3761 3762 3763 3764 3765 3766 3767 3768 3769 3770 3771 3772 3773 3774 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 3809 3810 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 3841 3842 3843 3844 3845 3846 3847 3848 3849 3850 3851 3852 3853 3854 3855 3856 3857 3858 3859 3860 3861 3862 3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912 3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965 3966 3967 3968 3969 3970 3971 3972 3973 3974 3975 3976 3977 3978 3979 3980 3981 3982 3983 3984 3985 3986 3987 3988 3989 3990 3991 3992 3993 3994 3995 3996 3997 3998 3999 4000 4001 4002 4003 4004 4005 4006 4007 4008 4009 4010 4011 4012 4013 4014 4015 4016 4017 4018 4019 4020 4021 4022 4023 4024 4025 4026 4027 4028 4029 4030 4031 4032 4033 4034 4035 4036 4037 4038 4039 4040 4041 4042 4043 4044 4045 4046 4047 4048 4049 4050 4051 4052 4053 4054 4055 4056 4057 4058 4059 4060 4061 4062 4063 4064 4065 4066 4067 4068 4069 4070 4071 4072 4073 4074 4075 4076 4077 4078 4079 4080 4081 4082 4083 4084 4085 4086 4087 4088 4089 4090 4091 4092 4093 4094 4095 4096 4097 4098 4099 4100 4101 4102 4103 4104 4105 4106 4107 4108 4109 4110 4111 4112 4113 4114 4115 4116 4117 4118 4119 4120 4121 4122 4123 4124 4125 4126 4127 4128 4129 4130 4131 4132 4133 4134 4135 4136 4137 4138 4139 4140 4141 4142 4143 4144 4145 4146 4147 4148 4149 4150 4151 4152 4153 4154 4155 4156 4157 4158 4159 4160 4161 4162 4163 4164 4165 4166 4167 4168 4169 4170 4171 4172 4173 4174 4175 4176 4177 4178 4179 4180 4181 4182 4183 4184 4185 4186 4187 4188 4189 4190 4191 4192 4193 4194 4195 4196 4197 4198 4199 4200 4201 4202 4203 4204 4205 4206 4207 4208 4209 4210 4211 4212 4213 4214 4215 4216 4217 4218 4219 4220 4221 4222 4223 4224 4225 4226 4227 4228 4229 4230 4231 4232 4233 4234 4235 4236 4237 4238 4239 4240 4241 4242 4243 4244 4245 4246 4247 4248 4249 4250 4251 4252 4253 4254 4255 4256 4257 4258 4259 4260 4261 4262 4263 4264 4265 4266 4267 4268 4269 4270 4271 4272 4273 4274 4275 4276 4277 4278 4279 4280 4281 4282 4283 4284 4285 4286 4287 4288 4289 4290 4291 4292 4293 4294 4295 4296 4297 4298 4299 4300 4301 4302 4303 4304 4305 4306 4307 4308 4309 4310 4311 4312 4313 4314 4315 4316 4317 4318 4319 4320 4321 4322 4323 4324 4325 4326 4327 4328 4329 4330 4331 4332 4333 4334 4335 4336 4337 4338 4339 4340 4341 4342 4343 4344 4345 4346 4347 4348 4349 4350 4351 4352 4353 4354 4355 4356 4357 4358 4359 4360 4361 4362 4363 4364 4365 4366 4367 4368 4369 4370 4371 4372 4373 4374 4375 4376 4377 4378 4379 4380 4381 4382 4383 4384 4385 4386 4387 4388 4389 4390 4391 4392 4393 4394 4395 4396 4397 4398 4399 4400 4401 4402 4403 4404 4405 4406 4407 4408 4409 4410 4411 4412 4413 4414 4415 4416 4417 4418 4419 4420 4421 4422 4423 4424 4425 4426 4427 4428 4429 4430 4431 4432 4433 4434 4435 4436 4437 4438 4439 4440 4441 4442 4443 4444 4445 4446 4447 4448 4449 4450 4451 4452 4453 4454 4455 4456 4457 4458 4459 4460 4461 4462 4463 4464 4465 4466 4467 4468 4469 4470 4471 4472 4473 4474 4475 4476 4477 4478 4479 4480 4481 4482 4483 4484 4485 4486 4487 4488 4489 4490 4491 4492 4493 4494 4495 4496 4497 4498 4499 4500 4501 4502 4503 4504 4505 4506 4507 4508 4509 4510 4511 4512 4513 4514 4515 4516 4517 4518 4519 4520 4521 4522 4523 4524 4525 4526 4527 4528 4529 4530 4531 4532 4533 4534 4535 4536 4537 4538 4539 4540 4541 4542 4543 4544 4545 4546 4547 4548 4549 4550 4551 4552 4553 4554 4555 4556 4557 4558 4559 4560 4561 4562 4563 4564 4565 4566 4567 4568 4569 4570 4571 4572 4573 4574 4575 4576 4577 4578 4579 4580 4581 4582 4583 4584 4585 4586 4587 4588 4589 4590 4591 4592 4593 4594 4595 4596 4597 4598 4599 4600 4601 4602 4603 4604 4605 4606 4607 4608 4609 4610 4611 4612 4613 4614 4615 4616 4617 4618 4619 4620 4621 4622 4623 4624 4625 4626 4627 4628 4629 4630 4631 4632 4633 4634 4635 4636 4637 4638 4639 4640 4641 4642 4643 4644 4645 4646 4647 4648 4649 4650 4651 4652 4653 4654 4655 4656 4657 4658 4659 4660 4661 4662 4663 4664 4665 4666 4667 4668 4669 4670 4671 4672 4673 4674 4675 4676 4677 4678 4679 4680 4681 4682 4683 4684 4685 4686 4687 4688 4689 4690 4691 4692 4693 4694 4695 4696 4697 4698 4699 4700 4701 4702 4703 4704 4705 4706 4707 4708 4709 4710 4711 4712 4713 4714 4715 4716 4717 4718 4719 4720 4721 4722 4723 4724 4725 4726 4727 4728 4729 4730 4731 4732 4733 4734 4735 4736 4737 4738 4739 4740 4741 4742 4743 4744 4745 4746 4747 4748 4749 4750 4751 4752 4753 4754 4755 4756 4757 4758 4759 4760 4761 4762 4763 4764 4765 4766 4767 4768 4769 4770 4771 4772 4773 4774 4775 4776 4777 4778 4779 4780 4781 4782 4783 4784 4785 4786 4787 4788 4789 4790 4791 4792 4793 4794 4795 4796 4797 4798 4799 4800 4801 4802 4803 4804 4805 4806 4807 4808 4809 4810 4811 4812 4813 4814 4815 4816 4817 4818 4819 4820 4821 4822 4823 4824 4825 4826 4827 4828 4829 4830 4831 4832 4833 4834 4835 4836 4837 4838 4839 4840 4841 4842 4843 4844 4845 4846 4847 4848 4849 4850 4851 4852 4853 4854 4855 4856 4857 4858 4859 4860 4861 4862 4863 4864 4865 4866 4867 4868 4869 4870 4871 4872 4873 4874 4875 4876 4877 4878 4879 4880 4881 4882 4883 4884 4885 4886 4887 4888 4889 4890 4891 4892 4893 4894 4895 4896 4897 4898 4899 4900 4901 4902 4903 4904 4905 4906 4907 4908 4909 4910 4911 4912 4913 4914 4915 4916 4917 4918 4919 4920 4921 4922 4923 4924 4925 4926 4927 4928 4929 4930 4931 4932 4933 4934 4935 4936 4937 4938 4939 4940 4941 4942 4943 4944 4945 4946 4947 4948 4949 4950 4951 4952 4953 4954 4955 4956 4957 4958 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4975 4976 4977 4978 4979 4980 4981 4982 4983 4984 4985 4986 4987 4988 4989 4990 4991 4992 4993 4994 4995 4996 4997 4998 4999 5000 5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 5127 5128 5129 5130 5131 5132 5133 5134 5135 5136 5137 5138 5139 5140 5141 5142 5143 5144 5145 5146 5147 5148 5149 5150 5151 5152 5153 5154 5155 5156 5157 5158 5159 5160 5161 5162 5163 5164 5165 5166 5167 5168 5169 5170 5171 5172 5173 5174 5175 5176 5177 5178 5179 5180 5181 5182 5183 5184 5185 5186 5187 5188 5189 5190 5191 5192 5193 5194 5195 5196 5197 5198 5199 5200 5201 5202 5203 5204 5205 5206 5207 5208 5209 5210 5211 5212 5213 5214 5215 5216 5217 5218 5219 5220 5221 5222 5223 5224 5225 5226 5227 5228 5229 5230 5231 5232 5233 5234 5235 5236 5237 5238 5239 5240 5241 5242 5243 5244 5245 5246 5247 5248 5249 5250 5251 5252 5253 5254 5255 5256 5257 5258 5259 5260 5261 5262 5263 5264 5265 5266 5267 5268 5269 5270 5271 5272 5273 5274 5275 5276 5277 5278 5279 5280 5281 5282 5283 5284 5285 5286 5287 5288 5289 5290 5291 5292 5293 5294 5295 5296 5297 5298 5299 5300 5301 5302 5303 5304 5305 5306 5307 5308 5309 5310 5311 5312 5313 5314 5315 5316 5317 5318 5319 5320 5321 5322 5323 5324 5325 5326 5327 5328 5329 5330 5331 5332 5333 5334 5335 5336 5337 5338 5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 5368 5369 5370 5371 5372 5373 5374 5375 5376 5377 5378 5379 5380 5381 5382 5383 5384 5385 5386 5387 5388 5389 5390 5391 5392 5393 5394 5395 5396 5397 5398 5399 5400 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 5415 5416 5417 5418 5419 5420 5421 5422 5423 5424 5425 5426 5427 5428 5429 5430 5431 5432 5433 5434 5435 5436 5437 5438 5439 5440 5441 5442 5443 5444 5445 5446 5447 5448 5449 5450 5451 5452 5453 5454 5455 5456 5457 5458 5459 5460 5461 5462 5463 5464 5465 5466 5467 5468 5469 5470 5471 5472 5473 5474 5475 5476 5477 5478 5479 5480 5481 5482 5483 5484 5485 5486 5487 5488 5489 5490 5491 5492 5493 5494 5495 5496 5497 5498 5499 5500 5501 5502 5503 5504 5505 5506 5507 5508 5509 5510 5511 5512 5513 5514 5515 5516 5517 5518 5519 5520 5521 5522 5523 5524 5525 5526 5527 5528 5529 5530 5531 5532 5533 5534 5535 5536 5537 5538 5539 5540 5541 5542 5543 5544 5545 5546 5547 5548 5549 5550 5551 5552 5553 5554 5555 5556 5557 5558 5559 5560 5561 5562 5563 5564 5565 5566 5567 5568 5569 5570 5571 5572 5573 5574 5575 5576 5577 5578 5579 5580 5581 5582 5583 5584 5585 5586 5587 5588 5589 5590 5591 5592 5593 5594 5595 5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 5612 5613 5614 5615 5616 5617 5618 5619 5620 5621 5622 5623 5624 5625 5626 5627 5628 5629 5630 5631 5632 5633 5634 5635 5636 5637 5638 5639 5640 5641 5642 5643 5644 5645 5646 5647 5648 5649 5650 5651 5652 5653 5654 5655 5656 5657 5658 5659 5660 5661 5662 5663 5664 5665 5666 5667 5668 5669 5670 5671 5672 5673 5674 5675 5676 5677 5678 5679 5680 5681 5682 5683 5684 5685 5686 5687 5688 5689 5690 5691 5692 5693 5694 5695 5696 5697 5698 5699 5700 5701 5702 5703 5704 5705 5706 5707 5708 5709 5710 5711 5712 5713 5714 5715 5716 5717 5718 5719 5720 5721 5722 5723 5724 5725 5726 5727 5728 5729 5730 5731 5732 5733 5734 5735 5736 5737 5738 5739 5740 5741 5742 5743 5744 5745 5746 5747 5748 5749 5750 5751 5752 5753 5754 5755 5756 5757 5758 5759 5760 5761 5762 5763 5764 5765 5766 5767 5768 5769 5770 5771 5772 5773 5774 5775 5776 5777 5778 5779 5780 5781 5782 5783 5784 5785 5786 5787 5788 5789 5790 5791 5792 5793 5794 5795 5796 5797 5798 5799 5800 5801 5802 5803 5804 5805 5806 5807 5808 5809 5810 5811 5812 5813 5814 5815 5816 5817 5818 5819 5820 5821 5822 5823 5824 5825 5826 5827 5828 5829 5830 5831 5832 5833 5834 5835 5836 5837 5838 5839 5840 5841 5842 5843 5844 5845 5846 5847 5848 5849 5850 5851 5852 5853 5854 5855 5856 5857 5858 5859 5860 5861 5862 5863 5864 5865 5866 5867 5868 5869 5870 5871 5872 5873 5874 5875 5876 5877 5878 5879 5880 5881 5882 5883 5884 5885 5886 5887 5888 5889 5890 5891 5892 5893 5894 5895 5896 5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 5907 5908 5909 5910 5911 5912 5913 5914 5915 5916 5917 5918 5919 5920 5921 5922 5923 5924 5925 5926 5927 5928 5929 5930 5931 5932 5933 5934 5935 5936 5937 5938 5939 5940 5941 5942 5943 5944 5945 5946 5947 5948 5949 5950 5951 5952 5953 5954 5955 5956 5957 5958 5959 5960 5961 5962 5963 5964 5965 5966 5967 5968 5969 5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 5985 5986 5987 5988 5989 5990 5991 5992 5993 5994 5995 5996 5997 5998 5999 6000 6001 6002 6003 6004 6005 6006 6007 6008 6009 6010 6011 6012 6013 6014 6015 6016 6017 6018 6019 6020 6021 6022 6023 6024 6025 6026 6027 6028 6029 6030 6031 6032 6033 6034 6035 6036 6037 6038 6039 6040 6041 6042 6043 6044 6045 6046 6047 6048 6049 6050 6051 6052 6053 6054 6055 6056 6057 6058 6059 6060 6061 6062 6063 6064 6065 6066 6067 6068 6069 6070 6071 6072 6073 6074 6075 6076 6077 6078 6079 6080 6081 6082 6083 6084 6085 6086 6087 6088 6089 6090 6091 6092 6093 6094 6095 6096 6097 6098 6099 6100 6101 6102 6103 6104 6105 6106 6107 6108 6109 6110 6111 6112 6113 6114 6115 6116 6117 6118 6119 6120 6121 6122 6123 6124 6125 6126 6127 6128 6129 6130 6131 6132 6133 6134 6135 6136 6137 6138 6139 6140 6141 6142 6143 6144 6145 6146 6147 6148 6149 6150 6151 6152 6153 6154 6155 6156 6157 6158 6159 6160 6161 6162 6163 6164 6165 6166 6167 6168 6169 6170 6171 6172 6173 6174 6175 6176 6177 6178 6179 6180 6181 6182 6183 6184 6185 6186 6187 6188 6189 6190 6191 6192 6193 6194 6195 6196 6197 6198 6199 6200 6201 6202 6203 6204 6205 6206 6207 6208 6209 6210 6211 6212 6213 6214 6215 6216 6217 6218 6219 6220 6221 6222 6223 6224 6225 6226 6227 6228 6229 6230 6231 6232 6233 6234 6235 6236 6237 6238 6239 6240 6241 6242 6243 6244 6245 6246 6247 6248 6249 6250 6251 6252 6253 6254 6255 6256 6257 6258 6259 6260 6261 6262 6263 6264 6265 6266 6267 6268 6269 6270 6271 6272 6273 6274 6275 6276 6277 6278 6279 6280 6281 6282 6283 6284 6285 6286 6287 6288 6289 6290 6291 6292 6293 6294 6295 6296 6297 6298 6299 6300 6301 6302 6303 6304 6305 6306 6307 6308 6309 6310 6311 6312 6313 6314 6315 6316 6317 6318 6319 6320 6321 6322 6323 6324 6325 6326 6327 6328 6329 6330 6331 6332 6333 6334 6335 6336 6337 6338 6339 6340 6341 6342 6343 6344 6345 6346 6347 6348 6349 6350 6351 6352 6353 6354 6355 6356 6357 6358 6359 6360 6361 6362 6363 6364 6365 6366 6367 6368 6369 6370 6371 6372 6373 6374 6375 6376 6377 6378 6379 6380 6381 6382 6383 6384 6385 6386 6387 6388 6389 6390 6391 6392 6393 6394 6395 6396 6397 6398 6399 6400 6401 6402 6403 6404 6405 6406 6407 6408 6409 6410 6411 6412 6413 6414 6415 6416 6417 6418 6419 6420 6421 6422 6423 6424 6425 6426 6427 6428 6429 6430 6431 6432 6433 6434 6435 6436 6437 6438 6439 6440 6441 6442 6443 6444 6445 6446 6447 6448 6449 6450 6451 6452 6453 6454 6455 6456 6457 6458 6459 6460 6461 6462 6463 6464 6465 6466 6467 6468 6469 6470 6471 6472 6473 6474 6475 6476 6477 6478 6479 6480 6481 6482 6483 6484 6485 6486 6487 6488 6489 6490 6491 6492 6493 6494 6495 6496 6497 6498 6499 6500 6501 6502 6503 6504 6505 6506 6507 6508 6509 6510 6511 6512 6513 6514 6515 6516 6517 6518 6519 6520 6521 6522 6523 6524 6525 6526 6527 6528 6529 6530 6531 6532 6533 6534 6535 6536 6537 6538 6539 6540 6541 6542 6543 6544 6545 6546 6547 6548 6549 6550 6551 6552 6553 6554 6555 6556 6557 6558 6559 6560 6561 6562 6563 6564 6565 6566 6567 6568 6569 6570 6571 6572 6573 6574 6575 6576 6577 6578 6579 6580 6581 6582 6583 6584 6585 6586 6587 6588 6589 6590 6591 6592 6593 6594 6595 6596 6597 6598 6599 6600 6601 6602 6603 6604 6605 6606 6607 6608 6609 6610 6611 6612 6613 6614 6615 6616 6617 6618 6619 6620 6621 6622 6623 6624 6625 6626 6627 6628 6629 6630 6631 6632 6633 6634 6635 6636 6637 6638 6639 6640 6641 6642 6643 6644 6645 6646 6647 6648 6649 6650 6651 6652 6653 6654 6655 6656 6657 6658 6659 6660 6661 6662 6663 6664 6665 6666 6667 6668 6669 6670 6671 6672 6673 6674 6675 6676 6677 6678 6679 6680 6681 6682 6683 6684 6685 6686 6687 6688 6689 6690 6691 6692 6693 6694 6695 6696 6697 6698 6699 6700 6701 6702 6703 6704 6705 6706 6707 6708 6709 6710 6711 6712 6713 6714 6715 6716 6717 6718 6719 6720 6721 6722 6723 6724 6725 6726 6727 6728 6729 6730 6731 6732 6733 6734 6735 6736 6737 6738 6739 6740 6741 6742 6743 6744 6745 6746 6747 6748 6749 6750 6751 6752 6753 6754 6755 6756 6757 6758 6759 6760 6761 6762 6763 6764 6765 6766 6767 6768 6769 6770 6771 6772 6773 6774 6775 6776 6777 6778 6779 6780 6781 6782 6783 6784 6785 6786 6787 6788 6789 6790 6791 6792 6793 6794 6795 6796 6797 6798 6799 6800 6801 6802 6803 6804 6805 6806 6807 6808 6809 6810 6811 6812 6813 6814 6815 6816 6817 6818 6819 6820 6821 6822 6823 6824 6825 6826 6827 6828 6829 6830 6831 6832 6833 6834 6835 6836 6837 6838 6839 6840 6841 6842 6843 6844 6845 6846 6847 6848 6849 6850 6851 6852 6853 6854 6855 6856 6857 6858 6859 6860 6861 6862 6863 6864 6865 6866 6867 6868 6869 6870 6871 6872 6873 6874 6875 6876 6877 6878 6879 6880 6881 6882 6883 6884 6885 6886 6887 6888 6889 6890 6891 6892 6893 6894 6895 6896 6897 6898 6899 6900 6901 6902 6903 6904 6905 6906 6907 6908 6909 6910 6911 6912 6913 6914 6915 6916 6917 6918 6919 6920 6921 6922 6923 6924 6925 6926 6927 6928 6929 6930 6931 6932 6933 6934 6935 6936 6937 6938 6939 6940 6941 6942 6943 6944 6945 6946 6947 6948 6949 6950 6951 6952 6953 6954 6955 6956 6957 6958 6959 6960 6961 6962 6963 6964 6965 6966 6967 6968 6969 6970 6971 6972 6973 6974 6975 6976 6977 6978 6979 6980 6981 6982 6983 6984 6985 6986 6987 6988 6989 6990 6991 6992 6993 6994 6995 6996 6997 6998 6999 7000 7001 7002 7003 7004 7005 7006 7007 7008 7009 7010 7011 7012 7013 7014 7015 7016 7017 7018 7019 7020 7021 7022 7023 7024 7025 7026 7027 7028 7029 7030 7031 7032 7033 7034 7035 7036 7037 7038 7039 7040 7041 7042 7043 7044 7045 7046 7047 7048 7049 7050 7051 7052 7053 7054 7055 7056 7057 7058 7059 7060 7061 7062 7063 7064 7065 7066 7067 7068 7069 7070 7071 7072 7073 7074 7075 7076 7077 7078 7079 7080 7081 7082 7083 7084 7085 7086 7087 7088 7089 7090 7091 7092 7093 7094 7095 7096 7097 7098 7099 7100 7101 7102 7103 7104 7105 7106 7107 7108 7109 7110 7111 7112 7113 7114 7115 7116 7117 7118 7119 7120 7121 7122 7123 7124 7125 7126 7127 7128 7129 7130 7131 7132 7133 7134 7135 7136 7137 7138 7139 7140 7141 7142 7143 7144 7145 7146 7147 7148 7149 7150 7151 7152 7153 7154 7155 7156 7157 7158 7159 7160 7161 7162 7163 7164 7165 7166 7167 7168 7169 7170 7171 7172 7173 7174 7175 7176 7177 7178 7179 7180 7181 7182 7183 7184 7185 7186 7187 7188 7189 7190 7191 7192 7193 7194 7195 7196 7197 7198 7199 7200 7201 7202 7203 7204 7205 7206 7207 7208 7209 7210 7211 7212 7213 7214 7215 7216 7217 7218 7219 7220 7221 7222 7223 7224 7225 7226 7227 7228 7229 7230 7231 7232 7233 7234 7235 7236 7237 7238 7239 7240 7241 7242 7243 7244 7245 7246 7247 7248 7249 7250 7251 7252 7253 7254 7255 7256 7257 7258 7259 7260 7261 7262 7263 7264 7265 7266 7267 7268 7269 7270 7271 7272 7273 7274 7275 7276 7277 7278 7279 7280 7281 7282 7283 7284 7285 7286 7287 7288 7289 7290 7291 7292 7293 7294 7295 7296 7297 7298 7299 7300 7301 7302 7303 7304 7305 7306 7307 7308 7309 7310 7311 | // SPDX-License-Identifier: GPL-2.0 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include <linux/objtool.h> #include <linux/percpu.h> #include <asm/debugreg.h> #include <asm/mmu_context.h> #include "x86.h" #include "cpuid.h" #include "hyperv.h" #include "mmu.h" #include "nested.h" #include "pmu.h" #include "posted_intr.h" #include "sgx.h" #include "trace.h" #include "vmx.h" #include "smm.h" static bool __read_mostly enable_shadow_vmcs = 1; module_param_named(enable_shadow_vmcs, enable_shadow_vmcs, bool, S_IRUGO); static bool __read_mostly nested_early_check = 0; module_param(nested_early_check, bool, S_IRUGO); #define CC KVM_NESTED_VMENTER_CONSISTENCY_CHECK /* * Hyper-V requires all of these, so mark them as supported even though * they are just treated the same as all-context. */ #define VMX_VPID_EXTENT_SUPPORTED_MASK \ (VMX_VPID_EXTENT_INDIVIDUAL_ADDR_BIT | \ VMX_VPID_EXTENT_SINGLE_CONTEXT_BIT | \ VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT | \ VMX_VPID_EXTENT_SINGLE_NON_GLOBAL_BIT) #define VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE 5 enum { VMX_VMREAD_BITMAP, VMX_VMWRITE_BITMAP, VMX_BITMAP_NR }; static unsigned long *vmx_bitmap[VMX_BITMAP_NR]; #define vmx_vmread_bitmap (vmx_bitmap[VMX_VMREAD_BITMAP]) #define vmx_vmwrite_bitmap (vmx_bitmap[VMX_VMWRITE_BITMAP]) struct shadow_vmcs_field { u16 encoding; u16 offset; }; static struct shadow_vmcs_field shadow_read_only_fields[] = { #define SHADOW_FIELD_RO(x, y) { x, offsetof(struct vmcs12, y) }, #include "vmcs_shadow_fields.h" }; static int max_shadow_read_only_fields = ARRAY_SIZE(shadow_read_only_fields); static struct shadow_vmcs_field shadow_read_write_fields[] = { #define SHADOW_FIELD_RW(x, y) { x, offsetof(struct vmcs12, y) }, #include "vmcs_shadow_fields.h" }; static int max_shadow_read_write_fields = ARRAY_SIZE(shadow_read_write_fields); static void init_vmcs_shadow_fields(void) { int i, j; memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE); memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE); for (i = j = 0; i < max_shadow_read_only_fields; i++) { struct shadow_vmcs_field entry = shadow_read_only_fields[i]; u16 field = entry.encoding; if (vmcs_field_width(field) == VMCS_FIELD_WIDTH_U64 && (i + 1 == max_shadow_read_only_fields || shadow_read_only_fields[i + 1].encoding != field + 1)) pr_err("Missing field from shadow_read_only_field %x\n", field + 1); clear_bit(field, vmx_vmread_bitmap); if (field & 1) #ifdef CONFIG_X86_64 continue; #else entry.offset += sizeof(u32); #endif shadow_read_only_fields[j++] = entry; } max_shadow_read_only_fields = j; for (i = j = 0; i < max_shadow_read_write_fields; i++) { struct shadow_vmcs_field entry = shadow_read_write_fields[i]; u16 field = entry.encoding; if (vmcs_field_width(field) == VMCS_FIELD_WIDTH_U64 && (i + 1 == max_shadow_read_write_fields || shadow_read_write_fields[i + 1].encoding != field + 1)) pr_err("Missing field from shadow_read_write_field %x\n", field + 1); WARN_ONCE(field >= GUEST_ES_AR_BYTES && field <= GUEST_TR_AR_BYTES, "Update vmcs12_write_any() to drop reserved bits from AR_BYTES"); /* * PML and the preemption timer can be emulated, but the * processor cannot vmwrite to fields that don't exist * on bare metal. */ switch (field) { case GUEST_PML_INDEX: if (!cpu_has_vmx_pml()) continue; break; case VMX_PREEMPTION_TIMER_VALUE: if (!cpu_has_vmx_preemption_timer()) continue; break; case GUEST_INTR_STATUS: if (!cpu_has_vmx_apicv()) continue; break; default: break; } clear_bit(field, vmx_vmwrite_bitmap); clear_bit(field, vmx_vmread_bitmap); if (field & 1) #ifdef CONFIG_X86_64 continue; #else entry.offset += sizeof(u32); #endif shadow_read_write_fields[j++] = entry; } max_shadow_read_write_fields = j; } /* * The following 3 functions, nested_vmx_succeed()/failValid()/failInvalid(), * set the success or error code of an emulated VMX instruction (as specified * by Vol 2B, VMX Instruction Reference, "Conventions"), and skip the emulated * instruction. */ static int nested_vmx_succeed(struct kvm_vcpu *vcpu) { vmx_set_rflags(vcpu, vmx_get_rflags(vcpu) & ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF)); return kvm_skip_emulated_instruction(vcpu); } static int nested_vmx_failInvalid(struct kvm_vcpu *vcpu) { vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu) & ~(X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF)) | X86_EFLAGS_CF); return kvm_skip_emulated_instruction(vcpu); } static int nested_vmx_failValid(struct kvm_vcpu *vcpu, u32 vm_instruction_error) { vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu) & ~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_SF | X86_EFLAGS_OF)) | X86_EFLAGS_ZF); get_vmcs12(vcpu)->vm_instruction_error = vm_instruction_error; /* * We don't need to force sync to shadow VMCS because * VM_INSTRUCTION_ERROR is not shadowed. Enlightened VMCS 'shadows' all * fields and thus must be synced. */ if (nested_vmx_is_evmptr12_set(to_vmx(vcpu))) to_vmx(vcpu)->nested.need_vmcs12_to_shadow_sync = true; return kvm_skip_emulated_instruction(vcpu); } static int nested_vmx_fail(struct kvm_vcpu *vcpu, u32 vm_instruction_error) { struct vcpu_vmx *vmx = to_vmx(vcpu); /* * failValid writes the error number to the current VMCS, which * can't be done if there isn't a current VMCS. */ if (vmx->nested.current_vmptr == INVALID_GPA && !nested_vmx_is_evmptr12_valid(vmx)) return nested_vmx_failInvalid(vcpu); return nested_vmx_failValid(vcpu, vm_instruction_error); } static void nested_vmx_abort(struct kvm_vcpu *vcpu, u32 indicator) { /* TODO: not to reset guest simply here. */ kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); pr_debug_ratelimited("nested vmx abort, indicator %d\n", indicator); } static inline bool vmx_control_verify(u32 control, u32 low, u32 high) { return fixed_bits_valid(control, low, high); } static inline u64 vmx_control_msr(u32 low, u32 high) { return low | ((u64)high << 32); } static void vmx_disable_shadow_vmcs(struct vcpu_vmx *vmx) { secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_SHADOW_VMCS); vmcs_write64(VMCS_LINK_POINTER, INVALID_GPA); vmx->nested.need_vmcs12_to_shadow_sync = false; } static inline void nested_release_evmcs(struct kvm_vcpu *vcpu) { #ifdef CONFIG_KVM_HYPERV struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu); struct vcpu_vmx *vmx = to_vmx(vcpu); kvm_vcpu_unmap(vcpu, &vmx->nested.hv_evmcs_map); vmx->nested.hv_evmcs = NULL; vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID; if (hv_vcpu) { hv_vcpu->nested.pa_page_gpa = INVALID_GPA; hv_vcpu->nested.vm_id = 0; hv_vcpu->nested.vp_id = 0; } #endif } static bool nested_evmcs_handle_vmclear(struct kvm_vcpu *vcpu, gpa_t vmptr) { #ifdef CONFIG_KVM_HYPERV struct vcpu_vmx *vmx = to_vmx(vcpu); /* * When Enlightened VMEntry is enabled on the calling CPU we treat * memory area pointer by vmptr as Enlightened VMCS (as there's no good * way to distinguish it from VMCS12) and we must not corrupt it by * writing to the non-existent 'launch_state' field. The area doesn't * have to be the currently active EVMCS on the calling CPU and there's * nothing KVM has to do to transition it from 'active' to 'non-active' * state. It is possible that the area will stay mapped as * vmx->nested.hv_evmcs but this shouldn't be a problem. */ if (!guest_cpu_cap_has_evmcs(vcpu) || !evmptr_is_valid(nested_get_evmptr(vcpu))) return false; if (nested_vmx_evmcs(vmx) && vmptr == vmx->nested.hv_evmcs_vmptr) nested_release_evmcs(vcpu); return true; #else return false; #endif } static void vmx_sync_vmcs_host_state(struct vcpu_vmx *vmx, struct loaded_vmcs *prev) { struct vmcs_host_state *dest, *src; if (unlikely(!vmx->guest_state_loaded)) return; src = &prev->host_state; dest = &vmx->loaded_vmcs->host_state; vmx_set_host_fs_gs(dest, src->fs_sel, src->gs_sel, src->fs_base, src->gs_base); dest->ldt_sel = src->ldt_sel; #ifdef CONFIG_X86_64 dest->ds_sel = src->ds_sel; dest->es_sel = src->es_sel; #endif } static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs) { struct vcpu_vmx *vmx = to_vmx(vcpu); struct loaded_vmcs *prev; int cpu; if (WARN_ON_ONCE(vmx->loaded_vmcs == vmcs)) return; cpu = get_cpu(); prev = vmx->loaded_vmcs; vmx->loaded_vmcs = vmcs; vmx_vcpu_load_vmcs(vcpu, cpu, prev); vmx_sync_vmcs_host_state(vmx, prev); put_cpu(); vcpu->arch.regs_avail = ~VMX_REGS_LAZY_LOAD_SET; /* * All lazily updated registers will be reloaded from VMCS12 on both * vmentry and vmexit. */ vcpu->arch.regs_dirty = 0; } static void nested_put_vmcs12_pages(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); kvm_vcpu_unmap(vcpu, &vmx->nested.apic_access_page_map); kvm_vcpu_unmap(vcpu, &vmx->nested.virtual_apic_map); kvm_vcpu_unmap(vcpu, &vmx->nested.pi_desc_map); vmx->nested.pi_desc = NULL; } /* * Free whatever needs to be freed from vmx->nested when L1 goes down, or * just stops using VMX. */ static void free_nested(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); if (WARN_ON_ONCE(vmx->loaded_vmcs != &vmx->vmcs01)) vmx_switch_vmcs(vcpu, &vmx->vmcs01); if (!vmx->nested.vmxon && !vmx->nested.smm.vmxon) return; kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); vmx->nested.vmxon = false; vmx->nested.smm.vmxon = false; vmx->nested.vmxon_ptr = INVALID_GPA; free_vpid(vmx->nested.vpid02); vmx->nested.posted_intr_nv = -1; vmx->nested.current_vmptr = INVALID_GPA; if (enable_shadow_vmcs) { vmx_disable_shadow_vmcs(vmx); vmcs_clear(vmx->vmcs01.shadow_vmcs); free_vmcs(vmx->vmcs01.shadow_vmcs); vmx->vmcs01.shadow_vmcs = NULL; } kfree(vmx->nested.cached_vmcs12); vmx->nested.cached_vmcs12 = NULL; kfree(vmx->nested.cached_shadow_vmcs12); vmx->nested.cached_shadow_vmcs12 = NULL; nested_put_vmcs12_pages(vcpu); kvm_mmu_free_roots(vcpu->kvm, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL); nested_release_evmcs(vcpu); free_loaded_vmcs(&vmx->nested.vmcs02); } /* * Ensure that the current vmcs of the logical processor is the * vmcs01 of the vcpu before calling free_nested(). */ void nested_vmx_free_vcpu(struct kvm_vcpu *vcpu) { vcpu_load(vcpu); vmx_leave_nested(vcpu); vcpu_put(vcpu); } #define EPTP_PA_MASK GENMASK_ULL(51, 12) static bool nested_ept_root_matches(hpa_t root_hpa, u64 root_eptp, u64 eptp) { return VALID_PAGE(root_hpa) && ((root_eptp & EPTP_PA_MASK) == (eptp & EPTP_PA_MASK)); } static void nested_ept_invalidate_addr(struct kvm_vcpu *vcpu, gpa_t eptp, gpa_t addr) { unsigned long roots = 0; uint i; struct kvm_mmu_root_info *cached_root; WARN_ON_ONCE(!mmu_is_nested(vcpu)); for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { cached_root = &vcpu->arch.mmu->prev_roots[i]; if (nested_ept_root_matches(cached_root->hpa, cached_root->pgd, eptp)) roots |= KVM_MMU_ROOT_PREVIOUS(i); } if (roots) kvm_mmu_invalidate_addr(vcpu, vcpu->arch.mmu, addr, roots); } static void nested_ept_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); struct vcpu_vmx *vmx = to_vmx(vcpu); unsigned long exit_qualification; u32 vm_exit_reason; if (vmx->nested.pml_full) { vm_exit_reason = EXIT_REASON_PML_FULL; vmx->nested.pml_full = false; /* * It should be impossible to trigger a nested PML Full VM-Exit * for anything other than an EPT Violation from L2. KVM *can* * trigger nEPT page fault injection in response to an EPT * Misconfig, e.g. if the MMIO SPTE was stale and L1's EPT * tables also changed, but KVM should not treat EPT Misconfig * VM-Exits as writes. */ WARN_ON_ONCE(vmx->exit_reason.basic != EXIT_REASON_EPT_VIOLATION); /* * PML Full and EPT Violation VM-Exits both use bit 12 to report * "NMI unblocking due to IRET", i.e. the bit can be propagated * as-is from the original EXIT_QUALIFICATION. */ exit_qualification = vmx_get_exit_qual(vcpu) & INTR_INFO_UNBLOCK_NMI; } else { if (fault->error_code & PFERR_RSVD_MASK) { vm_exit_reason = EXIT_REASON_EPT_MISCONFIG; exit_qualification = 0; } else { exit_qualification = fault->exit_qualification; exit_qualification |= vmx_get_exit_qual(vcpu) & (EPT_VIOLATION_GVA_IS_VALID | EPT_VIOLATION_GVA_TRANSLATED); vm_exit_reason = EXIT_REASON_EPT_VIOLATION; } /* * Although the caller (kvm_inject_emulated_page_fault) would * have already synced the faulting address in the shadow EPT * tables for the current EPTP12, we also need to sync it for * any other cached EPTP02s based on the same EP4TA, since the * TLB associates mappings to the EP4TA rather than the full EPTP. */ nested_ept_invalidate_addr(vcpu, vmcs12->ept_pointer, fault->address); } nested_vmx_vmexit(vcpu, vm_exit_reason, 0, exit_qualification); vmcs12->guest_physical_address = fault->address; } static void nested_ept_new_eptp(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); bool execonly = vmx->nested.msrs.ept_caps & VMX_EPT_EXECUTE_ONLY_BIT; int ept_lpage_level = ept_caps_to_lpage_level(vmx->nested.msrs.ept_caps); kvm_init_shadow_ept_mmu(vcpu, execonly, ept_lpage_level, nested_ept_ad_enabled(vcpu), nested_ept_get_eptp(vcpu)); } static void nested_ept_init_mmu_context(struct kvm_vcpu *vcpu) { WARN_ON(mmu_is_nested(vcpu)); vcpu->arch.mmu = &vcpu->arch.guest_mmu; nested_ept_new_eptp(vcpu); vcpu->arch.mmu->get_guest_pgd = nested_ept_get_eptp; vcpu->arch.mmu->inject_page_fault = nested_ept_inject_page_fault; vcpu->arch.mmu->get_pdptr = kvm_pdptr_read; vcpu->arch.walk_mmu = &vcpu->arch.nested_mmu; } static void nested_ept_uninit_mmu_context(struct kvm_vcpu *vcpu) { vcpu->arch.mmu = &vcpu->arch.root_mmu; vcpu->arch.walk_mmu = &vcpu->arch.root_mmu; } static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12, u16 error_code) { bool inequality, bit; bit = (vmcs12->exception_bitmap & (1u << PF_VECTOR)) != 0; inequality = (error_code & vmcs12->page_fault_error_code_mask) != vmcs12->page_fault_error_code_match; return inequality ^ bit; } static bool nested_vmx_is_exception_vmexit(struct kvm_vcpu *vcpu, u8 vector, u32 error_code) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); /* * Drop bits 31:16 of the error code when performing the #PF mask+match * check. All VMCS fields involved are 32 bits, but Intel CPUs never * set bits 31:16 and VMX disallows setting bits 31:16 in the injected * error code. Including the to-be-dropped bits in the check might * result in an "impossible" or missed exit from L1's perspective. */ if (vector == PF_VECTOR) return nested_vmx_is_page_fault_vmexit(vmcs12, (u16)error_code); return (vmcs12->exception_bitmap & (1u << vector)); } static int nested_vmx_check_io_bitmap_controls(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS)) return 0; if (CC(!page_address_valid(vcpu, vmcs12->io_bitmap_a)) || CC(!page_address_valid(vcpu, vmcs12->io_bitmap_b))) return -EINVAL; return 0; } static int nested_vmx_check_msr_bitmap_controls(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { if (!nested_cpu_has(vmcs12, CPU_BASED_USE_MSR_BITMAPS)) return 0; if (CC(!page_address_valid(vcpu, vmcs12->msr_bitmap))) return -EINVAL; return 0; } static int nested_vmx_check_tpr_shadow_controls(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { if (!nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) return 0; if (CC(!page_address_valid(vcpu, vmcs12->virtual_apic_page_addr))) return -EINVAL; return 0; } /* * For x2APIC MSRs, ignore the vmcs01 bitmap. L1 can enable x2APIC without L1 * itself utilizing x2APIC. All MSRs were previously set to be intercepted, * only the "disable intercept" case needs to be handled. */ static void nested_vmx_disable_intercept_for_x2apic_msr(unsigned long *msr_bitmap_l1, unsigned long *msr_bitmap_l0, u32 msr, int type) { if (type & MSR_TYPE_R && !vmx_test_msr_bitmap_read(msr_bitmap_l1, msr)) vmx_clear_msr_bitmap_read(msr_bitmap_l0, msr); if (type & MSR_TYPE_W && !vmx_test_msr_bitmap_write(msr_bitmap_l1, msr)) vmx_clear_msr_bitmap_write(msr_bitmap_l0, msr); } static inline void enable_x2apic_msr_intercepts(unsigned long *msr_bitmap) { int msr; for (msr = 0x800; msr <= 0x8ff; msr += BITS_PER_LONG) { unsigned word = msr / BITS_PER_LONG; msr_bitmap[word] = ~0; msr_bitmap[word + (0x800 / sizeof(long))] = ~0; } } #define BUILD_NVMX_MSR_INTERCEPT_HELPER(rw) \ static inline \ void nested_vmx_set_msr_##rw##_intercept(struct vcpu_vmx *vmx, \ unsigned long *msr_bitmap_l1, \ unsigned long *msr_bitmap_l0, u32 msr) \ { \ if (vmx_test_msr_bitmap_##rw(vmx->vmcs01.msr_bitmap, msr) || \ vmx_test_msr_bitmap_##rw(msr_bitmap_l1, msr)) \ vmx_set_msr_bitmap_##rw(msr_bitmap_l0, msr); \ else \ vmx_clear_msr_bitmap_##rw(msr_bitmap_l0, msr); \ } BUILD_NVMX_MSR_INTERCEPT_HELPER(read) BUILD_NVMX_MSR_INTERCEPT_HELPER(write) static inline void nested_vmx_set_intercept_for_msr(struct vcpu_vmx *vmx, unsigned long *msr_bitmap_l1, unsigned long *msr_bitmap_l0, u32 msr, int types) { if (types & MSR_TYPE_R) nested_vmx_set_msr_read_intercept(vmx, msr_bitmap_l1, msr_bitmap_l0, msr); if (types & MSR_TYPE_W) nested_vmx_set_msr_write_intercept(vmx, msr_bitmap_l1, msr_bitmap_l0, msr); } /* * Merge L0's and L1's MSR bitmap, return false to indicate that * we do not use the hardware. */ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { struct vcpu_vmx *vmx = to_vmx(vcpu); int msr; unsigned long *msr_bitmap_l1; unsigned long *msr_bitmap_l0 = vmx->nested.vmcs02.msr_bitmap; struct kvm_host_map map; /* Nothing to do if the MSR bitmap is not in use. */ if (!cpu_has_vmx_msr_bitmap() || !nested_cpu_has(vmcs12, CPU_BASED_USE_MSR_BITMAPS)) return false; /* * MSR bitmap update can be skipped when: * - MSR bitmap for L1 hasn't changed. * - Nested hypervisor (L1) is attempting to launch the same L2 as * before. * - Nested hypervisor (L1) has enabled 'Enlightened MSR Bitmap' feature * and tells KVM (L0) there were no changes in MSR bitmap for L2. */ if (!vmx->nested.force_msr_bitmap_recalc) { struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx); if (evmcs && evmcs->hv_enlightenments_control.msr_bitmap && evmcs->hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP) return true; } if (kvm_vcpu_map_readonly(vcpu, gpa_to_gfn(vmcs12->msr_bitmap), &map)) return false; msr_bitmap_l1 = (unsigned long *)map.hva; /* * To keep the control flow simple, pay eight 8-byte writes (sixteen * 4-byte writes on 32-bit systems) up front to enable intercepts for * the x2APIC MSR range and selectively toggle those relevant to L2. */ enable_x2apic_msr_intercepts(msr_bitmap_l0); if (nested_cpu_has_virt_x2apic_mode(vmcs12)) { if (nested_cpu_has_apic_reg_virt(vmcs12)) { /* * L0 need not intercept reads for MSRs between 0x800 * and 0x8ff, it just lets the processor take the value * from the virtual-APIC page; take those 256 bits * directly from the L1 bitmap. */ for (msr = 0x800; msr <= 0x8ff; msr += BITS_PER_LONG) { unsigned word = msr / BITS_PER_LONG; msr_bitmap_l0[word] = msr_bitmap_l1[word]; } } nested_vmx_disable_intercept_for_x2apic_msr( msr_bitmap_l1, msr_bitmap_l0, X2APIC_MSR(APIC_TASKPRI), MSR_TYPE_R | MSR_TYPE_W); if (nested_cpu_has_vid(vmcs12)) { nested_vmx_disable_intercept_for_x2apic_msr( msr_bitmap_l1, msr_bitmap_l0, X2APIC_MSR(APIC_EOI), MSR_TYPE_W); nested_vmx_disable_intercept_for_x2apic_msr( msr_bitmap_l1, msr_bitmap_l0, X2APIC_MSR(APIC_SELF_IPI), MSR_TYPE_W); } } /* * Always check vmcs01's bitmap to honor userspace MSR filters and any * other runtime changes to vmcs01's bitmap, e.g. dynamic pass-through. */ #ifdef CONFIG_X86_64 nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0, MSR_FS_BASE, MSR_TYPE_RW); nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0, MSR_GS_BASE, MSR_TYPE_RW); nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0, MSR_KERNEL_GS_BASE, MSR_TYPE_RW); #endif nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0, MSR_IA32_SPEC_CTRL, MSR_TYPE_RW); nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0, MSR_IA32_PRED_CMD, MSR_TYPE_W); nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0, MSR_IA32_FLUSH_CMD, MSR_TYPE_W); kvm_vcpu_unmap(vcpu, &map); vmx->nested.force_msr_bitmap_recalc = false; return true; } static void nested_cache_shadow_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { struct vcpu_vmx *vmx = to_vmx(vcpu); struct gfn_to_hva_cache *ghc = &vmx->nested.shadow_vmcs12_cache; if (!nested_cpu_has_shadow_vmcs(vmcs12) || vmcs12->vmcs_link_pointer == INVALID_GPA) return; if (ghc->gpa != vmcs12->vmcs_link_pointer && kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, vmcs12->vmcs_link_pointer, VMCS12_SIZE)) return; kvm_read_guest_cached(vmx->vcpu.kvm, ghc, get_shadow_vmcs12(vcpu), VMCS12_SIZE); } static void nested_flush_cached_shadow_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { struct vcpu_vmx *vmx = to_vmx(vcpu); struct gfn_to_hva_cache *ghc = &vmx->nested.shadow_vmcs12_cache; if (!nested_cpu_has_shadow_vmcs(vmcs12) || vmcs12->vmcs_link_pointer == INVALID_GPA) return; if (ghc->gpa != vmcs12->vmcs_link_pointer && kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, vmcs12->vmcs_link_pointer, VMCS12_SIZE)) return; kvm_write_guest_cached(vmx->vcpu.kvm, ghc, get_shadow_vmcs12(vcpu), VMCS12_SIZE); } /* * In nested virtualization, check if L1 has set * VM_EXIT_ACK_INTR_ON_EXIT */ static bool nested_exit_intr_ack_set(struct kvm_vcpu *vcpu) { return get_vmcs12(vcpu)->vm_exit_controls & VM_EXIT_ACK_INTR_ON_EXIT; } static int nested_vmx_check_apic_access_controls(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES) && CC(!page_address_valid(vcpu, vmcs12->apic_access_addr))) return -EINVAL; else return 0; } static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { if (!nested_cpu_has_virt_x2apic_mode(vmcs12) && !nested_cpu_has_apic_reg_virt(vmcs12) && !nested_cpu_has_vid(vmcs12) && !nested_cpu_has_posted_intr(vmcs12)) return 0; /* * If virtualize x2apic mode is enabled, * virtualize apic access must be disabled. */ if (CC(nested_cpu_has_virt_x2apic_mode(vmcs12) && nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))) return -EINVAL; /* * If virtual interrupt delivery is enabled, * we must exit on external interrupts. */ if (CC(nested_cpu_has_vid(vmcs12) && !nested_exit_on_intr(vcpu))) return -EINVAL; /* * bits 15:8 should be zero in posted_intr_nv, * the descriptor address has been already checked * in nested_get_vmcs12_pages. * * bits 5:0 of posted_intr_desc_addr should be zero. */ if (nested_cpu_has_posted_intr(vmcs12) && (CC(!nested_cpu_has_vid(vmcs12)) || CC(!nested_exit_intr_ack_set(vcpu)) || CC((vmcs12->posted_intr_nv & 0xff00)) || CC(!kvm_vcpu_is_legal_aligned_gpa(vcpu, vmcs12->posted_intr_desc_addr, 64)))) return -EINVAL; /* tpr shadow is needed by all apicv features. */ if (CC(!nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW))) return -EINVAL; return 0; } static int nested_vmx_check_msr_switch(struct kvm_vcpu *vcpu, u32 count, u64 addr) { if (count == 0) return 0; if (!kvm_vcpu_is_legal_aligned_gpa(vcpu, addr, 16) || !kvm_vcpu_is_legal_gpa(vcpu, (addr + count * sizeof(struct vmx_msr_entry) - 1))) return -EINVAL; return 0; } static int nested_vmx_check_exit_msr_switch_controls(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { if (CC(nested_vmx_check_msr_switch(vcpu, vmcs12->vm_exit_msr_load_count, vmcs12->vm_exit_msr_load_addr)) || CC(nested_vmx_check_msr_switch(vcpu, vmcs12->vm_exit_msr_store_count, vmcs12->vm_exit_msr_store_addr))) return -EINVAL; return 0; } static int nested_vmx_check_entry_msr_switch_controls(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { if (CC(nested_vmx_check_msr_switch(vcpu, vmcs12->vm_entry_msr_load_count, vmcs12->vm_entry_msr_load_addr))) return -EINVAL; return 0; } static int nested_vmx_check_pml_controls(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { if (!nested_cpu_has_pml(vmcs12)) return 0; if (CC(!nested_cpu_has_ept(vmcs12)) || CC(!page_address_valid(vcpu, vmcs12->pml_address))) return -EINVAL; return 0; } static int nested_vmx_check_unrestricted_guest_controls(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { if (CC(nested_cpu_has2(vmcs12, SECONDARY_EXEC_UNRESTRICTED_GUEST) && !nested_cpu_has_ept(vmcs12))) return -EINVAL; return 0; } static int nested_vmx_check_mode_based_ept_exec_controls(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { if (CC(nested_cpu_has2(vmcs12, SECONDARY_EXEC_MODE_BASED_EPT_EXEC) && !nested_cpu_has_ept(vmcs12))) return -EINVAL; return 0; } static int nested_vmx_check_shadow_vmcs_controls(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { if (!nested_cpu_has_shadow_vmcs(vmcs12)) return 0; if (CC(!page_address_valid(vcpu, vmcs12->vmread_bitmap)) || CC(!page_address_valid(vcpu, vmcs12->vmwrite_bitmap))) return -EINVAL; return 0; } static int nested_vmx_msr_check_common(struct kvm_vcpu *vcpu, struct vmx_msr_entry *e) { /* x2APIC MSR accesses are not allowed */ if (CC(vcpu->arch.apic_base & X2APIC_ENABLE && e->index >> 8 == 0x8)) return -EINVAL; if (CC(e->index == MSR_IA32_UCODE_WRITE) || /* SDM Table 35-2 */ CC(e->index == MSR_IA32_UCODE_REV)) return -EINVAL; if (CC(e->reserved != 0)) return -EINVAL; return 0; } static int nested_vmx_load_msr_check(struct kvm_vcpu *vcpu, struct vmx_msr_entry *e) { if (CC(e->index == MSR_FS_BASE) || CC(e->index == MSR_GS_BASE) || CC(e->index == MSR_IA32_SMM_MONITOR_CTL) || /* SMM is not supported */ nested_vmx_msr_check_common(vcpu, e)) return -EINVAL; return 0; } static int nested_vmx_store_msr_check(struct kvm_vcpu *vcpu, struct vmx_msr_entry *e) { if (CC(e->index == MSR_IA32_SMBASE) || /* SMM is not supported */ nested_vmx_msr_check_common(vcpu, e)) return -EINVAL; return 0; } static u32 nested_vmx_max_atomic_switch_msrs(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); u64 vmx_misc = vmx_control_msr(vmx->nested.msrs.misc_low, vmx->nested.msrs.misc_high); return (vmx_misc_max_msr(vmx_misc) + 1) * VMX_MISC_MSR_LIST_MULTIPLIER; } /* * Load guest's/host's msr at nested entry/exit. * return 0 for success, entry index for failure. * * One of the failure modes for MSR load/store is when a list exceeds the * virtual hardware's capacity. To maintain compatibility with hardware inasmuch * as possible, process all valid entries before failing rather than precheck * for a capacity violation. */ static u32 nested_vmx_load_msr(struct kvm_vcpu *vcpu, u64 gpa, u32 count) { u32 i; struct vmx_msr_entry e; u32 max_msr_list_size = nested_vmx_max_atomic_switch_msrs(vcpu); for (i = 0; i < count; i++) { if (unlikely(i >= max_msr_list_size)) goto fail; if (kvm_vcpu_read_guest(vcpu, gpa + i * sizeof(e), &e, sizeof(e))) { pr_debug_ratelimited( "%s cannot read MSR entry (%u, 0x%08llx)\n", __func__, i, gpa + i * sizeof(e)); goto fail; } if (nested_vmx_load_msr_check(vcpu, &e)) { pr_debug_ratelimited( "%s check failed (%u, 0x%x, 0x%x)\n", __func__, i, e.index, e.reserved); goto fail; } if (kvm_set_msr_with_filter(vcpu, e.index, e.value)) { pr_debug_ratelimited( "%s cannot write MSR (%u, 0x%x, 0x%llx)\n", __func__, i, e.index, e.value); goto fail; } } return 0; fail: /* Note, max_msr_list_size is at most 4096, i.e. this can't wrap. */ return i + 1; } static bool nested_vmx_get_vmexit_msr_value(struct kvm_vcpu *vcpu, u32 msr_index, u64 *data) { struct vcpu_vmx *vmx = to_vmx(vcpu); /* * If the L0 hypervisor stored a more accurate value for the TSC that * does not include the time taken for emulation of the L2->L1 * VM-exit in L0, use the more accurate value. */ if (msr_index == MSR_IA32_TSC) { int i = vmx_find_loadstore_msr_slot(&vmx->msr_autostore.guest, MSR_IA32_TSC); if (i >= 0) { u64 val = vmx->msr_autostore.guest.val[i].value; *data = kvm_read_l1_tsc(vcpu, val); return true; } } if (kvm_get_msr_with_filter(vcpu, msr_index, data)) { pr_debug_ratelimited("%s cannot read MSR (0x%x)\n", __func__, msr_index); return false; } return true; } static bool read_and_check_msr_entry(struct kvm_vcpu *vcpu, u64 gpa, int i, struct vmx_msr_entry *e) { if (kvm_vcpu_read_guest(vcpu, gpa + i * sizeof(*e), e, 2 * sizeof(u32))) { pr_debug_ratelimited( "%s cannot read MSR entry (%u, 0x%08llx)\n", __func__, i, gpa + i * sizeof(*e)); return false; } if (nested_vmx_store_msr_check(vcpu, e)) { pr_debug_ratelimited( "%s check failed (%u, 0x%x, 0x%x)\n", __func__, i, e->index, e->reserved); return false; } return true; } static int nested_vmx_store_msr(struct kvm_vcpu *vcpu, u64 gpa, u32 count) { u64 data; u32 i; struct vmx_msr_entry e; u32 max_msr_list_size = nested_vmx_max_atomic_switch_msrs(vcpu); for (i = 0; i < count; i++) { if (unlikely(i >= max_msr_list_size)) return -EINVAL; if (!read_and_check_msr_entry(vcpu, gpa, i, &e)) return -EINVAL; if (!nested_vmx_get_vmexit_msr_value(vcpu, e.index, &data)) return -EINVAL; if (kvm_vcpu_write_guest(vcpu, gpa + i * sizeof(e) + offsetof(struct vmx_msr_entry, value), &data, sizeof(data))) { pr_debug_ratelimited( "%s cannot write MSR (%u, 0x%x, 0x%llx)\n", __func__, i, e.index, data); return -EINVAL; } } return 0; } static bool nested_msr_store_list_has_msr(struct kvm_vcpu *vcpu, u32 msr_index) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); u32 count = vmcs12->vm_exit_msr_store_count; u64 gpa = vmcs12->vm_exit_msr_store_addr; struct vmx_msr_entry e; u32 i; for (i = 0; i < count; i++) { if (!read_and_check_msr_entry(vcpu, gpa, i, &e)) return false; if (e.index == msr_index) return true; } return false; } static void prepare_vmx_msr_autostore_list(struct kvm_vcpu *vcpu, u32 msr_index) { struct vcpu_vmx *vmx = to_vmx(vcpu); struct vmx_msrs *autostore = &vmx->msr_autostore.guest; bool in_vmcs12_store_list; int msr_autostore_slot; bool in_autostore_list; int last; msr_autostore_slot = vmx_find_loadstore_msr_slot(autostore, msr_index); in_autostore_list = msr_autostore_slot >= 0; in_vmcs12_store_list = nested_msr_store_list_has_msr(vcpu, msr_index); if (in_vmcs12_store_list && !in_autostore_list) { if (autostore->nr == MAX_NR_LOADSTORE_MSRS) { /* * Emulated VMEntry does not fail here. Instead a less * accurate value will be returned by * nested_vmx_get_vmexit_msr_value() by reading KVM's * internal MSR state instead of reading the value from * the vmcs02 VMExit MSR-store area. */ pr_warn_ratelimited( "Not enough msr entries in msr_autostore. Can't add msr %x\n", msr_index); return; } last = autostore->nr++; autostore->val[last].index = msr_index; } else if (!in_vmcs12_store_list && in_autostore_list) { last = --autostore->nr; autostore->val[msr_autostore_slot] = autostore->val[last]; } } /* * Load guest's/host's cr3 at nested entry/exit. @nested_ept is true if we are * emulating VM-Entry into a guest with EPT enabled. On failure, the expected * Exit Qualification (for a VM-Entry consistency check VM-Exit) is assigned to * @entry_failure_code. */ static int nested_vmx_load_cr3(struct kvm_vcpu *vcpu, unsigned long cr3, bool nested_ept, bool reload_pdptrs, enum vm_entry_failure_code *entry_failure_code) { if (CC(!kvm_vcpu_is_legal_cr3(vcpu, cr3))) { *entry_failure_code = ENTRY_FAIL_DEFAULT; return -EINVAL; } /* * If PAE paging and EPT are both on, CR3 is not used by the CPU and * must not be dereferenced. */ if (reload_pdptrs && !nested_ept && is_pae_paging(vcpu) && CC(!load_pdptrs(vcpu, cr3))) { *entry_failure_code = ENTRY_FAIL_PDPTE; return -EINVAL; } vcpu->arch.cr3 = cr3; kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3); /* Re-initialize the MMU, e.g. to pick up CR4 MMU role changes. */ kvm_init_mmu(vcpu); if (!nested_ept) kvm_mmu_new_pgd(vcpu, cr3); return 0; } /* * Returns if KVM is able to config CPU to tag TLB entries * populated by L2 differently than TLB entries populated * by L1. * * If L0 uses EPT, L1 and L2 run with different EPTP because * guest_mode is part of kvm_mmu_page_role. Thus, TLB entries * are tagged with different EPTP. * * If L1 uses VPID and we allocated a vpid02, TLB entries are tagged * with different VPID (L1 entries are tagged with vmx->vpid * while L2 entries are tagged with vmx->nested.vpid02). */ static bool nested_has_guest_tlb_tag(struct kvm_vcpu *vcpu) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); return enable_ept || (nested_cpu_has_vpid(vmcs12) && to_vmx(vcpu)->nested.vpid02); } static void nested_vmx_transition_tlb_flush(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, bool is_vmenter) { struct vcpu_vmx *vmx = to_vmx(vcpu); /* Handle pending Hyper-V TLB flush requests */ kvm_hv_nested_transtion_tlb_flush(vcpu, enable_ept); /* * If VPID is disabled, then guest TLB accesses use VPID=0, i.e. the * same VPID as the host, and so architecturally, linear and combined * mappings for VPID=0 must be flushed at VM-Enter and VM-Exit. KVM * emulates L2 sharing L1's VPID=0 by using vpid01 while running L2, * and so KVM must also emulate TLB flush of VPID=0, i.e. vpid01. This * is required if VPID is disabled in KVM, as a TLB flush (there are no * VPIDs) still occurs from L1's perspective, and KVM may need to * synchronize the MMU in response to the guest TLB flush. * * Note, using TLB_FLUSH_GUEST is correct even if nested EPT is in use. * EPT is a special snowflake, as guest-physical mappings aren't * flushed on VPID invalidations, including VM-Enter or VM-Exit with * VPID disabled. As a result, KVM _never_ needs to sync nEPT * entries on VM-Enter because L1 can't rely on VM-Enter to flush * those mappings. */ if (!nested_cpu_has_vpid(vmcs12)) { kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu); return; } /* L2 should never have a VPID if VPID is disabled. */ WARN_ON(!enable_vpid); /* * VPID is enabled and in use by vmcs12. If vpid12 is changing, then * emulate a guest TLB flush as KVM does not track vpid12 history nor * is the VPID incorporated into the MMU context. I.e. KVM must assume * that the new vpid12 has never been used and thus represents a new * guest ASID that cannot have entries in the TLB. */ if (is_vmenter && vmcs12->virtual_processor_id != vmx->nested.last_vpid) { vmx->nested.last_vpid = vmcs12->virtual_processor_id; kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu); return; } /* * If VPID is enabled, used by vmc12, and vpid12 is not changing but * does not have a unique TLB tag (ASID), i.e. EPT is disabled and * KVM was unable to allocate a VPID for L2, flush the current context * as the effective ASID is common to both L1 and L2. */ if (!nested_has_guest_tlb_tag(vcpu)) kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu); } static bool is_bitwise_subset(u64 superset, u64 subset, u64 mask) { superset &= mask; subset &= mask; return (superset | subset) == superset; } static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data) { const u64 feature_bits = VMX_BASIC_DUAL_MONITOR_TREATMENT | VMX_BASIC_INOUT | VMX_BASIC_TRUE_CTLS; const u64 reserved_bits = GENMASK_ULL(63, 56) | GENMASK_ULL(47, 45) | BIT_ULL(31); u64 vmx_basic = vmcs_config.nested.basic; BUILD_BUG_ON(feature_bits & reserved_bits); /* * Except for 32BIT_PHYS_ADDR_ONLY, which is an anti-feature bit (has * inverted polarity), the incoming value must not set feature bits or * reserved bits that aren't allowed/supported by KVM. Fields, i.e. * multi-bit values, are explicitly checked below. */ if (!is_bitwise_subset(vmx_basic, data, feature_bits | reserved_bits)) return -EINVAL; /* * KVM does not emulate a version of VMX that constrains physical * addresses of VMX structures (e.g. VMCS) to 32-bits. */ if (data & VMX_BASIC_32BIT_PHYS_ADDR_ONLY) return -EINVAL; if (vmx_basic_vmcs_revision_id(vmx_basic) != vmx_basic_vmcs_revision_id(data)) return -EINVAL; if (vmx_basic_vmcs_size(vmx_basic) > vmx_basic_vmcs_size(data)) return -EINVAL; vmx->nested.msrs.basic = data; return 0; } static void vmx_get_control_msr(struct nested_vmx_msrs *msrs, u32 msr_index, u32 **low, u32 **high) { switch (msr_index) { case MSR_IA32_VMX_TRUE_PINBASED_CTLS: *low = &msrs->pinbased_ctls_low; *high = &msrs->pinbased_ctls_high; break; case MSR_IA32_VMX_TRUE_PROCBASED_CTLS: *low = &msrs->procbased_ctls_low; *high = &msrs->procbased_ctls_high; break; case MSR_IA32_VMX_TRUE_EXIT_CTLS: *low = &msrs->exit_ctls_low; *high = &msrs->exit_ctls_high; break; case MSR_IA32_VMX_TRUE_ENTRY_CTLS: *low = &msrs->entry_ctls_low; *high = &msrs->entry_ctls_high; break; case MSR_IA32_VMX_PROCBASED_CTLS2: *low = &msrs->secondary_ctls_low; *high = &msrs->secondary_ctls_high; break; default: BUG(); } } static int vmx_restore_control_msr(struct vcpu_vmx *vmx, u32 msr_index, u64 data) { u32 *lowp, *highp; u64 supported; vmx_get_control_msr(&vmcs_config.nested, msr_index, &lowp, &highp); supported = vmx_control_msr(*lowp, *highp); /* Check must-be-1 bits are still 1. */ if (!is_bitwise_subset(data, supported, GENMASK_ULL(31, 0))) return -EINVAL; /* Check must-be-0 bits are still 0. */ if (!is_bitwise_subset(supported, data, GENMASK_ULL(63, 32))) return -EINVAL; vmx_get_control_msr(&vmx->nested.msrs, msr_index, &lowp, &highp); *lowp = data; *highp = data >> 32; return 0; } static int vmx_restore_vmx_misc(struct vcpu_vmx *vmx, u64 data) { const u64 feature_bits = VMX_MISC_SAVE_EFER_LMA | VMX_MISC_ACTIVITY_HLT | VMX_MISC_ACTIVITY_SHUTDOWN | VMX_MISC_ACTIVITY_WAIT_SIPI | VMX_MISC_INTEL_PT | VMX_MISC_RDMSR_IN_SMM | VMX_MISC_VMWRITE_SHADOW_RO_FIELDS | VMX_MISC_VMXOFF_BLOCK_SMI | VMX_MISC_ZERO_LEN_INS; const u64 reserved_bits = BIT_ULL(31) | GENMASK_ULL(13, 9); u64 vmx_misc = vmx_control_msr(vmcs_config.nested.misc_low, vmcs_config.nested.misc_high); BUILD_BUG_ON(feature_bits & reserved_bits); /* * The incoming value must not set feature bits or reserved bits that * aren't allowed/supported by KVM. Fields, i.e. multi-bit values, are * explicitly checked below. */ if (!is_bitwise_subset(vmx_misc, data, feature_bits | reserved_bits)) return -EINVAL; if ((vmx->nested.msrs.pinbased_ctls_high & PIN_BASED_VMX_PREEMPTION_TIMER) && vmx_misc_preemption_timer_rate(data) != vmx_misc_preemption_timer_rate(vmx_misc)) return -EINVAL; if (vmx_misc_cr3_count(data) > vmx_misc_cr3_count(vmx_misc)) return -EINVAL; if (vmx_misc_max_msr(data) > vmx_misc_max_msr(vmx_misc)) return -EINVAL; if (vmx_misc_mseg_revid(data) != vmx_misc_mseg_revid(vmx_misc)) return -EINVAL; vmx->nested.msrs.misc_low = data; vmx->nested.msrs.misc_high = data >> 32; return 0; } static int vmx_restore_vmx_ept_vpid_cap(struct vcpu_vmx *vmx, u64 data) { u64 vmx_ept_vpid_cap = vmx_control_msr(vmcs_config.nested.ept_caps, vmcs_config.nested.vpid_caps); /* Every bit is either reserved or a feature bit. */ if (!is_bitwise_subset(vmx_ept_vpid_cap, data, -1ULL)) return -EINVAL; vmx->nested.msrs.ept_caps = data; vmx->nested.msrs.vpid_caps = data >> 32; return 0; } static u64 *vmx_get_fixed0_msr(struct nested_vmx_msrs *msrs, u32 msr_index) { switch (msr_index) { case MSR_IA32_VMX_CR0_FIXED0: return &msrs->cr0_fixed0; case MSR_IA32_VMX_CR4_FIXED0: return &msrs->cr4_fixed0; default: BUG(); } } static int vmx_restore_fixed0_msr(struct vcpu_vmx *vmx, u32 msr_index, u64 data) { const u64 *msr = vmx_get_fixed0_msr(&vmcs_config.nested, msr_index); /* * 1 bits (which indicates bits which "must-be-1" during VMX operation) * must be 1 in the restored value. */ if (!is_bitwise_subset(data, *msr, -1ULL)) return -EINVAL; *vmx_get_fixed0_msr(&vmx->nested.msrs, msr_index) = data; return 0; } /* * Called when userspace is restoring VMX MSRs. * * Returns 0 on success, non-0 otherwise. */ int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data) { struct vcpu_vmx *vmx = to_vmx(vcpu); /* * Don't allow changes to the VMX capability MSRs while the vCPU * is in VMX operation. */ if (vmx->nested.vmxon) return -EBUSY; switch (msr_index) { case MSR_IA32_VMX_BASIC: return vmx_restore_vmx_basic(vmx, data); case MSR_IA32_VMX_PINBASED_CTLS: case MSR_IA32_VMX_PROCBASED_CTLS: case MSR_IA32_VMX_EXIT_CTLS: case MSR_IA32_VMX_ENTRY_CTLS: /* * The "non-true" VMX capability MSRs are generated from the * "true" MSRs, so we do not support restoring them directly. * * If userspace wants to emulate VMX_BASIC[55]=0, userspace * should restore the "true" MSRs with the must-be-1 bits * set according to the SDM Vol 3. A.2 "RESERVED CONTROLS AND * DEFAULT SETTINGS". */ return -EINVAL; case MSR_IA32_VMX_TRUE_PINBASED_CTLS: case MSR_IA32_VMX_TRUE_PROCBASED_CTLS: case MSR_IA32_VMX_TRUE_EXIT_CTLS: case MSR_IA32_VMX_TRUE_ENTRY_CTLS: case MSR_IA32_VMX_PROCBASED_CTLS2: return vmx_restore_control_msr(vmx, msr_index, data); case MSR_IA32_VMX_MISC: return vmx_restore_vmx_misc(vmx, data); case MSR_IA32_VMX_CR0_FIXED0: case MSR_IA32_VMX_CR4_FIXED0: return vmx_restore_fixed0_msr(vmx, msr_index, data); case MSR_IA32_VMX_CR0_FIXED1: case MSR_IA32_VMX_CR4_FIXED1: /* * These MSRs are generated based on the vCPU's CPUID, so we * do not support restoring them directly. */ return -EINVAL; case MSR_IA32_VMX_EPT_VPID_CAP: return vmx_restore_vmx_ept_vpid_cap(vmx, data); case MSR_IA32_VMX_VMCS_ENUM: vmx->nested.msrs.vmcs_enum = data; return 0; case MSR_IA32_VMX_VMFUNC: if (data & ~vmcs_config.nested.vmfunc_controls) return -EINVAL; vmx->nested.msrs.vmfunc_controls = data; return 0; default: /* * The rest of the VMX capability MSRs do not support restore. */ return -EINVAL; } } /* Returns 0 on success, non-0 otherwise. */ int vmx_get_vmx_msr(struct nested_vmx_msrs *msrs, u32 msr_index, u64 *pdata) { switch (msr_index) { case MSR_IA32_VMX_BASIC: *pdata = msrs->basic; break; case MSR_IA32_VMX_TRUE_PINBASED_CTLS: case MSR_IA32_VMX_PINBASED_CTLS: *pdata = vmx_control_msr( msrs->pinbased_ctls_low, msrs->pinbased_ctls_high); if (msr_index == MSR_IA32_VMX_PINBASED_CTLS) *pdata |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR; break; case MSR_IA32_VMX_TRUE_PROCBASED_CTLS: case MSR_IA32_VMX_PROCBASED_CTLS: *pdata = vmx_control_msr( msrs->procbased_ctls_low, msrs->procbased_ctls_high); if (msr_index == MSR_IA32_VMX_PROCBASED_CTLS) *pdata |= CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR; break; case MSR_IA32_VMX_TRUE_EXIT_CTLS: case MSR_IA32_VMX_EXIT_CTLS: *pdata = vmx_control_msr( msrs->exit_ctls_low, msrs->exit_ctls_high); if (msr_index == MSR_IA32_VMX_EXIT_CTLS) *pdata |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR; break; case MSR_IA32_VMX_TRUE_ENTRY_CTLS: case MSR_IA32_VMX_ENTRY_CTLS: *pdata = vmx_control_msr( msrs->entry_ctls_low, msrs->entry_ctls_high); if (msr_index == MSR_IA32_VMX_ENTRY_CTLS) *pdata |= VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; break; case MSR_IA32_VMX_MISC: *pdata = vmx_control_msr( msrs->misc_low, msrs->misc_high); break; case MSR_IA32_VMX_CR0_FIXED0: *pdata = msrs->cr0_fixed0; break; case MSR_IA32_VMX_CR0_FIXED1: *pdata = msrs->cr0_fixed1; break; case MSR_IA32_VMX_CR4_FIXED0: *pdata = msrs->cr4_fixed0; break; case MSR_IA32_VMX_CR4_FIXED1: *pdata = msrs->cr4_fixed1; break; case MSR_IA32_VMX_VMCS_ENUM: *pdata = msrs->vmcs_enum; break; case MSR_IA32_VMX_PROCBASED_CTLS2: *pdata = vmx_control_msr( msrs->secondary_ctls_low, msrs->secondary_ctls_high); break; case MSR_IA32_VMX_EPT_VPID_CAP: *pdata = msrs->ept_caps | ((u64)msrs->vpid_caps << 32); break; case MSR_IA32_VMX_VMFUNC: *pdata = msrs->vmfunc_controls; break; default: return 1; } return 0; } /* * Copy the writable VMCS shadow fields back to the VMCS12, in case they have * been modified by the L1 guest. Note, "writable" in this context means * "writable by the guest", i.e. tagged SHADOW_FIELD_RW; the set of * fields tagged SHADOW_FIELD_RO may or may not align with the "read-only" * VM-exit information fields (which are actually writable if the vCPU is * configured to support "VMWRITE to any supported field in the VMCS"). */ static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx) { struct vmcs *shadow_vmcs = vmx->vmcs01.shadow_vmcs; struct vmcs12 *vmcs12 = get_vmcs12(&vmx->vcpu); struct shadow_vmcs_field field; unsigned long val; int i; if (WARN_ON(!shadow_vmcs)) return; preempt_disable(); vmcs_load(shadow_vmcs); for (i = 0; i < max_shadow_read_write_fields; i++) { field = shadow_read_write_fields[i]; val = __vmcs_readl(field.encoding); vmcs12_write_any(vmcs12, field.encoding, field.offset, val); } vmcs_clear(shadow_vmcs); vmcs_load(vmx->loaded_vmcs->vmcs); preempt_enable(); } static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx) { const struct shadow_vmcs_field *fields[] = { shadow_read_write_fields, shadow_read_only_fields }; const int max_fields[] = { max_shadow_read_write_fields, max_shadow_read_only_fields }; struct vmcs *shadow_vmcs = vmx->vmcs01.shadow_vmcs; struct vmcs12 *vmcs12 = get_vmcs12(&vmx->vcpu); struct shadow_vmcs_field field; unsigned long val; int i, q; if (WARN_ON(!shadow_vmcs)) return; vmcs_load(shadow_vmcs); for (q = 0; q < ARRAY_SIZE(fields); q++) { for (i = 0; i < max_fields[q]; i++) { field = fields[q][i]; val = vmcs12_read_any(vmcs12, field.encoding, field.offset); __vmcs_writel(field.encoding, val); } } vmcs_clear(shadow_vmcs); vmcs_load(vmx->loaded_vmcs->vmcs); } static void copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, u32 hv_clean_fields) { #ifdef CONFIG_KVM_HYPERV struct vmcs12 *vmcs12 = vmx->nested.cached_vmcs12; struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx); struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(&vmx->vcpu); /* HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE */ vmcs12->tpr_threshold = evmcs->tpr_threshold; vmcs12->guest_rip = evmcs->guest_rip; if (unlikely(!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_ENLIGHTENMENTSCONTROL))) { hv_vcpu->nested.pa_page_gpa = evmcs->partition_assist_page; hv_vcpu->nested.vm_id = evmcs->hv_vm_id; hv_vcpu->nested.vp_id = evmcs->hv_vp_id; } if (unlikely(!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC))) { vmcs12->guest_rsp = evmcs->guest_rsp; vmcs12->guest_rflags = evmcs->guest_rflags; vmcs12->guest_interruptibility_info = evmcs->guest_interruptibility_info; /* * Not present in struct vmcs12: * vmcs12->guest_ssp = evmcs->guest_ssp; */ } if (unlikely(!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_PROC))) { vmcs12->cpu_based_vm_exec_control = evmcs->cpu_based_vm_exec_control; } if (unlikely(!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EXCPN))) { vmcs12->exception_bitmap = evmcs->exception_bitmap; } if (unlikely(!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_ENTRY))) { vmcs12->vm_entry_controls = evmcs->vm_entry_controls; } if (unlikely(!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT))) { vmcs12->vm_entry_intr_info_field = evmcs->vm_entry_intr_info_field; vmcs12->vm_entry_exception_error_code = evmcs->vm_entry_exception_error_code; vmcs12->vm_entry_instruction_len = evmcs->vm_entry_instruction_len; } if (unlikely(!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1))) { vmcs12->host_ia32_pat = evmcs->host_ia32_pat; vmcs12->host_ia32_efer = evmcs->host_ia32_efer; vmcs12->host_cr0 = evmcs->host_cr0; vmcs12->host_cr3 = evmcs->host_cr3; vmcs12->host_cr4 = evmcs->host_cr4; vmcs12->host_ia32_sysenter_esp = evmcs->host_ia32_sysenter_esp; vmcs12->host_ia32_sysenter_eip = evmcs->host_ia32_sysenter_eip; vmcs12->host_rip = evmcs->host_rip; vmcs12->host_ia32_sysenter_cs = evmcs->host_ia32_sysenter_cs; vmcs12->host_es_selector = evmcs->host_es_selector; vmcs12->host_cs_selector = evmcs->host_cs_selector; vmcs12->host_ss_selector = evmcs->host_ss_selector; vmcs12->host_ds_selector = evmcs->host_ds_selector; vmcs12->host_fs_selector = evmcs->host_fs_selector; vmcs12->host_gs_selector = evmcs->host_gs_selector; vmcs12->host_tr_selector = evmcs->host_tr_selector; vmcs12->host_ia32_perf_global_ctrl = evmcs->host_ia32_perf_global_ctrl; /* * Not present in struct vmcs12: * vmcs12->host_ia32_s_cet = evmcs->host_ia32_s_cet; * vmcs12->host_ssp = evmcs->host_ssp; * vmcs12->host_ia32_int_ssp_table_addr = evmcs->host_ia32_int_ssp_table_addr; */ } if (unlikely(!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1))) { vmcs12->pin_based_vm_exec_control = evmcs->pin_based_vm_exec_control; vmcs12->vm_exit_controls = evmcs->vm_exit_controls; vmcs12->secondary_vm_exec_control = evmcs->secondary_vm_exec_control; } if (unlikely(!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_IO_BITMAP))) { vmcs12->io_bitmap_a = evmcs->io_bitmap_a; vmcs12->io_bitmap_b = evmcs->io_bitmap_b; } if (unlikely(!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP))) { vmcs12->msr_bitmap = evmcs->msr_bitmap; } if (unlikely(!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2))) { vmcs12->guest_es_base = evmcs->guest_es_base; vmcs12->guest_cs_base = evmcs->guest_cs_base; vmcs12->guest_ss_base = evmcs->guest_ss_base; vmcs12->guest_ds_base = evmcs->guest_ds_base; vmcs12->guest_fs_base = evmcs->guest_fs_base; vmcs12->guest_gs_base = evmcs->guest_gs_base; vmcs12->guest_ldtr_base = evmcs->guest_ldtr_base; vmcs12->guest_tr_base = evmcs->guest_tr_base; vmcs12->guest_gdtr_base = evmcs->guest_gdtr_base; vmcs12->guest_idtr_base = evmcs->guest_idtr_base; vmcs12->guest_es_limit = evmcs->guest_es_limit; vmcs12->guest_cs_limit = evmcs->guest_cs_limit; vmcs12->guest_ss_limit = evmcs->guest_ss_limit; vmcs12->guest_ds_limit = evmcs->guest_ds_limit; vmcs12->guest_fs_limit = evmcs->guest_fs_limit; vmcs12->guest_gs_limit = evmcs->guest_gs_limit; vmcs12->guest_ldtr_limit = evmcs->guest_ldtr_limit; vmcs12->guest_tr_limit = evmcs->guest_tr_limit; vmcs12->guest_gdtr_limit = evmcs->guest_gdtr_limit; vmcs12->guest_idtr_limit = evmcs->guest_idtr_limit; vmcs12->guest_es_ar_bytes = evmcs->guest_es_ar_bytes; vmcs12->guest_cs_ar_bytes = evmcs->guest_cs_ar_bytes; vmcs12->guest_ss_ar_bytes = evmcs->guest_ss_ar_bytes; vmcs12->guest_ds_ar_bytes = evmcs->guest_ds_ar_bytes; vmcs12->guest_fs_ar_bytes = evmcs->guest_fs_ar_bytes; vmcs12->guest_gs_ar_bytes = evmcs->guest_gs_ar_bytes; vmcs12->guest_ldtr_ar_bytes = evmcs->guest_ldtr_ar_bytes; vmcs12->guest_tr_ar_bytes = evmcs->guest_tr_ar_bytes; vmcs12->guest_es_selector = evmcs->guest_es_selector; vmcs12->guest_cs_selector = evmcs->guest_cs_selector; vmcs12->guest_ss_selector = evmcs->guest_ss_selector; vmcs12->guest_ds_selector = evmcs->guest_ds_selector; vmcs12->guest_fs_selector = evmcs->guest_fs_selector; vmcs12->guest_gs_selector = evmcs->guest_gs_selector; vmcs12->guest_ldtr_selector = evmcs->guest_ldtr_selector; vmcs12->guest_tr_selector = evmcs->guest_tr_selector; } if (unlikely(!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2))) { vmcs12->tsc_offset = evmcs->tsc_offset; vmcs12->virtual_apic_page_addr = evmcs->virtual_apic_page_addr; vmcs12->xss_exit_bitmap = evmcs->xss_exit_bitmap; vmcs12->encls_exiting_bitmap = evmcs->encls_exiting_bitmap; vmcs12->tsc_multiplier = evmcs->tsc_multiplier; } if (unlikely(!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_CRDR))) { vmcs12->cr0_guest_host_mask = evmcs->cr0_guest_host_mask; vmcs12->cr4_guest_host_mask = evmcs->cr4_guest_host_mask; vmcs12->cr0_read_shadow = evmcs->cr0_read_shadow; vmcs12->cr4_read_shadow = evmcs->cr4_read_shadow; vmcs12->guest_cr0 = evmcs->guest_cr0; vmcs12->guest_cr3 = evmcs->guest_cr3; vmcs12->guest_cr4 = evmcs->guest_cr4; vmcs12->guest_dr7 = evmcs->guest_dr7; } if (unlikely(!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER))) { vmcs12->host_fs_base = evmcs->host_fs_base; vmcs12->host_gs_base = evmcs->host_gs_base; vmcs12->host_tr_base = evmcs->host_tr_base; vmcs12->host_gdtr_base = evmcs->host_gdtr_base; vmcs12->host_idtr_base = evmcs->host_idtr_base; vmcs12->host_rsp = evmcs->host_rsp; } if (unlikely(!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_XLAT))) { vmcs12->ept_pointer = evmcs->ept_pointer; vmcs12->virtual_processor_id = evmcs->virtual_processor_id; } if (unlikely(!(hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1))) { vmcs12->vmcs_link_pointer = evmcs->vmcs_link_pointer; vmcs12->guest_ia32_debugctl = evmcs->guest_ia32_debugctl; vmcs12->guest_ia32_pat = evmcs->guest_ia32_pat; vmcs12->guest_ia32_efer = evmcs->guest_ia32_efer; vmcs12->guest_pdptr0 = evmcs->guest_pdptr0; vmcs12->guest_pdptr1 = evmcs->guest_pdptr1; vmcs12->guest_pdptr2 = evmcs->guest_pdptr2; vmcs12->guest_pdptr3 = evmcs->guest_pdptr3; vmcs12->guest_pending_dbg_exceptions = evmcs->guest_pending_dbg_exceptions; vmcs12->guest_sysenter_esp = evmcs->guest_sysenter_esp; vmcs12->guest_sysenter_eip = evmcs->guest_sysenter_eip; vmcs12->guest_bndcfgs = evmcs->guest_bndcfgs; vmcs12->guest_activity_state = evmcs->guest_activity_state; vmcs12->guest_sysenter_cs = evmcs->guest_sysenter_cs; vmcs12->guest_ia32_perf_global_ctrl = evmcs->guest_ia32_perf_global_ctrl; /* * Not present in struct vmcs12: * vmcs12->guest_ia32_s_cet = evmcs->guest_ia32_s_cet; * vmcs12->guest_ia32_lbr_ctl = evmcs->guest_ia32_lbr_ctl; * vmcs12->guest_ia32_int_ssp_table_addr = evmcs->guest_ia32_int_ssp_table_addr; */ } /* * Not used? * vmcs12->vm_exit_msr_store_addr = evmcs->vm_exit_msr_store_addr; * vmcs12->vm_exit_msr_load_addr = evmcs->vm_exit_msr_load_addr; * vmcs12->vm_entry_msr_load_addr = evmcs->vm_entry_msr_load_addr; * vmcs12->page_fault_error_code_mask = * evmcs->page_fault_error_code_mask; * vmcs12->page_fault_error_code_match = * evmcs->page_fault_error_code_match; * vmcs12->cr3_target_count = evmcs->cr3_target_count; * vmcs12->vm_exit_msr_store_count = evmcs->vm_exit_msr_store_count; * vmcs12->vm_exit_msr_load_count = evmcs->vm_exit_msr_load_count; * vmcs12->vm_entry_msr_load_count = evmcs->vm_entry_msr_load_count; */ /* * Read only fields: * vmcs12->guest_physical_address = evmcs->guest_physical_address; * vmcs12->vm_instruction_error = evmcs->vm_instruction_error; * vmcs12->vm_exit_reason = evmcs->vm_exit_reason; * vmcs12->vm_exit_intr_info = evmcs->vm_exit_intr_info; * vmcs12->vm_exit_intr_error_code = evmcs->vm_exit_intr_error_code; * vmcs12->idt_vectoring_info_field = evmcs->idt_vectoring_info_field; * vmcs12->idt_vectoring_error_code = evmcs->idt_vectoring_error_code; * vmcs12->vm_exit_instruction_len = evmcs->vm_exit_instruction_len; * vmcs12->vmx_instruction_info = evmcs->vmx_instruction_info; * vmcs12->exit_qualification = evmcs->exit_qualification; * vmcs12->guest_linear_address = evmcs->guest_linear_address; * * Not present in struct vmcs12: * vmcs12->exit_io_instruction_ecx = evmcs->exit_io_instruction_ecx; * vmcs12->exit_io_instruction_esi = evmcs->exit_io_instruction_esi; * vmcs12->exit_io_instruction_edi = evmcs->exit_io_instruction_edi; * vmcs12->exit_io_instruction_eip = evmcs->exit_io_instruction_eip; */ return; #else /* CONFIG_KVM_HYPERV */ KVM_BUG_ON(1, vmx->vcpu.kvm); #endif /* CONFIG_KVM_HYPERV */ } static void copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx) { #ifdef CONFIG_KVM_HYPERV struct vmcs12 *vmcs12 = vmx->nested.cached_vmcs12; struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx); /* * Should not be changed by KVM: * * evmcs->host_es_selector = vmcs12->host_es_selector; * evmcs->host_cs_selector = vmcs12->host_cs_selector; * evmcs->host_ss_selector = vmcs12->host_ss_selector; * evmcs->host_ds_selector = vmcs12->host_ds_selector; * evmcs->host_fs_selector = vmcs12->host_fs_selector; * evmcs->host_gs_selector = vmcs12->host_gs_selector; * evmcs->host_tr_selector = vmcs12->host_tr_selector; * evmcs->host_ia32_pat = vmcs12->host_ia32_pat; * evmcs->host_ia32_efer = vmcs12->host_ia32_efer; * evmcs->host_cr0 = vmcs12->host_cr0; * evmcs->host_cr3 = vmcs12->host_cr3; * evmcs->host_cr4 = vmcs12->host_cr4; * evmcs->host_ia32_sysenter_esp = vmcs12->host_ia32_sysenter_esp; * evmcs->host_ia32_sysenter_eip = vmcs12->host_ia32_sysenter_eip; * evmcs->host_rip = vmcs12->host_rip; * evmcs->host_ia32_sysenter_cs = vmcs12->host_ia32_sysenter_cs; * evmcs->host_fs_base = vmcs12->host_fs_base; * evmcs->host_gs_base = vmcs12->host_gs_base; * evmcs->host_tr_base = vmcs12->host_tr_base; * evmcs->host_gdtr_base = vmcs12->host_gdtr_base; * evmcs->host_idtr_base = vmcs12->host_idtr_base; * evmcs->host_rsp = vmcs12->host_rsp; * sync_vmcs02_to_vmcs12() doesn't read these: * evmcs->io_bitmap_a = vmcs12->io_bitmap_a; * evmcs->io_bitmap_b = vmcs12->io_bitmap_b; * evmcs->msr_bitmap = vmcs12->msr_bitmap; * evmcs->ept_pointer = vmcs12->ept_pointer; * evmcs->xss_exit_bitmap = vmcs12->xss_exit_bitmap; * evmcs->vm_exit_msr_store_addr = vmcs12->vm_exit_msr_store_addr; * evmcs->vm_exit_msr_load_addr = vmcs12->vm_exit_msr_load_addr; * evmcs->vm_entry_msr_load_addr = vmcs12->vm_entry_msr_load_addr; * evmcs->tpr_threshold = vmcs12->tpr_threshold; * evmcs->virtual_processor_id = vmcs12->virtual_processor_id; * evmcs->exception_bitmap = vmcs12->exception_bitmap; * evmcs->vmcs_link_pointer = vmcs12->vmcs_link_pointer; * evmcs->pin_based_vm_exec_control = vmcs12->pin_based_vm_exec_control; * evmcs->vm_exit_controls = vmcs12->vm_exit_controls; * evmcs->secondary_vm_exec_control = vmcs12->secondary_vm_exec_control; * evmcs->page_fault_error_code_mask = * vmcs12->page_fault_error_code_mask; * evmcs->page_fault_error_code_match = * vmcs12->page_fault_error_code_match; * evmcs->cr3_target_count = vmcs12->cr3_target_count; * evmcs->virtual_apic_page_addr = vmcs12->virtual_apic_page_addr; * evmcs->tsc_offset = vmcs12->tsc_offset; * evmcs->guest_ia32_debugctl = vmcs12->guest_ia32_debugctl; * evmcs->cr0_guest_host_mask = vmcs12->cr0_guest_host_mask; * evmcs->cr4_guest_host_mask = vmcs12->cr4_guest_host_mask; * evmcs->cr0_read_shadow = vmcs12->cr0_read_shadow; * evmcs->cr4_read_shadow = vmcs12->cr4_read_shadow; * evmcs->vm_exit_msr_store_count = vmcs12->vm_exit_msr_store_count; * evmcs->vm_exit_msr_load_count = vmcs12->vm_exit_msr_load_count; * evmcs->vm_entry_msr_load_count = vmcs12->vm_entry_msr_load_count; * evmcs->guest_ia32_perf_global_ctrl = vmcs12->guest_ia32_perf_global_ctrl; * evmcs->host_ia32_perf_global_ctrl = vmcs12->host_ia32_perf_global_ctrl; * evmcs->encls_exiting_bitmap = vmcs12->encls_exiting_bitmap; * evmcs->tsc_multiplier = vmcs12->tsc_multiplier; * * Not present in struct vmcs12: * evmcs->exit_io_instruction_ecx = vmcs12->exit_io_instruction_ecx; * evmcs->exit_io_instruction_esi = vmcs12->exit_io_instruction_esi; * evmcs->exit_io_instruction_edi = vmcs12->exit_io_instruction_edi; * evmcs->exit_io_instruction_eip = vmcs12->exit_io_instruction_eip; * evmcs->host_ia32_s_cet = vmcs12->host_ia32_s_cet; * evmcs->host_ssp = vmcs12->host_ssp; * evmcs->host_ia32_int_ssp_table_addr = vmcs12->host_ia32_int_ssp_table_addr; * evmcs->guest_ia32_s_cet = vmcs12->guest_ia32_s_cet; * evmcs->guest_ia32_lbr_ctl = vmcs12->guest_ia32_lbr_ctl; * evmcs->guest_ia32_int_ssp_table_addr = vmcs12->guest_ia32_int_ssp_table_addr; * evmcs->guest_ssp = vmcs12->guest_ssp; */ evmcs->guest_es_selector = vmcs12->guest_es_selector; evmcs->guest_cs_selector = vmcs12->guest_cs_selector; evmcs->guest_ss_selector = vmcs12->guest_ss_selector; evmcs->guest_ds_selector = vmcs12->guest_ds_selector; evmcs->guest_fs_selector = vmcs12->guest_fs_selector; evmcs->guest_gs_selector = vmcs12->guest_gs_selector; evmcs->guest_ldtr_selector = vmcs12->guest_ldtr_selector; evmcs->guest_tr_selector = vmcs12->guest_tr_selector; evmcs->guest_es_limit = vmcs12->guest_es_limit; evmcs->guest_cs_limit = vmcs12->guest_cs_limit; evmcs->guest_ss_limit = vmcs12->guest_ss_limit; evmcs->guest_ds_limit = vmcs12->guest_ds_limit; evmcs->guest_fs_limit = vmcs12->guest_fs_limit; evmcs->guest_gs_limit = vmcs12->guest_gs_limit; evmcs->guest_ldtr_limit = vmcs12->guest_ldtr_limit; evmcs->guest_tr_limit = vmcs12->guest_tr_limit; evmcs->guest_gdtr_limit = vmcs12->guest_gdtr_limit; evmcs->guest_idtr_limit = vmcs12->guest_idtr_limit; evmcs->guest_es_ar_bytes = vmcs12->guest_es_ar_bytes; evmcs->guest_cs_ar_bytes = vmcs12->guest_cs_ar_bytes; evmcs->guest_ss_ar_bytes = vmcs12->guest_ss_ar_bytes; evmcs->guest_ds_ar_bytes = vmcs12->guest_ds_ar_bytes; evmcs->guest_fs_ar_bytes = vmcs12->guest_fs_ar_bytes; evmcs->guest_gs_ar_bytes = vmcs12->guest_gs_ar_bytes; evmcs->guest_ldtr_ar_bytes = vmcs12->guest_ldtr_ar_bytes; evmcs->guest_tr_ar_bytes = vmcs12->guest_tr_ar_bytes; evmcs->guest_es_base = vmcs12->guest_es_base; evmcs->guest_cs_base = vmcs12->guest_cs_base; evmcs->guest_ss_base = vmcs12->guest_ss_base; evmcs->guest_ds_base = vmcs12->guest_ds_base; evmcs->guest_fs_base = vmcs12->guest_fs_base; evmcs->guest_gs_base = vmcs12->guest_gs_base; evmcs->guest_ldtr_base = vmcs12->guest_ldtr_base; evmcs->guest_tr_base = vmcs12->guest_tr_base; evmcs->guest_gdtr_base = vmcs12->guest_gdtr_base; evmcs->guest_idtr_base = vmcs12->guest_idtr_base; evmcs->guest_ia32_pat = vmcs12->guest_ia32_pat; evmcs->guest_ia32_efer = vmcs12->guest_ia32_efer; evmcs->guest_pdptr0 = vmcs12->guest_pdptr0; evmcs->guest_pdptr1 = vmcs12->guest_pdptr1; evmcs->guest_pdptr2 = vmcs12->guest_pdptr2; evmcs->guest_pdptr3 = vmcs12->guest_pdptr3; evmcs->guest_pending_dbg_exceptions = vmcs12->guest_pending_dbg_exceptions; evmcs->guest_sysenter_esp = vmcs12->guest_sysenter_esp; evmcs->guest_sysenter_eip = vmcs12->guest_sysenter_eip; evmcs->guest_activity_state = vmcs12->guest_activity_state; evmcs->guest_sysenter_cs = vmcs12->guest_sysenter_cs; evmcs->guest_cr0 = vmcs12->guest_cr0; evmcs->guest_cr3 = vmcs12->guest_cr3; evmcs->guest_cr4 = vmcs12->guest_cr4; evmcs->guest_dr7 = vmcs12->guest_dr7; evmcs->guest_physical_address = vmcs12->guest_physical_address; evmcs->vm_instruction_error = vmcs12->vm_instruction_error; evmcs->vm_exit_reason = vmcs12->vm_exit_reason; evmcs->vm_exit_intr_info = vmcs12->vm_exit_intr_info; evmcs->vm_exit_intr_error_code = vmcs12->vm_exit_intr_error_code; evmcs->idt_vectoring_info_field = vmcs12->idt_vectoring_info_field; evmcs->idt_vectoring_error_code = vmcs12->idt_vectoring_error_code; evmcs->vm_exit_instruction_len = vmcs12->vm_exit_instruction_len; evmcs->vmx_instruction_info = vmcs12->vmx_instruction_info; evmcs->exit_qualification = vmcs12->exit_qualification; evmcs->guest_linear_address = vmcs12->guest_linear_address; evmcs->guest_rsp = vmcs12->guest_rsp; evmcs->guest_rflags = vmcs12->guest_rflags; evmcs->guest_interruptibility_info = vmcs12->guest_interruptibility_info; evmcs->cpu_based_vm_exec_control = vmcs12->cpu_based_vm_exec_control; evmcs->vm_entry_controls = vmcs12->vm_entry_controls; evmcs->vm_entry_intr_info_field = vmcs12->vm_entry_intr_info_field; evmcs->vm_entry_exception_error_code = vmcs12->vm_entry_exception_error_code; evmcs->vm_entry_instruction_len = vmcs12->vm_entry_instruction_len; evmcs->guest_rip = vmcs12->guest_rip; evmcs->guest_bndcfgs = vmcs12->guest_bndcfgs; return; #else /* CONFIG_KVM_HYPERV */ KVM_BUG_ON(1, vmx->vcpu.kvm); #endif /* CONFIG_KVM_HYPERV */ } /* * This is an equivalent of the nested hypervisor executing the vmptrld * instruction. */ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld( struct kvm_vcpu *vcpu, bool from_launch) { #ifdef CONFIG_KVM_HYPERV struct vcpu_vmx *vmx = to_vmx(vcpu); bool evmcs_gpa_changed = false; u64 evmcs_gpa; if (likely(!guest_cpu_cap_has_evmcs(vcpu))) return EVMPTRLD_DISABLED; evmcs_gpa = nested_get_evmptr(vcpu); if (!evmptr_is_valid(evmcs_gpa)) { nested_release_evmcs(vcpu); return EVMPTRLD_DISABLED; } if (unlikely(evmcs_gpa != vmx->nested.hv_evmcs_vmptr)) { vmx->nested.current_vmptr = INVALID_GPA; nested_release_evmcs(vcpu); if (kvm_vcpu_map(vcpu, gpa_to_gfn(evmcs_gpa), &vmx->nested.hv_evmcs_map)) return EVMPTRLD_ERROR; vmx->nested.hv_evmcs = vmx->nested.hv_evmcs_map.hva; /* * Currently, KVM only supports eVMCS version 1 * (== KVM_EVMCS_VERSION) and thus we expect guest to set this * value to first u32 field of eVMCS which should specify eVMCS * VersionNumber. * * Guest should be aware of supported eVMCS versions by host by * examining CPUID.0x4000000A.EAX[0:15]. Host userspace VMM is * expected to set this CPUID leaf according to the value * returned in vmcs_version from nested_enable_evmcs(). * * However, it turns out that Microsoft Hyper-V fails to comply * to their own invented interface: When Hyper-V use eVMCS, it * just sets first u32 field of eVMCS to revision_id specified * in MSR_IA32_VMX_BASIC. Instead of used eVMCS version number * which is one of the supported versions specified in * CPUID.0x4000000A.EAX[0:15]. * * To overcome Hyper-V bug, we accept here either a supported * eVMCS version or VMCS12 revision_id as valid values for first * u32 field of eVMCS. */ if ((vmx->nested.hv_evmcs->revision_id != KVM_EVMCS_VERSION) && (vmx->nested.hv_evmcs->revision_id != VMCS12_REVISION)) { nested_release_evmcs(vcpu); return EVMPTRLD_VMFAIL; } vmx->nested.hv_evmcs_vmptr = evmcs_gpa; evmcs_gpa_changed = true; /* * Unlike normal vmcs12, enlightened vmcs12 is not fully * reloaded from guest's memory (read only fields, fields not * present in struct hv_enlightened_vmcs, ...). Make sure there * are no leftovers. */ if (from_launch) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); memset(vmcs12, 0, sizeof(*vmcs12)); vmcs12->hdr.revision_id = VMCS12_REVISION; } } /* * Clean fields data can't be used on VMLAUNCH and when we switch * between different L2 guests as KVM keeps a single VMCS12 per L1. */ if (from_launch || evmcs_gpa_changed) { vmx->nested.hv_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL; vmx->nested.force_msr_bitmap_recalc = true; } return EVMPTRLD_SUCCEEDED; #else return EVMPTRLD_DISABLED; #endif } void nested_sync_vmcs12_to_shadow(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); if (nested_vmx_is_evmptr12_valid(vmx)) copy_vmcs12_to_enlightened(vmx); else copy_vmcs12_to_shadow(vmx); vmx->nested.need_vmcs12_to_shadow_sync = false; } static enum hrtimer_restart vmx_preemption_timer_fn(struct hrtimer *timer) { struct vcpu_vmx *vmx = container_of(timer, struct vcpu_vmx, nested.preemption_timer); vmx->nested.preemption_timer_expired = true; kvm_make_request(KVM_REQ_EVENT, &vmx->vcpu); kvm_vcpu_kick(&vmx->vcpu); return HRTIMER_NORESTART; } static u64 vmx_calc_preemption_timer_value(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); struct vmcs12 *vmcs12 = get_vmcs12(vcpu); u64 l1_scaled_tsc = kvm_read_l1_tsc(vcpu, rdtsc()) >> VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE; if (!vmx->nested.has_preemption_timer_deadline) { vmx->nested.preemption_timer_deadline = vmcs12->vmx_preemption_timer_value + l1_scaled_tsc; vmx->nested.has_preemption_timer_deadline = true; } return vmx->nested.preemption_timer_deadline - l1_scaled_tsc; } static void vmx_start_preemption_timer(struct kvm_vcpu *vcpu, u64 preemption_timeout) { struct vcpu_vmx *vmx = to_vmx(vcpu); /* * A timer value of zero is architecturally guaranteed to cause * a VMExit prior to executing any instructions in the guest. */ if (preemption_timeout == 0) { vmx_preemption_timer_fn(&vmx->nested.preemption_timer); return; } if (vcpu->arch.virtual_tsc_khz == 0) return; preemption_timeout <<= VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE; preemption_timeout *= 1000000; do_div(preemption_timeout, vcpu->arch.virtual_tsc_khz); hrtimer_start(&vmx->nested.preemption_timer, ktime_add_ns(ktime_get(), preemption_timeout), HRTIMER_MODE_ABS_PINNED); } static u64 nested_vmx_calc_efer(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12) { if (vmx->nested.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)) return vmcs12->guest_ia32_efer; else if (vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) return vmx->vcpu.arch.efer | (EFER_LMA | EFER_LME); else return vmx->vcpu.arch.efer & ~(EFER_LMA | EFER_LME); } static void prepare_vmcs02_constant_state(struct vcpu_vmx *vmx) { struct kvm *kvm = vmx->vcpu.kvm; /* * If vmcs02 hasn't been initialized, set the constant vmcs02 state * according to L0's settings (vmcs12 is irrelevant here). Host * fields that come from L0 and are not constant, e.g. HOST_CR3, * will be set as needed prior to VMLAUNCH/VMRESUME. */ if (vmx->nested.vmcs02_initialized) return; vmx->nested.vmcs02_initialized = true; /* * We don't care what the EPTP value is we just need to guarantee * it's valid so we don't get a false positive when doing early * consistency checks. */ if (enable_ept && nested_early_check) vmcs_write64(EPT_POINTER, construct_eptp(&vmx->vcpu, 0, PT64_ROOT_4LEVEL)); if (vmx->ve_info) vmcs_write64(VE_INFORMATION_ADDRESS, __pa(vmx->ve_info)); /* All VMFUNCs are currently emulated through L0 vmexits. */ if (cpu_has_vmx_vmfunc()) vmcs_write64(VM_FUNCTION_CONTROL, 0); if (cpu_has_vmx_posted_intr()) vmcs_write16(POSTED_INTR_NV, POSTED_INTR_NESTED_VECTOR); if (cpu_has_vmx_msr_bitmap()) vmcs_write64(MSR_BITMAP, __pa(vmx->nested.vmcs02.msr_bitmap)); /* * PML is emulated for L2, but never enabled in hardware as the MMU * handles A/D emulation. Disabling PML for L2 also avoids having to * deal with filtering out L2 GPAs from the buffer. */ if (enable_pml) { vmcs_write64(PML_ADDRESS, 0); vmcs_write16(GUEST_PML_INDEX, -1); } if (cpu_has_vmx_encls_vmexit()) vmcs_write64(ENCLS_EXITING_BITMAP, INVALID_GPA); if (kvm_notify_vmexit_enabled(kvm)) vmcs_write32(NOTIFY_WINDOW, kvm->arch.notify_window); /* * Set the MSR load/store lists to match L0's settings. Only the * addresses are constant (for vmcs02), the counts can change based * on L2's behavior, e.g. switching to/from long mode. */ vmcs_write64(VM_EXIT_MSR_STORE_ADDR, __pa(vmx->msr_autostore.guest.val)); vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.host.val)); vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx->msr_autoload.guest.val)); vmx_set_constant_host_state(vmx); } static void prepare_vmcs02_early_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12) { prepare_vmcs02_constant_state(vmx); vmcs_write64(VMCS_LINK_POINTER, INVALID_GPA); /* * If VPID is disabled, then guest TLB accesses use VPID=0, i.e. the * same VPID as the host. Emulate this behavior by using vpid01 for L2 * if VPID is disabled in vmcs12. Note, if VPID is disabled, VM-Enter * and VM-Exit are architecturally required to flush VPID=0, but *only* * VPID=0. I.e. using vpid02 would be ok (so long as KVM emulates the * required flushes), but doing so would cause KVM to over-flush. E.g. * if L1 runs L2 X with VPID12=1, then runs L2 Y with VPID12 disabled, * and then runs L2 X again, then KVM can and should retain TLB entries * for VPID12=1. */ if (enable_vpid) { if (nested_cpu_has_vpid(vmcs12) && vmx->nested.vpid02) vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->nested.vpid02); else vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid); } } static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs01, struct vmcs12 *vmcs12) { u32 exec_control; u64 guest_efer = nested_vmx_calc_efer(vmx, vmcs12); if (vmx->nested.dirty_vmcs12 || nested_vmx_is_evmptr12_valid(vmx)) prepare_vmcs02_early_rare(vmx, vmcs12); /* * PIN CONTROLS */ exec_control = __pin_controls_get(vmcs01); exec_control |= (vmcs12->pin_based_vm_exec_control & ~PIN_BASED_VMX_PREEMPTION_TIMER); /* Posted interrupts setting is only taken from vmcs12. */ vmx->nested.pi_pending = false; if (nested_cpu_has_posted_intr(vmcs12)) { vmx->nested.posted_intr_nv = vmcs12->posted_intr_nv; } else { vmx->nested.posted_intr_nv = -1; exec_control &= ~PIN_BASED_POSTED_INTR; } pin_controls_set(vmx, exec_control); /* * EXEC CONTROLS */ exec_control = __exec_controls_get(vmcs01); /* L0's desires */ exec_control &= ~CPU_BASED_INTR_WINDOW_EXITING; exec_control &= ~CPU_BASED_NMI_WINDOW_EXITING; exec_control &= ~CPU_BASED_TPR_SHADOW; exec_control |= vmcs12->cpu_based_vm_exec_control; vmx->nested.l1_tpr_threshold = -1; if (exec_control & CPU_BASED_TPR_SHADOW) vmcs_write32(TPR_THRESHOLD, vmcs12->tpr_threshold); #ifdef CONFIG_X86_64 else exec_control |= CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING; #endif /* * A vmexit (to either L1 hypervisor or L0 userspace) is always needed * for I/O port accesses. */ exec_control |= CPU_BASED_UNCOND_IO_EXITING; exec_control &= ~CPU_BASED_USE_IO_BITMAPS; /* * This bit will be computed in nested_get_vmcs12_pages, because * we do not have access to L1's MSR bitmap yet. For now, keep * the same bit as before, hoping to avoid multiple VMWRITEs that * only set/clear this bit. */ exec_control &= ~CPU_BASED_USE_MSR_BITMAPS; exec_control |= exec_controls_get(vmx) & CPU_BASED_USE_MSR_BITMAPS; exec_controls_set(vmx, exec_control); /* * SECONDARY EXEC CONTROLS */ if (cpu_has_secondary_exec_ctrls()) { exec_control = __secondary_exec_controls_get(vmcs01); /* Take the following fields only from vmcs12 */ exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | SECONDARY_EXEC_ENABLE_INVPCID | SECONDARY_EXEC_ENABLE_RDTSCP | SECONDARY_EXEC_ENABLE_XSAVES | SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE | SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY | SECONDARY_EXEC_APIC_REGISTER_VIRT | SECONDARY_EXEC_ENABLE_VMFUNC | SECONDARY_EXEC_DESC); if (nested_cpu_has(vmcs12, CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)) exec_control |= vmcs12->secondary_vm_exec_control; /* PML is emulated and never enabled in hardware for L2. */ exec_control &= ~SECONDARY_EXEC_ENABLE_PML; /* VMCS shadowing for L2 is emulated for now */ exec_control &= ~SECONDARY_EXEC_SHADOW_VMCS; /* * Preset *DT exiting when emulating UMIP, so that vmx_set_cr4() * will not have to rewrite the controls just for this bit. */ if (vmx_umip_emulated() && (vmcs12->guest_cr4 & X86_CR4_UMIP)) exec_control |= SECONDARY_EXEC_DESC; if (exec_control & SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY) vmcs_write16(GUEST_INTR_STATUS, vmcs12->guest_intr_status); if (!nested_cpu_has2(vmcs12, SECONDARY_EXEC_UNRESTRICTED_GUEST)) exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST; if (exec_control & SECONDARY_EXEC_ENCLS_EXITING) vmx_write_encls_bitmap(&vmx->vcpu, vmcs12); secondary_exec_controls_set(vmx, exec_control); } /* * ENTRY CONTROLS * * vmcs12's VM_{ENTRY,EXIT}_LOAD_IA32_EFER and VM_ENTRY_IA32E_MODE * are emulated by vmx_set_efer() in prepare_vmcs02(), but speculate * on the related bits (if supported by the CPU) in the hope that * we can avoid VMWrites during vmx_set_efer(). * * Similarly, take vmcs01's PERF_GLOBAL_CTRL in the hope that if KVM is * loading PERF_GLOBAL_CTRL via the VMCS for L1, then KVM will want to * do the same for L2. */ exec_control = __vm_entry_controls_get(vmcs01); exec_control |= (vmcs12->vm_entry_controls & ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL); exec_control &= ~(VM_ENTRY_IA32E_MODE | VM_ENTRY_LOAD_IA32_EFER); if (cpu_has_load_ia32_efer()) { if (guest_efer & EFER_LMA) exec_control |= VM_ENTRY_IA32E_MODE; if (guest_efer != kvm_host.efer) exec_control |= VM_ENTRY_LOAD_IA32_EFER; } vm_entry_controls_set(vmx, exec_control); /* * EXIT CONTROLS * * L2->L1 exit controls are emulated - the hardware exit is to L0 so * we should use its exit controls. Note that VM_EXIT_LOAD_IA32_EFER * bits may be modified by vmx_set_efer() in prepare_vmcs02(). */ exec_control = __vm_exit_controls_get(vmcs01); if (cpu_has_load_ia32_efer() && guest_efer != kvm_host.efer) exec_control |= VM_EXIT_LOAD_IA32_EFER; else exec_control &= ~VM_EXIT_LOAD_IA32_EFER; vm_exit_controls_set(vmx, exec_control); /* * Interrupt/Exception Fields */ if (vmx->nested.nested_run_pending) { vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, vmcs12->vm_entry_intr_info_field); vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, vmcs12->vm_entry_exception_error_code); vmcs_write32(VM_ENTRY_INSTRUCTION_LEN, vmcs12->vm_entry_instruction_len); vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, vmcs12->guest_interruptibility_info); vmx->loaded_vmcs->nmi_known_unmasked = !(vmcs12->guest_interruptibility_info & GUEST_INTR_STATE_NMI); } else { vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0); } } static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12) { struct hv_enlightened_vmcs *hv_evmcs = nested_vmx_evmcs(vmx); if (!hv_evmcs || !(hv_evmcs->hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2)) { vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector); vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector); vmcs_write16(GUEST_SS_SELECTOR, vmcs12->guest_ss_selector); vmcs_write16(GUEST_DS_SELECTOR, vmcs12->guest_ds_selector); vmcs_write16(GUEST_FS_SELECTOR, vmcs12->guest_fs_selector); vmcs_write16(GUEST_GS_SELECTOR, vmcs12->guest_gs_selector); vmcs_write16(GUEST_LDTR_SELECTOR, vmcs12->guest_ldtr_selector); vmcs_write16(GUEST_TR_SELECTOR, vmcs12->guest_tr_selector); vmcs_write32(GUEST_ES_LIMIT, vmcs12->guest_es_limit); vmcs_write32(GUEST_CS_LIMIT, vmcs12->guest_cs_limit); vmcs_write32(GUEST_SS_LIMIT, vmcs12->guest_ss_limit); vmcs_write32(GUEST_DS_LIMIT, vmcs12->guest_ds_limit); vmcs_write32(GUEST_FS_LIMIT, vmcs12->guest_fs_limit); vmcs_write32(GUEST_GS_LIMIT, vmcs12->guest_gs_limit); vmcs_write32(GUEST_LDTR_LIMIT, vmcs12->guest_ldtr_limit); vmcs_write32(GUEST_TR_LIMIT, vmcs12->guest_tr_limit); vmcs_write32(GUEST_GDTR_LIMIT, vmcs12->guest_gdtr_limit); vmcs_write32(GUEST_IDTR_LIMIT, vmcs12->guest_idtr_limit); vmcs_write32(GUEST_CS_AR_BYTES, vmcs12->guest_cs_ar_bytes); vmcs_write32(GUEST_SS_AR_BYTES, vmcs12->guest_ss_ar_bytes); vmcs_write32(GUEST_ES_AR_BYTES, vmcs12->guest_es_ar_bytes); vmcs_write32(GUEST_DS_AR_BYTES, vmcs12->guest_ds_ar_bytes); vmcs_write32(GUEST_FS_AR_BYTES, vmcs12->guest_fs_ar_bytes); vmcs_write32(GUEST_GS_AR_BYTES, vmcs12->guest_gs_ar_bytes); vmcs_write32(GUEST_LDTR_AR_BYTES, vmcs12->guest_ldtr_ar_bytes); vmcs_write32(GUEST_TR_AR_BYTES, vmcs12->guest_tr_ar_bytes); vmcs_writel(GUEST_ES_BASE, vmcs12->guest_es_base); vmcs_writel(GUEST_CS_BASE, vmcs12->guest_cs_base); vmcs_writel(GUEST_SS_BASE, vmcs12->guest_ss_base); vmcs_writel(GUEST_DS_BASE, vmcs12->guest_ds_base); vmcs_writel(GUEST_FS_BASE, vmcs12->guest_fs_base); vmcs_writel(GUEST_GS_BASE, vmcs12->guest_gs_base); vmcs_writel(GUEST_LDTR_BASE, vmcs12->guest_ldtr_base); vmcs_writel(GUEST_TR_BASE, vmcs12->guest_tr_base); vmcs_writel(GUEST_GDTR_BASE, vmcs12->guest_gdtr_base); vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base); vmx_segment_cache_clear(vmx); } if (!hv_evmcs || !(hv_evmcs->hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1)) { vmcs_write32(GUEST_SYSENTER_CS, vmcs12->guest_sysenter_cs); vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, vmcs12->guest_pending_dbg_exceptions); vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->guest_sysenter_esp); vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->guest_sysenter_eip); /* * L1 may access the L2's PDPTR, so save them to construct * vmcs12 */ if (enable_ept) { vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0); vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1); vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2); vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3); } if (kvm_mpx_supported() && vmx->nested.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)) vmcs_write64(GUEST_BNDCFGS, vmcs12->guest_bndcfgs); } if (nested_cpu_has_xsaves(vmcs12)) vmcs_write64(XSS_EXIT_BITMAP, vmcs12->xss_exit_bitmap); /* * Whether page-faults are trapped is determined by a combination of * 3 settings: PFEC_MASK, PFEC_MATCH and EXCEPTION_BITMAP.PF. If L0 * doesn't care about page faults then we should set all of these to * L1's desires. However, if L0 does care about (some) page faults, it * is not easy (if at all possible?) to merge L0 and L1's desires, we * simply ask to exit on each and every L2 page fault. This is done by * setting MASK=MATCH=0 and (see below) EB.PF=1. * Note that below we don't need special code to set EB.PF beyond the * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept, * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when * !enable_ept, EB.PF is 1, so the "or" will always be 1. */ if (vmx_need_pf_intercept(&vmx->vcpu)) { /* * TODO: if both L0 and L1 need the same MASK and MATCH, * go ahead and use it? */ vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0); vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0); } else { vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, vmcs12->page_fault_error_code_mask); vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, vmcs12->page_fault_error_code_match); } if (cpu_has_vmx_apicv()) { vmcs_write64(EOI_EXIT_BITMAP0, vmcs12->eoi_exit_bitmap0); vmcs_write64(EOI_EXIT_BITMAP1, vmcs12->eoi_exit_bitmap1); vmcs_write64(EOI_EXIT_BITMAP2, vmcs12->eoi_exit_bitmap2); vmcs_write64(EOI_EXIT_BITMAP3, vmcs12->eoi_exit_bitmap3); } /* * Make sure the msr_autostore list is up to date before we set the * count in the vmcs02. */ prepare_vmx_msr_autostore_list(&vmx->vcpu, MSR_IA32_TSC); vmcs_write32(VM_EXIT_MSR_STORE_COUNT, vmx->msr_autostore.guest.nr); vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr); vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr); set_cr4_guest_host_mask(vmx); } /* * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it * with L0's requirements for its guest (a.k.a. vmcs01), so we can run the L2 * guest in a way that will both be appropriate to L1's requests, and our * needs. In addition to modifying the active vmcs (which is vmcs02), this * function also has additional necessary side-effects, like setting various * vcpu->arch fields. * Returns 0 on success, 1 on failure. Invalid state exit qualification code * is assigned to entry_failure_code on failure. */ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, bool from_vmentry, enum vm_entry_failure_code *entry_failure_code) { struct vcpu_vmx *vmx = to_vmx(vcpu); struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx); bool load_guest_pdptrs_vmcs12 = false; if (vmx->nested.dirty_vmcs12 || nested_vmx_is_evmptr12_valid(vmx)) { prepare_vmcs02_rare(vmx, vmcs12); vmx->nested.dirty_vmcs12 = false; load_guest_pdptrs_vmcs12 = !nested_vmx_is_evmptr12_valid(vmx) || !(evmcs->hv_clean_fields & HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1); } if (vmx->nested.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) { kvm_set_dr(vcpu, 7, vmcs12->guest_dr7); vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12->guest_ia32_debugctl); } else { kvm_set_dr(vcpu, 7, vcpu->arch.dr7); vmcs_write64(GUEST_IA32_DEBUGCTL, vmx->nested.pre_vmenter_debugctl); } if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))) vmcs_write64(GUEST_BNDCFGS, vmx->nested.pre_vmenter_bndcfgs); vmx_set_rflags(vcpu, vmcs12->guest_rflags); /* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the * bitwise-or of what L1 wants to trap for L2, and what we want to * trap. Note that CR0.TS also needs updating - we do this later. */ vmx_update_exception_bitmap(vcpu); vcpu->arch.cr0_guest_owned_bits &= ~vmcs12->cr0_guest_host_mask; vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits); if (vmx->nested.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT)) { vmcs_write64(GUEST_IA32_PAT, vmcs12->guest_ia32_pat); vcpu->arch.pat = vmcs12->guest_ia32_pat; } else if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) { vmcs_write64(GUEST_IA32_PAT, vmx->vcpu.arch.pat); } vcpu->arch.tsc_offset = kvm_calc_nested_tsc_offset( vcpu->arch.l1_tsc_offset, vmx_get_l2_tsc_offset(vcpu), vmx_get_l2_tsc_multiplier(vcpu)); vcpu->arch.tsc_scaling_ratio = kvm_calc_nested_tsc_multiplier( vcpu->arch.l1_tsc_scaling_ratio, vmx_get_l2_tsc_multiplier(vcpu)); vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset); if (kvm_caps.has_tsc_control) vmcs_write64(TSC_MULTIPLIER, vcpu->arch.tsc_scaling_ratio); nested_vmx_transition_tlb_flush(vcpu, vmcs12, true); if (nested_cpu_has_ept(vmcs12)) nested_ept_init_mmu_context(vcpu); /* * Override the CR0/CR4 read shadows after setting the effective guest * CR0/CR4. The common helpers also set the shadows, but they don't * account for vmcs12's cr0/4_guest_host_mask. */ vmx_set_cr0(vcpu, vmcs12->guest_cr0); vmcs_writel(CR0_READ_SHADOW, nested_read_cr0(vmcs12)); vmx_set_cr4(vcpu, vmcs12->guest_cr4); vmcs_writel(CR4_READ_SHADOW, nested_read_cr4(vmcs12)); vcpu->arch.efer = nested_vmx_calc_efer(vmx, vmcs12); /* Note: may modify VM_ENTRY/EXIT_CONTROLS and GUEST/HOST_IA32_EFER */ vmx_set_efer(vcpu, vcpu->arch.efer); /* * Guest state is invalid and unrestricted guest is disabled, * which means L1 attempted VMEntry to L2 with invalid state. * Fail the VMEntry. * * However when force loading the guest state (SMM exit or * loading nested state after migration, it is possible to * have invalid guest state now, which will be later fixed by * restoring L2 register state */ if (CC(from_vmentry && !vmx_guest_state_valid(vcpu))) { *entry_failure_code = ENTRY_FAIL_DEFAULT; return -EINVAL; } /* Shadow page tables on either EPT or shadow page tables. */ if (nested_vmx_load_cr3(vcpu, vmcs12->guest_cr3, nested_cpu_has_ept(vmcs12), from_vmentry, entry_failure_code)) return -EINVAL; /* * Immediately write vmcs02.GUEST_CR3. It will be propagated to vmcs12 * on nested VM-Exit, which can occur without actually running L2 and * thus without hitting vmx_load_mmu_pgd(), e.g. if L1 is entering L2 with * vmcs12.GUEST_ACTIVITYSTATE=HLT, in which case KVM will intercept the * transition to HLT instead of running L2. */ if (enable_ept) vmcs_writel(GUEST_CR3, vmcs12->guest_cr3); /* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */ if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) && is_pae_paging(vcpu)) { vmcs_write64(GUEST_PDPTR0, vmcs12->guest_pdptr0); vmcs_write64(GUEST_PDPTR1, vmcs12->guest_pdptr1); vmcs_write64(GUEST_PDPTR2, vmcs12->guest_pdptr2); vmcs_write64(GUEST_PDPTR3, vmcs12->guest_pdptr3); } if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) && kvm_pmu_has_perf_global_ctrl(vcpu_to_pmu(vcpu)) && WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL, vmcs12->guest_ia32_perf_global_ctrl))) { *entry_failure_code = ENTRY_FAIL_DEFAULT; return -EINVAL; } kvm_rsp_write(vcpu, vmcs12->guest_rsp); kvm_rip_write(vcpu, vmcs12->guest_rip); /* * It was observed that genuine Hyper-V running in L1 doesn't reset * 'hv_clean_fields' by itself, it only sets the corresponding dirty * bits when it changes a field in eVMCS. Mark all fields as clean * here. */ if (nested_vmx_is_evmptr12_valid(vmx)) evmcs->hv_clean_fields |= HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL; return 0; } static int nested_vmx_check_nmi_controls(struct vmcs12 *vmcs12) { if (CC(!nested_cpu_has_nmi_exiting(vmcs12) && nested_cpu_has_virtual_nmis(vmcs12))) return -EINVAL; if (CC(!nested_cpu_has_virtual_nmis(vmcs12) && nested_cpu_has(vmcs12, CPU_BASED_NMI_WINDOW_EXITING))) return -EINVAL; return 0; } static bool nested_vmx_check_eptp(struct kvm_vcpu *vcpu, u64 new_eptp) { struct vcpu_vmx *vmx = to_vmx(vcpu); /* Check for memory type validity */ switch (new_eptp & VMX_EPTP_MT_MASK) { case VMX_EPTP_MT_UC: if (CC(!(vmx->nested.msrs.ept_caps & VMX_EPTP_UC_BIT))) return false; break; case VMX_EPTP_MT_WB: if (CC(!(vmx->nested.msrs.ept_caps & VMX_EPTP_WB_BIT))) return false; break; default: return false; } /* Page-walk levels validity. */ switch (new_eptp & VMX_EPTP_PWL_MASK) { case VMX_EPTP_PWL_5: if (CC(!(vmx->nested.msrs.ept_caps & VMX_EPT_PAGE_WALK_5_BIT))) return false; break; case VMX_EPTP_PWL_4: if (CC(!(vmx->nested.msrs.ept_caps & VMX_EPT_PAGE_WALK_4_BIT))) return false; break; default: return false; } /* Reserved bits should not be set */ if (CC(!kvm_vcpu_is_legal_gpa(vcpu, new_eptp) || ((new_eptp >> 7) & 0x1f))) return false; /* AD, if set, should be supported */ if (new_eptp & VMX_EPTP_AD_ENABLE_BIT) { if (CC(!(vmx->nested.msrs.ept_caps & VMX_EPT_AD_BIT))) return false; } return true; } /* * Checks related to VM-Execution Control Fields */ static int nested_check_vm_execution_controls(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { struct vcpu_vmx *vmx = to_vmx(vcpu); if (CC(!vmx_control_verify(vmcs12->pin_based_vm_exec_control, vmx->nested.msrs.pinbased_ctls_low, vmx->nested.msrs.pinbased_ctls_high)) || CC(!vmx_control_verify(vmcs12->cpu_based_vm_exec_control, vmx->nested.msrs.procbased_ctls_low, vmx->nested.msrs.procbased_ctls_high))) return -EINVAL; if (nested_cpu_has(vmcs12, CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) && CC(!vmx_control_verify(vmcs12->secondary_vm_exec_control, vmx->nested.msrs.secondary_ctls_low, vmx->nested.msrs.secondary_ctls_high))) return -EINVAL; if (CC(vmcs12->cr3_target_count > nested_cpu_vmx_misc_cr3_count(vcpu)) || nested_vmx_check_io_bitmap_controls(vcpu, vmcs12) || nested_vmx_check_msr_bitmap_controls(vcpu, vmcs12) || nested_vmx_check_tpr_shadow_controls(vcpu, vmcs12) || nested_vmx_check_apic_access_controls(vcpu, vmcs12) || nested_vmx_check_apicv_controls(vcpu, vmcs12) || nested_vmx_check_nmi_controls(vmcs12) || nested_vmx_check_pml_controls(vcpu, vmcs12) || nested_vmx_check_unrestricted_guest_controls(vcpu, vmcs12) || nested_vmx_check_mode_based_ept_exec_controls(vcpu, vmcs12) || nested_vmx_check_shadow_vmcs_controls(vcpu, vmcs12) || CC(nested_cpu_has_vpid(vmcs12) && !vmcs12->virtual_processor_id)) return -EINVAL; if (!nested_cpu_has_preemption_timer(vmcs12) && nested_cpu_has_save_preemption_timer(vmcs12)) return -EINVAL; if (nested_cpu_has_ept(vmcs12) && CC(!nested_vmx_check_eptp(vcpu, vmcs12->ept_pointer))) return -EINVAL; if (nested_cpu_has_vmfunc(vmcs12)) { if (CC(vmcs12->vm_function_control & ~vmx->nested.msrs.vmfunc_controls)) return -EINVAL; if (nested_cpu_has_eptp_switching(vmcs12)) { if (CC(!nested_cpu_has_ept(vmcs12)) || CC(!page_address_valid(vcpu, vmcs12->eptp_list_address))) return -EINVAL; } } return 0; } /* * Checks related to VM-Exit Control Fields */ static int nested_check_vm_exit_controls(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { struct vcpu_vmx *vmx = to_vmx(vcpu); if (CC(!vmx_control_verify(vmcs12->vm_exit_controls, vmx->nested.msrs.exit_ctls_low, vmx->nested.msrs.exit_ctls_high)) || CC(nested_vmx_check_exit_msr_switch_controls(vcpu, vmcs12))) return -EINVAL; return 0; } /* * Checks related to VM-Entry Control Fields */ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { struct vcpu_vmx *vmx = to_vmx(vcpu); if (CC(!vmx_control_verify(vmcs12->vm_entry_controls, vmx->nested.msrs.entry_ctls_low, vmx->nested.msrs.entry_ctls_high))) return -EINVAL; /* * From the Intel SDM, volume 3: * Fields relevant to VM-entry event injection must be set properly. * These fields are the VM-entry interruption-information field, the * VM-entry exception error code, and the VM-entry instruction length. */ if (vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK) { u32 intr_info = vmcs12->vm_entry_intr_info_field; u8 vector = intr_info & INTR_INFO_VECTOR_MASK; u32 intr_type = intr_info & INTR_INFO_INTR_TYPE_MASK; bool has_error_code = intr_info & INTR_INFO_DELIVER_CODE_MASK; bool should_have_error_code; bool urg = nested_cpu_has2(vmcs12, SECONDARY_EXEC_UNRESTRICTED_GUEST); bool prot_mode = !urg || vmcs12->guest_cr0 & X86_CR0_PE; /* VM-entry interruption-info field: interruption type */ if (CC(intr_type == INTR_TYPE_RESERVED) || CC(intr_type == INTR_TYPE_OTHER_EVENT && !nested_cpu_supports_monitor_trap_flag(vcpu))) return -EINVAL; /* VM-entry interruption-info field: vector */ if (CC(intr_type == INTR_TYPE_NMI_INTR && vector != NMI_VECTOR) || CC(intr_type == INTR_TYPE_HARD_EXCEPTION && vector > 31) || CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0)) return -EINVAL; /* VM-entry interruption-info field: deliver error code */ should_have_error_code = intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode && x86_exception_has_error_code(vector); if (CC(has_error_code != should_have_error_code)) return -EINVAL; /* VM-entry exception error code */ if (CC(has_error_code && vmcs12->vm_entry_exception_error_code & GENMASK(31, 16))) return -EINVAL; /* VM-entry interruption-info field: reserved bits */ if (CC(intr_info & INTR_INFO_RESVD_BITS_MASK)) return -EINVAL; /* VM-entry instruction length */ switch (intr_type) { case INTR_TYPE_SOFT_EXCEPTION: case INTR_TYPE_SOFT_INTR: case INTR_TYPE_PRIV_SW_EXCEPTION: if (CC(vmcs12->vm_entry_instruction_len > 15) || CC(vmcs12->vm_entry_instruction_len == 0 && CC(!nested_cpu_has_zero_length_injection(vcpu)))) return -EINVAL; } } if (nested_vmx_check_entry_msr_switch_controls(vcpu, vmcs12)) return -EINVAL; return 0; } static int nested_vmx_check_controls(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { if (nested_check_vm_execution_controls(vcpu, vmcs12) || nested_check_vm_exit_controls(vcpu, vmcs12) || nested_check_vm_entry_controls(vcpu, vmcs12)) return -EINVAL; #ifdef CONFIG_KVM_HYPERV if (guest_cpu_cap_has_evmcs(vcpu)) return nested_evmcs_check_controls(vmcs12); #endif return 0; } static int nested_vmx_check_address_space_size(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { #ifdef CONFIG_X86_64 if (CC(!!(vmcs12->vm_exit_controls & VM_EXIT_HOST_ADDR_SPACE_SIZE) != !!(vcpu->arch.efer & EFER_LMA))) return -EINVAL; #endif return 0; } static bool is_l1_noncanonical_address_on_vmexit(u64 la, struct vmcs12 *vmcs12) { /* * Check that the given linear address is canonical after a VM exit * from L2, based on HOST_CR4.LA57 value that will be loaded for L1. */ u8 l1_address_bits_on_exit = (vmcs12->host_cr4 & X86_CR4_LA57) ? 57 : 48; return !__is_canonical_address(la, l1_address_bits_on_exit); } static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { bool ia32e = !!(vmcs12->vm_exit_controls & VM_EXIT_HOST_ADDR_SPACE_SIZE); if (CC(!nested_host_cr0_valid(vcpu, vmcs12->host_cr0)) || CC(!nested_host_cr4_valid(vcpu, vmcs12->host_cr4)) || CC(!kvm_vcpu_is_legal_cr3(vcpu, vmcs12->host_cr3))) return -EINVAL; if (CC(is_noncanonical_msr_address(vmcs12->host_ia32_sysenter_esp, vcpu)) || CC(is_noncanonical_msr_address(vmcs12->host_ia32_sysenter_eip, vcpu))) return -EINVAL; if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT) && CC(!kvm_pat_valid(vmcs12->host_ia32_pat))) return -EINVAL; if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL) && CC(!kvm_valid_perf_global_ctrl(vcpu_to_pmu(vcpu), vmcs12->host_ia32_perf_global_ctrl))) return -EINVAL; if (ia32e) { if (CC(!(vmcs12->host_cr4 & X86_CR4_PAE))) return -EINVAL; } else { if (CC(vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) || CC(vmcs12->host_cr4 & X86_CR4_PCIDE) || CC((vmcs12->host_rip) >> 32)) return -EINVAL; } if (CC(vmcs12->host_cs_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK)) || CC(vmcs12->host_ss_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK)) || CC(vmcs12->host_ds_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK)) || CC(vmcs12->host_es_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK)) || CC(vmcs12->host_fs_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK)) || CC(vmcs12->host_gs_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK)) || CC(vmcs12->host_tr_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK)) || CC(vmcs12->host_cs_selector == 0) || CC(vmcs12->host_tr_selector == 0) || CC(vmcs12->host_ss_selector == 0 && !ia32e)) return -EINVAL; if (CC(is_noncanonical_base_address(vmcs12->host_fs_base, vcpu)) || CC(is_noncanonical_base_address(vmcs12->host_gs_base, vcpu)) || CC(is_noncanonical_base_address(vmcs12->host_gdtr_base, vcpu)) || CC(is_noncanonical_base_address(vmcs12->host_idtr_base, vcpu)) || CC(is_noncanonical_base_address(vmcs12->host_tr_base, vcpu)) || CC(is_l1_noncanonical_address_on_vmexit(vmcs12->host_rip, vmcs12))) return -EINVAL; /* * If the load IA32_EFER VM-exit control is 1, bits reserved in the * IA32_EFER MSR must be 0 in the field for that register. In addition, * the values of the LMA and LME bits in the field must each be that of * the host address-space size VM-exit control. */ if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER) { if (CC(!kvm_valid_efer(vcpu, vmcs12->host_ia32_efer)) || CC(ia32e != !!(vmcs12->host_ia32_efer & EFER_LMA)) || CC(ia32e != !!(vmcs12->host_ia32_efer & EFER_LME))) return -EINVAL; } return 0; } static int nested_vmx_check_vmcs_link_ptr(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { struct vcpu_vmx *vmx = to_vmx(vcpu); struct gfn_to_hva_cache *ghc = &vmx->nested.shadow_vmcs12_cache; struct vmcs_hdr hdr; if (vmcs12->vmcs_link_pointer == INVALID_GPA) return 0; if (CC(!page_address_valid(vcpu, vmcs12->vmcs_link_pointer))) return -EINVAL; if (ghc->gpa != vmcs12->vmcs_link_pointer && CC(kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, vmcs12->vmcs_link_pointer, VMCS12_SIZE))) return -EINVAL; if (CC(kvm_read_guest_offset_cached(vcpu->kvm, ghc, &hdr, offsetof(struct vmcs12, hdr), sizeof(hdr)))) return -EINVAL; if (CC(hdr.revision_id != VMCS12_REVISION) || CC(hdr.shadow_vmcs != nested_cpu_has_shadow_vmcs(vmcs12))) return -EINVAL; return 0; } /* * Checks related to Guest Non-register State */ static int nested_check_guest_non_reg_state(struct vmcs12 *vmcs12) { if (CC(vmcs12->guest_activity_state != GUEST_ACTIVITY_ACTIVE && vmcs12->guest_activity_state != GUEST_ACTIVITY_HLT && vmcs12->guest_activity_state != GUEST_ACTIVITY_WAIT_SIPI)) return -EINVAL; return 0; } static int nested_vmx_check_guest_state(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, enum vm_entry_failure_code *entry_failure_code) { bool ia32e = !!(vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE); *entry_failure_code = ENTRY_FAIL_DEFAULT; if (CC(!nested_guest_cr0_valid(vcpu, vmcs12->guest_cr0)) || CC(!nested_guest_cr4_valid(vcpu, vmcs12->guest_cr4))) return -EINVAL; if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS) && CC(!kvm_dr7_valid(vmcs12->guest_dr7))) return -EINVAL; if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PAT) && CC(!kvm_pat_valid(vmcs12->guest_ia32_pat))) return -EINVAL; if (nested_vmx_check_vmcs_link_ptr(vcpu, vmcs12)) { *entry_failure_code = ENTRY_FAIL_VMCS_LINK_PTR; return -EINVAL; } if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) && CC(!kvm_valid_perf_global_ctrl(vcpu_to_pmu(vcpu), vmcs12->guest_ia32_perf_global_ctrl))) return -EINVAL; if (CC((vmcs12->guest_cr0 & (X86_CR0_PG | X86_CR0_PE)) == X86_CR0_PG)) return -EINVAL; if (CC(ia32e && !(vmcs12->guest_cr4 & X86_CR4_PAE)) || CC(ia32e && !(vmcs12->guest_cr0 & X86_CR0_PG))) return -EINVAL; /* * If the load IA32_EFER VM-entry control is 1, the following checks * are performed on the field for the IA32_EFER MSR: * - Bits reserved in the IA32_EFER MSR must be 0. * - Bit 10 (corresponding to IA32_EFER.LMA) must equal the value of * the IA-32e mode guest VM-exit control. It must also be identical * to bit 8 (LME) if bit 31 in the CR0 field (corresponding to * CR0.PG) is 1. */ if (to_vmx(vcpu)->nested.nested_run_pending && (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_EFER)) { if (CC(!kvm_valid_efer(vcpu, vmcs12->guest_ia32_efer)) || CC(ia32e != !!(vmcs12->guest_ia32_efer & EFER_LMA)) || CC(((vmcs12->guest_cr0 & X86_CR0_PG) && ia32e != !!(vmcs12->guest_ia32_efer & EFER_LME)))) return -EINVAL; } if ((vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS) && (CC(is_noncanonical_msr_address(vmcs12->guest_bndcfgs & PAGE_MASK, vcpu)) || CC((vmcs12->guest_bndcfgs & MSR_IA32_BNDCFGS_RSVD)))) return -EINVAL; if (nested_check_guest_non_reg_state(vmcs12)) return -EINVAL; return 0; } static int nested_vmx_check_vmentry_hw(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); unsigned long cr3, cr4; bool vm_fail; if (!nested_early_check) return 0; if (vmx->msr_autoload.host.nr) vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0); if (vmx->msr_autoload.guest.nr) vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0); preempt_disable(); vmx_prepare_switch_to_guest(vcpu); /* * Induce a consistency check VMExit by clearing bit 1 in GUEST_RFLAGS, * which is reserved to '1' by hardware. GUEST_RFLAGS is guaranteed to * be written (by prepare_vmcs02()) before the "real" VMEnter, i.e. * there is no need to preserve other bits or save/restore the field. */ vmcs_writel(GUEST_RFLAGS, 0); cr3 = __get_current_cr3_fast(); if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) { vmcs_writel(HOST_CR3, cr3); vmx->loaded_vmcs->host_state.cr3 = cr3; } cr4 = cr4_read_shadow(); if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) { vmcs_writel(HOST_CR4, cr4); vmx->loaded_vmcs->host_state.cr4 = cr4; } vm_fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.regs, __vmx_vcpu_run_flags(vmx)); if (vmx->msr_autoload.host.nr) vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr); if (vmx->msr_autoload.guest.nr) vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr); if (vm_fail) { u32 error = vmcs_read32(VM_INSTRUCTION_ERROR); preempt_enable(); trace_kvm_nested_vmenter_failed( "early hardware check VM-instruction error: ", error); WARN_ON_ONCE(error != VMXERR_ENTRY_INVALID_CONTROL_FIELD); return 1; } /* * VMExit clears RFLAGS.IF and DR7, even on a consistency check. */ if (hw_breakpoint_active()) set_debugreg(__this_cpu_read(cpu_dr7), 7); local_irq_enable(); preempt_enable(); /* * A non-failing VMEntry means we somehow entered guest mode with * an illegal RIP, and that's just the tip of the iceberg. There * is no telling what memory has been modified or what state has * been exposed to unknown code. Hitting this all but guarantees * a (very critical) hardware issue. */ WARN_ON(!(vmcs_read32(VM_EXIT_REASON) & VMX_EXIT_REASONS_FAILED_VMENTRY)); return 0; } #ifdef CONFIG_KVM_HYPERV static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); /* * hv_evmcs may end up being not mapped after migration (when * L2 was running), map it here to make sure vmcs12 changes are * properly reflected. */ if (guest_cpu_cap_has_evmcs(vcpu) && vmx->nested.hv_evmcs_vmptr == EVMPTR_MAP_PENDING) { enum nested_evmptrld_status evmptrld_status = nested_vmx_handle_enlightened_vmptrld(vcpu, false); if (evmptrld_status == EVMPTRLD_VMFAIL || evmptrld_status == EVMPTRLD_ERROR) return false; /* * Post migration VMCS12 always provides the most actual * information, copy it to eVMCS upon entry. */ vmx->nested.need_vmcs12_to_shadow_sync = true; } return true; } #endif static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); struct vcpu_vmx *vmx = to_vmx(vcpu); struct kvm_host_map *map; if (!vcpu->arch.pdptrs_from_userspace && !nested_cpu_has_ept(vmcs12) && is_pae_paging(vcpu)) { /* * Reload the guest's PDPTRs since after a migration * the guest CR3 might be restored prior to setting the nested * state which can lead to a load of wrong PDPTRs. */ if (CC(!load_pdptrs(vcpu, vcpu->arch.cr3))) return false; } if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) { map = &vmx->nested.apic_access_page_map; if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->apic_access_addr), map)) { vmcs_write64(APIC_ACCESS_ADDR, pfn_to_hpa(map->pfn)); } else { pr_debug_ratelimited("%s: no backing for APIC-access address in vmcs12\n", __func__); vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR; vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_EMULATION; vcpu->run->internal.ndata = 0; return false; } } if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) { map = &vmx->nested.virtual_apic_map; if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->virtual_apic_page_addr), map)) { vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, pfn_to_hpa(map->pfn)); } else if (nested_cpu_has(vmcs12, CPU_BASED_CR8_LOAD_EXITING) && nested_cpu_has(vmcs12, CPU_BASED_CR8_STORE_EXITING) && !nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) { /* * The processor will never use the TPR shadow, simply * clear the bit from the execution control. Such a * configuration is useless, but it happens in tests. * For any other configuration, failing the vm entry is * _not_ what the processor does but it's basically the * only possibility we have. */ exec_controls_clearbit(vmx, CPU_BASED_TPR_SHADOW); } else { /* * Write an illegal value to VIRTUAL_APIC_PAGE_ADDR to * force VM-Entry to fail. */ vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, INVALID_GPA); } } if (nested_cpu_has_posted_intr(vmcs12)) { map = &vmx->nested.pi_desc_map; if (!kvm_vcpu_map(vcpu, gpa_to_gfn(vmcs12->posted_intr_desc_addr), map)) { vmx->nested.pi_desc = (struct pi_desc *)(((void *)map->hva) + offset_in_page(vmcs12->posted_intr_desc_addr)); vmcs_write64(POSTED_INTR_DESC_ADDR, pfn_to_hpa(map->pfn) + offset_in_page(vmcs12->posted_intr_desc_addr)); } else { /* * Defer the KVM_INTERNAL_EXIT until KVM tries to * access the contents of the VMCS12 posted interrupt * descriptor. (Note that KVM may do this when it * should not, per the architectural specification.) */ vmx->nested.pi_desc = NULL; pin_controls_clearbit(vmx, PIN_BASED_POSTED_INTR); } } if (nested_vmx_prepare_msr_bitmap(vcpu, vmcs12)) exec_controls_setbit(vmx, CPU_BASED_USE_MSR_BITMAPS); else exec_controls_clearbit(vmx, CPU_BASED_USE_MSR_BITMAPS); return true; } static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu) { #ifdef CONFIG_KVM_HYPERV /* * Note: nested_get_evmcs_page() also updates 'vp_assist_page' copy * in 'struct kvm_vcpu_hv' in case eVMCS is in use, this is mandatory * to make nested_evmcs_l2_tlb_flush_enabled() work correctly post * migration. */ if (!nested_get_evmcs_page(vcpu)) { pr_debug_ratelimited("%s: enlightened vmptrld failed\n", __func__); vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR; vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_EMULATION; vcpu->run->internal.ndata = 0; return false; } #endif if (is_guest_mode(vcpu) && !nested_get_vmcs12_pages(vcpu)) return false; return true; } static int nested_vmx_write_pml_buffer(struct kvm_vcpu *vcpu, gpa_t gpa) { struct vmcs12 *vmcs12; struct vcpu_vmx *vmx = to_vmx(vcpu); gpa_t dst; if (WARN_ON_ONCE(!is_guest_mode(vcpu))) return 0; if (WARN_ON_ONCE(vmx->nested.pml_full)) return 1; /* * Check if PML is enabled for the nested guest. Whether eptp bit 6 is * set is already checked as part of A/D emulation. */ vmcs12 = get_vmcs12(vcpu); if (!nested_cpu_has_pml(vmcs12)) return 0; if (vmcs12->guest_pml_index >= PML_LOG_NR_ENTRIES) { vmx->nested.pml_full = true; return 1; } gpa &= ~0xFFFull; dst = vmcs12->pml_address + sizeof(u64) * vmcs12->guest_pml_index; if (kvm_write_guest_page(vcpu->kvm, gpa_to_gfn(dst), &gpa, offset_in_page(dst), sizeof(gpa))) return 0; vmcs12->guest_pml_index--; return 0; } /* * Intel's VMX Instruction Reference specifies a common set of prerequisites * for running VMX instructions (except VMXON, whose prerequisites are * slightly different). It also specifies what exception to inject otherwise. * Note that many of these exceptions have priority over VM exits, so they * don't have to be checked again here. */ static int nested_vmx_check_permission(struct kvm_vcpu *vcpu) { if (!to_vmx(vcpu)->nested.vmxon) { kvm_queue_exception(vcpu, UD_VECTOR); return 0; } if (vmx_get_cpl(vcpu)) { kvm_inject_gp(vcpu, 0); return 0; } return 1; } static void load_vmcs12_host_state(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12); /* * If from_vmentry is false, this is being called from state restore (either RSM * or KVM_SET_NESTED_STATE). Otherwise it's called from vmlaunch/vmresume. * * Returns: * NVMX_VMENTRY_SUCCESS: Entered VMX non-root mode * NVMX_VMENTRY_VMFAIL: Consistency check VMFail * NVMX_VMENTRY_VMEXIT: Consistency check VMExit * NVMX_VMENTRY_KVM_INTERNAL_ERROR: KVM internal error */ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu, bool from_vmentry) { struct vcpu_vmx *vmx = to_vmx(vcpu); struct vmcs12 *vmcs12 = get_vmcs12(vcpu); enum vm_entry_failure_code entry_failure_code; union vmx_exit_reason exit_reason = { .basic = EXIT_REASON_INVALID_STATE, .failed_vmentry = 1, }; u32 failed_index; trace_kvm_nested_vmenter(kvm_rip_read(vcpu), vmx->nested.current_vmptr, vmcs12->guest_rip, vmcs12->guest_intr_status, vmcs12->vm_entry_intr_info_field, vmcs12->secondary_vm_exec_control & SECONDARY_EXEC_ENABLE_EPT, vmcs12->ept_pointer, vmcs12->guest_cr3, KVM_ISA_VMX); kvm_service_local_tlb_flush_requests(vcpu); if (!vmx->nested.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS)) vmx->nested.pre_vmenter_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL); if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending || !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))) vmx->nested.pre_vmenter_bndcfgs = vmcs_read64(GUEST_BNDCFGS); /* * Overwrite vmcs01.GUEST_CR3 with L1's CR3 if EPT is disabled *and* * nested early checks are disabled. In the event of a "late" VM-Fail, * i.e. a VM-Fail detected by hardware but not KVM, KVM must unwind its * software model to the pre-VMEntry host state. When EPT is disabled, * GUEST_CR3 holds KVM's shadow CR3, not L1's "real" CR3, which causes * nested_vmx_restore_host_state() to corrupt vcpu->arch.cr3. Stuffing * vmcs01.GUEST_CR3 results in the unwind naturally setting arch.cr3 to * the correct value. Smashing vmcs01.GUEST_CR3 is safe because nested * VM-Exits, and the unwind, reset KVM's MMU, i.e. vmcs01.GUEST_CR3 is * guaranteed to be overwritten with a shadow CR3 prior to re-entering * L1. Don't stuff vmcs01.GUEST_CR3 when using nested early checks as * KVM modifies vcpu->arch.cr3 if and only if the early hardware checks * pass, and early VM-Fails do not reset KVM's MMU, i.e. the VM-Fail * path would need to manually save/restore vmcs01.GUEST_CR3. */ if (!enable_ept && !nested_early_check) vmcs_writel(GUEST_CR3, vcpu->arch.cr3); vmx_switch_vmcs(vcpu, &vmx->nested.vmcs02); prepare_vmcs02_early(vmx, &vmx->vmcs01, vmcs12); if (from_vmentry) { if (unlikely(!nested_get_vmcs12_pages(vcpu))) { vmx_switch_vmcs(vcpu, &vmx->vmcs01); return NVMX_VMENTRY_KVM_INTERNAL_ERROR; } if (nested_vmx_check_vmentry_hw(vcpu)) { vmx_switch_vmcs(vcpu, &vmx->vmcs01); return NVMX_VMENTRY_VMFAIL; } if (nested_vmx_check_guest_state(vcpu, vmcs12, &entry_failure_code)) { exit_reason.basic = EXIT_REASON_INVALID_STATE; vmcs12->exit_qualification = entry_failure_code; goto vmentry_fail_vmexit; } } enter_guest_mode(vcpu); if (prepare_vmcs02(vcpu, vmcs12, from_vmentry, &entry_failure_code)) { exit_reason.basic = EXIT_REASON_INVALID_STATE; vmcs12->exit_qualification = entry_failure_code; goto vmentry_fail_vmexit_guest_mode; } if (from_vmentry) { failed_index = nested_vmx_load_msr(vcpu, vmcs12->vm_entry_msr_load_addr, vmcs12->vm_entry_msr_load_count); if (failed_index) { exit_reason.basic = EXIT_REASON_MSR_LOAD_FAIL; vmcs12->exit_qualification = failed_index; goto vmentry_fail_vmexit_guest_mode; } } else { /* * The MMU is not initialized to point at the right entities yet and * "get pages" would need to read data from the guest (i.e. we will * need to perform gpa to hpa translation). Request a call * to nested_get_vmcs12_pages before the next VM-entry. The MSRs * have already been set at vmentry time and should not be reset. */ kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); } /* * Re-evaluate pending events if L1 had a pending IRQ/NMI/INIT/SIPI * when it executed VMLAUNCH/VMRESUME, as entering non-root mode can * effectively unblock various events, e.g. INIT/SIPI cause VM-Exit * unconditionally. Take care to pull data from vmcs01 as appropriate, * e.g. when checking for interrupt windows, as vmcs02 is now loaded. */ if ((__exec_controls_get(&vmx->vmcs01) & (CPU_BASED_INTR_WINDOW_EXITING | CPU_BASED_NMI_WINDOW_EXITING)) || kvm_apic_has_pending_init_or_sipi(vcpu) || kvm_apic_has_interrupt(vcpu)) kvm_make_request(KVM_REQ_EVENT, vcpu); /* * Do not start the preemption timer hrtimer until after we know * we are successful, so that only nested_vmx_vmexit needs to cancel * the timer. */ vmx->nested.preemption_timer_expired = false; if (nested_cpu_has_preemption_timer(vmcs12)) { u64 timer_value = vmx_calc_preemption_timer_value(vcpu); vmx_start_preemption_timer(vcpu, timer_value); } /* * Note no nested_vmx_succeed or nested_vmx_fail here. At this point * we are no longer running L1, and VMLAUNCH/VMRESUME has not yet * returned as far as L1 is concerned. It will only return (and set * the success flag) when L2 exits (see nested_vmx_vmexit()). */ return NVMX_VMENTRY_SUCCESS; /* * A failed consistency check that leads to a VMExit during L1's * VMEnter to L2 is a variation of a normal VMexit, as explained in * 26.7 "VM-entry failures during or after loading guest state". */ vmentry_fail_vmexit_guest_mode: if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING) vcpu->arch.tsc_offset -= vmcs12->tsc_offset; leave_guest_mode(vcpu); vmentry_fail_vmexit: vmx_switch_vmcs(vcpu, &vmx->vmcs01); if (!from_vmentry) return NVMX_VMENTRY_VMEXIT; load_vmcs12_host_state(vcpu, vmcs12); vmcs12->vm_exit_reason = exit_reason.full; if (enable_shadow_vmcs || nested_vmx_is_evmptr12_valid(vmx)) vmx->nested.need_vmcs12_to_shadow_sync = true; return NVMX_VMENTRY_VMEXIT; } /* * nested_vmx_run() handles a nested entry, i.e., a VMLAUNCH or VMRESUME on L1 * for running an L2 nested guest. */ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch) { struct vmcs12 *vmcs12; enum nvmx_vmentry_status status; struct vcpu_vmx *vmx = to_vmx(vcpu); u32 interrupt_shadow = vmx_get_interrupt_shadow(vcpu); enum nested_evmptrld_status evmptrld_status; if (!nested_vmx_check_permission(vcpu)) return 1; evmptrld_status = nested_vmx_handle_enlightened_vmptrld(vcpu, launch); if (evmptrld_status == EVMPTRLD_ERROR) { kvm_queue_exception(vcpu, UD_VECTOR); return 1; } kvm_pmu_trigger_event(vcpu, kvm_pmu_eventsel.BRANCH_INSTRUCTIONS_RETIRED); if (CC(evmptrld_status == EVMPTRLD_VMFAIL)) return nested_vmx_failInvalid(vcpu); if (CC(!nested_vmx_is_evmptr12_valid(vmx) && vmx->nested.current_vmptr == INVALID_GPA)) return nested_vmx_failInvalid(vcpu); vmcs12 = get_vmcs12(vcpu); /* * Can't VMLAUNCH or VMRESUME a shadow VMCS. Despite the fact * that there *is* a valid VMCS pointer, RFLAGS.CF is set * rather than RFLAGS.ZF, and no error number is stored to the * VM-instruction error field. */ if (CC(vmcs12->hdr.shadow_vmcs)) return nested_vmx_failInvalid(vcpu); if (nested_vmx_is_evmptr12_valid(vmx)) { struct hv_enlightened_vmcs *evmcs = nested_vmx_evmcs(vmx); copy_enlightened_to_vmcs12(vmx, evmcs->hv_clean_fields); /* Enlightened VMCS doesn't have launch state */ vmcs12->launch_state = !launch; } else if (enable_shadow_vmcs) { copy_shadow_to_vmcs12(vmx); } /* * The nested entry process starts with enforcing various prerequisites * on vmcs12 as required by the Intel SDM, and act appropriately when * they fail: As the SDM explains, some conditions should cause the * instruction to fail, while others will cause the instruction to seem * to succeed, but return an EXIT_REASON_INVALID_STATE. * To speed up the normal (success) code path, we should avoid checking * for misconfigurations which will anyway be caught by the processor * when using the merged vmcs02. */ if (CC(interrupt_shadow & KVM_X86_SHADOW_INT_MOV_SS)) return nested_vmx_fail(vcpu, VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS); if (CC(vmcs12->launch_state == launch)) return nested_vmx_fail(vcpu, launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS : VMXERR_VMRESUME_NONLAUNCHED_VMCS); if (nested_vmx_check_controls(vcpu, vmcs12)) return nested_vmx_fail(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); if (nested_vmx_check_address_space_size(vcpu, vmcs12)) return nested_vmx_fail(vcpu, VMXERR_ENTRY_INVALID_HOST_STATE_FIELD); if (nested_vmx_check_host_state(vcpu, vmcs12)) return nested_vmx_fail(vcpu, VMXERR_ENTRY_INVALID_HOST_STATE_FIELD); /* * We're finally done with prerequisite checking, and can start with * the nested entry. */ vmx->nested.nested_run_pending = 1; vmx->nested.has_preemption_timer_deadline = false; status = nested_vmx_enter_non_root_mode(vcpu, true); if (unlikely(status != NVMX_VMENTRY_SUCCESS)) goto vmentry_failed; /* Hide L1D cache contents from the nested guest. */ vmx->vcpu.arch.l1tf_flush_l1d = true; /* * Must happen outside of nested_vmx_enter_non_root_mode() as it will * also be used as part of restoring nVMX state for * snapshot restore (migration). * * In this flow, it is assumed that vmcs12 cache was * transferred as part of captured nVMX state and should * therefore not be read from guest memory (which may not * exist on destination host yet). */ nested_cache_shadow_vmcs12(vcpu, vmcs12); switch (vmcs12->guest_activity_state) { case GUEST_ACTIVITY_HLT: /* * If we're entering a halted L2 vcpu and the L2 vcpu won't be * awakened by event injection or by an NMI-window VM-exit or * by an interrupt-window VM-exit, halt the vcpu. */ if (!(vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK) && !nested_cpu_has(vmcs12, CPU_BASED_NMI_WINDOW_EXITING) && !(nested_cpu_has(vmcs12, CPU_BASED_INTR_WINDOW_EXITING) && (vmcs12->guest_rflags & X86_EFLAGS_IF))) { vmx->nested.nested_run_pending = 0; return kvm_emulate_halt_noskip(vcpu); } break; case GUEST_ACTIVITY_WAIT_SIPI: vmx->nested.nested_run_pending = 0; vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED; break; default: break; } return 1; vmentry_failed: vmx->nested.nested_run_pending = 0; if (status == NVMX_VMENTRY_KVM_INTERNAL_ERROR) return 0; if (status == NVMX_VMENTRY_VMEXIT) return 1; WARN_ON_ONCE(status != NVMX_VMENTRY_VMFAIL); return nested_vmx_fail(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); } /* * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date * because L2 may have changed some cr0 bits directly (CR0_GUEST_HOST_MASK). * This function returns the new value we should put in vmcs12.guest_cr0. * It's not enough to just return the vmcs02 GUEST_CR0. Rather, * 1. Bits that neither L0 nor L1 trapped, were set directly by L2 and are now * available in vmcs02 GUEST_CR0. (Note: It's enough to check that L0 * didn't trap the bit, because if L1 did, so would L0). * 2. Bits that L1 asked to trap (and therefore L0 also did) could not have * been modified by L2, and L1 knows it. So just leave the old value of * the bit from vmcs12.guest_cr0. Note that the bit from vmcs02 GUEST_CR0 * isn't relevant, because if L0 traps this bit it can set it to anything. * 3. Bits that L1 didn't trap, but L0 did. L1 believes the guest could have * changed these bits, and therefore they need to be updated, but L0 * didn't necessarily allow them to be changed in GUEST_CR0 - and rather * put them in vmcs02 CR0_READ_SHADOW. So take these bits from there. */ static inline unsigned long vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { return /*1*/ (vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) | /*2*/ (vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) | /*3*/ (vmcs_readl(CR0_READ_SHADOW) & ~(vmcs12->cr0_guest_host_mask | vcpu->arch.cr0_guest_owned_bits)); } static inline unsigned long vmcs12_guest_cr4(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { return /*1*/ (vmcs_readl(GUEST_CR4) & vcpu->arch.cr4_guest_owned_bits) | /*2*/ (vmcs12->guest_cr4 & vmcs12->cr4_guest_host_mask) | /*3*/ (vmcs_readl(CR4_READ_SHADOW) & ~(vmcs12->cr4_guest_host_mask | vcpu->arch.cr4_guest_owned_bits)); } static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, u32 vm_exit_reason, u32 exit_intr_info) { u32 idt_vectoring; unsigned int nr; /* * Per the SDM, VM-Exits due to double and triple faults are never * considered to occur during event delivery, even if the double/triple * fault is the result of an escalating vectoring issue. * * Note, the SDM qualifies the double fault behavior with "The original * event results in a double-fault exception". It's unclear why the * qualification exists since exits due to double fault can occur only * while vectoring a different exception (injected events are never * subject to interception), i.e. there's _always_ an original event. * * The SDM also uses NMI as a confusing example for the "original event * causes the VM exit directly" clause. NMI isn't special in any way, * the same rule applies to all events that cause an exit directly. * NMI is an odd choice for the example because NMIs can only occur on * instruction boundaries, i.e. they _can't_ occur during vectoring. */ if ((u16)vm_exit_reason == EXIT_REASON_TRIPLE_FAULT || ((u16)vm_exit_reason == EXIT_REASON_EXCEPTION_NMI && is_double_fault(exit_intr_info))) { vmcs12->idt_vectoring_info_field = 0; } else if (vcpu->arch.exception.injected) { nr = vcpu->arch.exception.vector; idt_vectoring = nr | VECTORING_INFO_VALID_MASK; if (kvm_exception_is_soft(nr)) { vmcs12->vm_exit_instruction_len = vcpu->arch.event_exit_inst_len; idt_vectoring |= INTR_TYPE_SOFT_EXCEPTION; } else idt_vectoring |= INTR_TYPE_HARD_EXCEPTION; if (vcpu->arch.exception.has_error_code) { idt_vectoring |= VECTORING_INFO_DELIVER_CODE_MASK; vmcs12->idt_vectoring_error_code = vcpu->arch.exception.error_code; } vmcs12->idt_vectoring_info_field = idt_vectoring; } else if (vcpu->arch.nmi_injected) { vmcs12->idt_vectoring_info_field = INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK | NMI_VECTOR; } else if (vcpu->arch.interrupt.injected) { nr = vcpu->arch.interrupt.nr; idt_vectoring = nr | VECTORING_INFO_VALID_MASK; if (vcpu->arch.interrupt.soft) { idt_vectoring |= INTR_TYPE_SOFT_INTR; vmcs12->vm_entry_instruction_len = vcpu->arch.event_exit_inst_len; } else idt_vectoring |= INTR_TYPE_EXT_INTR; vmcs12->idt_vectoring_info_field = idt_vectoring; } else { vmcs12->idt_vectoring_info_field = 0; } } void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); gfn_t gfn; /* * Don't need to mark the APIC access page dirty; it is never * written to by the CPU during APIC virtualization. */ if (nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) { gfn = vmcs12->virtual_apic_page_addr >> PAGE_SHIFT; kvm_vcpu_mark_page_dirty(vcpu, gfn); } if (nested_cpu_has_posted_intr(vmcs12)) { gfn = vmcs12->posted_intr_desc_addr >> PAGE_SHIFT; kvm_vcpu_mark_page_dirty(vcpu, gfn); } } static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); int max_irr; void *vapic_page; u16 status; if (!vmx->nested.pi_pending) return 0; if (!vmx->nested.pi_desc) goto mmio_needed; vmx->nested.pi_pending = false; if (!pi_test_and_clear_on(vmx->nested.pi_desc)) return 0; max_irr = pi_find_highest_vector(vmx->nested.pi_desc); if (max_irr > 0) { vapic_page = vmx->nested.virtual_apic_map.hva; if (!vapic_page) goto mmio_needed; __kvm_apic_update_irr(vmx->nested.pi_desc->pir, vapic_page, &max_irr); status = vmcs_read16(GUEST_INTR_STATUS); if ((u8)max_irr > ((u8)status & 0xff)) { status &= ~0xff; status |= (u8)max_irr; vmcs_write16(GUEST_INTR_STATUS, status); } } nested_mark_vmcs12_pages_dirty(vcpu); return 0; mmio_needed: kvm_handle_memory_failure(vcpu, X86EMUL_IO_NEEDED, NULL); return -ENXIO; } static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu) { struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit; u32 intr_info = ex->vector | INTR_INFO_VALID_MASK; struct vmcs12 *vmcs12 = get_vmcs12(vcpu); unsigned long exit_qual; if (ex->has_payload) { exit_qual = ex->payload; } else if (ex->vector == PF_VECTOR) { exit_qual = vcpu->arch.cr2; } else if (ex->vector == DB_VECTOR) { exit_qual = vcpu->arch.dr6; exit_qual &= ~DR6_BT; exit_qual ^= DR6_ACTIVE_LOW; } else { exit_qual = 0; } /* * Unlike AMD's Paged Real Mode, which reports an error code on #PF * VM-Exits even if the CPU is in Real Mode, Intel VMX never sets the * "has error code" flags on VM-Exit if the CPU is in Real Mode. */ if (ex->has_error_code && is_protmode(vcpu)) { /* * Intel CPUs do not generate error codes with bits 31:16 set, * and more importantly VMX disallows setting bits 31:16 in the * injected error code for VM-Entry. Drop the bits to mimic * hardware and avoid inducing failure on nested VM-Entry if L1 * chooses to inject the exception back to L2. AMD CPUs _do_ * generate "full" 32-bit error codes, so KVM allows userspace * to inject exception error codes with bits 31:16 set. */ vmcs12->vm_exit_intr_error_code = (u16)ex->error_code; intr_info |= INTR_INFO_DELIVER_CODE_MASK; } if (kvm_exception_is_soft(ex->vector)) intr_info |= INTR_TYPE_SOFT_EXCEPTION; else intr_info |= INTR_TYPE_HARD_EXCEPTION; if (!(vmcs12->idt_vectoring_info_field & VECTORING_INFO_VALID_MASK) && vmx_get_nmi_mask(vcpu)) intr_info |= INTR_INFO_UNBLOCK_NMI; nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, intr_info, exit_qual); } /* * Returns true if a debug trap is (likely) pending delivery. Infer the class * of a #DB (trap-like vs. fault-like) from the exception payload (to-be-DR6). * Using the payload is flawed because code breakpoints (fault-like) and data * breakpoints (trap-like) set the same bits in DR6 (breakpoint detected), i.e. * this will return false positives if a to-be-injected code breakpoint #DB is * pending (from KVM's perspective, but not "pending" across an instruction * boundary). ICEBP, a.k.a. INT1, is also not reflected here even though it * too is trap-like. * * KVM "works" despite these flaws as ICEBP isn't currently supported by the * emulator, Monitor Trap Flag is not marked pending on intercepted #DBs (the * #DB has already happened), and MTF isn't marked pending on code breakpoints * from the emulator (because such #DBs are fault-like and thus don't trigger * actions that fire on instruction retire). */ static unsigned long vmx_get_pending_dbg_trap(struct kvm_queued_exception *ex) { if (!ex->pending || ex->vector != DB_VECTOR) return 0; /* General Detect #DBs are always fault-like. */ return ex->payload & ~DR6_BD; } /* * Returns true if there's a pending #DB exception that is lower priority than * a pending Monitor Trap Flag VM-Exit. TSS T-flag #DBs are not emulated by * KVM, but could theoretically be injected by userspace. Note, this code is * imperfect, see above. */ static bool vmx_is_low_priority_db_trap(struct kvm_queued_exception *ex) { return vmx_get_pending_dbg_trap(ex) & ~DR6_BT; } /* * Certain VM-exits set the 'pending debug exceptions' field to indicate a * recognized #DB (data or single-step) that has yet to be delivered. Since KVM * represents these debug traps with a payload that is said to be compatible * with the 'pending debug exceptions' field, write the payload to the VMCS * field if a VM-exit is delivered before the debug trap. */ static void nested_vmx_update_pending_dbg(struct kvm_vcpu *vcpu) { unsigned long pending_dbg; pending_dbg = vmx_get_pending_dbg_trap(&vcpu->arch.exception); if (pending_dbg) vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, pending_dbg); } static bool nested_vmx_preemption_timer_pending(struct kvm_vcpu *vcpu) { return nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) && to_vmx(vcpu)->nested.preemption_timer_expired; } static bool vmx_has_nested_events(struct kvm_vcpu *vcpu, bool for_injection) { struct vcpu_vmx *vmx = to_vmx(vcpu); void *vapic = vmx->nested.virtual_apic_map.hva; int max_irr, vppr; if (nested_vmx_preemption_timer_pending(vcpu) || vmx->nested.mtf_pending) return true; /* * Virtual Interrupt Delivery doesn't require manual injection. Either * the interrupt is already in GUEST_RVI and will be recognized by CPU * at VM-Entry, or there is a KVM_REQ_EVENT pending and KVM will move * the interrupt from the PIR to RVI prior to entering the guest. */ if (for_injection) return false; if (!nested_cpu_has_vid(get_vmcs12(vcpu)) || __vmx_interrupt_blocked(vcpu)) return false; if (!vapic) return false; vppr = *((u32 *)(vapic + APIC_PROCPRI)); max_irr = vmx_get_rvi(); if ((max_irr & 0xf0) > (vppr & 0xf0)) return true; if (vmx->nested.pi_pending && vmx->nested.pi_desc && pi_test_on(vmx->nested.pi_desc)) { max_irr = pi_find_highest_vector(vmx->nested.pi_desc); if (max_irr > 0 && (max_irr & 0xf0) > (vppr & 0xf0)) return true; } return false; } /* * Per the Intel SDM's table "Priority Among Concurrent Events", with minor * edits to fill in missing examples, e.g. #DB due to split-lock accesses, * and less minor edits to splice in the priority of VMX Non-Root specific * events, e.g. MTF and NMI/INTR-window exiting. * * 1 Hardware Reset and Machine Checks * - RESET * - Machine Check * * 2 Trap on Task Switch * - T flag in TSS is set (on task switch) * * 3 External Hardware Interventions * - FLUSH * - STOPCLK * - SMI * - INIT * * 3.5 Monitor Trap Flag (MTF) VM-exit[1] * * 4 Traps on Previous Instruction * - Breakpoints * - Trap-class Debug Exceptions (#DB due to TF flag set, data/I-O * breakpoint, or #DB due to a split-lock access) * * 4.3 VMX-preemption timer expired VM-exit * * 4.6 NMI-window exiting VM-exit[2] * * 5 Nonmaskable Interrupts (NMI) * * 5.5 Interrupt-window exiting VM-exit and Virtual-interrupt delivery * * 6 Maskable Hardware Interrupts * * 7 Code Breakpoint Fault * * 8 Faults from Fetching Next Instruction * - Code-Segment Limit Violation * - Code Page Fault * - Control protection exception (missing ENDBRANCH at target of indirect * call or jump) * * 9 Faults from Decoding Next Instruction * - Instruction length > 15 bytes * - Invalid Opcode * - Coprocessor Not Available * *10 Faults on Executing Instruction * - Overflow * - Bound error * - Invalid TSS * - Segment Not Present * - Stack fault * - General Protection * - Data Page Fault * - Alignment Check * - x86 FPU Floating-point exception * - SIMD floating-point exception * - Virtualization exception * - Control protection exception * * [1] Per the "Monitor Trap Flag" section: System-management interrupts (SMIs), * INIT signals, and higher priority events take priority over MTF VM exits. * MTF VM exits take priority over debug-trap exceptions and lower priority * events. * * [2] Debug-trap exceptions and higher priority events take priority over VM exits * caused by the VMX-preemption timer. VM exits caused by the VMX-preemption * timer take priority over VM exits caused by the "NMI-window exiting" * VM-execution control and lower priority events. * * [3] Debug-trap exceptions and higher priority events take priority over VM exits * caused by "NMI-window exiting". VM exits caused by this control take * priority over non-maskable interrupts (NMIs) and lower priority events. * * [4] Virtual-interrupt delivery has the same priority as that of VM exits due to * the 1-setting of the "interrupt-window exiting" VM-execution control. Thus, * non-maskable interrupts (NMIs) and higher priority events take priority over * delivery of a virtual interrupt; delivery of a virtual interrupt takes * priority over external interrupts and lower priority events. */ static int vmx_check_nested_events(struct kvm_vcpu *vcpu) { struct kvm_lapic *apic = vcpu->arch.apic; struct vcpu_vmx *vmx = to_vmx(vcpu); /* * Only a pending nested run blocks a pending exception. If there is a * previously injected event, the pending exception occurred while said * event was being delivered and thus needs to be handled. */ bool block_nested_exceptions = vmx->nested.nested_run_pending; /* * Events that don't require injection, i.e. that are virtualized by * hardware, aren't blocked by a pending VM-Enter as KVM doesn't need * to regain control in order to deliver the event, and hardware will * handle event ordering, e.g. with respect to injected exceptions. * * But, new events (not exceptions) are only recognized at instruction * boundaries. If an event needs reinjection, then KVM is handling a * VM-Exit that occurred _during_ instruction execution; new events, * irrespective of whether or not they're injected, are blocked until * the instruction completes. */ bool block_non_injected_events = kvm_event_needs_reinjection(vcpu); /* * Inject events are blocked by nested VM-Enter, as KVM is responsible * for managing priority between concurrent events, i.e. KVM needs to * wait until after VM-Enter completes to deliver injected events. */ bool block_nested_events = block_nested_exceptions || block_non_injected_events; if (lapic_in_kernel(vcpu) && test_bit(KVM_APIC_INIT, &apic->pending_events)) { if (block_nested_events) return -EBUSY; nested_vmx_update_pending_dbg(vcpu); clear_bit(KVM_APIC_INIT, &apic->pending_events); if (vcpu->arch.mp_state != KVM_MP_STATE_INIT_RECEIVED) nested_vmx_vmexit(vcpu, EXIT_REASON_INIT_SIGNAL, 0, 0); /* MTF is discarded if the vCPU is in WFS. */ vmx->nested.mtf_pending = false; return 0; } if (lapic_in_kernel(vcpu) && test_bit(KVM_APIC_SIPI, &apic->pending_events)) { if (block_nested_events) return -EBUSY; clear_bit(KVM_APIC_SIPI, &apic->pending_events); if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) { nested_vmx_vmexit(vcpu, EXIT_REASON_SIPI_SIGNAL, 0, apic->sipi_vector & 0xFFUL); return 0; } /* Fallthrough, the SIPI is completely ignored. */ } /* * Process exceptions that are higher priority than Monitor Trap Flag: * fault-like exceptions, TSS T flag #DB (not emulated by KVM, but * could theoretically come in from userspace), and ICEBP (INT1). * * TODO: SMIs have higher priority than MTF and trap-like #DBs (except * for TSS T flag #DBs). KVM also doesn't save/restore pending MTF * across SMI/RSM as it should; that needs to be addressed in order to * prioritize SMI over MTF and trap-like #DBs. */ if (vcpu->arch.exception_vmexit.pending && !vmx_is_low_priority_db_trap(&vcpu->arch.exception_vmexit)) { if (block_nested_exceptions) return -EBUSY; nested_vmx_inject_exception_vmexit(vcpu); return 0; } if (vcpu->arch.exception.pending && !vmx_is_low_priority_db_trap(&vcpu->arch.exception)) { if (block_nested_exceptions) return -EBUSY; goto no_vmexit; } if (vmx->nested.mtf_pending) { if (block_nested_events) return -EBUSY; nested_vmx_update_pending_dbg(vcpu); nested_vmx_vmexit(vcpu, EXIT_REASON_MONITOR_TRAP_FLAG, 0, 0); return 0; } if (vcpu->arch.exception_vmexit.pending) { if (block_nested_exceptions) return -EBUSY; nested_vmx_inject_exception_vmexit(vcpu); return 0; } if (vcpu->arch.exception.pending) { if (block_nested_exceptions) return -EBUSY; goto no_vmexit; } if (nested_vmx_preemption_timer_pending(vcpu)) { if (block_nested_events) return -EBUSY; nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0); return 0; } if (vcpu->arch.smi_pending && !is_smm(vcpu)) { if (block_nested_events) return -EBUSY; goto no_vmexit; } if (vcpu->arch.nmi_pending && !vmx_nmi_blocked(vcpu)) { if (block_nested_events) return -EBUSY; if (!nested_exit_on_nmi(vcpu)) goto no_vmexit; nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI, NMI_VECTOR | INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK, 0); /* * The NMI-triggered VM exit counts as injection: * clear this one and block further NMIs. */ vcpu->arch.nmi_pending = 0; vmx_set_nmi_mask(vcpu, true); return 0; } if (kvm_cpu_has_interrupt(vcpu) && !vmx_interrupt_blocked(vcpu)) { int irq; if (!nested_exit_on_intr(vcpu)) { if (block_nested_events) return -EBUSY; goto no_vmexit; } if (!nested_exit_intr_ack_set(vcpu)) { if (block_nested_events) return -EBUSY; nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0); return 0; } irq = kvm_cpu_get_extint(vcpu); if (irq != -1) { if (block_nested_events) return -EBUSY; nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR | irq, 0); return 0; } irq = kvm_apic_has_interrupt(vcpu); if (WARN_ON_ONCE(irq < 0)) goto no_vmexit; /* * If the IRQ is L2's PI notification vector, process posted * interrupts for L2 instead of injecting VM-Exit, as the * detection/morphing architecturally occurs when the IRQ is * delivered to the CPU. Note, only interrupts that are routed * through the local APIC trigger posted interrupt processing, * and enabling posted interrupts requires ACK-on-exit. */ if (irq == vmx->nested.posted_intr_nv) { /* * Nested posted interrupts are delivered via RVI, i.e. * aren't injected by KVM, and so can be queued even if * manual event injection is disallowed. */ if (block_non_injected_events) return -EBUSY; vmx->nested.pi_pending = true; kvm_apic_clear_irr(vcpu, irq); goto no_vmexit; } if (block_nested_events) return -EBUSY; nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR | irq, 0); /* * ACK the interrupt _after_ emulating VM-Exit, as the IRQ must * be marked as in-service in vmcs01.GUEST_INTERRUPT_STATUS.SVI * if APICv is active. */ kvm_apic_ack_interrupt(vcpu, irq); return 0; } no_vmexit: return vmx_complete_nested_posted_interrupt(vcpu); } static u32 vmx_get_preemption_timer_value(struct kvm_vcpu *vcpu) { ktime_t remaining = hrtimer_get_remaining(&to_vmx(vcpu)->nested.preemption_timer); u64 value; if (ktime_to_ns(remaining) <= 0) return 0; value = ktime_to_ns(remaining) * vcpu->arch.virtual_tsc_khz; do_div(value, 1000000); return value >> VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE; } static bool is_vmcs12_ext_field(unsigned long field) { switch (field) { case GUEST_ES_SELECTOR: case GUEST_CS_SELECTOR: case GUEST_SS_SELECTOR: case GUEST_DS_SELECTOR: case GUEST_FS_SELECTOR: case GUEST_GS_SELECTOR: case GUEST_LDTR_SELECTOR: case GUEST_TR_SELECTOR: case GUEST_ES_LIMIT: case GUEST_CS_LIMIT: case GUEST_SS_LIMIT: case GUEST_DS_LIMIT: case GUEST_FS_LIMIT: case GUEST_GS_LIMIT: case GUEST_LDTR_LIMIT: case GUEST_TR_LIMIT: case GUEST_GDTR_LIMIT: case GUEST_IDTR_LIMIT: case GUEST_ES_AR_BYTES: case GUEST_DS_AR_BYTES: case GUEST_FS_AR_BYTES: case GUEST_GS_AR_BYTES: case GUEST_LDTR_AR_BYTES: case GUEST_TR_AR_BYTES: case GUEST_ES_BASE: case GUEST_CS_BASE: case GUEST_SS_BASE: case GUEST_DS_BASE: case GUEST_FS_BASE: case GUEST_GS_BASE: case GUEST_LDTR_BASE: case GUEST_TR_BASE: case GUEST_GDTR_BASE: case GUEST_IDTR_BASE: case GUEST_PENDING_DBG_EXCEPTIONS: case GUEST_BNDCFGS: return true; default: break; } return false; } static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { struct vcpu_vmx *vmx = to_vmx(vcpu); vmcs12->guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR); vmcs12->guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR); vmcs12->guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR); vmcs12->guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR); vmcs12->guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR); vmcs12->guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR); vmcs12->guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR); vmcs12->guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR); vmcs12->guest_es_limit = vmcs_read32(GUEST_ES_LIMIT); vmcs12->guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT); vmcs12->guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT); vmcs12->guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT); vmcs12->guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT); vmcs12->guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT); vmcs12->guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT); vmcs12->guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT); vmcs12->guest_gdtr_limit = vmcs_read32(GUEST_GDTR_LIMIT); vmcs12->guest_idtr_limit = vmcs_read32(GUEST_IDTR_LIMIT); vmcs12->guest_es_ar_bytes = vmcs_read32(GUEST_ES_AR_BYTES); vmcs12->guest_ds_ar_bytes = vmcs_read32(GUEST_DS_AR_BYTES); vmcs12->guest_fs_ar_bytes = vmcs_read32(GUEST_FS_AR_BYTES); vmcs12->guest_gs_ar_bytes = vmcs_read32(GUEST_GS_AR_BYTES); vmcs12->guest_ldtr_ar_bytes = vmcs_read32(GUEST_LDTR_AR_BYTES); vmcs12->guest_tr_ar_bytes = vmcs_read32(GUEST_TR_AR_BYTES); vmcs12->guest_es_base = vmcs_readl(GUEST_ES_BASE); vmcs12->guest_cs_base = vmcs_readl(GUEST_CS_BASE); vmcs12->guest_ss_base = vmcs_readl(GUEST_SS_BASE); vmcs12->guest_ds_base = vmcs_readl(GUEST_DS_BASE); vmcs12->guest_fs_base = vmcs_readl(GUEST_FS_BASE); vmcs12->guest_gs_base = vmcs_readl(GUEST_GS_BASE); vmcs12->guest_ldtr_base = vmcs_readl(GUEST_LDTR_BASE); vmcs12->guest_tr_base = vmcs_readl(GUEST_TR_BASE); vmcs12->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE); vmcs12->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE); vmcs12->guest_pending_dbg_exceptions = vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS); vmx->nested.need_sync_vmcs02_to_vmcs12_rare = false; } static void copy_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { struct vcpu_vmx *vmx = to_vmx(vcpu); int cpu; if (!vmx->nested.need_sync_vmcs02_to_vmcs12_rare) return; WARN_ON_ONCE(vmx->loaded_vmcs != &vmx->vmcs01); cpu = get_cpu(); vmx->loaded_vmcs = &vmx->nested.vmcs02; vmx_vcpu_load_vmcs(vcpu, cpu, &vmx->vmcs01); sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12); vmx->loaded_vmcs = &vmx->vmcs01; vmx_vcpu_load_vmcs(vcpu, cpu, &vmx->nested.vmcs02); put_cpu(); } /* * Update the guest state fields of vmcs12 to reflect changes that * occurred while L2 was running. (The "IA-32e mode guest" bit of the * VM-entry controls is also updated, since this is really a guest * state bit.) */ static void sync_vmcs02_to_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { struct vcpu_vmx *vmx = to_vmx(vcpu); if (nested_vmx_is_evmptr12_valid(vmx)) sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12); vmx->nested.need_sync_vmcs02_to_vmcs12_rare = !nested_vmx_is_evmptr12_valid(vmx); vmcs12->guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12); vmcs12->guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12); vmcs12->guest_rsp = kvm_rsp_read(vcpu); vmcs12->guest_rip = kvm_rip_read(vcpu); vmcs12->guest_rflags = vmcs_readl(GUEST_RFLAGS); vmcs12->guest_cs_ar_bytes = vmcs_read32(GUEST_CS_AR_BYTES); vmcs12->guest_ss_ar_bytes = vmcs_read32(GUEST_SS_AR_BYTES); vmcs12->guest_interruptibility_info = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO); if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED) vmcs12->guest_activity_state = GUEST_ACTIVITY_HLT; else if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) vmcs12->guest_activity_state = GUEST_ACTIVITY_WAIT_SIPI; else vmcs12->guest_activity_state = GUEST_ACTIVITY_ACTIVE; if (nested_cpu_has_preemption_timer(vmcs12) && vmcs12->vm_exit_controls & VM_EXIT_SAVE_VMX_PREEMPTION_TIMER && !vmx->nested.nested_run_pending) vmcs12->vmx_preemption_timer_value = vmx_get_preemption_timer_value(vcpu); /* * In some cases (usually, nested EPT), L2 is allowed to change its * own CR3 without exiting. If it has changed it, we must keep it. * Of course, if L0 is using shadow page tables, GUEST_CR3 was defined * by L0, not L1 or L2, so we mustn't unconditionally copy it to vmcs12. * * Additionally, restore L2's PDPTR to vmcs12. */ if (enable_ept) { vmcs12->guest_cr3 = vmcs_readl(GUEST_CR3); if (nested_cpu_has_ept(vmcs12) && is_pae_paging(vcpu)) { vmcs12->guest_pdptr0 = vmcs_read64(GUEST_PDPTR0); vmcs12->guest_pdptr1 = vmcs_read64(GUEST_PDPTR1); vmcs12->guest_pdptr2 = vmcs_read64(GUEST_PDPTR2); vmcs12->guest_pdptr3 = vmcs_read64(GUEST_PDPTR3); } } vmcs12->guest_linear_address = vmcs_readl(GUEST_LINEAR_ADDRESS); if (nested_cpu_has_vid(vmcs12)) vmcs12->guest_intr_status = vmcs_read16(GUEST_INTR_STATUS); vmcs12->vm_entry_controls = (vmcs12->vm_entry_controls & ~VM_ENTRY_IA32E_MODE) | (vm_entry_controls_get(to_vmx(vcpu)) & VM_ENTRY_IA32E_MODE); if (vmcs12->vm_exit_controls & VM_EXIT_SAVE_DEBUG_CONTROLS) vmcs12->guest_dr7 = vcpu->arch.dr7; if (vmcs12->vm_exit_controls & VM_EXIT_SAVE_IA32_EFER) vmcs12->guest_ia32_efer = vcpu->arch.efer; } /* * prepare_vmcs12 is part of what we need to do when the nested L2 guest exits * and we want to prepare to run its L1 parent. L1 keeps a vmcs for L2 (vmcs12), * and this function updates it to reflect the changes to the guest state while * L2 was running (and perhaps made some exits which were handled directly by L0 * without going back to L1), and to reflect the exit reason. * Note that we do not have to copy here all VMCS fields, just those that * could have changed by the L2 guest or the exit - i.e., the guest-state and * exit-information fields only. Other fields are modified by L1 with VMWRITE, * which already writes to vmcs12 directly. */ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, u32 vm_exit_reason, u32 exit_intr_info, unsigned long exit_qualification) { /* update exit information fields: */ vmcs12->vm_exit_reason = vm_exit_reason; if (to_vmx(vcpu)->exit_reason.enclave_mode) vmcs12->vm_exit_reason |= VMX_EXIT_REASONS_SGX_ENCLAVE_MODE; vmcs12->exit_qualification = exit_qualification; /* * On VM-Exit due to a failed VM-Entry, the VMCS isn't marked launched * and only EXIT_REASON and EXIT_QUALIFICATION are updated, all other * exit info fields are unmodified. */ if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY)) { vmcs12->launch_state = 1; /* vm_entry_intr_info_field is cleared on exit. Emulate this * instead of reading the real value. */ vmcs12->vm_entry_intr_info_field &= ~INTR_INFO_VALID_MASK; /* * Transfer the event that L0 or L1 may wanted to inject into * L2 to IDT_VECTORING_INFO_FIELD. */ vmcs12_save_pending_event(vcpu, vmcs12, vm_exit_reason, exit_intr_info); vmcs12->vm_exit_intr_info = exit_intr_info; vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN); vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); /* * According to spec, there's no need to store the guest's * MSRs if the exit is due to a VM-entry failure that occurs * during or after loading the guest state. Since this exit * does not fall in that category, we need to save the MSRs. */ if (nested_vmx_store_msr(vcpu, vmcs12->vm_exit_msr_store_addr, vmcs12->vm_exit_msr_store_count)) nested_vmx_abort(vcpu, VMX_ABORT_SAVE_GUEST_MSR_FAIL); } } /* * A part of what we need to when the nested L2 guest exits and we want to * run its L1 parent, is to reset L1's guest state to the host state specified * in vmcs12. * This function is to be called not only on normal nested exit, but also on * a nested entry failure, as explained in Intel's spec, 3B.23.7 ("VM-Entry * Failures During or After Loading Guest State"). * This function should be called when the active VMCS is L1's (vmcs01). */ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { enum vm_entry_failure_code ignored; struct kvm_segment seg; if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER) vcpu->arch.efer = vmcs12->host_ia32_efer; else if (vmcs12->vm_exit_controls & VM_EXIT_HOST_ADDR_SPACE_SIZE) vcpu->arch.efer |= (EFER_LMA | EFER_LME); else vcpu->arch.efer &= ~(EFER_LMA | EFER_LME); vmx_set_efer(vcpu, vcpu->arch.efer); kvm_rsp_write(vcpu, vmcs12->host_rsp); kvm_rip_write(vcpu, vmcs12->host_rip); vmx_set_rflags(vcpu, X86_EFLAGS_FIXED); vmx_set_interrupt_shadow(vcpu, 0); /* * Note that calling vmx_set_cr0 is important, even if cr0 hasn't * actually changed, because vmx_set_cr0 refers to efer set above. * * CR0_GUEST_HOST_MASK is already set in the original vmcs01 * (KVM doesn't change it); */ vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits(); vmx_set_cr0(vcpu, vmcs12->host_cr0); /* Same as above - no reason to call set_cr4_guest_host_mask(). */ vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK); vmx_set_cr4(vcpu, vmcs12->host_cr4); nested_ept_uninit_mmu_context(vcpu); /* * Only PDPTE load can fail as the value of cr3 was checked on entry and * couldn't have changed. */ if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, true, &ignored)) nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL); nested_vmx_transition_tlb_flush(vcpu, vmcs12, false); vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs); vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->host_ia32_sysenter_esp); vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->host_ia32_sysenter_eip); vmcs_writel(GUEST_IDTR_BASE, vmcs12->host_idtr_base); vmcs_writel(GUEST_GDTR_BASE, vmcs12->host_gdtr_base); vmcs_write32(GUEST_IDTR_LIMIT, 0xFFFF); vmcs_write32(GUEST_GDTR_LIMIT, 0xFFFF); /* If not VM_EXIT_CLEAR_BNDCFGS, the L2 value propagates to L1. */ if (vmcs12->vm_exit_controls & VM_EXIT_CLEAR_BNDCFGS) vmcs_write64(GUEST_BNDCFGS, 0); if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT) { vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat); vcpu->arch.pat = vmcs12->host_ia32_pat; } if ((vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL) && kvm_pmu_has_perf_global_ctrl(vcpu_to_pmu(vcpu))) WARN_ON_ONCE(kvm_set_msr(vcpu, MSR_CORE_PERF_GLOBAL_CTRL, vmcs12->host_ia32_perf_global_ctrl)); /* Set L1 segment info according to Intel SDM 27.5.2 Loading Host Segment and Descriptor-Table Registers */ seg = (struct kvm_segment) { .base = 0, .limit = 0xFFFFFFFF, .selector = vmcs12->host_cs_selector, .type = 11, .present = 1, .s = 1, .g = 1 }; if (vmcs12->vm_exit_controls & VM_EXIT_HOST_ADDR_SPACE_SIZE) seg.l = 1; else seg.db = 1; __vmx_set_segment(vcpu, &seg, VCPU_SREG_CS); seg = (struct kvm_segment) { .base = 0, .limit = 0xFFFFFFFF, .type = 3, .present = 1, .s = 1, .db = 1, .g = 1 }; seg.selector = vmcs12->host_ds_selector; __vmx_set_segment(vcpu, &seg, VCPU_SREG_DS); seg.selector = vmcs12->host_es_selector; __vmx_set_segment(vcpu, &seg, VCPU_SREG_ES); seg.selector = vmcs12->host_ss_selector; __vmx_set_segment(vcpu, &seg, VCPU_SREG_SS); seg.selector = vmcs12->host_fs_selector; seg.base = vmcs12->host_fs_base; __vmx_set_segment(vcpu, &seg, VCPU_SREG_FS); seg.selector = vmcs12->host_gs_selector; seg.base = vmcs12->host_gs_base; __vmx_set_segment(vcpu, &seg, VCPU_SREG_GS); seg = (struct kvm_segment) { .base = vmcs12->host_tr_base, .limit = 0x67, .selector = vmcs12->host_tr_selector, .type = 11, .present = 1 }; __vmx_set_segment(vcpu, &seg, VCPU_SREG_TR); memset(&seg, 0, sizeof(seg)); seg.unusable = 1; __vmx_set_segment(vcpu, &seg, VCPU_SREG_LDTR); kvm_set_dr(vcpu, 7, 0x400); vmcs_write64(GUEST_IA32_DEBUGCTL, 0); if (nested_vmx_load_msr(vcpu, vmcs12->vm_exit_msr_load_addr, vmcs12->vm_exit_msr_load_count)) nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL); to_vmx(vcpu)->emulation_required = vmx_emulation_required(vcpu); } static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx) { struct vmx_uret_msr *efer_msr; unsigned int i; if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_EFER) return vmcs_read64(GUEST_IA32_EFER); if (cpu_has_load_ia32_efer()) return kvm_host.efer; for (i = 0; i < vmx->msr_autoload.guest.nr; ++i) { if (vmx->msr_autoload.guest.val[i].index == MSR_EFER) return vmx->msr_autoload.guest.val[i].value; } efer_msr = vmx_find_uret_msr(vmx, MSR_EFER); if (efer_msr) return efer_msr->data; return kvm_host.efer; } static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); struct vcpu_vmx *vmx = to_vmx(vcpu); struct vmx_msr_entry g, h; gpa_t gpa; u32 i, j; vcpu->arch.pat = vmcs_read64(GUEST_IA32_PAT); if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS) { /* * L1's host DR7 is lost if KVM_GUESTDBG_USE_HW_BP is set * as vmcs01.GUEST_DR7 contains a userspace defined value * and vcpu->arch.dr7 is not squirreled away before the * nested VMENTER (not worth adding a variable in nested_vmx). */ if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) kvm_set_dr(vcpu, 7, DR7_FIXED_1); else WARN_ON(kvm_set_dr(vcpu, 7, vmcs_readl(GUEST_DR7))); } /* * Note that calling vmx_set_{efer,cr0,cr4} is important as they * handle a variety of side effects to KVM's software model. */ vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx)); vcpu->arch.cr0_guest_owned_bits = vmx_l1_guest_owned_cr0_bits(); vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW)); vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK); vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW)); nested_ept_uninit_mmu_context(vcpu); vcpu->arch.cr3 = vmcs_readl(GUEST_CR3); kvm_register_mark_available(vcpu, VCPU_EXREG_CR3); /* * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs * from vmcs01 (if necessary). The PDPTRs are not loaded on * VMFail, like everything else we just need to ensure our * software model is up-to-date. */ if (enable_ept && is_pae_paging(vcpu)) ept_save_pdptrs(vcpu); kvm_mmu_reset_context(vcpu); /* * This nasty bit of open coding is a compromise between blindly * loading L1's MSRs using the exit load lists (incorrect emulation * of VMFail), leaving the nested VM's MSRs in the software model * (incorrect behavior) and snapshotting the modified MSRs (too * expensive since the lists are unbound by hardware). For each * MSR that was (prematurely) loaded from the nested VMEntry load * list, reload it from the exit load list if it exists and differs * from the guest value. The intent is to stuff host state as * silently as possible, not to fully process the exit load list. */ for (i = 0; i < vmcs12->vm_entry_msr_load_count; i++) { gpa = vmcs12->vm_entry_msr_load_addr + (i * sizeof(g)); if (kvm_vcpu_read_guest(vcpu, gpa, &g, sizeof(g))) { pr_debug_ratelimited( "%s read MSR index failed (%u, 0x%08llx)\n", __func__, i, gpa); goto vmabort; } for (j = 0; j < vmcs12->vm_exit_msr_load_count; j++) { gpa = vmcs12->vm_exit_msr_load_addr + (j * sizeof(h)); if (kvm_vcpu_read_guest(vcpu, gpa, &h, sizeof(h))) { pr_debug_ratelimited( "%s read MSR failed (%u, 0x%08llx)\n", __func__, j, gpa); goto vmabort; } if (h.index != g.index) continue; if (h.value == g.value) break; if (nested_vmx_load_msr_check(vcpu, &h)) { pr_debug_ratelimited( "%s check failed (%u, 0x%x, 0x%x)\n", __func__, j, h.index, h.reserved); goto vmabort; } if (kvm_set_msr_with_filter(vcpu, h.index, h.value)) { pr_debug_ratelimited( "%s WRMSR failed (%u, 0x%x, 0x%llx)\n", __func__, j, h.index, h.value); goto vmabort; } } } return; vmabort: nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL); } /* * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1 * and modify vmcs12 to make it see what it would expect to see there if * L2 was its real guest. Must only be called when in L2 (is_guest_mode()) */ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, u32 exit_intr_info, unsigned long exit_qualification) { struct vcpu_vmx *vmx = to_vmx(vcpu); struct vmcs12 *vmcs12 = get_vmcs12(vcpu); /* Pending MTF traps are discarded on VM-Exit. */ vmx->nested.mtf_pending = false; /* trying to cancel vmlaunch/vmresume is a bug */ WARN_ON_ONCE(vmx->nested.nested_run_pending); #ifdef CONFIG_KVM_HYPERV if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) { /* * KVM_REQ_GET_NESTED_STATE_PAGES is also used to map * Enlightened VMCS after migration and we still need to * do that when something is forcing L2->L1 exit prior to * the first L2 run. */ (void)nested_get_evmcs_page(vcpu); } #endif /* Service pending TLB flush requests for L2 before switching to L1. */ kvm_service_local_tlb_flush_requests(vcpu); /* * VCPU_EXREG_PDPTR will be clobbered in arch/x86/kvm/vmx/vmx.h between * now and the new vmentry. Ensure that the VMCS02 PDPTR fields are * up-to-date before switching to L1. */ if (enable_ept && is_pae_paging(vcpu)) vmx_ept_load_pdptrs(vcpu); leave_guest_mode(vcpu); if (nested_cpu_has_preemption_timer(vmcs12)) hrtimer_cancel(&to_vmx(vcpu)->nested.preemption_timer); if (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETTING)) { vcpu->arch.tsc_offset = vcpu->arch.l1_tsc_offset; if (nested_cpu_has2(vmcs12, SECONDARY_EXEC_TSC_SCALING)) vcpu->arch.tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio; } if (likely(!vmx->fail)) { sync_vmcs02_to_vmcs12(vcpu, vmcs12); if (vm_exit_reason != -1) prepare_vmcs12(vcpu, vmcs12, vm_exit_reason, exit_intr_info, exit_qualification); /* * Must happen outside of sync_vmcs02_to_vmcs12() as it will * also be used to capture vmcs12 cache as part of * capturing nVMX state for snapshot (migration). * * Otherwise, this flush will dirty guest memory at a * point it is already assumed by user-space to be * immutable. */ nested_flush_cached_shadow_vmcs12(vcpu, vmcs12); } else { /* * The only expected VM-instruction error is "VM entry with * invalid control field(s)." Anything else indicates a * problem with L0. And we should never get here with a * VMFail of any type if early consistency checks are enabled. */ WARN_ON_ONCE(vmcs_read32(VM_INSTRUCTION_ERROR) != VMXERR_ENTRY_INVALID_CONTROL_FIELD); WARN_ON_ONCE(nested_early_check); } /* * Drop events/exceptions that were queued for re-injection to L2 * (picked up via vmx_complete_interrupts()), as well as exceptions * that were pending for L2. Note, this must NOT be hoisted above * prepare_vmcs12(), events/exceptions queued for re-injection need to * be captured in vmcs12 (see vmcs12_save_pending_event()). */ vcpu->arch.nmi_injected = false; kvm_clear_exception_queue(vcpu); kvm_clear_interrupt_queue(vcpu); vmx_switch_vmcs(vcpu, &vmx->vmcs01); /* * If IBRS is advertised to the vCPU, KVM must flush the indirect * branch predictors when transitioning from L2 to L1, as L1 expects * hardware (KVM in this case) to provide separate predictor modes. * Bare metal isolates VMX root (host) from VMX non-root (guest), but * doesn't isolate different VMCSs, i.e. in this case, doesn't provide * separate modes for L2 vs L1. */ if (guest_cpu_cap_has(vcpu, X86_FEATURE_SPEC_CTRL)) indirect_branch_prediction_barrier(); /* Update any VMCS fields that might have changed while L2 ran */ vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr); vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr); vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset); if (kvm_caps.has_tsc_control) vmcs_write64(TSC_MULTIPLIER, vcpu->arch.tsc_scaling_ratio); if (vmx->nested.l1_tpr_threshold != -1) vmcs_write32(TPR_THRESHOLD, vmx->nested.l1_tpr_threshold); if (vmx->nested.change_vmcs01_virtual_apic_mode) { vmx->nested.change_vmcs01_virtual_apic_mode = false; vmx_set_virtual_apic_mode(vcpu); } if (vmx->nested.update_vmcs01_cpu_dirty_logging) { vmx->nested.update_vmcs01_cpu_dirty_logging = false; vmx_update_cpu_dirty_logging(vcpu); } nested_put_vmcs12_pages(vcpu); if (vmx->nested.reload_vmcs01_apic_access_page) { vmx->nested.reload_vmcs01_apic_access_page = false; kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu); } if (vmx->nested.update_vmcs01_apicv_status) { vmx->nested.update_vmcs01_apicv_status = false; kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu); } if (vmx->nested.update_vmcs01_hwapic_isr) { vmx->nested.update_vmcs01_hwapic_isr = false; kvm_apic_update_hwapic_isr(vcpu); } if ((vm_exit_reason != -1) && (enable_shadow_vmcs || nested_vmx_is_evmptr12_valid(vmx))) vmx->nested.need_vmcs12_to_shadow_sync = true; /* in case we halted in L2 */ vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE; if (likely(!vmx->fail)) { if (vm_exit_reason != -1) trace_kvm_nested_vmexit_inject(vmcs12->vm_exit_reason, vmcs12->exit_qualification, vmcs12->idt_vectoring_info_field, vmcs12->vm_exit_intr_info, vmcs12->vm_exit_intr_error_code, KVM_ISA_VMX); load_vmcs12_host_state(vcpu, vmcs12); return; } /* * After an early L2 VM-entry failure, we're now back * in L1 which thinks it just finished a VMLAUNCH or * VMRESUME instruction, so we need to set the failure * flag and the VM-instruction error field of the VMCS * accordingly, and skip the emulated instruction. */ (void)nested_vmx_fail(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD); /* * Restore L1's host state to KVM's software model. We're here * because a consistency check was caught by hardware, which * means some amount of guest state has been propagated to KVM's * model and needs to be unwound to the host's state. */ nested_vmx_restore_host_state(vcpu); vmx->fail = 0; } static void nested_vmx_triple_fault(struct kvm_vcpu *vcpu) { kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu); nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0); } /* * Decode the memory-address operand of a vmx instruction, as recorded on an * exit caused by such an instruction (run by a guest hypervisor). * On success, returns 0. When the operand is invalid, returns 1 and throws * #UD, #GP, or #SS. */ int get_vmx_mem_address(struct kvm_vcpu *vcpu, unsigned long exit_qualification, u32 vmx_instruction_info, bool wr, int len, gva_t *ret) { gva_t off; bool exn; struct kvm_segment s; /* * According to Vol. 3B, "Information for VM Exits Due to Instruction * Execution", on an exit, vmx_instruction_info holds most of the * addressing components of the operand. Only the displacement part * is put in exit_qualification (see 3B, "Basic VM-Exit Information"). * For how an actual address is calculated from all these components, * refer to Vol. 1, "Operand Addressing". */ int scaling = vmx_instruction_info & 3; int addr_size = (vmx_instruction_info >> 7) & 7; bool is_reg = vmx_instruction_info & (1u << 10); int seg_reg = (vmx_instruction_info >> 15) & 7; int index_reg = (vmx_instruction_info >> 18) & 0xf; bool index_is_valid = !(vmx_instruction_info & (1u << 22)); int base_reg = (vmx_instruction_info >> 23) & 0xf; bool base_is_valid = !(vmx_instruction_info & (1u << 27)); if (is_reg) { kvm_queue_exception(vcpu, UD_VECTOR); return 1; } /* Addr = segment_base + offset */ /* offset = base + [index * scale] + displacement */ off = exit_qualification; /* holds the displacement */ if (addr_size == 1) off = (gva_t)sign_extend64(off, 31); else if (addr_size == 0) off = (gva_t)sign_extend64(off, 15); if (base_is_valid) off += kvm_register_read(vcpu, base_reg); if (index_is_valid) off += kvm_register_read(vcpu, index_reg) << scaling; vmx_get_segment(vcpu, &s, seg_reg); /* * The effective address, i.e. @off, of a memory operand is truncated * based on the address size of the instruction. Note that this is * the *effective address*, i.e. the address prior to accounting for * the segment's base. */ if (addr_size == 1) /* 32 bit */ off &= 0xffffffff; else if (addr_size == 0) /* 16 bit */ off &= 0xffff; /* Checks for #GP/#SS exceptions. */ exn = false; if (is_long_mode(vcpu)) { /* * The virtual/linear address is never truncated in 64-bit * mode, e.g. a 32-bit address size can yield a 64-bit virtual * address when using FS/GS with a non-zero base. */ if (seg_reg == VCPU_SREG_FS || seg_reg == VCPU_SREG_GS) *ret = s.base + off; else *ret = off; *ret = vmx_get_untagged_addr(vcpu, *ret, 0); /* Long mode: #GP(0)/#SS(0) if the memory address is in a * non-canonical form. This is the only check on the memory * destination for long mode! */ exn = is_noncanonical_address(*ret, vcpu, 0); } else { /* * When not in long mode, the virtual/linear address is * unconditionally truncated to 32 bits regardless of the * address size. */ *ret = (s.base + off) & 0xffffffff; /* Protected mode: apply checks for segment validity in the * following order: * - segment type check (#GP(0) may be thrown) * - usability check (#GP(0)/#SS(0)) * - limit check (#GP(0)/#SS(0)) */ if (wr) /* #GP(0) if the destination operand is located in a * read-only data segment or any code segment. */ exn = ((s.type & 0xa) == 0 || (s.type & 8)); else /* #GP(0) if the source operand is located in an * execute-only code segment */ exn = ((s.type & 0xa) == 8); if (exn) { kvm_queue_exception_e(vcpu, GP_VECTOR, 0); return 1; } /* Protected mode: #GP(0)/#SS(0) if the segment is unusable. */ exn = (s.unusable != 0); /* * Protected mode: #GP(0)/#SS(0) if the memory operand is * outside the segment limit. All CPUs that support VMX ignore * limit checks for flat segments, i.e. segments with base==0, * limit==0xffffffff and of type expand-up data or code. */ if (!(s.base == 0 && s.limit == 0xffffffff && ((s.type & 8) || !(s.type & 4)))) exn = exn || ((u64)off + len - 1 > s.limit); } if (exn) { kvm_queue_exception_e(vcpu, seg_reg == VCPU_SREG_SS ? SS_VECTOR : GP_VECTOR, 0); return 1; } return 0; } static int nested_vmx_get_vmptr(struct kvm_vcpu *vcpu, gpa_t *vmpointer, int *ret) { gva_t gva; struct x86_exception e; int r; if (get_vmx_mem_address(vcpu, vmx_get_exit_qual(vcpu), vmcs_read32(VMX_INSTRUCTION_INFO), false, sizeof(*vmpointer), &gva)) { *ret = 1; return -EINVAL; } r = kvm_read_guest_virt(vcpu, gva, vmpointer, sizeof(*vmpointer), &e); if (r != X86EMUL_CONTINUE) { *ret = kvm_handle_memory_failure(vcpu, r, &e); return -EINVAL; } return 0; } /* * Allocate a shadow VMCS and associate it with the currently loaded * VMCS, unless such a shadow VMCS already exists. The newly allocated * VMCS is also VMCLEARed, so that it is ready for use. */ static struct vmcs *alloc_shadow_vmcs(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); struct loaded_vmcs *loaded_vmcs = vmx->loaded_vmcs; /* * KVM allocates a shadow VMCS only when L1 executes VMXON and frees it * when L1 executes VMXOFF or the vCPU is forced out of nested * operation. VMXON faults if the CPU is already post-VMXON, so it * should be impossible to already have an allocated shadow VMCS. KVM * doesn't support virtualization of VMCS shadowing, so vmcs01 should * always be the loaded VMCS. */ if (WARN_ON(loaded_vmcs != &vmx->vmcs01 || loaded_vmcs->shadow_vmcs)) return loaded_vmcs->shadow_vmcs; loaded_vmcs->shadow_vmcs = alloc_vmcs(true); if (loaded_vmcs->shadow_vmcs) vmcs_clear(loaded_vmcs->shadow_vmcs); return loaded_vmcs->shadow_vmcs; } static int enter_vmx_operation(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); int r; r = alloc_loaded_vmcs(&vmx->nested.vmcs02); if (r < 0) goto out_vmcs02; vmx->nested.cached_vmcs12 = kzalloc(VMCS12_SIZE, GFP_KERNEL_ACCOUNT); if (!vmx->nested.cached_vmcs12) goto out_cached_vmcs12; vmx->nested.shadow_vmcs12_cache.gpa = INVALID_GPA; vmx->nested.cached_shadow_vmcs12 = kzalloc(VMCS12_SIZE, GFP_KERNEL_ACCOUNT); if (!vmx->nested.cached_shadow_vmcs12) goto out_cached_shadow_vmcs12; if (enable_shadow_vmcs && !alloc_shadow_vmcs(vcpu)) goto out_shadow_vmcs; hrtimer_init(&vmx->nested.preemption_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED); vmx->nested.preemption_timer.function = vmx_preemption_timer_fn; vmx->nested.vpid02 = allocate_vpid(); vmx->nested.vmcs02_initialized = false; vmx->nested.vmxon = true; if (vmx_pt_mode_is_host_guest()) { vmx->pt_desc.guest.ctl = 0; pt_update_intercept_for_msr(vcpu); } return 0; out_shadow_vmcs: kfree(vmx->nested.cached_shadow_vmcs12); out_cached_shadow_vmcs12: kfree(vmx->nested.cached_vmcs12); out_cached_vmcs12: free_loaded_vmcs(&vmx->nested.vmcs02); out_vmcs02: return -ENOMEM; } /* Emulate the VMXON instruction. */ static int handle_vmxon(struct kvm_vcpu *vcpu) { int ret; gpa_t vmptr; uint32_t revision; struct vcpu_vmx *vmx = to_vmx(vcpu); const u64 VMXON_NEEDED_FEATURES = FEAT_CTL_LOCKED | FEAT_CTL_VMX_ENABLED_OUTSIDE_SMX; /* * Manually check CR4.VMXE checks, KVM must force CR4.VMXE=1 to enter * the guest and so cannot rely on hardware to perform the check, * which has higher priority than VM-Exit (see Intel SDM's pseudocode * for VMXON). * * Rely on hardware for the other pre-VM-Exit checks, CR0.PE=1, !VM86 * and !COMPATIBILITY modes. For an unrestricted guest, KVM doesn't * force any of the relevant guest state. For a restricted guest, KVM * does force CR0.PE=1, but only to also force VM86 in order to emulate * Real Mode, and so there's no need to check CR0.PE manually. */ if (!kvm_is_cr4_bit_set(vcpu, X86_CR4_VMXE)) { kvm_queue_exception(vcpu, UD_VECTOR); return 1; } /* * The CPL is checked for "not in VMX operation" and for "in VMX root", * and has higher priority than the VM-Fail due to being post-VMXON, * i.e. VMXON #GPs outside of VMX non-root if CPL!=0. In VMX non-root, * VMXON causes VM-Exit and KVM unconditionally forwards VMXON VM-Exits * from L2 to L1, i.e. there's no need to check for the vCPU being in * VMX non-root. * * Forwarding the VM-Exit unconditionally, i.e. without performing the * #UD checks (see above), is functionally ok because KVM doesn't allow * L1 to run L2 without CR4.VMXE=0, and because KVM never modifies L2's * CR0 or CR4, i.e. it's L2's responsibility to emulate #UDs that are * missed by hardware due to shadowing CR0 and/or CR4. */ if (vmx_get_cpl(vcpu)) { kvm_inject_gp(vcpu, 0); return 1; } if (vmx->nested.vmxon) return nested_vmx_fail(vcpu, VMXERR_VMXON_IN_VMX_ROOT_OPERATION); /* * Invalid CR0/CR4 generates #GP. These checks are performed if and * only if the vCPU isn't already in VMX operation, i.e. effectively * have lower priority than the VM-Fail above. */ if (!nested_host_cr0_valid(vcpu, kvm_read_cr0(vcpu)) || !nested_host_cr4_valid(vcpu, kvm_read_cr4(vcpu))) { kvm_inject_gp(vcpu, 0); return 1; } if ((vmx->msr_ia32_feature_control & VMXON_NEEDED_FEATURES) != VMXON_NEEDED_FEATURES) { kvm_inject_gp(vcpu, 0); return 1; } if (nested_vmx_get_vmptr(vcpu, &vmptr, &ret)) return ret; /* * SDM 3: 24.11.5 * The first 4 bytes of VMXON region contain the supported * VMCS revision identifier * * Note - IA32_VMX_BASIC[48] will never be 1 for the nested case; * which replaces physical address width with 32 */ if (!page_address_valid(vcpu, vmptr)) return nested_vmx_failInvalid(vcpu); if (kvm_read_guest(vcpu->kvm, vmptr, &revision, sizeof(revision)) || revision != VMCS12_REVISION) return nested_vmx_failInvalid(vcpu); vmx->nested.vmxon_ptr = vmptr; ret = enter_vmx_operation(vcpu); if (ret) return ret; return nested_vmx_succeed(vcpu); } static inline void nested_release_vmcs12(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); if (vmx->nested.current_vmptr == INVALID_GPA) return; copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu)); if (enable_shadow_vmcs) { /* copy to memory all shadowed fields in case they were modified */ copy_shadow_to_vmcs12(vmx); vmx_disable_shadow_vmcs(vmx); } vmx->nested.posted_intr_nv = -1; /* Flush VMCS12 to guest memory */ kvm_vcpu_write_guest_page(vcpu, vmx->nested.current_vmptr >> PAGE_SHIFT, vmx->nested.cached_vmcs12, 0, VMCS12_SIZE); kvm_mmu_free_roots(vcpu->kvm, &vcpu->arch.guest_mmu, KVM_MMU_ROOTS_ALL); vmx->nested.current_vmptr = INVALID_GPA; } /* Emulate the VMXOFF instruction */ static int handle_vmxoff(struct kvm_vcpu *vcpu) { if (!nested_vmx_check_permission(vcpu)) return 1; free_nested(vcpu); if (kvm_apic_has_pending_init_or_sipi(vcpu)) kvm_make_request(KVM_REQ_EVENT, vcpu); return nested_vmx_succeed(vcpu); } /* Emulate the VMCLEAR instruction */ static int handle_vmclear(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); u32 zero = 0; gpa_t vmptr; int r; if (!nested_vmx_check_permission(vcpu)) return 1; if (nested_vmx_get_vmptr(vcpu, &vmptr, &r)) return r; if (!page_address_valid(vcpu, vmptr)) return nested_vmx_fail(vcpu, VMXERR_VMCLEAR_INVALID_ADDRESS); if (vmptr == vmx->nested.vmxon_ptr) return nested_vmx_fail(vcpu, VMXERR_VMCLEAR_VMXON_POINTER); if (likely(!nested_evmcs_handle_vmclear(vcpu, vmptr))) { if (vmptr == vmx->nested.current_vmptr) nested_release_vmcs12(vcpu); /* * Silently ignore memory errors on VMCLEAR, Intel's pseudocode * for VMCLEAR includes a "ensure that data for VMCS referenced * by the operand is in memory" clause that guards writes to * memory, i.e. doing nothing for I/O is architecturally valid. * * FIXME: Suppress failures if and only if no memslot is found, * i.e. exit to userspace if __copy_to_user() fails. */ (void)kvm_vcpu_write_guest(vcpu, vmptr + offsetof(struct vmcs12, launch_state), &zero, sizeof(zero)); } return nested_vmx_succeed(vcpu); } /* Emulate the VMLAUNCH instruction */ static int handle_vmlaunch(struct kvm_vcpu *vcpu) { return nested_vmx_run(vcpu, true); } /* Emulate the VMRESUME instruction */ static int handle_vmresume(struct kvm_vcpu *vcpu) { return nested_vmx_run(vcpu, false); } static int handle_vmread(struct kvm_vcpu *vcpu) { struct vmcs12 *vmcs12 = is_guest_mode(vcpu) ? get_shadow_vmcs12(vcpu) : get_vmcs12(vcpu); unsigned long exit_qualification = vmx_get_exit_qual(vcpu); u32 instr_info = vmcs_read32(VMX_INSTRUCTION_INFO); struct vcpu_vmx *vmx = to_vmx(vcpu); struct x86_exception e; unsigned long field; u64 value; gva_t gva = 0; short offset; int len, r; if (!nested_vmx_check_permission(vcpu)) return 1; /* Decode instruction info and find the field to read */ field = kvm_register_read(vcpu, (((instr_info) >> 28) & 0xf)); if (!nested_vmx_is_evmptr12_valid(vmx)) { /* * In VMX non-root operation, when the VMCS-link pointer is INVALID_GPA, * any VMREAD sets the ALU flags for VMfailInvalid. */ if (vmx->nested.current_vmptr == INVALID_GPA || (is_guest_mode(vcpu) && get_vmcs12(vcpu)->vmcs_link_pointer == INVALID_GPA)) return nested_vmx_failInvalid(vcpu); offset = get_vmcs12_field_offset(field); if (offset < 0) return nested_vmx_fail(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT); if (!is_guest_mode(vcpu) && is_vmcs12_ext_field(field)) copy_vmcs02_to_vmcs12_rare(vcpu, vmcs12); /* Read the field, zero-extended to a u64 value */ value = vmcs12_read_any(vmcs12, field, offset); } else { /* * Hyper-V TLFS (as of 6.0b) explicitly states, that while an * enlightened VMCS is active VMREAD/VMWRITE instructions are * unsupported. Unfortunately, certain versions of Windows 11 * don't comply with this requirement which is not enforced in * genuine Hyper-V. Allow VMREAD from an enlightened VMCS as a * workaround, as misbehaving guests will panic on VM-Fail. * Note, enlightened VMCS is incompatible with shadow VMCS so * all VMREADs from L2 should go to L1. */ if (WARN_ON_ONCE(is_guest_mode(vcpu))) return nested_vmx_failInvalid(vcpu); offset = evmcs_field_offset(field, NULL); if (offset < 0) return nested_vmx_fail(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT); /* Read the field, zero-extended to a u64 value */ value = evmcs_read_any(nested_vmx_evmcs(vmx), field, offset); } /* * Now copy part of this value to register or memory, as requested. * Note that the number of bits actually copied is 32 or 64 depending * on the guest's mode (32 or 64 bit), not on the given field's length. */ if (instr_info & BIT(10)) { kvm_register_write(vcpu, (((instr_info) >> 3) & 0xf), value); } else { len = is_64_bit_mode(vcpu) ? 8 : 4; if (get_vmx_mem_address(vcpu, exit_qualification, instr_info, true, len, &gva)) return 1; /* _system ok, nested_vmx_check_permission has verified cpl=0 */ r = kvm_write_guest_virt_system(vcpu, gva, &value, len, &e); if (r != X86EMUL_CONTINUE) return kvm_handle_memory_failure(vcpu, r, &e); } return nested_vmx_succeed(vcpu); } static bool is_shadow_field_rw(unsigned long field) { switch (field) { #define SHADOW_FIELD_RW(x, y) case x: #include "vmcs_shadow_fields.h" return true; default: break; } return false; } static bool is_shadow_field_ro(unsigned long field) { switch (field) { #define SHADOW_FIELD_RO(x, y) case x: #include "vmcs_shadow_fields.h" return true; default: break; } return false; } static int handle_vmwrite(struct kvm_vcpu *vcpu) { struct vmcs12 *vmcs12 = is_guest_mode(vcpu) ? get_shadow_vmcs12(vcpu) : get_vmcs12(vcpu); unsigned long exit_qualification = vmx_get_exit_qual(vcpu); u32 instr_info = vmcs_read32(VMX_INSTRUCTION_INFO); struct vcpu_vmx *vmx = to_vmx(vcpu); struct x86_exception e; unsigned long field; short offset; gva_t gva; int len, r; /* * The value to write might be 32 or 64 bits, depending on L1's long * mode, and eventually we need to write that into a field of several * possible lengths. The code below first zero-extends the value to 64 * bit (value), and then copies only the appropriate number of * bits into the vmcs12 field. */ u64 value = 0; if (!nested_vmx_check_permission(vcpu)) return 1; /* * In VMX non-root operation, when the VMCS-link pointer is INVALID_GPA, * any VMWRITE sets the ALU flags for VMfailInvalid. */ if (vmx->nested.current_vmptr == INVALID_GPA || (is_guest_mode(vcpu) && get_vmcs12(vcpu)->vmcs_link_pointer == INVALID_GPA)) return nested_vmx_failInvalid(vcpu); if (instr_info & BIT(10)) value = kvm_register_read(vcpu, (((instr_info) >> 3) & 0xf)); else { len = is_64_bit_mode(vcpu) ? 8 : 4; if (get_vmx_mem_address(vcpu, exit_qualification, instr_info, false, len, &gva)) return 1; r = kvm_read_guest_virt(vcpu, gva, &value, len, &e); if (r != X86EMUL_CONTINUE) return kvm_handle_memory_failure(vcpu, r, &e); } field = kvm_register_read(vcpu, (((instr_info) >> 28) & 0xf)); offset = get_vmcs12_field_offset(field); if (offset < 0) return nested_vmx_fail(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT); /* * If the vCPU supports "VMWRITE to any supported field in the * VMCS," then the "read-only" fields are actually read/write. */ if (vmcs_field_readonly(field) && !nested_cpu_has_vmwrite_any_field(vcpu)) return nested_vmx_fail(vcpu, VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT); /* * Ensure vmcs12 is up-to-date before any VMWRITE that dirties * vmcs12, else we may crush a field or consume a stale value. */ if (!is_guest_mode(vcpu) && !is_shadow_field_rw(field)) copy_vmcs02_to_vmcs12_rare(vcpu, vmcs12); /* * Some Intel CPUs intentionally drop the reserved bits of the AR byte * fields on VMWRITE. Emulate this behavior to ensure consistent KVM * behavior regardless of the underlying hardware, e.g. if an AR_BYTE * field is intercepted for VMWRITE but not VMREAD (in L1), then VMREAD * from L1 will return a different value than VMREAD from L2 (L1 sees * the stripped down value, L2 sees the full value as stored by KVM). */ if (field >= GUEST_ES_AR_BYTES && field <= GUEST_TR_AR_BYTES) value &= 0x1f0ff; vmcs12_write_any(vmcs12, field, offset, value); /* * Do not track vmcs12 dirty-state if in guest-mode as we actually * dirty shadow vmcs12 instead of vmcs12. Fields that can be updated * by L1 without a vmexit are always updated in the vmcs02, i.e. don't * "dirty" vmcs12, all others go down the prepare_vmcs02() slow path. */ if (!is_guest_mode(vcpu) && !is_shadow_field_rw(field)) { /* * L1 can read these fields without exiting, ensure the * shadow VMCS is up-to-date. */ if (enable_shadow_vmcs && is_shadow_field_ro(field)) { preempt_disable(); vmcs_load(vmx->vmcs01.shadow_vmcs); __vmcs_writel(field, value); vmcs_clear(vmx->vmcs01.shadow_vmcs); vmcs_load(vmx->loaded_vmcs->vmcs); preempt_enable(); } vmx->nested.dirty_vmcs12 = true; } return nested_vmx_succeed(vcpu); } static void set_current_vmptr(struct vcpu_vmx *vmx, gpa_t vmptr) { vmx->nested.current_vmptr = vmptr; if (enable_shadow_vmcs) { secondary_exec_controls_setbit(vmx, SECONDARY_EXEC_SHADOW_VMCS); vmcs_write64(VMCS_LINK_POINTER, __pa(vmx->vmcs01.shadow_vmcs)); vmx->nested.need_vmcs12_to_shadow_sync = true; } vmx->nested.dirty_vmcs12 = true; vmx->nested.force_msr_bitmap_recalc = true; } /* Emulate the VMPTRLD instruction */ static int handle_vmptrld(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); gpa_t vmptr; int r; if (!nested_vmx_check_permission(vcpu)) return 1; if (nested_vmx_get_vmptr(vcpu, &vmptr, &r)) return r; if (!page_address_valid(vcpu, vmptr)) return nested_vmx_fail(vcpu, VMXERR_VMPTRLD_INVALID_ADDRESS); if (vmptr == vmx->nested.vmxon_ptr) return nested_vmx_fail(vcpu, VMXERR_VMPTRLD_VMXON_POINTER); /* Forbid normal VMPTRLD if Enlightened version was used */ if (nested_vmx_is_evmptr12_valid(vmx)) return 1; if (vmx->nested.current_vmptr != vmptr) { struct gfn_to_hva_cache *ghc = &vmx->nested.vmcs12_cache; struct vmcs_hdr hdr; if (kvm_gfn_to_hva_cache_init(vcpu->kvm, ghc, vmptr, VMCS12_SIZE)) { /* * Reads from an unbacked page return all 1s, * which means that the 32 bits located at the * given physical address won't match the required * VMCS12_REVISION identifier. */ return nested_vmx_fail(vcpu, VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID); } if (kvm_read_guest_offset_cached(vcpu->kvm, ghc, &hdr, offsetof(struct vmcs12, hdr), sizeof(hdr))) { return nested_vmx_fail(vcpu, VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID); } if (hdr.revision_id != VMCS12_REVISION || (hdr.shadow_vmcs && !nested_cpu_has_vmx_shadow_vmcs(vcpu))) { return nested_vmx_fail(vcpu, VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID); } nested_release_vmcs12(vcpu); /* * Load VMCS12 from guest memory since it is not already * cached. */ if (kvm_read_guest_cached(vcpu->kvm, ghc, vmx->nested.cached_vmcs12, VMCS12_SIZE)) { return nested_vmx_fail(vcpu, VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID); } set_current_vmptr(vmx, vmptr); } return nested_vmx_succeed(vcpu); } /* Emulate the VMPTRST instruction */ static int handle_vmptrst(struct kvm_vcpu *vcpu) { unsigned long exit_qual = vmx_get_exit_qual(vcpu); u32 instr_info = vmcs_read32(VMX_INSTRUCTION_INFO); gpa_t current_vmptr = to_vmx(vcpu)->nested.current_vmptr; struct x86_exception e; gva_t gva; int r; if (!nested_vmx_check_permission(vcpu)) return 1; if (unlikely(nested_vmx_is_evmptr12_valid(to_vmx(vcpu)))) return 1; if (get_vmx_mem_address(vcpu, exit_qual, instr_info, true, sizeof(gpa_t), &gva)) return 1; /* *_system ok, nested_vmx_check_permission has verified cpl=0 */ r = kvm_write_guest_virt_system(vcpu, gva, (void *)¤t_vmptr, sizeof(gpa_t), &e); if (r != X86EMUL_CONTINUE) return kvm_handle_memory_failure(vcpu, r, &e); return nested_vmx_succeed(vcpu); } /* Emulate the INVEPT instruction */ static int handle_invept(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); u32 vmx_instruction_info, types; unsigned long type, roots_to_free; struct kvm_mmu *mmu; gva_t gva; struct x86_exception e; struct { u64 eptp, gpa; } operand; int i, r, gpr_index; if (!(vmx->nested.msrs.secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT) || !(vmx->nested.msrs.ept_caps & VMX_EPT_INVEPT_BIT)) { kvm_queue_exception(vcpu, UD_VECTOR); return 1; } if (!nested_vmx_check_permission(vcpu)) return 1; vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); gpr_index = vmx_get_instr_info_reg2(vmx_instruction_info); type = kvm_register_read(vcpu, gpr_index); types = (vmx->nested.msrs.ept_caps >> VMX_EPT_EXTENT_SHIFT) & 6; if (type >= 32 || !(types & (1 << type))) return nested_vmx_fail(vcpu, VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); /* According to the Intel VMX instruction reference, the memory * operand is read even if it isn't needed (e.g., for type==global) */ if (get_vmx_mem_address(vcpu, vmx_get_exit_qual(vcpu), vmx_instruction_info, false, sizeof(operand), &gva)) return 1; r = kvm_read_guest_virt(vcpu, gva, &operand, sizeof(operand), &e); if (r != X86EMUL_CONTINUE) return kvm_handle_memory_failure(vcpu, r, &e); /* * Nested EPT roots are always held through guest_mmu, * not root_mmu. */ mmu = &vcpu->arch.guest_mmu; switch (type) { case VMX_EPT_EXTENT_CONTEXT: if (!nested_vmx_check_eptp(vcpu, operand.eptp)) return nested_vmx_fail(vcpu, VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); roots_to_free = 0; if (nested_ept_root_matches(mmu->root.hpa, mmu->root.pgd, operand.eptp)) roots_to_free |= KVM_MMU_ROOT_CURRENT; for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++) { if (nested_ept_root_matches(mmu->prev_roots[i].hpa, mmu->prev_roots[i].pgd, operand.eptp)) roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i); } break; case VMX_EPT_EXTENT_GLOBAL: roots_to_free = KVM_MMU_ROOTS_ALL; break; default: BUG(); break; } if (roots_to_free) kvm_mmu_free_roots(vcpu->kvm, mmu, roots_to_free); return nested_vmx_succeed(vcpu); } static int handle_invvpid(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); u32 vmx_instruction_info; unsigned long type, types; gva_t gva; struct x86_exception e; struct { u64 vpid; u64 gla; } operand; u16 vpid02; int r, gpr_index; if (!(vmx->nested.msrs.secondary_ctls_high & SECONDARY_EXEC_ENABLE_VPID) || !(vmx->nested.msrs.vpid_caps & VMX_VPID_INVVPID_BIT)) { kvm_queue_exception(vcpu, UD_VECTOR); return 1; } if (!nested_vmx_check_permission(vcpu)) return 1; vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); gpr_index = vmx_get_instr_info_reg2(vmx_instruction_info); type = kvm_register_read(vcpu, gpr_index); types = (vmx->nested.msrs.vpid_caps & VMX_VPID_EXTENT_SUPPORTED_MASK) >> 8; if (type >= 32 || !(types & (1 << type))) return nested_vmx_fail(vcpu, VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); /* according to the intel vmx instruction reference, the memory * operand is read even if it isn't needed (e.g., for type==global) */ if (get_vmx_mem_address(vcpu, vmx_get_exit_qual(vcpu), vmx_instruction_info, false, sizeof(operand), &gva)) return 1; r = kvm_read_guest_virt(vcpu, gva, &operand, sizeof(operand), &e); if (r != X86EMUL_CONTINUE) return kvm_handle_memory_failure(vcpu, r, &e); if (operand.vpid >> 16) return nested_vmx_fail(vcpu, VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); /* * Always flush the effective vpid02, i.e. never flush the current VPID * and never explicitly flush vpid01. INVVPID targets a VPID, not a * VMCS, and so whether or not the current vmcs12 has VPID enabled is * irrelevant (and there may not be a loaded vmcs12). */ vpid02 = nested_get_vpid02(vcpu); switch (type) { case VMX_VPID_EXTENT_INDIVIDUAL_ADDR: /* * LAM doesn't apply to addresses that are inputs to TLB * invalidation. */ if (!operand.vpid || is_noncanonical_invlpg_address(operand.gla, vcpu)) return nested_vmx_fail(vcpu, VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); vpid_sync_vcpu_addr(vpid02, operand.gla); break; case VMX_VPID_EXTENT_SINGLE_CONTEXT: case VMX_VPID_EXTENT_SINGLE_NON_GLOBAL: if (!operand.vpid) return nested_vmx_fail(vcpu, VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID); vpid_sync_context(vpid02); break; case VMX_VPID_EXTENT_ALL_CONTEXT: vpid_sync_context(vpid02); break; default: WARN_ON_ONCE(1); return kvm_skip_emulated_instruction(vcpu); } /* * Sync the shadow page tables if EPT is disabled, L1 is invalidating * linear mappings for L2 (tagged with L2's VPID). Free all guest * roots as VPIDs are not tracked in the MMU role. * * Note, this operates on root_mmu, not guest_mmu, as L1 and L2 share * an MMU when EPT is disabled. * * TODO: sync only the affected SPTEs for INVDIVIDUAL_ADDR. */ if (!enable_ept) kvm_mmu_free_guest_mode_roots(vcpu->kvm, &vcpu->arch.root_mmu); return nested_vmx_succeed(vcpu); } static int nested_vmx_eptp_switching(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { u32 index = kvm_rcx_read(vcpu); u64 new_eptp; if (WARN_ON_ONCE(!nested_cpu_has_ept(vmcs12))) return 1; if (index >= VMFUNC_EPTP_ENTRIES) return 1; if (kvm_vcpu_read_guest_page(vcpu, vmcs12->eptp_list_address >> PAGE_SHIFT, &new_eptp, index * 8, 8)) return 1; /* * If the (L2) guest does a vmfunc to the currently * active ept pointer, we don't have to do anything else */ if (vmcs12->ept_pointer != new_eptp) { if (!nested_vmx_check_eptp(vcpu, new_eptp)) return 1; vmcs12->ept_pointer = new_eptp; nested_ept_new_eptp(vcpu); if (!nested_cpu_has_vpid(vmcs12)) kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu); } return 0; } static int handle_vmfunc(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); struct vmcs12 *vmcs12; u32 function = kvm_rax_read(vcpu); /* * VMFUNC should never execute cleanly while L1 is active; KVM supports * VMFUNC for nested VMs, but not for L1. */ if (WARN_ON_ONCE(!is_guest_mode(vcpu))) { kvm_queue_exception(vcpu, UD_VECTOR); return 1; } vmcs12 = get_vmcs12(vcpu); /* * #UD on out-of-bounds function has priority over VM-Exit, and VMFUNC * is enabled in vmcs02 if and only if it's enabled in vmcs12. */ if (WARN_ON_ONCE((function > 63) || !nested_cpu_has_vmfunc(vmcs12))) { kvm_queue_exception(vcpu, UD_VECTOR); return 1; } if (!(vmcs12->vm_function_control & BIT_ULL(function))) goto fail; switch (function) { case 0: if (nested_vmx_eptp_switching(vcpu, vmcs12)) goto fail; break; default: goto fail; } return kvm_skip_emulated_instruction(vcpu); fail: /* * This is effectively a reflected VM-Exit, as opposed to a synthesized * nested VM-Exit. Pass the original exit reason, i.e. don't hardcode * EXIT_REASON_VMFUNC as the exit reason. */ nested_vmx_vmexit(vcpu, vmx->exit_reason.full, vmx_get_intr_info(vcpu), vmx_get_exit_qual(vcpu)); return 1; } /* * Return true if an IO instruction with the specified port and size should cause * a VM-exit into L1. */ bool nested_vmx_check_io_bitmaps(struct kvm_vcpu *vcpu, unsigned int port, int size) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); gpa_t bitmap, last_bitmap; u8 b; last_bitmap = INVALID_GPA; b = -1; while (size > 0) { if (port < 0x8000) bitmap = vmcs12->io_bitmap_a; else if (port < 0x10000) bitmap = vmcs12->io_bitmap_b; else return true; bitmap += (port & 0x7fff) / 8; if (last_bitmap != bitmap) if (kvm_vcpu_read_guest(vcpu, bitmap, &b, 1)) return true; if (b & (1 << (port & 7))) return true; port++; size--; last_bitmap = bitmap; } return false; } static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { unsigned long exit_qualification; unsigned short port; int size; if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS)) return nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING); exit_qualification = vmx_get_exit_qual(vcpu); port = exit_qualification >> 16; size = (exit_qualification & 7) + 1; return nested_vmx_check_io_bitmaps(vcpu, port, size); } /* * Return 1 if we should exit from L2 to L1 to handle an MSR access, * rather than handle it ourselves in L0. I.e., check whether L1 expressed * disinterest in the current event (read or write a specific MSR) by using an * MSR bitmap. This may be the case even when L0 doesn't use MSR bitmaps. */ static bool nested_vmx_exit_handled_msr(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, union vmx_exit_reason exit_reason) { u32 msr_index = kvm_rcx_read(vcpu); gpa_t bitmap; if (!nested_cpu_has(vmcs12, CPU_BASED_USE_MSR_BITMAPS)) return true; /* * The MSR_BITMAP page is divided into four 1024-byte bitmaps, * for the four combinations of read/write and low/high MSR numbers. * First we need to figure out which of the four to use: */ bitmap = vmcs12->msr_bitmap; if (exit_reason.basic == EXIT_REASON_MSR_WRITE) bitmap += 2048; if (msr_index >= 0xc0000000) { msr_index -= 0xc0000000; bitmap += 1024; } /* Then read the msr_index'th bit from this bitmap: */ if (msr_index < 1024*8) { unsigned char b; if (kvm_vcpu_read_guest(vcpu, bitmap + msr_index/8, &b, 1)) return true; return 1 & (b >> (msr_index & 7)); } else return true; /* let L1 handle the wrong parameter */ } /* * Return 1 if we should exit from L2 to L1 to handle a CR access exit, * rather than handle it ourselves in L0. I.e., check if L1 wanted to * intercept (via guest_host_mask etc.) the current event. */ static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { unsigned long exit_qualification = vmx_get_exit_qual(vcpu); int cr = exit_qualification & 15; int reg; unsigned long val; switch ((exit_qualification >> 4) & 3) { case 0: /* mov to cr */ reg = (exit_qualification >> 8) & 15; val = kvm_register_read(vcpu, reg); switch (cr) { case 0: if (vmcs12->cr0_guest_host_mask & (val ^ vmcs12->cr0_read_shadow)) return true; break; case 3: if (nested_cpu_has(vmcs12, CPU_BASED_CR3_LOAD_EXITING)) return true; break; case 4: if (vmcs12->cr4_guest_host_mask & (vmcs12->cr4_read_shadow ^ val)) return true; break; case 8: if (nested_cpu_has(vmcs12, CPU_BASED_CR8_LOAD_EXITING)) return true; break; } break; case 2: /* clts */ if ((vmcs12->cr0_guest_host_mask & X86_CR0_TS) && (vmcs12->cr0_read_shadow & X86_CR0_TS)) return true; break; case 1: /* mov from cr */ switch (cr) { case 3: if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_CR3_STORE_EXITING) return true; break; case 8: if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_CR8_STORE_EXITING) return true; break; } break; case 3: /* lmsw */ /* * lmsw can change bits 1..3 of cr0, and only set bit 0 of * cr0. Other attempted changes are ignored, with no exit. */ val = (exit_qualification >> LMSW_SOURCE_DATA_SHIFT) & 0x0f; if (vmcs12->cr0_guest_host_mask & 0xe & (val ^ vmcs12->cr0_read_shadow)) return true; if ((vmcs12->cr0_guest_host_mask & 0x1) && !(vmcs12->cr0_read_shadow & 0x1) && (val & 0x1)) return true; break; } return false; } static bool nested_vmx_exit_handled_encls(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12) { u32 encls_leaf; if (!guest_cpu_cap_has(vcpu, X86_FEATURE_SGX) || !nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENCLS_EXITING)) return false; encls_leaf = kvm_rax_read(vcpu); if (encls_leaf > 62) encls_leaf = 63; return vmcs12->encls_exiting_bitmap & BIT_ULL(encls_leaf); } static bool nested_vmx_exit_handled_vmcs_access(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, gpa_t bitmap) { u32 vmx_instruction_info; unsigned long field; u8 b; if (!nested_cpu_has_shadow_vmcs(vmcs12)) return true; /* Decode instruction info and find the field to access */ vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); field = kvm_register_read(vcpu, (((vmx_instruction_info) >> 28) & 0xf)); /* Out-of-range fields always cause a VM exit from L2 to L1 */ if (field >> 15) return true; if (kvm_vcpu_read_guest(vcpu, bitmap + field/8, &b, 1)) return true; return 1 & (b >> (field & 7)); } static bool nested_vmx_exit_handled_mtf(struct vmcs12 *vmcs12) { u32 entry_intr_info = vmcs12->vm_entry_intr_info_field; if (nested_cpu_has_mtf(vmcs12)) return true; /* * An MTF VM-exit may be injected into the guest by setting the * interruption-type to 7 (other event) and the vector field to 0. Such * is the case regardless of the 'monitor trap flag' VM-execution * control. */ return entry_intr_info == (INTR_INFO_VALID_MASK | INTR_TYPE_OTHER_EVENT); } /* * Return true if L0 wants to handle an exit from L2 regardless of whether or not * L1 wants the exit. Only call this when in is_guest_mode (L2). */ static bool nested_vmx_l0_wants_exit(struct kvm_vcpu *vcpu, union vmx_exit_reason exit_reason) { u32 intr_info; switch ((u16)exit_reason.basic) { case EXIT_REASON_EXCEPTION_NMI: intr_info = vmx_get_intr_info(vcpu); if (is_nmi(intr_info)) return true; else if (is_page_fault(intr_info)) return vcpu->arch.apf.host_apf_flags || vmx_need_pf_intercept(vcpu); else if (is_debug(intr_info) && vcpu->guest_debug & (KVM_GUESTDBG_SINGLESTEP | KVM_GUESTDBG_USE_HW_BP)) return true; else if (is_breakpoint(intr_info) && vcpu->guest_debug & KVM_GUESTDBG_USE_SW_BP) return true; else if (is_alignment_check(intr_info) && !vmx_guest_inject_ac(vcpu)) return true; else if (is_ve_fault(intr_info)) return true; return false; case EXIT_REASON_EXTERNAL_INTERRUPT: return true; case EXIT_REASON_MCE_DURING_VMENTRY: return true; case EXIT_REASON_EPT_VIOLATION: /* * L0 always deals with the EPT violation. If nested EPT is * used, and the nested mmu code discovers that the address is * missing in the guest EPT table (EPT12), the EPT violation * will be injected with nested_ept_inject_page_fault() */ return true; case EXIT_REASON_EPT_MISCONFIG: /* * L2 never uses directly L1's EPT, but rather L0's own EPT * table (shadow on EPT) or a merged EPT table that L0 built * (EPT on EPT). So any problems with the structure of the * table is L0's fault. */ return true; case EXIT_REASON_PREEMPTION_TIMER: return true; case EXIT_REASON_PML_FULL: /* * PML is emulated for an L1 VMM and should never be enabled in * vmcs02, always "handle" PML_FULL by exiting to userspace. */ return true; case EXIT_REASON_VMFUNC: /* VM functions are emulated through L2->L0 vmexits. */ return true; case EXIT_REASON_BUS_LOCK: /* * At present, bus lock VM exit is never exposed to L1. * Handle L2's bus locks in L0 directly. */ return true; #ifdef CONFIG_KVM_HYPERV case EXIT_REASON_VMCALL: /* Hyper-V L2 TLB flush hypercall is handled by L0 */ return guest_hv_cpuid_has_l2_tlb_flush(vcpu) && nested_evmcs_l2_tlb_flush_enabled(vcpu) && kvm_hv_is_tlb_flush_hcall(vcpu); #endif default: break; } return false; } /* * Return 1 if L1 wants to intercept an exit from L2. Only call this when in * is_guest_mode (L2). */ static bool nested_vmx_l1_wants_exit(struct kvm_vcpu *vcpu, union vmx_exit_reason exit_reason) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); u32 intr_info; switch ((u16)exit_reason.basic) { case EXIT_REASON_EXCEPTION_NMI: intr_info = vmx_get_intr_info(vcpu); if (is_nmi(intr_info)) return true; else if (is_page_fault(intr_info)) return true; return vmcs12->exception_bitmap & (1u << (intr_info & INTR_INFO_VECTOR_MASK)); case EXIT_REASON_EXTERNAL_INTERRUPT: return nested_exit_on_intr(vcpu); case EXIT_REASON_TRIPLE_FAULT: return true; case EXIT_REASON_INTERRUPT_WINDOW: return nested_cpu_has(vmcs12, CPU_BASED_INTR_WINDOW_EXITING); case EXIT_REASON_NMI_WINDOW: return nested_cpu_has(vmcs12, CPU_BASED_NMI_WINDOW_EXITING); case EXIT_REASON_TASK_SWITCH: return true; case EXIT_REASON_CPUID: return true; case EXIT_REASON_HLT: return nested_cpu_has(vmcs12, CPU_BASED_HLT_EXITING); case EXIT_REASON_INVD: return true; case EXIT_REASON_INVLPG: return nested_cpu_has(vmcs12, CPU_BASED_INVLPG_EXITING); case EXIT_REASON_RDPMC: return nested_cpu_has(vmcs12, CPU_BASED_RDPMC_EXITING); case EXIT_REASON_RDRAND: return nested_cpu_has2(vmcs12, SECONDARY_EXEC_RDRAND_EXITING); case EXIT_REASON_RDSEED: return nested_cpu_has2(vmcs12, SECONDARY_EXEC_RDSEED_EXITING); case EXIT_REASON_RDTSC: case EXIT_REASON_RDTSCP: return nested_cpu_has(vmcs12, CPU_BASED_RDTSC_EXITING); case EXIT_REASON_VMREAD: return nested_vmx_exit_handled_vmcs_access(vcpu, vmcs12, vmcs12->vmread_bitmap); case EXIT_REASON_VMWRITE: return nested_vmx_exit_handled_vmcs_access(vcpu, vmcs12, vmcs12->vmwrite_bitmap); case EXIT_REASON_VMCALL: case EXIT_REASON_VMCLEAR: case EXIT_REASON_VMLAUNCH: case EXIT_REASON_VMPTRLD: case EXIT_REASON_VMPTRST: case EXIT_REASON_VMRESUME: case EXIT_REASON_VMOFF: case EXIT_REASON_VMON: case EXIT_REASON_INVEPT: case EXIT_REASON_INVVPID: /* * VMX instructions trap unconditionally. This allows L1 to * emulate them for its L2 guest, i.e., allows 3-level nesting! */ return true; case EXIT_REASON_CR_ACCESS: return nested_vmx_exit_handled_cr(vcpu, vmcs12); case EXIT_REASON_DR_ACCESS: return nested_cpu_has(vmcs12, CPU_BASED_MOV_DR_EXITING); case EXIT_REASON_IO_INSTRUCTION: return nested_vmx_exit_handled_io(vcpu, vmcs12); case EXIT_REASON_GDTR_IDTR: case EXIT_REASON_LDTR_TR: return nested_cpu_has2(vmcs12, SECONDARY_EXEC_DESC); case EXIT_REASON_MSR_READ: case EXIT_REASON_MSR_WRITE: return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason); case EXIT_REASON_INVALID_STATE: return true; case EXIT_REASON_MWAIT_INSTRUCTION: return nested_cpu_has(vmcs12, CPU_BASED_MWAIT_EXITING); case EXIT_REASON_MONITOR_TRAP_FLAG: return nested_vmx_exit_handled_mtf(vmcs12); case EXIT_REASON_MONITOR_INSTRUCTION: return nested_cpu_has(vmcs12, CPU_BASED_MONITOR_EXITING); case EXIT_REASON_PAUSE_INSTRUCTION: return nested_cpu_has(vmcs12, CPU_BASED_PAUSE_EXITING) || nested_cpu_has2(vmcs12, SECONDARY_EXEC_PAUSE_LOOP_EXITING); case EXIT_REASON_MCE_DURING_VMENTRY: return true; case EXIT_REASON_TPR_BELOW_THRESHOLD: return nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW); case EXIT_REASON_APIC_ACCESS: case EXIT_REASON_APIC_WRITE: case EXIT_REASON_EOI_INDUCED: /* * The controls for "virtualize APIC accesses," "APIC- * register virtualization," and "virtual-interrupt * delivery" only come from vmcs12. */ return true; case EXIT_REASON_INVPCID: return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_INVPCID) && nested_cpu_has(vmcs12, CPU_BASED_INVLPG_EXITING); case EXIT_REASON_WBINVD: return nested_cpu_has2(vmcs12, SECONDARY_EXEC_WBINVD_EXITING); case EXIT_REASON_XSETBV: return true; case EXIT_REASON_XSAVES: case EXIT_REASON_XRSTORS: /* * This should never happen, since it is not possible to * set XSS to a non-zero value---neither in L1 nor in L2. * If if it were, XSS would have to be checked against * the XSS exit bitmap in vmcs12. */ return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_XSAVES); case EXIT_REASON_UMWAIT: case EXIT_REASON_TPAUSE: return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE); case EXIT_REASON_ENCLS: return nested_vmx_exit_handled_encls(vcpu, vmcs12); case EXIT_REASON_NOTIFY: /* Notify VM exit is not exposed to L1 */ return false; default: return true; } } /* * Conditionally reflect a VM-Exit into L1. Returns %true if the VM-Exit was * reflected into L1. */ bool nested_vmx_reflect_vmexit(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); union vmx_exit_reason exit_reason = vmx->exit_reason; unsigned long exit_qual; u32 exit_intr_info; WARN_ON_ONCE(vmx->nested.nested_run_pending); /* * Late nested VM-Fail shares the same flow as nested VM-Exit since KVM * has already loaded L2's state. */ if (unlikely(vmx->fail)) { trace_kvm_nested_vmenter_failed( "hardware VM-instruction error: ", vmcs_read32(VM_INSTRUCTION_ERROR)); exit_intr_info = 0; exit_qual = 0; goto reflect_vmexit; } trace_kvm_nested_vmexit(vcpu, KVM_ISA_VMX); /* If L0 (KVM) wants the exit, it trumps L1's desires. */ if (nested_vmx_l0_wants_exit(vcpu, exit_reason)) return false; /* If L1 doesn't want the exit, handle it in L0. */ if (!nested_vmx_l1_wants_exit(vcpu, exit_reason)) return false; /* * vmcs.VM_EXIT_INTR_INFO is only valid for EXCEPTION_NMI exits. For * EXTERNAL_INTERRUPT, the value for vmcs12->vm_exit_intr_info would * need to be synthesized by querying the in-kernel LAPIC, but external * interrupts are never reflected to L1 so it's a non-issue. */ exit_intr_info = vmx_get_intr_info(vcpu); if (is_exception_with_error_code(exit_intr_info)) { struct vmcs12 *vmcs12 = get_vmcs12(vcpu); vmcs12->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE); } exit_qual = vmx_get_exit_qual(vcpu); reflect_vmexit: nested_vmx_vmexit(vcpu, exit_reason.full, exit_intr_info, exit_qual); return true; } static int vmx_get_nested_state(struct kvm_vcpu *vcpu, struct kvm_nested_state __user *user_kvm_nested_state, u32 user_data_size) { struct vcpu_vmx *vmx; struct vmcs12 *vmcs12; struct kvm_nested_state kvm_state = { .flags = 0, .format = KVM_STATE_NESTED_FORMAT_VMX, .size = sizeof(kvm_state), .hdr.vmx.flags = 0, .hdr.vmx.vmxon_pa = INVALID_GPA, .hdr.vmx.vmcs12_pa = INVALID_GPA, .hdr.vmx.preemption_timer_deadline = 0, }; struct kvm_vmx_nested_state_data __user *user_vmx_nested_state = &user_kvm_nested_state->data.vmx[0]; if (!vcpu) return kvm_state.size + sizeof(*user_vmx_nested_state); vmx = to_vmx(vcpu); vmcs12 = get_vmcs12(vcpu); if (guest_cpu_cap_has(vcpu, X86_FEATURE_VMX) && (vmx->nested.vmxon || vmx->nested.smm.vmxon)) { kvm_state.hdr.vmx.vmxon_pa = vmx->nested.vmxon_ptr; kvm_state.hdr.vmx.vmcs12_pa = vmx->nested.current_vmptr; if (vmx_has_valid_vmcs12(vcpu)) { kvm_state.size += sizeof(user_vmx_nested_state->vmcs12); /* 'hv_evmcs_vmptr' can also be EVMPTR_MAP_PENDING here */ if (nested_vmx_is_evmptr12_set(vmx)) kvm_state.flags |= KVM_STATE_NESTED_EVMCS; if (is_guest_mode(vcpu) && nested_cpu_has_shadow_vmcs(vmcs12) && vmcs12->vmcs_link_pointer != INVALID_GPA) kvm_state.size += sizeof(user_vmx_nested_state->shadow_vmcs12); } if (vmx->nested.smm.vmxon) kvm_state.hdr.vmx.smm.flags |= KVM_STATE_NESTED_SMM_VMXON; if (vmx->nested.smm.guest_mode) kvm_state.hdr.vmx.smm.flags |= KVM_STATE_NESTED_SMM_GUEST_MODE; if (is_guest_mode(vcpu)) { kvm_state.flags |= KVM_STATE_NESTED_GUEST_MODE; if (vmx->nested.nested_run_pending) kvm_state.flags |= KVM_STATE_NESTED_RUN_PENDING; if (vmx->nested.mtf_pending) kvm_state.flags |= KVM_STATE_NESTED_MTF_PENDING; if (nested_cpu_has_preemption_timer(vmcs12) && vmx->nested.has_preemption_timer_deadline) { kvm_state.hdr.vmx.flags |= KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE; kvm_state.hdr.vmx.preemption_timer_deadline = vmx->nested.preemption_timer_deadline; } } } if (user_data_size < kvm_state.size) goto out; if (copy_to_user(user_kvm_nested_state, &kvm_state, sizeof(kvm_state))) return -EFAULT; if (!vmx_has_valid_vmcs12(vcpu)) goto out; /* * When running L2, the authoritative vmcs12 state is in the * vmcs02. When running L1, the authoritative vmcs12 state is * in the shadow or enlightened vmcs linked to vmcs01, unless * need_vmcs12_to_shadow_sync is set, in which case, the authoritative * vmcs12 state is in the vmcs12 already. */ if (is_guest_mode(vcpu)) { sync_vmcs02_to_vmcs12(vcpu, vmcs12); sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12); } else { copy_vmcs02_to_vmcs12_rare(vcpu, get_vmcs12(vcpu)); if (!vmx->nested.need_vmcs12_to_shadow_sync) { if (nested_vmx_is_evmptr12_valid(vmx)) /* * L1 hypervisor is not obliged to keep eVMCS * clean fields data always up-to-date while * not in guest mode, 'hv_clean_fields' is only * supposed to be actual upon vmentry so we need * to ignore it here and do full copy. */ copy_enlightened_to_vmcs12(vmx, 0); else if (enable_shadow_vmcs) copy_shadow_to_vmcs12(vmx); } } BUILD_BUG_ON(sizeof(user_vmx_nested_state->vmcs12) < VMCS12_SIZE); BUILD_BUG_ON(sizeof(user_vmx_nested_state->shadow_vmcs12) < VMCS12_SIZE); /* * Copy over the full allocated size of vmcs12 rather than just the size * of the struct. */ if (copy_to_user(user_vmx_nested_state->vmcs12, vmcs12, VMCS12_SIZE)) return -EFAULT; if (nested_cpu_has_shadow_vmcs(vmcs12) && vmcs12->vmcs_link_pointer != INVALID_GPA) { if (copy_to_user(user_vmx_nested_state->shadow_vmcs12, get_shadow_vmcs12(vcpu), VMCS12_SIZE)) return -EFAULT; } out: return kvm_state.size; } void vmx_leave_nested(struct kvm_vcpu *vcpu) { if (is_guest_mode(vcpu)) { to_vmx(vcpu)->nested.nested_run_pending = 0; nested_vmx_vmexit(vcpu, -1, 0, 0); } free_nested(vcpu); } static int vmx_set_nested_state(struct kvm_vcpu *vcpu, struct kvm_nested_state __user *user_kvm_nested_state, struct kvm_nested_state *kvm_state) { struct vcpu_vmx *vmx = to_vmx(vcpu); struct vmcs12 *vmcs12; enum vm_entry_failure_code ignored; struct kvm_vmx_nested_state_data __user *user_vmx_nested_state = &user_kvm_nested_state->data.vmx[0]; int ret; if (kvm_state->format != KVM_STATE_NESTED_FORMAT_VMX) return -EINVAL; if (kvm_state->hdr.vmx.vmxon_pa == INVALID_GPA) { if (kvm_state->hdr.vmx.smm.flags) return -EINVAL; if (kvm_state->hdr.vmx.vmcs12_pa != INVALID_GPA) return -EINVAL; /* * KVM_STATE_NESTED_EVMCS used to signal that KVM should * enable eVMCS capability on vCPU. However, since then * code was changed such that flag signals vmcs12 should * be copied into eVMCS in guest memory. * * To preserve backwards compatibility, allow user * to set this flag even when there is no VMXON region. */ if (kvm_state->flags & ~KVM_STATE_NESTED_EVMCS) return -EINVAL; } else { if (!guest_cpu_cap_has(vcpu, X86_FEATURE_VMX)) return -EINVAL; if (!page_address_valid(vcpu, kvm_state->hdr.vmx.vmxon_pa)) return -EINVAL; } if ((kvm_state->hdr.vmx.smm.flags & KVM_STATE_NESTED_SMM_GUEST_MODE) && (kvm_state->flags & KVM_STATE_NESTED_GUEST_MODE)) return -EINVAL; if (kvm_state->hdr.vmx.smm.flags & ~(KVM_STATE_NESTED_SMM_GUEST_MODE | KVM_STATE_NESTED_SMM_VMXON)) return -EINVAL; if (kvm_state->hdr.vmx.flags & ~KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE) return -EINVAL; /* * SMM temporarily disables VMX, so we cannot be in guest mode, * nor can VMLAUNCH/VMRESUME be pending. Outside SMM, SMM flags * must be zero. */ if (is_smm(vcpu) ? (kvm_state->flags & (KVM_STATE_NESTED_GUEST_MODE | KVM_STATE_NESTED_RUN_PENDING)) : kvm_state->hdr.vmx.smm.flags) return -EINVAL; if ((kvm_state->hdr.vmx.smm.flags & KVM_STATE_NESTED_SMM_GUEST_MODE) && !(kvm_state->hdr.vmx.smm.flags & KVM_STATE_NESTED_SMM_VMXON)) return -EINVAL; if ((kvm_state->flags & KVM_STATE_NESTED_EVMCS) && (!guest_cpu_cap_has(vcpu, X86_FEATURE_VMX) || !vmx->nested.enlightened_vmcs_enabled)) return -EINVAL; vmx_leave_nested(vcpu); if (kvm_state->hdr.vmx.vmxon_pa == INVALID_GPA) return 0; vmx->nested.vmxon_ptr = kvm_state->hdr.vmx.vmxon_pa; ret = enter_vmx_operation(vcpu); if (ret) return ret; /* Empty 'VMXON' state is permitted if no VMCS loaded */ if (kvm_state->size < sizeof(*kvm_state) + sizeof(*vmcs12)) { /* See vmx_has_valid_vmcs12. */ if ((kvm_state->flags & KVM_STATE_NESTED_GUEST_MODE) || (kvm_state->flags & KVM_STATE_NESTED_EVMCS) || (kvm_state->hdr.vmx.vmcs12_pa != INVALID_GPA)) return -EINVAL; else return 0; } if (kvm_state->hdr.vmx.vmcs12_pa != INVALID_GPA) { if (kvm_state->hdr.vmx.vmcs12_pa == kvm_state->hdr.vmx.vmxon_pa || !page_address_valid(vcpu, kvm_state->hdr.vmx.vmcs12_pa)) return -EINVAL; set_current_vmptr(vmx, kvm_state->hdr.vmx.vmcs12_pa); #ifdef CONFIG_KVM_HYPERV } else if (kvm_state->flags & KVM_STATE_NESTED_EVMCS) { /* * nested_vmx_handle_enlightened_vmptrld() cannot be called * directly from here as HV_X64_MSR_VP_ASSIST_PAGE may not be * restored yet. EVMCS will be mapped from * nested_get_vmcs12_pages(). */ vmx->nested.hv_evmcs_vmptr = EVMPTR_MAP_PENDING; kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu); #endif } else { return -EINVAL; } if (kvm_state->hdr.vmx.smm.flags & KVM_STATE_NESTED_SMM_VMXON) { vmx->nested.smm.vmxon = true; vmx->nested.vmxon = false; if (kvm_state->hdr.vmx.smm.flags & KVM_STATE_NESTED_SMM_GUEST_MODE) vmx->nested.smm.guest_mode = true; } vmcs12 = get_vmcs12(vcpu); if (copy_from_user(vmcs12, user_vmx_nested_state->vmcs12, sizeof(*vmcs12))) return -EFAULT; if (vmcs12->hdr.revision_id != VMCS12_REVISION) return -EINVAL; if (!(kvm_state->flags & KVM_STATE_NESTED_GUEST_MODE)) return 0; vmx->nested.nested_run_pending = !!(kvm_state->flags & KVM_STATE_NESTED_RUN_PENDING); vmx->nested.mtf_pending = !!(kvm_state->flags & KVM_STATE_NESTED_MTF_PENDING); ret = -EINVAL; if (nested_cpu_has_shadow_vmcs(vmcs12) && vmcs12->vmcs_link_pointer != INVALID_GPA) { struct vmcs12 *shadow_vmcs12 = get_shadow_vmcs12(vcpu); if (kvm_state->size < sizeof(*kvm_state) + sizeof(user_vmx_nested_state->vmcs12) + sizeof(*shadow_vmcs12)) goto error_guest_mode; if (copy_from_user(shadow_vmcs12, user_vmx_nested_state->shadow_vmcs12, sizeof(*shadow_vmcs12))) { ret = -EFAULT; goto error_guest_mode; } if (shadow_vmcs12->hdr.revision_id != VMCS12_REVISION || !shadow_vmcs12->hdr.shadow_vmcs) goto error_guest_mode; } vmx->nested.has_preemption_timer_deadline = false; if (kvm_state->hdr.vmx.flags & KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE) { vmx->nested.has_preemption_timer_deadline = true; vmx->nested.preemption_timer_deadline = kvm_state->hdr.vmx.preemption_timer_deadline; } if (nested_vmx_check_controls(vcpu, vmcs12) || nested_vmx_check_host_state(vcpu, vmcs12) || nested_vmx_check_guest_state(vcpu, vmcs12, &ignored)) goto error_guest_mode; vmx->nested.dirty_vmcs12 = true; vmx->nested.force_msr_bitmap_recalc = true; ret = nested_vmx_enter_non_root_mode(vcpu, false); if (ret) goto error_guest_mode; if (vmx->nested.mtf_pending) kvm_make_request(KVM_REQ_EVENT, vcpu); return 0; error_guest_mode: vmx->nested.nested_run_pending = 0; return ret; } void nested_vmx_set_vmcs_shadowing_bitmap(void) { if (enable_shadow_vmcs) { vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap)); vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap)); } } /* * Indexing into the vmcs12 uses the VMCS encoding rotated left by 6. Undo * that madness to get the encoding for comparison. */ #define VMCS12_IDX_TO_ENC(idx) ((u16)(((u16)(idx) >> 6) | ((u16)(idx) << 10))) static u64 nested_vmx_calc_vmcs_enum_msr(void) { /* * Note these are the so called "index" of the VMCS field encoding, not * the index into vmcs12. */ unsigned int max_idx, idx; int i; /* * For better or worse, KVM allows VMREAD/VMWRITE to all fields in * vmcs12, regardless of whether or not the associated feature is * exposed to L1. Simply find the field with the highest index. */ max_idx = 0; for (i = 0; i < nr_vmcs12_fields; i++) { /* The vmcs12 table is very, very sparsely populated. */ if (!vmcs12_field_offsets[i]) continue; idx = vmcs_field_index(VMCS12_IDX_TO_ENC(i)); if (idx > max_idx) max_idx = idx; } return (u64)max_idx << VMCS_FIELD_INDEX_SHIFT; } static void nested_vmx_setup_pinbased_ctls(struct vmcs_config *vmcs_conf, struct nested_vmx_msrs *msrs) { msrs->pinbased_ctls_low = PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR; msrs->pinbased_ctls_high = vmcs_conf->pin_based_exec_ctrl; msrs->pinbased_ctls_high &= PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING | PIN_BASED_VIRTUAL_NMIS | (enable_apicv ? PIN_BASED_POSTED_INTR : 0); msrs->pinbased_ctls_high |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR | PIN_BASED_VMX_PREEMPTION_TIMER; } static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf, struct nested_vmx_msrs *msrs) { msrs->exit_ctls_low = VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR; msrs->exit_ctls_high = vmcs_conf->vmexit_ctrl; msrs->exit_ctls_high &= #ifdef CONFIG_X86_64 VM_EXIT_HOST_ADDR_SPACE_SIZE | #endif VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT | VM_EXIT_CLEAR_BNDCFGS; msrs->exit_ctls_high |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR | VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER | VM_EXIT_SAVE_VMX_PREEMPTION_TIMER | VM_EXIT_ACK_INTR_ON_EXIT | VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL; /* We support free control of debug control saving. */ msrs->exit_ctls_low &= ~VM_EXIT_SAVE_DEBUG_CONTROLS; } static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf, struct nested_vmx_msrs *msrs) { msrs->entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR; msrs->entry_ctls_high = vmcs_conf->vmentry_ctrl; msrs->entry_ctls_high &= #ifdef CONFIG_X86_64 VM_ENTRY_IA32E_MODE | #endif VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS; msrs->entry_ctls_high |= (VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER | VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL); /* We support free control of debug control loading. */ msrs->entry_ctls_low &= ~VM_ENTRY_LOAD_DEBUG_CONTROLS; } static void nested_vmx_setup_cpubased_ctls(struct vmcs_config *vmcs_conf, struct nested_vmx_msrs *msrs) { msrs->procbased_ctls_low = CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR; msrs->procbased_ctls_high = vmcs_conf->cpu_based_exec_ctrl; msrs->procbased_ctls_high &= CPU_BASED_INTR_WINDOW_EXITING | CPU_BASED_NMI_WINDOW_EXITING | CPU_BASED_USE_TSC_OFFSETTING | CPU_BASED_HLT_EXITING | CPU_BASED_INVLPG_EXITING | CPU_BASED_MWAIT_EXITING | CPU_BASED_CR3_LOAD_EXITING | CPU_BASED_CR3_STORE_EXITING | #ifdef CONFIG_X86_64 CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING | #endif CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING | CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_TRAP_FLAG | CPU_BASED_MONITOR_EXITING | CPU_BASED_RDPMC_EXITING | CPU_BASED_RDTSC_EXITING | CPU_BASED_PAUSE_EXITING | CPU_BASED_TPR_SHADOW | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; /* * We can allow some features even when not supported by the * hardware. For example, L1 can specify an MSR bitmap - and we * can use it to avoid exits to L1 - even when L0 runs L2 * without MSR bitmaps. */ msrs->procbased_ctls_high |= CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR | CPU_BASED_USE_MSR_BITMAPS; /* We support free control of CR3 access interception. */ msrs->procbased_ctls_low &= ~(CPU_BASED_CR3_LOAD_EXITING | CPU_BASED_CR3_STORE_EXITING); } static void nested_vmx_setup_secondary_ctls(u32 ept_caps, struct vmcs_config *vmcs_conf, struct nested_vmx_msrs *msrs) { msrs->secondary_ctls_low = 0; msrs->secondary_ctls_high = vmcs_conf->cpu_based_2nd_exec_ctrl; msrs->secondary_ctls_high &= SECONDARY_EXEC_DESC | SECONDARY_EXEC_ENABLE_RDTSCP | SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE | SECONDARY_EXEC_WBINVD_EXITING | SECONDARY_EXEC_APIC_REGISTER_VIRT | SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY | SECONDARY_EXEC_RDRAND_EXITING | SECONDARY_EXEC_ENABLE_INVPCID | SECONDARY_EXEC_ENABLE_VMFUNC | SECONDARY_EXEC_RDSEED_EXITING | SECONDARY_EXEC_ENABLE_XSAVES | SECONDARY_EXEC_TSC_SCALING | SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE; /* * We can emulate "VMCS shadowing," even if the hardware * doesn't support it. */ msrs->secondary_ctls_high |= SECONDARY_EXEC_SHADOW_VMCS; if (enable_ept) { /* nested EPT: emulate EPT also to L1 */ msrs->secondary_ctls_high |= SECONDARY_EXEC_ENABLE_EPT; msrs->ept_caps = VMX_EPT_PAGE_WALK_4_BIT | VMX_EPT_PAGE_WALK_5_BIT | VMX_EPTP_WB_BIT | VMX_EPT_INVEPT_BIT | VMX_EPT_EXECUTE_ONLY_BIT; msrs->ept_caps &= ept_caps; msrs->ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT | VMX_EPT_EXTENT_CONTEXT_BIT | VMX_EPT_2MB_PAGE_BIT | VMX_EPT_1GB_PAGE_BIT; if (enable_ept_ad_bits) { msrs->secondary_ctls_high |= SECONDARY_EXEC_ENABLE_PML; msrs->ept_caps |= VMX_EPT_AD_BIT; } /* * Advertise EPTP switching irrespective of hardware support, * KVM emulates it in software so long as VMFUNC is supported. */ if (cpu_has_vmx_vmfunc()) msrs->vmfunc_controls = VMX_VMFUNC_EPTP_SWITCHING; } /* * Old versions of KVM use the single-context version without * checking for support, so declare that it is supported even * though it is treated as global context. The alternative is * not failing the single-context invvpid, and it is worse. */ if (enable_vpid) { msrs->secondary_ctls_high |= SECONDARY_EXEC_ENABLE_VPID; msrs->vpid_caps = VMX_VPID_INVVPID_BIT | VMX_VPID_EXTENT_SUPPORTED_MASK; } if (enable_unrestricted_guest) msrs->secondary_ctls_high |= SECONDARY_EXEC_UNRESTRICTED_GUEST; if (flexpriority_enabled) msrs->secondary_ctls_high |= SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES; if (enable_sgx) msrs->secondary_ctls_high |= SECONDARY_EXEC_ENCLS_EXITING; } static void nested_vmx_setup_misc_data(struct vmcs_config *vmcs_conf, struct nested_vmx_msrs *msrs) { msrs->misc_low = (u32)vmcs_conf->misc & VMX_MISC_SAVE_EFER_LMA; msrs->misc_low |= VMX_MISC_VMWRITE_SHADOW_RO_FIELDS | VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE | VMX_MISC_ACTIVITY_HLT | VMX_MISC_ACTIVITY_WAIT_SIPI; msrs->misc_high = 0; } static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs) { /* * This MSR reports some information about VMX support. We * should return information about the VMX we emulate for the * guest, and the VMCS structure we give it - not about the * VMX support of the underlying hardware. */ msrs->basic = vmx_basic_encode_vmcs_info(VMCS12_REVISION, VMCS12_SIZE, X86_MEMTYPE_WB); msrs->basic |= VMX_BASIC_TRUE_CTLS; if (cpu_has_vmx_basic_inout()) msrs->basic |= VMX_BASIC_INOUT; } static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs) { /* * These MSRs specify bits which the guest must keep fixed on * while L1 is in VMXON mode (in L1's root mode, or running an L2). * We picked the standard core2 setting. */ #define VMXON_CR0_ALWAYSON (X86_CR0_PE | X86_CR0_PG | X86_CR0_NE) #define VMXON_CR4_ALWAYSON X86_CR4_VMXE msrs->cr0_fixed0 = VMXON_CR0_ALWAYSON; msrs->cr4_fixed0 = VMXON_CR4_ALWAYSON; /* These MSRs specify bits which the guest must keep fixed off. */ rdmsrl(MSR_IA32_VMX_CR0_FIXED1, msrs->cr0_fixed1); rdmsrl(MSR_IA32_VMX_CR4_FIXED1, msrs->cr4_fixed1); if (vmx_umip_emulated()) msrs->cr4_fixed1 |= X86_CR4_UMIP; } /* * nested_vmx_setup_ctls_msrs() sets up variables containing the values to be * returned for the various VMX controls MSRs when nested VMX is enabled. * The same values should also be used to verify that vmcs12 control fields are * valid during nested entry from L1 to L2. * Each of these control msrs has a low and high 32-bit half: A low bit is on * if the corresponding bit in the (32-bit) control field *must* be on, and a * bit in the high half is on if the corresponding bit in the control field * may be on. See also vmx_control_verify(). */ void nested_vmx_setup_ctls_msrs(struct vmcs_config *vmcs_conf, u32 ept_caps) { struct nested_vmx_msrs *msrs = &vmcs_conf->nested; /* * Note that as a general rule, the high half of the MSRs (bits in * the control fields which may be 1) should be initialized by the * intersection of the underlying hardware's MSR (i.e., features which * can be supported) and the list of features we want to expose - * because they are known to be properly supported in our code. * Also, usually, the low half of the MSRs (bits which must be 1) can * be set to 0, meaning that L1 may turn off any of these bits. The * reason is that if one of these bits is necessary, it will appear * in vmcs01 and prepare_vmcs02, when it bitwise-or's the control * fields of vmcs01 and vmcs02, will turn these bits off - and * nested_vmx_l1_wants_exit() will not pass related exits to L1. * These rules have exceptions below. */ nested_vmx_setup_pinbased_ctls(vmcs_conf, msrs); nested_vmx_setup_exit_ctls(vmcs_conf, msrs); nested_vmx_setup_entry_ctls(vmcs_conf, msrs); nested_vmx_setup_cpubased_ctls(vmcs_conf, msrs); nested_vmx_setup_secondary_ctls(ept_caps, vmcs_conf, msrs); nested_vmx_setup_misc_data(vmcs_conf, msrs); nested_vmx_setup_basic(msrs); nested_vmx_setup_cr_fixed(msrs); msrs->vmcs_enum = nested_vmx_calc_vmcs_enum_msr(); } void nested_vmx_hardware_unsetup(void) { int i; if (enable_shadow_vmcs) { for (i = 0; i < VMX_BITMAP_NR; i++) free_page((unsigned long)vmx_bitmap[i]); } } __init int nested_vmx_hardware_setup(int (*exit_handlers[])(struct kvm_vcpu *)) { int i; if (!cpu_has_vmx_shadow_vmcs()) enable_shadow_vmcs = 0; if (enable_shadow_vmcs) { for (i = 0; i < VMX_BITMAP_NR; i++) { /* * The vmx_bitmap is not tied to a VM and so should * not be charged to a memcg. */ vmx_bitmap[i] = (unsigned long *) __get_free_page(GFP_KERNEL); if (!vmx_bitmap[i]) { nested_vmx_hardware_unsetup(); return -ENOMEM; } } init_vmcs_shadow_fields(); } exit_handlers[EXIT_REASON_VMCLEAR] = handle_vmclear; exit_handlers[EXIT_REASON_VMLAUNCH] = handle_vmlaunch; exit_handlers[EXIT_REASON_VMPTRLD] = handle_vmptrld; exit_handlers[EXIT_REASON_VMPTRST] = handle_vmptrst; exit_handlers[EXIT_REASON_VMREAD] = handle_vmread; exit_handlers[EXIT_REASON_VMRESUME] = handle_vmresume; exit_handlers[EXIT_REASON_VMWRITE] = handle_vmwrite; exit_handlers[EXIT_REASON_VMOFF] = handle_vmxoff; exit_handlers[EXIT_REASON_VMON] = handle_vmxon; exit_handlers[EXIT_REASON_INVEPT] = handle_invept; exit_handlers[EXIT_REASON_INVVPID] = handle_invvpid; exit_handlers[EXIT_REASON_VMFUNC] = handle_vmfunc; return 0; } struct kvm_x86_nested_ops vmx_nested_ops = { .leave_nested = vmx_leave_nested, .is_exception_vmexit = nested_vmx_is_exception_vmexit, .check_events = vmx_check_nested_events, .has_events = vmx_has_nested_events, .triple_fault = nested_vmx_triple_fault, .get_state = vmx_get_nested_state, .set_state = vmx_set_nested_state, .get_nested_state_pages = vmx_get_nested_state_pages, .write_log_dirty = nested_vmx_write_pml_buffer, #ifdef CONFIG_KVM_HYPERV .enable_evmcs = nested_enable_evmcs, .get_evmcs_version = nested_get_evmcs_version, .hv_inject_synthetic_vmexit_post_tlb_flush = vmx_hv_inject_synthetic_vmexit_post_tlb_flush, #endif }; |
| 1 1 41 41 46 8 38 37 26 27 1 2 33 33 2 33 33 27 2 4 21 2 1 29 1 14 2 1 2 6 1 8 5 4 36 1 1 1 33 12 3 35 6 1 11 2 16 19 19 9 41 1 1 39 39 5 9 25 39 2 1 25 13 13 25 41 3 1 1 25 3 33 5 43 96 7 26 2 5 24 24 24 24 49 44 29 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 | // SPDX-License-Identifier: GPL-2.0+ /* * the_nilfs shared structure. * * Copyright (C) 2005-2008 Nippon Telegraph and Telephone Corporation. * * Written by Ryusuke Konishi. * */ #include <linux/buffer_head.h> #include <linux/slab.h> #include <linux/blkdev.h> #include <linux/backing-dev.h> #include <linux/log2.h> #include <linux/crc32.h> #include "nilfs.h" #include "segment.h" #include "alloc.h" #include "cpfile.h" #include "sufile.h" #include "dat.h" #include "segbuf.h" static int nilfs_valid_sb(struct nilfs_super_block *sbp); void nilfs_set_last_segment(struct the_nilfs *nilfs, sector_t start_blocknr, u64 seq, __u64 cno) { spin_lock(&nilfs->ns_last_segment_lock); nilfs->ns_last_pseg = start_blocknr; nilfs->ns_last_seq = seq; nilfs->ns_last_cno = cno; if (!nilfs_sb_dirty(nilfs)) { if (nilfs->ns_prev_seq == nilfs->ns_last_seq) goto stay_cursor; set_nilfs_sb_dirty(nilfs); } nilfs->ns_prev_seq = nilfs->ns_last_seq; stay_cursor: spin_unlock(&nilfs->ns_last_segment_lock); } /** * alloc_nilfs - allocate a nilfs object * @sb: super block instance * * Return: a pointer to the allocated nilfs object on success, or NULL on * failure. */ struct the_nilfs *alloc_nilfs(struct super_block *sb) { struct the_nilfs *nilfs; nilfs = kzalloc(sizeof(*nilfs), GFP_KERNEL); if (!nilfs) return NULL; nilfs->ns_sb = sb; nilfs->ns_bdev = sb->s_bdev; atomic_set(&nilfs->ns_ndirtyblks, 0); init_rwsem(&nilfs->ns_sem); mutex_init(&nilfs->ns_snapshot_mount_mutex); INIT_LIST_HEAD(&nilfs->ns_dirty_files); INIT_LIST_HEAD(&nilfs->ns_gc_inodes); spin_lock_init(&nilfs->ns_inode_lock); spin_lock_init(&nilfs->ns_last_segment_lock); nilfs->ns_cptree = RB_ROOT; spin_lock_init(&nilfs->ns_cptree_lock); init_rwsem(&nilfs->ns_segctor_sem); nilfs->ns_sb_update_freq = NILFS_SB_FREQ; return nilfs; } /** * destroy_nilfs - destroy nilfs object * @nilfs: nilfs object to be released */ void destroy_nilfs(struct the_nilfs *nilfs) { might_sleep(); if (nilfs_init(nilfs)) { brelse(nilfs->ns_sbh[0]); brelse(nilfs->ns_sbh[1]); } kfree(nilfs); } static int nilfs_load_super_root(struct the_nilfs *nilfs, struct super_block *sb, sector_t sr_block) { struct buffer_head *bh_sr; struct nilfs_super_root *raw_sr; struct nilfs_super_block **sbp = nilfs->ns_sbp; struct nilfs_inode *rawi; unsigned int dat_entry_size, segment_usage_size, checkpoint_size; unsigned int inode_size; int err; err = nilfs_read_super_root_block(nilfs, sr_block, &bh_sr, 1); if (unlikely(err)) return err; down_read(&nilfs->ns_sem); dat_entry_size = le16_to_cpu(sbp[0]->s_dat_entry_size); checkpoint_size = le16_to_cpu(sbp[0]->s_checkpoint_size); segment_usage_size = le16_to_cpu(sbp[0]->s_segment_usage_size); up_read(&nilfs->ns_sem); inode_size = nilfs->ns_inode_size; rawi = (void *)bh_sr->b_data + NILFS_SR_DAT_OFFSET(inode_size); err = nilfs_dat_read(sb, dat_entry_size, rawi, &nilfs->ns_dat); if (err) goto failed; rawi = (void *)bh_sr->b_data + NILFS_SR_CPFILE_OFFSET(inode_size); err = nilfs_cpfile_read(sb, checkpoint_size, rawi, &nilfs->ns_cpfile); if (err) goto failed_dat; rawi = (void *)bh_sr->b_data + NILFS_SR_SUFILE_OFFSET(inode_size); err = nilfs_sufile_read(sb, segment_usage_size, rawi, &nilfs->ns_sufile); if (err) goto failed_cpfile; raw_sr = (struct nilfs_super_root *)bh_sr->b_data; nilfs->ns_nongc_ctime = le64_to_cpu(raw_sr->sr_nongc_ctime); failed: brelse(bh_sr); return err; failed_cpfile: iput(nilfs->ns_cpfile); failed_dat: iput(nilfs->ns_dat); goto failed; } static void nilfs_init_recovery_info(struct nilfs_recovery_info *ri) { memset(ri, 0, sizeof(*ri)); INIT_LIST_HEAD(&ri->ri_used_segments); } static void nilfs_clear_recovery_info(struct nilfs_recovery_info *ri) { nilfs_dispose_segment_list(&ri->ri_used_segments); } /** * nilfs_store_log_cursor - load log cursor from a super block * @nilfs: nilfs object * @sbp: buffer storing super block to be read * * nilfs_store_log_cursor() reads the last position of the log * containing a super root from a given super block, and initializes * relevant information on the nilfs object preparatory for log * scanning and recovery. * * Return: 0 on success, or %-EINVAL if current segment number is out * of range. */ static int nilfs_store_log_cursor(struct the_nilfs *nilfs, struct nilfs_super_block *sbp) { int ret = 0; nilfs->ns_last_pseg = le64_to_cpu(sbp->s_last_pseg); nilfs->ns_last_cno = le64_to_cpu(sbp->s_last_cno); nilfs->ns_last_seq = le64_to_cpu(sbp->s_last_seq); nilfs->ns_prev_seq = nilfs->ns_last_seq; nilfs->ns_seg_seq = nilfs->ns_last_seq; nilfs->ns_segnum = nilfs_get_segnum_of_block(nilfs, nilfs->ns_last_pseg); nilfs->ns_cno = nilfs->ns_last_cno + 1; if (nilfs->ns_segnum >= nilfs->ns_nsegments) { nilfs_err(nilfs->ns_sb, "pointed segment number is out of range: segnum=%llu, nsegments=%lu", (unsigned long long)nilfs->ns_segnum, nilfs->ns_nsegments); ret = -EINVAL; } return ret; } /** * nilfs_get_blocksize - get block size from raw superblock data * @sb: super block instance * @sbp: superblock raw data buffer * @blocksize: place to store block size * * nilfs_get_blocksize() calculates the block size from the block size * exponent information written in @sbp and stores it in @blocksize, * or aborts with an error message if it's too large. * * Return: 0 on success, or %-EINVAL if the block size is too large. */ static int nilfs_get_blocksize(struct super_block *sb, struct nilfs_super_block *sbp, int *blocksize) { unsigned int shift_bits = le32_to_cpu(sbp->s_log_block_size); if (unlikely(shift_bits > ilog2(NILFS_MAX_BLOCK_SIZE) - BLOCK_SIZE_BITS)) { nilfs_err(sb, "too large filesystem blocksize: 2 ^ %u KiB", shift_bits); return -EINVAL; } *blocksize = BLOCK_SIZE << shift_bits; return 0; } /** * load_nilfs - load and recover the nilfs * @nilfs: the_nilfs structure to be released * @sb: super block instance used to recover past segment * * load_nilfs() searches and load the latest super root, * attaches the last segment, and does recovery if needed. * The caller must call this exclusively for simultaneous mounts. * * Return: 0 on success, or one of the following negative error codes on * failure: * * %-EINVAL - No valid segment found. * * %-EIO - I/O error. * * %-ENOMEM - Insufficient memory available. * * %-EROFS - Read only device or RO compat mode (if recovery is required) */ int load_nilfs(struct the_nilfs *nilfs, struct super_block *sb) { struct nilfs_recovery_info ri; unsigned int s_flags = sb->s_flags; int really_read_only = bdev_read_only(nilfs->ns_bdev); int valid_fs = nilfs_valid_fs(nilfs); int err; if (!valid_fs) { nilfs_warn(sb, "mounting unchecked fs"); if (s_flags & SB_RDONLY) { nilfs_info(sb, "recovery required for readonly filesystem"); nilfs_info(sb, "write access will be enabled during recovery"); } } nilfs_init_recovery_info(&ri); err = nilfs_search_super_root(nilfs, &ri); if (unlikely(err)) { struct nilfs_super_block **sbp = nilfs->ns_sbp; int blocksize; if (err != -EINVAL) goto scan_error; if (!nilfs_valid_sb(sbp[1])) { nilfs_warn(sb, "unable to fall back to spare super block"); goto scan_error; } nilfs_info(sb, "trying rollback from an earlier position"); /* * restore super block with its spare and reconfigure * relevant states of the nilfs object. */ memcpy(sbp[0], sbp[1], nilfs->ns_sbsize); nilfs->ns_crc_seed = le32_to_cpu(sbp[0]->s_crc_seed); nilfs->ns_sbwtime = le64_to_cpu(sbp[0]->s_wtime); /* verify consistency between two super blocks */ err = nilfs_get_blocksize(sb, sbp[0], &blocksize); if (err) goto scan_error; if (blocksize != nilfs->ns_blocksize) { nilfs_warn(sb, "blocksize differs between two super blocks (%d != %d)", blocksize, nilfs->ns_blocksize); err = -EINVAL; goto scan_error; } err = nilfs_store_log_cursor(nilfs, sbp[0]); if (err) goto scan_error; /* drop clean flag to allow roll-forward and recovery */ nilfs->ns_mount_state &= ~NILFS_VALID_FS; valid_fs = 0; err = nilfs_search_super_root(nilfs, &ri); if (err) goto scan_error; } err = nilfs_load_super_root(nilfs, sb, ri.ri_super_root); if (unlikely(err)) { nilfs_err(sb, "error %d while loading super root", err); goto failed; } err = nilfs_sysfs_create_device_group(sb); if (unlikely(err)) goto sysfs_error; if (valid_fs) goto skip_recovery; if (s_flags & SB_RDONLY) { __u64 features; if (nilfs_test_opt(nilfs, NORECOVERY)) { nilfs_info(sb, "norecovery option specified, skipping roll-forward recovery"); goto skip_recovery; } features = le64_to_cpu(nilfs->ns_sbp[0]->s_feature_compat_ro) & ~NILFS_FEATURE_COMPAT_RO_SUPP; if (features) { nilfs_err(sb, "couldn't proceed with recovery because of unsupported optional features (%llx)", (unsigned long long)features); err = -EROFS; goto failed_unload; } if (really_read_only) { nilfs_err(sb, "write access unavailable, cannot proceed"); err = -EROFS; goto failed_unload; } sb->s_flags &= ~SB_RDONLY; } else if (nilfs_test_opt(nilfs, NORECOVERY)) { nilfs_err(sb, "recovery cancelled because norecovery option was specified for a read/write mount"); err = -EINVAL; goto failed_unload; } err = nilfs_salvage_orphan_logs(nilfs, sb, &ri); if (err) goto failed_unload; down_write(&nilfs->ns_sem); nilfs->ns_mount_state |= NILFS_VALID_FS; /* set "clean" flag */ err = nilfs_cleanup_super(sb); up_write(&nilfs->ns_sem); if (err) { nilfs_err(sb, "error %d updating super block. recovery unfinished.", err); goto failed_unload; } nilfs_info(sb, "recovery complete"); skip_recovery: nilfs_clear_recovery_info(&ri); sb->s_flags = s_flags; return 0; scan_error: nilfs_err(sb, "error %d while searching super root", err); goto failed; failed_unload: nilfs_sysfs_delete_device_group(nilfs); sysfs_error: iput(nilfs->ns_cpfile); iput(nilfs->ns_sufile); iput(nilfs->ns_dat); failed: nilfs_clear_recovery_info(&ri); sb->s_flags = s_flags; return err; } static unsigned long long nilfs_max_size(unsigned int blkbits) { unsigned int max_bits; unsigned long long res = MAX_LFS_FILESIZE; /* page cache limit */ max_bits = blkbits + NILFS_BMAP_KEY_BIT; /* bmap size limit */ if (max_bits < 64) res = min_t(unsigned long long, res, (1ULL << max_bits) - 1); return res; } /** * nilfs_nrsvsegs - calculate the number of reserved segments * @nilfs: nilfs object * @nsegs: total number of segments * * Return: Number of reserved segments. */ unsigned long nilfs_nrsvsegs(struct the_nilfs *nilfs, unsigned long nsegs) { return max_t(unsigned long, NILFS_MIN_NRSVSEGS, DIV_ROUND_UP(nsegs * nilfs->ns_r_segments_percentage, 100)); } /** * nilfs_max_segment_count - calculate the maximum number of segments * @nilfs: nilfs object * * Return: Maximum number of segments */ static u64 nilfs_max_segment_count(struct the_nilfs *nilfs) { u64 max_count = U64_MAX; max_count = div64_ul(max_count, nilfs->ns_blocks_per_segment); return min_t(u64, max_count, ULONG_MAX); } void nilfs_set_nsegments(struct the_nilfs *nilfs, unsigned long nsegs) { nilfs->ns_nsegments = nsegs; nilfs->ns_nrsvsegs = nilfs_nrsvsegs(nilfs, nsegs); } static int nilfs_store_disk_layout(struct the_nilfs *nilfs, struct nilfs_super_block *sbp) { u64 nsegments, nblocks; if (le32_to_cpu(sbp->s_rev_level) < NILFS_MIN_SUPP_REV) { nilfs_err(nilfs->ns_sb, "unsupported revision (superblock rev.=%d.%d, current rev.=%d.%d). Please check the version of mkfs.nilfs(2).", le32_to_cpu(sbp->s_rev_level), le16_to_cpu(sbp->s_minor_rev_level), NILFS_CURRENT_REV, NILFS_MINOR_REV); return -EINVAL; } nilfs->ns_sbsize = le16_to_cpu(sbp->s_bytes); if (nilfs->ns_sbsize > BLOCK_SIZE) return -EINVAL; nilfs->ns_inode_size = le16_to_cpu(sbp->s_inode_size); if (nilfs->ns_inode_size > nilfs->ns_blocksize) { nilfs_err(nilfs->ns_sb, "too large inode size: %d bytes", nilfs->ns_inode_size); return -EINVAL; } else if (nilfs->ns_inode_size < NILFS_MIN_INODE_SIZE) { nilfs_err(nilfs->ns_sb, "too small inode size: %d bytes", nilfs->ns_inode_size); return -EINVAL; } nilfs->ns_first_ino = le32_to_cpu(sbp->s_first_ino); if (nilfs->ns_first_ino < NILFS_USER_INO) { nilfs_err(nilfs->ns_sb, "too small lower limit for non-reserved inode numbers: %u", nilfs->ns_first_ino); return -EINVAL; } nilfs->ns_blocks_per_segment = le32_to_cpu(sbp->s_blocks_per_segment); if (nilfs->ns_blocks_per_segment < NILFS_SEG_MIN_BLOCKS) { nilfs_err(nilfs->ns_sb, "too short segment: %lu blocks", nilfs->ns_blocks_per_segment); return -EINVAL; } nilfs->ns_first_data_block = le64_to_cpu(sbp->s_first_data_block); nilfs->ns_r_segments_percentage = le32_to_cpu(sbp->s_r_segments_percentage); if (nilfs->ns_r_segments_percentage < 1 || nilfs->ns_r_segments_percentage > 99) { nilfs_err(nilfs->ns_sb, "invalid reserved segments percentage: %lu", nilfs->ns_r_segments_percentage); return -EINVAL; } nsegments = le64_to_cpu(sbp->s_nsegments); if (nsegments > nilfs_max_segment_count(nilfs)) { nilfs_err(nilfs->ns_sb, "segment count %llu exceeds upper limit (%llu segments)", (unsigned long long)nsegments, (unsigned long long)nilfs_max_segment_count(nilfs)); return -EINVAL; } nblocks = sb_bdev_nr_blocks(nilfs->ns_sb); if (nblocks) { u64 min_block_count = nsegments * nilfs->ns_blocks_per_segment; /* * To avoid failing to mount early device images without a * second superblock, exclude that block count from the * "min_block_count" calculation. */ if (nblocks < min_block_count) { nilfs_err(nilfs->ns_sb, "total number of segment blocks %llu exceeds device size (%llu blocks)", (unsigned long long)min_block_count, (unsigned long long)nblocks); return -EINVAL; } } nilfs_set_nsegments(nilfs, nsegments); nilfs->ns_crc_seed = le32_to_cpu(sbp->s_crc_seed); return 0; } static int nilfs_valid_sb(struct nilfs_super_block *sbp) { static unsigned char sum[4]; const int sumoff = offsetof(struct nilfs_super_block, s_sum); size_t bytes; u32 crc; if (!sbp || le16_to_cpu(sbp->s_magic) != NILFS_SUPER_MAGIC) return 0; bytes = le16_to_cpu(sbp->s_bytes); if (bytes < sumoff + 4 || bytes > BLOCK_SIZE) return 0; crc = crc32_le(le32_to_cpu(sbp->s_crc_seed), (unsigned char *)sbp, sumoff); crc = crc32_le(crc, sum, 4); crc = crc32_le(crc, (unsigned char *)sbp + sumoff + 4, bytes - sumoff - 4); return crc == le32_to_cpu(sbp->s_sum); } /** * nilfs_sb2_bad_offset - check the location of the second superblock * @sbp: superblock raw data buffer * @offset: byte offset of second superblock calculated from device size * * nilfs_sb2_bad_offset() checks if the position on the second * superblock is valid or not based on the filesystem parameters * stored in @sbp. If @offset points to a location within the segment * area, or if the parameters themselves are not normal, it is * determined to be invalid. * * Return: true if invalid, false if valid. */ static bool nilfs_sb2_bad_offset(struct nilfs_super_block *sbp, u64 offset) { unsigned int shift_bits = le32_to_cpu(sbp->s_log_block_size); u32 blocks_per_segment = le32_to_cpu(sbp->s_blocks_per_segment); u64 nsegments = le64_to_cpu(sbp->s_nsegments); u64 index; if (blocks_per_segment < NILFS_SEG_MIN_BLOCKS || shift_bits > ilog2(NILFS_MAX_BLOCK_SIZE) - BLOCK_SIZE_BITS) return true; index = offset >> (shift_bits + BLOCK_SIZE_BITS); do_div(index, blocks_per_segment); return index < nsegments; } static void nilfs_release_super_block(struct the_nilfs *nilfs) { int i; for (i = 0; i < 2; i++) { if (nilfs->ns_sbp[i]) { brelse(nilfs->ns_sbh[i]); nilfs->ns_sbh[i] = NULL; nilfs->ns_sbp[i] = NULL; } } } void nilfs_fall_back_super_block(struct the_nilfs *nilfs) { brelse(nilfs->ns_sbh[0]); nilfs->ns_sbh[0] = nilfs->ns_sbh[1]; nilfs->ns_sbp[0] = nilfs->ns_sbp[1]; nilfs->ns_sbh[1] = NULL; nilfs->ns_sbp[1] = NULL; } void nilfs_swap_super_block(struct the_nilfs *nilfs) { struct buffer_head *tsbh = nilfs->ns_sbh[0]; struct nilfs_super_block *tsbp = nilfs->ns_sbp[0]; nilfs->ns_sbh[0] = nilfs->ns_sbh[1]; nilfs->ns_sbp[0] = nilfs->ns_sbp[1]; nilfs->ns_sbh[1] = tsbh; nilfs->ns_sbp[1] = tsbp; } static int nilfs_load_super_block(struct the_nilfs *nilfs, struct super_block *sb, int blocksize, struct nilfs_super_block **sbpp) { struct nilfs_super_block **sbp = nilfs->ns_sbp; struct buffer_head **sbh = nilfs->ns_sbh; u64 sb2off, devsize = bdev_nr_bytes(nilfs->ns_bdev); int valid[2], swp = 0, older; if (devsize < NILFS_SEG_MIN_BLOCKS * NILFS_MIN_BLOCK_SIZE + 4096) { nilfs_err(sb, "device size too small"); return -EINVAL; } sb2off = NILFS_SB2_OFFSET_BYTES(devsize); sbp[0] = nilfs_read_super_block(sb, NILFS_SB_OFFSET_BYTES, blocksize, &sbh[0]); sbp[1] = nilfs_read_super_block(sb, sb2off, blocksize, &sbh[1]); if (!sbp[0]) { if (!sbp[1]) { nilfs_err(sb, "unable to read superblock"); return -EIO; } nilfs_warn(sb, "unable to read primary superblock (blocksize = %d)", blocksize); } else if (!sbp[1]) { nilfs_warn(sb, "unable to read secondary superblock (blocksize = %d)", blocksize); } /* * Compare two super blocks and set 1 in swp if the secondary * super block is valid and newer. Otherwise, set 0 in swp. */ valid[0] = nilfs_valid_sb(sbp[0]); valid[1] = nilfs_valid_sb(sbp[1]); swp = valid[1] && (!valid[0] || le64_to_cpu(sbp[1]->s_last_cno) > le64_to_cpu(sbp[0]->s_last_cno)); if (valid[swp] && nilfs_sb2_bad_offset(sbp[swp], sb2off)) { brelse(sbh[1]); sbh[1] = NULL; sbp[1] = NULL; valid[1] = 0; swp = 0; } if (!valid[swp]) { nilfs_release_super_block(nilfs); nilfs_err(sb, "couldn't find nilfs on the device"); return -EINVAL; } if (!valid[!swp]) nilfs_warn(sb, "broken superblock, retrying with spare superblock (blocksize = %d)", blocksize); if (swp) nilfs_swap_super_block(nilfs); /* * Calculate the array index of the older superblock data. * If one has been dropped, set index 0 pointing to the remaining one, * otherwise set index 1 pointing to the old one (including if both * are the same). * * Divided case valid[0] valid[1] swp -> older * ------------------------------------------------------------- * Both SBs are invalid 0 0 N/A (Error) * SB1 is invalid 0 1 1 0 * SB2 is invalid 1 0 0 0 * SB2 is newer 1 1 1 0 * SB2 is older or the same 1 1 0 1 */ older = valid[1] ^ swp; nilfs->ns_sbwcount = 0; nilfs->ns_sbwtime = le64_to_cpu(sbp[0]->s_wtime); nilfs->ns_prot_seq = le64_to_cpu(sbp[older]->s_last_seq); *sbpp = sbp[0]; return 0; } /** * init_nilfs - initialize a NILFS instance. * @nilfs: the_nilfs structure * @sb: super block * * init_nilfs() performs common initialization per block device (e.g. * reading the super block, getting disk layout information, initializing * shared fields in the_nilfs). * * Return: 0 on success, or a negative error code on failure. */ int init_nilfs(struct the_nilfs *nilfs, struct super_block *sb) { struct nilfs_super_block *sbp; int blocksize; int err; down_write(&nilfs->ns_sem); blocksize = sb_min_blocksize(sb, NILFS_MIN_BLOCK_SIZE); if (!blocksize) { nilfs_err(sb, "unable to set blocksize"); err = -EINVAL; goto out; } err = nilfs_load_super_block(nilfs, sb, blocksize, &sbp); if (err) goto out; err = nilfs_store_magic(sb, sbp); if (err) goto failed_sbh; err = nilfs_check_feature_compatibility(sb, sbp); if (err) goto failed_sbh; err = nilfs_get_blocksize(sb, sbp, &blocksize); if (err) goto failed_sbh; if (blocksize < NILFS_MIN_BLOCK_SIZE) { nilfs_err(sb, "couldn't mount because of unsupported filesystem blocksize %d", blocksize); err = -EINVAL; goto failed_sbh; } if (sb->s_blocksize != blocksize) { int hw_blocksize = bdev_logical_block_size(sb->s_bdev); if (blocksize < hw_blocksize) { nilfs_err(sb, "blocksize %d too small for device (sector-size = %d)", blocksize, hw_blocksize); err = -EINVAL; goto failed_sbh; } nilfs_release_super_block(nilfs); if (!sb_set_blocksize(sb, blocksize)) { nilfs_err(sb, "bad blocksize %d", blocksize); err = -EINVAL; goto out; } err = nilfs_load_super_block(nilfs, sb, blocksize, &sbp); if (err) goto out; /* * Not to failed_sbh; sbh is released automatically * when reloading fails. */ } nilfs->ns_blocksize_bits = sb->s_blocksize_bits; nilfs->ns_blocksize = blocksize; err = nilfs_store_disk_layout(nilfs, sbp); if (err) goto failed_sbh; sb->s_maxbytes = nilfs_max_size(sb->s_blocksize_bits); nilfs->ns_mount_state = le16_to_cpu(sbp->s_state); err = nilfs_store_log_cursor(nilfs, sbp); if (err) goto failed_sbh; set_nilfs_init(nilfs); err = 0; out: up_write(&nilfs->ns_sem); return err; failed_sbh: nilfs_release_super_block(nilfs); goto out; } int nilfs_discard_segments(struct the_nilfs *nilfs, __u64 *segnump, size_t nsegs) { sector_t seg_start, seg_end; sector_t start = 0, nblocks = 0; unsigned int sects_per_block; __u64 *sn; int ret = 0; sects_per_block = (1 << nilfs->ns_blocksize_bits) / bdev_logical_block_size(nilfs->ns_bdev); for (sn = segnump; sn < segnump + nsegs; sn++) { nilfs_get_segment_range(nilfs, *sn, &seg_start, &seg_end); if (!nblocks) { start = seg_start; nblocks = seg_end - seg_start + 1; } else if (start + nblocks == seg_start) { nblocks += seg_end - seg_start + 1; } else { ret = blkdev_issue_discard(nilfs->ns_bdev, start * sects_per_block, nblocks * sects_per_block, GFP_NOFS); if (ret < 0) return ret; nblocks = 0; } } if (nblocks) ret = blkdev_issue_discard(nilfs->ns_bdev, start * sects_per_block, nblocks * sects_per_block, GFP_NOFS); return ret; } int nilfs_count_free_blocks(struct the_nilfs *nilfs, sector_t *nblocks) { unsigned long ncleansegs; ncleansegs = nilfs_sufile_get_ncleansegs(nilfs->ns_sufile); *nblocks = (sector_t)ncleansegs * nilfs->ns_blocks_per_segment; return 0; } int nilfs_near_disk_full(struct the_nilfs *nilfs) { unsigned long ncleansegs, nincsegs; ncleansegs = nilfs_sufile_get_ncleansegs(nilfs->ns_sufile); nincsegs = atomic_read(&nilfs->ns_ndirtyblks) / nilfs->ns_blocks_per_segment + 1; return ncleansegs <= nilfs->ns_nrsvsegs + nincsegs; } struct nilfs_root *nilfs_lookup_root(struct the_nilfs *nilfs, __u64 cno) { struct rb_node *n; struct nilfs_root *root; spin_lock(&nilfs->ns_cptree_lock); n = nilfs->ns_cptree.rb_node; while (n) { root = rb_entry(n, struct nilfs_root, rb_node); if (cno < root->cno) { n = n->rb_left; } else if (cno > root->cno) { n = n->rb_right; } else { refcount_inc(&root->count); spin_unlock(&nilfs->ns_cptree_lock); return root; } } spin_unlock(&nilfs->ns_cptree_lock); return NULL; } struct nilfs_root * nilfs_find_or_create_root(struct the_nilfs *nilfs, __u64 cno) { struct rb_node **p, *parent; struct nilfs_root *root, *new; int err; root = nilfs_lookup_root(nilfs, cno); if (root) return root; new = kzalloc(sizeof(*root), GFP_KERNEL); if (!new) return NULL; spin_lock(&nilfs->ns_cptree_lock); p = &nilfs->ns_cptree.rb_node; parent = NULL; while (*p) { parent = *p; root = rb_entry(parent, struct nilfs_root, rb_node); if (cno < root->cno) { p = &(*p)->rb_left; } else if (cno > root->cno) { p = &(*p)->rb_right; } else { refcount_inc(&root->count); spin_unlock(&nilfs->ns_cptree_lock); kfree(new); return root; } } new->cno = cno; new->ifile = NULL; new->nilfs = nilfs; refcount_set(&new->count, 1); atomic64_set(&new->inodes_count, 0); atomic64_set(&new->blocks_count, 0); rb_link_node(&new->rb_node, parent, p); rb_insert_color(&new->rb_node, &nilfs->ns_cptree); spin_unlock(&nilfs->ns_cptree_lock); err = nilfs_sysfs_create_snapshot_group(new); if (err) { kfree(new); new = NULL; } return new; } void nilfs_put_root(struct nilfs_root *root) { struct the_nilfs *nilfs = root->nilfs; if (refcount_dec_and_lock(&root->count, &nilfs->ns_cptree_lock)) { rb_erase(&root->rb_node, &nilfs->ns_cptree); spin_unlock(&nilfs->ns_cptree_lock); nilfs_sysfs_delete_snapshot_group(root); iput(root->ifile); kfree(root); } } |
| 681 160 160 160 160 1 651 651 249 21 21 21 25 25 25 25 248 248 41 42 60 58 21 25 2 37 42 42 42 42 42 42 54 53 249 248 47 248 248 248 249 247 248 247 247 222 222 222 222 222 87 100 86 54 21 248 248 248 248 247 160 160 342 342 40 737 365 739 747 540 706 707 706 718 212 624 625 219 174 627 1 99 705 705 401 354 313 74 27 464 468 467 20 20 20 705 707 358 357 8 8 4 390 362 245 245 245 390 554 6 554 546 548 881 832 44 44 44 44 634 588 172 171 44 137 634 878 881 880 223 829 833 830 832 832 832 173 842 173 846 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 | // SPDX-License-Identifier: GPL-2.0 #include "bcachefs.h" #include "btree_locking.h" #include "btree_types.h" static struct lock_class_key bch2_btree_node_lock_key; void bch2_btree_lock_init(struct btree_bkey_cached_common *b, enum six_lock_init_flags flags) { __six_lock_init(&b->lock, "b->c.lock", &bch2_btree_node_lock_key, flags); lockdep_set_notrack_class(&b->lock); } /* Btree node locking: */ struct six_lock_count bch2_btree_node_lock_counts(struct btree_trans *trans, struct btree_path *skip, struct btree_bkey_cached_common *b, unsigned level) { struct btree_path *path; struct six_lock_count ret; unsigned i; memset(&ret, 0, sizeof(ret)); if (IS_ERR_OR_NULL(b)) return ret; trans_for_each_path(trans, path, i) if (path != skip && &path->l[level].b->c == b) { int t = btree_node_locked_type(path, level); if (t != BTREE_NODE_UNLOCKED) ret.n[t]++; } return ret; } /* unlock */ void bch2_btree_node_unlock_write(struct btree_trans *trans, struct btree_path *path, struct btree *b) { bch2_btree_node_unlock_write_inlined(trans, path, b); } /* lock */ /* * @trans wants to lock @b with type @type */ struct trans_waiting_for_lock { struct btree_trans *trans; struct btree_bkey_cached_common *node_want; enum six_lock_type lock_want; /* for iterating over held locks :*/ u8 path_idx; u8 level; u64 lock_start_time; }; struct lock_graph { struct trans_waiting_for_lock g[8]; unsigned nr; }; static noinline void print_cycle(struct printbuf *out, struct lock_graph *g) { struct trans_waiting_for_lock *i; prt_printf(out, "Found lock cycle (%u entries):\n", g->nr); for (i = g->g; i < g->g + g->nr; i++) { struct task_struct *task = READ_ONCE(i->trans->locking_wait.task); if (!task) continue; bch2_btree_trans_to_text(out, i->trans); bch2_prt_task_backtrace(out, task, i == g->g ? 5 : 1, GFP_NOWAIT); } } static noinline void print_chain(struct printbuf *out, struct lock_graph *g) { struct trans_waiting_for_lock *i; for (i = g->g; i != g->g + g->nr; i++) { struct task_struct *task = i->trans->locking_wait.task; if (i != g->g) prt_str(out, "<- "); prt_printf(out, "%u ", task ?task->pid : 0); } prt_newline(out); } static void lock_graph_up(struct lock_graph *g) { closure_put(&g->g[--g->nr].trans->ref); } static noinline void lock_graph_pop_all(struct lock_graph *g) { while (g->nr) lock_graph_up(g); } static noinline void lock_graph_pop_from(struct lock_graph *g, struct trans_waiting_for_lock *i) { while (g->g + g->nr > i) lock_graph_up(g); } static void __lock_graph_down(struct lock_graph *g, struct btree_trans *trans) { g->g[g->nr++] = (struct trans_waiting_for_lock) { .trans = trans, .node_want = trans->locking, .lock_want = trans->locking_wait.lock_want, }; } static void lock_graph_down(struct lock_graph *g, struct btree_trans *trans) { closure_get(&trans->ref); __lock_graph_down(g, trans); } static bool lock_graph_remove_non_waiters(struct lock_graph *g, struct trans_waiting_for_lock *from) { struct trans_waiting_for_lock *i; if (from->trans->locking != from->node_want) { lock_graph_pop_from(g, from); return true; } for (i = from + 1; i < g->g + g->nr; i++) if (i->trans->locking != i->node_want || i->trans->locking_wait.start_time != i[-1].lock_start_time) { lock_graph_pop_from(g, i); return true; } return false; } static void trace_would_deadlock(struct lock_graph *g, struct btree_trans *trans) { struct bch_fs *c = trans->c; count_event(c, trans_restart_would_deadlock); if (trace_trans_restart_would_deadlock_enabled()) { struct printbuf buf = PRINTBUF; buf.atomic++; print_cycle(&buf, g); trace_trans_restart_would_deadlock(trans, buf.buf); printbuf_exit(&buf); } } static int abort_lock(struct lock_graph *g, struct trans_waiting_for_lock *i) { if (i == g->g) { trace_would_deadlock(g, i->trans); return btree_trans_restart(i->trans, BCH_ERR_transaction_restart_would_deadlock); } else { i->trans->lock_must_abort = true; wake_up_process(i->trans->locking_wait.task); return 0; } } static int btree_trans_abort_preference(struct btree_trans *trans) { if (trans->lock_may_not_fail) return 0; if (trans->locking_wait.lock_want == SIX_LOCK_write) return 1; if (!trans->in_traverse_all) return 2; return 3; } static noinline int break_cycle(struct lock_graph *g, struct printbuf *cycle, struct trans_waiting_for_lock *from) { struct trans_waiting_for_lock *i, *abort = NULL; unsigned best = 0, pref; int ret; if (lock_graph_remove_non_waiters(g, from)) return 0; /* Only checking, for debugfs: */ if (cycle) { print_cycle(cycle, g); ret = -1; goto out; } for (i = from; i < g->g + g->nr; i++) { pref = btree_trans_abort_preference(i->trans); if (pref > best) { abort = i; best = pref; } } if (unlikely(!best)) { struct printbuf buf = PRINTBUF; buf.atomic++; prt_printf(&buf, bch2_fmt(g->g->trans->c, "cycle of nofail locks")); for (i = g->g; i < g->g + g->nr; i++) { struct btree_trans *trans = i->trans; bch2_btree_trans_to_text(&buf, trans); prt_printf(&buf, "backtrace:\n"); printbuf_indent_add(&buf, 2); bch2_prt_task_backtrace(&buf, trans->locking_wait.task, 2, GFP_NOWAIT); printbuf_indent_sub(&buf, 2); prt_newline(&buf); } bch2_print_string_as_lines_nonblocking(KERN_ERR, buf.buf); printbuf_exit(&buf); BUG(); } ret = abort_lock(g, abort); out: if (ret) lock_graph_pop_all(g); else lock_graph_pop_from(g, abort); return ret; } static int lock_graph_descend(struct lock_graph *g, struct btree_trans *trans, struct printbuf *cycle) { struct btree_trans *orig_trans = g->g->trans; struct trans_waiting_for_lock *i; for (i = g->g; i < g->g + g->nr; i++) if (i->trans == trans) { closure_put(&trans->ref); return break_cycle(g, cycle, i); } if (g->nr == ARRAY_SIZE(g->g)) { closure_put(&trans->ref); if (orig_trans->lock_may_not_fail) return 0; lock_graph_pop_all(g); if (cycle) return 0; trace_and_count(trans->c, trans_restart_would_deadlock_recursion_limit, trans, _RET_IP_); return btree_trans_restart(orig_trans, BCH_ERR_transaction_restart_deadlock_recursion_limit); } __lock_graph_down(g, trans); return 0; } static bool lock_type_conflicts(enum six_lock_type t1, enum six_lock_type t2) { return t1 + t2 > 1; } int bch2_check_for_deadlock(struct btree_trans *trans, struct printbuf *cycle) { struct lock_graph g; struct trans_waiting_for_lock *top; struct btree_bkey_cached_common *b; btree_path_idx_t path_idx; int ret = 0; g.nr = 0; if (trans->lock_must_abort && !trans->lock_may_not_fail) { if (cycle) return -1; trace_would_deadlock(&g, trans); return btree_trans_restart(trans, BCH_ERR_transaction_restart_would_deadlock); } lock_graph_down(&g, trans); /* trans->paths is rcu protected vs. freeing */ rcu_read_lock(); if (cycle) cycle->atomic++; next: if (!g.nr) goto out; top = &g.g[g.nr - 1]; struct btree_path *paths = rcu_dereference(top->trans->paths); if (!paths) goto up; unsigned long *paths_allocated = trans_paths_allocated(paths); trans_for_each_path_idx_from(paths_allocated, *trans_paths_nr(paths), path_idx, top->path_idx) { struct btree_path *path = paths + path_idx; if (!path->nodes_locked) continue; if (path_idx != top->path_idx) { top->path_idx = path_idx; top->level = 0; top->lock_start_time = 0; } for (; top->level < BTREE_MAX_DEPTH; top->level++, top->lock_start_time = 0) { int lock_held = btree_node_locked_type(path, top->level); if (lock_held == BTREE_NODE_UNLOCKED) continue; b = &READ_ONCE(path->l[top->level].b)->c; if (IS_ERR_OR_NULL(b)) { /* * If we get here, it means we raced with the * other thread updating its btree_path * structures - which means it can't be blocked * waiting on a lock: */ if (!lock_graph_remove_non_waiters(&g, g.g)) { /* * If lock_graph_remove_non_waiters() * didn't do anything, it must be * because we're being called by debugfs * checking for lock cycles, which * invokes us on btree_transactions that * aren't actually waiting on anything. * Just bail out: */ lock_graph_pop_all(&g); } goto next; } if (list_empty_careful(&b->lock.wait_list)) continue; raw_spin_lock(&b->lock.wait_lock); list_for_each_entry(trans, &b->lock.wait_list, locking_wait.list) { BUG_ON(b != trans->locking); if (top->lock_start_time && time_after_eq64(top->lock_start_time, trans->locking_wait.start_time)) continue; top->lock_start_time = trans->locking_wait.start_time; /* Don't check for self deadlock: */ if (trans == top->trans || !lock_type_conflicts(lock_held, trans->locking_wait.lock_want)) continue; closure_get(&trans->ref); raw_spin_unlock(&b->lock.wait_lock); ret = lock_graph_descend(&g, trans, cycle); if (ret) goto out; goto next; } raw_spin_unlock(&b->lock.wait_lock); } } up: if (g.nr > 1 && cycle) print_chain(cycle, &g); lock_graph_up(&g); goto next; out: if (cycle) --cycle->atomic; rcu_read_unlock(); return ret; } int bch2_six_check_for_deadlock(struct six_lock *lock, void *p) { struct btree_trans *trans = p; return bch2_check_for_deadlock(trans, NULL); } int __bch2_btree_node_lock_write(struct btree_trans *trans, struct btree_path *path, struct btree_bkey_cached_common *b, bool lock_may_not_fail) { int readers = bch2_btree_node_lock_counts(trans, NULL, b, b->level).n[SIX_LOCK_read]; int ret; /* * Must drop our read locks before calling six_lock_write() - * six_unlock() won't do wakeups until the reader count * goes to 0, and it's safe because we have the node intent * locked: */ six_lock_readers_add(&b->lock, -readers); ret = __btree_node_lock_nopath(trans, b, SIX_LOCK_write, lock_may_not_fail, _RET_IP_); six_lock_readers_add(&b->lock, readers); if (ret) mark_btree_node_locked_noreset(path, b->level, BTREE_NODE_INTENT_LOCKED); return ret; } void bch2_btree_node_lock_write_nofail(struct btree_trans *trans, struct btree_path *path, struct btree_bkey_cached_common *b) { int ret = __btree_node_lock_write(trans, path, b, true); BUG_ON(ret); } /* relock */ static inline bool btree_path_get_locks(struct btree_trans *trans, struct btree_path *path, bool upgrade, struct get_locks_fail *f) { unsigned l = path->level; int fail_idx = -1; do { if (!btree_path_node(path, l)) break; if (!(upgrade ? bch2_btree_node_upgrade(trans, path, l) : bch2_btree_node_relock(trans, path, l))) { fail_idx = l; if (f) { f->l = l; f->b = path->l[l].b; } } l++; } while (l < path->locks_want); /* * When we fail to get a lock, we have to ensure that any child nodes * can't be relocked so bch2_btree_path_traverse has to walk back up to * the node that we failed to relock: */ if (fail_idx >= 0) { __bch2_btree_path_unlock(trans, path); btree_path_set_dirty(path, BTREE_ITER_NEED_TRAVERSE); do { path->l[fail_idx].b = upgrade ? ERR_PTR(-BCH_ERR_no_btree_node_upgrade) : ERR_PTR(-BCH_ERR_no_btree_node_relock); --fail_idx; } while (fail_idx >= 0); } if (path->uptodate == BTREE_ITER_NEED_RELOCK) path->uptodate = BTREE_ITER_UPTODATE; return path->uptodate < BTREE_ITER_NEED_RELOCK; } bool __bch2_btree_node_relock(struct btree_trans *trans, struct btree_path *path, unsigned level, bool trace) { struct btree *b = btree_path_node(path, level); int want = __btree_lock_want(path, level); if (race_fault()) goto fail; if (six_relock_type(&b->c.lock, want, path->l[level].lock_seq) || (btree_node_lock_seq_matches(path, b, level) && btree_node_lock_increment(trans, &b->c, level, want))) { mark_btree_node_locked(trans, path, level, want); return true; } fail: if (trace && !trans->notrace_relock_fail) trace_and_count(trans->c, btree_path_relock_fail, trans, _RET_IP_, path, level); return false; } /* upgrade */ bool bch2_btree_node_upgrade(struct btree_trans *trans, struct btree_path *path, unsigned level) { struct btree *b = path->l[level].b; if (!is_btree_node(path, level)) return false; switch (btree_lock_want(path, level)) { case BTREE_NODE_UNLOCKED: BUG_ON(btree_node_locked(path, level)); return true; case BTREE_NODE_READ_LOCKED: BUG_ON(btree_node_intent_locked(path, level)); return bch2_btree_node_relock(trans, path, level); case BTREE_NODE_INTENT_LOCKED: break; case BTREE_NODE_WRITE_LOCKED: BUG(); } if (btree_node_intent_locked(path, level)) return true; if (race_fault()) return false; if (btree_node_locked(path, level) ? six_lock_tryupgrade(&b->c.lock) : six_relock_type(&b->c.lock, SIX_LOCK_intent, path->l[level].lock_seq)) goto success; if (btree_node_lock_seq_matches(path, b, level) && btree_node_lock_increment(trans, &b->c, level, BTREE_NODE_INTENT_LOCKED)) { btree_node_unlock(trans, path, level); goto success; } trace_and_count(trans->c, btree_path_upgrade_fail, trans, _RET_IP_, path, level); return false; success: mark_btree_node_locked_noreset(path, level, BTREE_NODE_INTENT_LOCKED); return true; } /* Btree path locking: */ /* * Only for btree_cache.c - only relocks intent locks */ int bch2_btree_path_relock_intent(struct btree_trans *trans, struct btree_path *path) { unsigned l; for (l = path->level; l < path->locks_want && btree_path_node(path, l); l++) { if (!bch2_btree_node_relock(trans, path, l)) { __bch2_btree_path_unlock(trans, path); btree_path_set_dirty(path, BTREE_ITER_NEED_TRAVERSE); trace_and_count(trans->c, trans_restart_relock_path_intent, trans, _RET_IP_, path); return btree_trans_restart(trans, BCH_ERR_transaction_restart_relock_path_intent); } } return 0; } __flatten bool bch2_btree_path_relock_norestart(struct btree_trans *trans, struct btree_path *path) { struct get_locks_fail f; bool ret = btree_path_get_locks(trans, path, false, &f); bch2_trans_verify_locks(trans); return ret; } int __bch2_btree_path_relock(struct btree_trans *trans, struct btree_path *path, unsigned long trace_ip) { if (!bch2_btree_path_relock_norestart(trans, path)) { trace_and_count(trans->c, trans_restart_relock_path, trans, trace_ip, path); return btree_trans_restart(trans, BCH_ERR_transaction_restart_relock_path); } return 0; } bool bch2_btree_path_upgrade_noupgrade_sibs(struct btree_trans *trans, struct btree_path *path, unsigned new_locks_want, struct get_locks_fail *f) { EBUG_ON(path->locks_want >= new_locks_want); path->locks_want = new_locks_want; bool ret = btree_path_get_locks(trans, path, true, f); bch2_trans_verify_locks(trans); return ret; } bool __bch2_btree_path_upgrade(struct btree_trans *trans, struct btree_path *path, unsigned new_locks_want, struct get_locks_fail *f) { bool ret = bch2_btree_path_upgrade_noupgrade_sibs(trans, path, new_locks_want, f); if (ret) goto out; /* * XXX: this is ugly - we'd prefer to not be mucking with other * iterators in the btree_trans here. * * On failure to upgrade the iterator, setting iter->locks_want and * calling get_locks() is sufficient to make bch2_btree_path_traverse() * get the locks we want on transaction restart. * * But if this iterator was a clone, on transaction restart what we did * to this iterator isn't going to be preserved. * * Possibly we could add an iterator field for the parent iterator when * an iterator is a copy - for now, we'll just upgrade any other * iterators with the same btree id. * * The code below used to be needed to ensure ancestor nodes get locked * before interior nodes - now that's handled by * bch2_btree_path_traverse_all(). */ if (!path->cached && !trans->in_traverse_all) { struct btree_path *linked; unsigned i; trans_for_each_path(trans, linked, i) if (linked != path && linked->cached == path->cached && linked->btree_id == path->btree_id && linked->locks_want < new_locks_want) { linked->locks_want = new_locks_want; btree_path_get_locks(trans, linked, true, NULL); } } out: bch2_trans_verify_locks(trans); return ret; } void __bch2_btree_path_downgrade(struct btree_trans *trans, struct btree_path *path, unsigned new_locks_want) { unsigned l, old_locks_want = path->locks_want; if (trans->restarted) return; EBUG_ON(path->locks_want < new_locks_want); path->locks_want = new_locks_want; while (path->nodes_locked && (l = btree_path_highest_level_locked(path)) >= path->locks_want) { if (l > path->level) { btree_node_unlock(trans, path, l); } else { if (btree_node_intent_locked(path, l)) { six_lock_downgrade(&path->l[l].b->c.lock); mark_btree_node_locked_noreset(path, l, BTREE_NODE_READ_LOCKED); } break; } } bch2_btree_path_verify_locks(path); trace_path_downgrade(trans, _RET_IP_, path, old_locks_want); } /* Btree transaction locking: */ void bch2_trans_downgrade(struct btree_trans *trans) { struct btree_path *path; unsigned i; if (trans->restarted) return; trans_for_each_path(trans, path, i) if (path->ref) bch2_btree_path_downgrade(trans, path); } static inline void __bch2_trans_unlock(struct btree_trans *trans) { struct btree_path *path; unsigned i; trans_for_each_path(trans, path, i) __bch2_btree_path_unlock(trans, path); } static noinline __cold int bch2_trans_relock_fail(struct btree_trans *trans, struct btree_path *path, struct get_locks_fail *f, bool trace) { if (!trace) goto out; if (trace_trans_restart_relock_enabled()) { struct printbuf buf = PRINTBUF; bch2_bpos_to_text(&buf, path->pos); prt_printf(&buf, " l=%u seq=%u node seq=", f->l, path->l[f->l].lock_seq); if (IS_ERR_OR_NULL(f->b)) { prt_str(&buf, bch2_err_str(PTR_ERR(f->b))); } else { prt_printf(&buf, "%u", f->b->c.lock.seq); struct six_lock_count c = bch2_btree_node_lock_counts(trans, NULL, &f->b->c, f->l); prt_printf(&buf, " self locked %u.%u.%u", c.n[0], c.n[1], c.n[2]); c = six_lock_counts(&f->b->c.lock); prt_printf(&buf, " total locked %u.%u.%u", c.n[0], c.n[1], c.n[2]); } trace_trans_restart_relock(trans, _RET_IP_, buf.buf); printbuf_exit(&buf); } count_event(trans->c, trans_restart_relock); out: __bch2_trans_unlock(trans); bch2_trans_verify_locks(trans); return btree_trans_restart(trans, BCH_ERR_transaction_restart_relock); } static inline int __bch2_trans_relock(struct btree_trans *trans, bool trace) { bch2_trans_verify_locks(trans); if (unlikely(trans->restarted)) return -((int) trans->restarted); if (unlikely(trans->locked)) goto out; struct btree_path *path; unsigned i; trans_for_each_path(trans, path, i) { struct get_locks_fail f; if (path->should_be_locked && !btree_path_get_locks(trans, path, false, &f)) return bch2_trans_relock_fail(trans, path, &f, trace); } trans_set_locked(trans, true); out: bch2_trans_verify_locks(trans); return 0; } int bch2_trans_relock(struct btree_trans *trans) { return __bch2_trans_relock(trans, true); } int bch2_trans_relock_notrace(struct btree_trans *trans) { return __bch2_trans_relock(trans, false); } void bch2_trans_unlock_noassert(struct btree_trans *trans) { __bch2_trans_unlock(trans); trans_set_unlocked(trans); } void bch2_trans_unlock(struct btree_trans *trans) { __bch2_trans_unlock(trans); trans_set_unlocked(trans); } void bch2_trans_unlock_long(struct btree_trans *trans) { bch2_trans_unlock(trans); bch2_trans_srcu_unlock(trans); } void bch2_trans_unlock_write(struct btree_trans *trans) { struct btree_path *path; unsigned i; trans_for_each_path(trans, path, i) for (unsigned l = 0; l < BTREE_MAX_DEPTH; l++) if (btree_node_write_locked(path, l)) bch2_btree_node_unlock_write(trans, path, path->l[l].b); } int __bch2_trans_mutex_lock(struct btree_trans *trans, struct mutex *lock) { int ret = drop_locks_do(trans, (mutex_lock(lock), 0)); if (ret) mutex_unlock(lock); return ret; } /* Debug */ #ifdef CONFIG_BCACHEFS_DEBUG void bch2_btree_path_verify_locks(struct btree_path *path) { /* * A path may be uptodate and yet have nothing locked if and only if * there is no node at path->level, which generally means we were * iterating over all nodes and got to the end of the btree */ BUG_ON(path->uptodate == BTREE_ITER_UPTODATE && btree_path_node(path, path->level) && !path->nodes_locked); if (!path->nodes_locked) return; for (unsigned l = 0; l < BTREE_MAX_DEPTH; l++) { int want = btree_lock_want(path, l); int have = btree_node_locked_type(path, l); BUG_ON(!is_btree_node(path, l) && have != BTREE_NODE_UNLOCKED); BUG_ON(is_btree_node(path, l) && (want == BTREE_NODE_UNLOCKED || have != BTREE_NODE_WRITE_LOCKED) && want != have); BUG_ON(btree_node_locked(path, l) && path->l[l].lock_seq != six_lock_seq(&path->l[l].b->c.lock)); } } static bool bch2_trans_locked(struct btree_trans *trans) { struct btree_path *path; unsigned i; trans_for_each_path(trans, path, i) if (path->nodes_locked) return true; return false; } void bch2_trans_verify_locks(struct btree_trans *trans) { if (!trans->locked) { BUG_ON(bch2_trans_locked(trans)); return; } struct btree_path *path; unsigned i; trans_for_each_path(trans, path, i) bch2_btree_path_verify_locks(path); } #endif |
| 10 2 1 1 1 1 4 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | // SPDX-License-Identifier: GPL-2.0-only /* * Copyright (c) 2008-2009 Patrick McHardy <kaber@trash.net> * * Development of this code funded by Astaro AG (http://www.astaro.com/) */ #include <linux/unaligned.h> #include <linux/kernel.h> #include <linux/init.h> #include <linux/module.h> #include <linux/netlink.h> #include <linux/netfilter.h> #include <linux/netfilter/nf_tables.h> #include <net/netfilter/nf_tables_core.h> #include <net/netfilter/nf_tables.h> struct nft_byteorder { u8 sreg; u8 dreg; enum nft_byteorder_ops op:8; u8 len; u8 size; }; void nft_byteorder_eval(const struct nft_expr *expr, struct nft_regs *regs, const struct nft_pktinfo *pkt) { const struct nft_byteorder *priv = nft_expr_priv(expr); u32 *src = ®s->data[priv->sreg]; u32 *dst = ®s->data[priv->dreg]; u16 *s16, *d16; unsigned int i; s16 = (void *)src; d16 = (void *)dst; switch (priv->size) { case 8: { u64 *dst64 = (void *)dst; u64 src64; switch (priv->op) { case NFT_BYTEORDER_NTOH: for (i = 0; i < priv->len / 8; i++) { src64 = nft_reg_load64(&src[i]); nft_reg_store64(&dst64[i], be64_to_cpu((__force __be64)src64)); } break; case NFT_BYTEORDER_HTON: for (i = 0; i < priv->len / 8; i++) { src64 = (__force __u64) cpu_to_be64(nft_reg_load64(&src[i])); nft_reg_store64(&dst64[i], src64); } break; } break; } case 4: switch (priv->op) { case NFT_BYTEORDER_NTOH: for (i = 0; i < priv->len / 4; i++) dst[i] = ntohl((__force __be32)src[i]); break; case NFT_BYTEORDER_HTON: for (i = 0; i < priv->len / 4; i++) dst[i] = (__force __u32)htonl(src[i]); break; } break; case 2: switch (priv->op) { case NFT_BYTEORDER_NTOH: for (i = 0; i < priv->len / 2; i++) d16[i] = ntohs((__force __be16)s16[i]); break; case NFT_BYTEORDER_HTON: for (i = 0; i < priv->len / 2; i++) d16[i] = (__force __u16)htons(s16[i]); break; } break; } } static const struct nla_policy nft_byteorder_policy[NFTA_BYTEORDER_MAX + 1] = { [NFTA_BYTEORDER_SREG] = { .type = NLA_U32 }, [NFTA_BYTEORDER_DREG] = { .type = NLA_U32 }, [NFTA_BYTEORDER_OP] = NLA_POLICY_MAX(NLA_BE32, 255), [NFTA_BYTEORDER_LEN] = NLA_POLICY_MAX(NLA_BE32, 255), [NFTA_BYTEORDER_SIZE] = NLA_POLICY_MAX(NLA_BE32, 255), }; static int nft_byteorder_init(const struct nft_ctx *ctx, const struct nft_expr *expr, const struct nlattr * const tb[]) { struct nft_byteorder *priv = nft_expr_priv(expr); u32 size, len; int err; if (tb[NFTA_BYTEORDER_SREG] == NULL || tb[NFTA_BYTEORDER_DREG] == NULL || tb[NFTA_BYTEORDER_LEN] == NULL || tb[NFTA_BYTEORDER_SIZE] == NULL || tb[NFTA_BYTEORDER_OP] == NULL) return -EINVAL; priv->op = ntohl(nla_get_be32(tb[NFTA_BYTEORDER_OP])); switch (priv->op) { case NFT_BYTEORDER_NTOH: case NFT_BYTEORDER_HTON: break; default: return -EINVAL; } err = nft_parse_u32_check(tb[NFTA_BYTEORDER_SIZE], U8_MAX, &size); if (err < 0) return err; priv->size = size; switch (priv->size) { case 2: case 4: case 8: break; default: return -EINVAL; } err = nft_parse_u32_check(tb[NFTA_BYTEORDER_LEN], U8_MAX, &len); if (err < 0) return err; priv->len = len; err = nft_parse_register_load(ctx, tb[NFTA_BYTEORDER_SREG], &priv->sreg, priv->len); if (err < 0) return err; return nft_parse_register_store(ctx, tb[NFTA_BYTEORDER_DREG], &priv->dreg, NULL, NFT_DATA_VALUE, priv->len); } static int nft_byteorder_dump(struct sk_buff *skb, const struct nft_expr *expr, bool reset) { const struct nft_byteorder *priv = nft_expr_priv(expr); if (nft_dump_register(skb, NFTA_BYTEORDER_SREG, priv->sreg)) goto nla_put_failure; if (nft_dump_register(skb, NFTA_BYTEORDER_DREG, priv->dreg)) goto nla_put_failure; if (nla_put_be32(skb, NFTA_BYTEORDER_OP, htonl(priv->op))) goto nla_put_failure; if (nla_put_be32(skb, NFTA_BYTEORDER_LEN, htonl(priv->len))) goto nla_put_failure; if (nla_put_be32(skb, NFTA_BYTEORDER_SIZE, htonl(priv->size))) goto nla_put_failure; return 0; nla_put_failure: return -1; } static bool nft_byteorder_reduce(struct nft_regs_track *track, const struct nft_expr *expr) { struct nft_byteorder *priv = nft_expr_priv(expr); nft_reg_track_cancel(track, priv->dreg, priv->len); return false; } static const struct nft_expr_ops nft_byteorder_ops = { .type = &nft_byteorder_type, .size = NFT_EXPR_SIZE(sizeof(struct nft_byteorder)), .eval = nft_byteorder_eval, .init = nft_byteorder_init, .dump = nft_byteorder_dump, .reduce = nft_byteorder_reduce, }; struct nft_expr_type nft_byteorder_type __read_mostly = { .name = "byteorder", .ops = &nft_byteorder_ops, .policy = nft_byteorder_policy, .maxattr = NFTA_BYTEORDER_MAX, .owner = THIS_MODULE, }; |
| 6 22 22 38 6 2 4 4 4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _NETFILTER_NETDEV_H_ #define _NETFILTER_NETDEV_H_ #include <linux/netfilter.h> #include <linux/netdevice.h> #ifdef CONFIG_NETFILTER_INGRESS static inline bool nf_hook_ingress_active(const struct sk_buff *skb) { #ifdef CONFIG_JUMP_LABEL if (!static_key_false(&nf_hooks_needed[NFPROTO_NETDEV][NF_NETDEV_INGRESS])) return false; #endif return rcu_access_pointer(skb->dev->nf_hooks_ingress); } /* caller must hold rcu_read_lock */ static inline int nf_hook_ingress(struct sk_buff *skb) { struct nf_hook_entries *e = rcu_dereference(skb->dev->nf_hooks_ingress); struct nf_hook_state state; int ret; /* Must recheck the ingress hook head, in the event it became NULL * after the check in nf_hook_ingress_active evaluated to true. */ if (unlikely(!e)) return 0; nf_hook_state_init(&state, NF_NETDEV_INGRESS, NFPROTO_NETDEV, skb->dev, NULL, NULL, dev_net(skb->dev), NULL); ret = nf_hook_slow(skb, &state, e, 0); if (ret == 0) return -1; return ret; } #else /* CONFIG_NETFILTER_INGRESS */ static inline int nf_hook_ingress_active(struct sk_buff *skb) { return 0; } static inline int nf_hook_ingress(struct sk_buff *skb) { return 0; } #endif /* CONFIG_NETFILTER_INGRESS */ #ifdef CONFIG_NETFILTER_EGRESS static inline bool nf_hook_egress_active(void) { #ifdef CONFIG_JUMP_LABEL if (!static_key_false(&nf_hooks_needed[NFPROTO_NETDEV][NF_NETDEV_EGRESS])) return false; #endif return true; } /** * nf_hook_egress - classify packets before transmission * @skb: packet to be classified * @rc: result code which shall be returned by __dev_queue_xmit() on failure * @dev: netdev whose egress hooks shall be applied to @skb * * Caller must hold rcu_read_lock. * * On ingress, packets are classified first by tc, then by netfilter. * On egress, the order is reversed for symmetry. Conceptually, tc and * netfilter can be thought of as layers, with netfilter layered above tc: * When tc redirects a packet to another interface, netfilter is not applied * because the packet is on the tc layer. * * The nf_skip_egress flag controls whether netfilter is applied on egress. * It is updated by __netif_receive_skb_core() and __dev_queue_xmit() when the * packet passes through tc and netfilter. Because __dev_queue_xmit() may be * called recursively by tunnel drivers such as vxlan, the flag is reverted to * false after sch_handle_egress(). This ensures that netfilter is applied * both on the overlay and underlying network. * * Returns: @skb on success or %NULL if the packet was consumed or filtered. */ static inline struct sk_buff *nf_hook_egress(struct sk_buff *skb, int *rc, struct net_device *dev) { struct nf_hook_entries *e; struct nf_hook_state state; int ret; #ifdef CONFIG_NETFILTER_SKIP_EGRESS if (skb->nf_skip_egress) return skb; #endif e = rcu_dereference_check(dev->nf_hooks_egress, rcu_read_lock_bh_held()); if (!e) return skb; nf_hook_state_init(&state, NF_NETDEV_EGRESS, NFPROTO_NETDEV, NULL, dev, NULL, dev_net(dev), NULL); /* nf assumes rcu_read_lock, not just read_lock_bh */ rcu_read_lock(); ret = nf_hook_slow(skb, &state, e, 0); rcu_read_unlock(); if (ret == 1) { return skb; } else if (ret < 0) { *rc = NET_XMIT_DROP; return NULL; } else { /* ret == 0 */ *rc = NET_XMIT_SUCCESS; return NULL; } } #else /* CONFIG_NETFILTER_EGRESS */ static inline bool nf_hook_egress_active(void) { return false; } static inline struct sk_buff *nf_hook_egress(struct sk_buff *skb, int *rc, struct net_device *dev) { return skb; } #endif /* CONFIG_NETFILTER_EGRESS */ static inline void nf_skip_egress(struct sk_buff *skb, bool skip) { #ifdef CONFIG_NETFILTER_SKIP_EGRESS skb->nf_skip_egress = skip; #endif } static inline void nf_hook_netdev_init(struct net_device *dev) { #ifdef CONFIG_NETFILTER_INGRESS RCU_INIT_POINTER(dev->nf_hooks_ingress, NULL); #endif #ifdef CONFIG_NETFILTER_EGRESS RCU_INIT_POINTER(dev->nf_hooks_egress, NULL); #endif } #endif /* _NETFILTER_NETDEV_H_ */ |
| 25 31 30 10 23 23 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Declarations of NET/ROM type objects. * * Jonathan Naylor G4KLX 9/4/95 */ #ifndef _NETROM_H #define _NETROM_H #include <linux/netrom.h> #include <linux/list.h> #include <linux/slab.h> #include <net/sock.h> #include <linux/refcount.h> #include <linux/seq_file.h> #include <net/ax25.h> #define NR_NETWORK_LEN 15 #define NR_TRANSPORT_LEN 5 #define NR_PROTO_IP 0x0C #define NR_PROTOEXT 0x00 #define NR_CONNREQ 0x01 #define NR_CONNACK 0x02 #define NR_DISCREQ 0x03 #define NR_DISCACK 0x04 #define NR_INFO 0x05 #define NR_INFOACK 0x06 #define NR_RESET 0x07 #define NR_CHOKE_FLAG 0x80 #define NR_NAK_FLAG 0x40 #define NR_MORE_FLAG 0x20 /* Define Link State constants. */ enum { NR_STATE_0, NR_STATE_1, NR_STATE_2, NR_STATE_3 }; #define NR_COND_ACK_PENDING 0x01 #define NR_COND_REJECT 0x02 #define NR_COND_PEER_RX_BUSY 0x04 #define NR_COND_OWN_RX_BUSY 0x08 #define NR_DEFAULT_T1 120000 /* Outstanding frames - 120 seconds */ #define NR_DEFAULT_T2 5000 /* Response delay - 5 seconds */ #define NR_DEFAULT_N2 3 /* Number of Retries - 3 */ #define NR_DEFAULT_T4 180000 /* Busy Delay - 180 seconds */ #define NR_DEFAULT_IDLE 0 /* No Activity Timeout - none */ #define NR_DEFAULT_WINDOW 4 /* Default Window Size - 4 */ #define NR_DEFAULT_OBS 6 /* Default Obsolescence Count - 6 */ #define NR_DEFAULT_QUAL 10 /* Default Neighbour Quality - 10 */ #define NR_DEFAULT_TTL 16 /* Default Time To Live - 16 */ #define NR_DEFAULT_ROUTING 1 /* Is routing enabled ? */ #define NR_DEFAULT_FAILS 2 /* Link fails until route fails */ #define NR_DEFAULT_RESET 0 /* Sent / accept reset cmds? */ #define NR_MODULUS 256 #define NR_MAX_WINDOW_SIZE 127 /* Maximum Window Allowable - 127 */ #define NR_MAX_PACKET_SIZE 236 /* Maximum Packet Length - 236 */ struct nr_sock { struct sock sock; ax25_address user_addr, source_addr, dest_addr; struct net_device *device; unsigned char my_index, my_id; unsigned char your_index, your_id; unsigned char state, condition, bpqext, window; unsigned short vs, vr, va, vl; unsigned char n2, n2count; unsigned long t1, t2, t4, idle; unsigned short fraglen; struct timer_list t1timer; struct timer_list t2timer; struct timer_list t4timer; struct timer_list idletimer; struct sk_buff_head ack_queue; struct sk_buff_head reseq_queue; struct sk_buff_head frag_queue; }; #define nr_sk(sk) ((struct nr_sock *)(sk)) struct nr_neigh { struct hlist_node neigh_node; ax25_address callsign; ax25_digi *digipeat; ax25_cb *ax25; struct net_device *dev; unsigned char quality; unsigned char locked; unsigned short count; unsigned int number; unsigned char failed; refcount_t refcount; }; struct nr_route { unsigned char quality; unsigned char obs_count; struct nr_neigh *neighbour; }; struct nr_node { struct hlist_node node_node; ax25_address callsign; char mnemonic[7]; unsigned char which; unsigned char count; struct nr_route routes[3]; refcount_t refcount; spinlock_t node_lock; }; /********************************************************************* * nr_node & nr_neigh lists, refcounting and locking *********************************************************************/ #define nr_node_hold(__nr_node) \ refcount_inc(&((__nr_node)->refcount)) static __inline__ void nr_node_put(struct nr_node *nr_node) { if (refcount_dec_and_test(&nr_node->refcount)) { kfree(nr_node); } } #define nr_neigh_hold(__nr_neigh) \ refcount_inc(&((__nr_neigh)->refcount)) static __inline__ void nr_neigh_put(struct nr_neigh *nr_neigh) { if (refcount_dec_and_test(&nr_neigh->refcount)) { if (nr_neigh->ax25) ax25_cb_put(nr_neigh->ax25); kfree(nr_neigh->digipeat); kfree(nr_neigh); } } /* nr_node_lock and nr_node_unlock also hold/put the node's refcounter. */ static __inline__ void nr_node_lock(struct nr_node *nr_node) { nr_node_hold(nr_node); spin_lock_bh(&nr_node->node_lock); } static __inline__ void nr_node_unlock(struct nr_node *nr_node) { spin_unlock_bh(&nr_node->node_lock); nr_node_put(nr_node); } #define nr_neigh_for_each(__nr_neigh, list) \ hlist_for_each_entry(__nr_neigh, list, neigh_node) #define nr_neigh_for_each_safe(__nr_neigh, node2, list) \ hlist_for_each_entry_safe(__nr_neigh, node2, list, neigh_node) #define nr_node_for_each(__nr_node, list) \ hlist_for_each_entry(__nr_node, list, node_node) #define nr_node_for_each_safe(__nr_node, node2, list) \ hlist_for_each_entry_safe(__nr_node, node2, list, node_node) /*********************************************************************/ /* af_netrom.c */ extern int sysctl_netrom_default_path_quality; extern int sysctl_netrom_obsolescence_count_initialiser; extern int sysctl_netrom_network_ttl_initialiser; extern int sysctl_netrom_transport_timeout; extern int sysctl_netrom_transport_maximum_tries; extern int sysctl_netrom_transport_acknowledge_delay; extern int sysctl_netrom_transport_busy_delay; extern int sysctl_netrom_transport_requested_window_size; extern int sysctl_netrom_transport_no_activity_timeout; extern int sysctl_netrom_routing_control; extern int sysctl_netrom_link_fails_count; extern int sysctl_netrom_reset_circuit; int nr_rx_frame(struct sk_buff *, struct net_device *); void nr_destroy_socket(struct sock *); /* nr_dev.c */ int nr_rx_ip(struct sk_buff *, struct net_device *); void nr_setup(struct net_device *); /* nr_in.c */ int nr_process_rx_frame(struct sock *, struct sk_buff *); /* nr_loopback.c */ void nr_loopback_init(void); void nr_loopback_clear(void); int nr_loopback_queue(struct sk_buff *); /* nr_out.c */ void nr_output(struct sock *, struct sk_buff *); void nr_send_nak_frame(struct sock *); void nr_kick(struct sock *); void nr_transmit_buffer(struct sock *, struct sk_buff *); void nr_establish_data_link(struct sock *); void nr_enquiry_response(struct sock *); void nr_check_iframes_acked(struct sock *, unsigned short); /* nr_route.c */ void nr_rt_device_down(struct net_device *); struct net_device *nr_dev_first(void); struct net_device *nr_dev_get(ax25_address *); int nr_rt_ioctl(unsigned int, void __user *); void nr_link_failed(ax25_cb *, int); int nr_route_frame(struct sk_buff *, ax25_cb *); extern const struct seq_operations nr_node_seqops; extern const struct seq_operations nr_neigh_seqops; void nr_rt_free(void); /* nr_subr.c */ void nr_clear_queues(struct sock *); void nr_frames_acked(struct sock *, unsigned short); void nr_requeue_frames(struct sock *); int nr_validate_nr(struct sock *, unsigned short); int nr_in_rx_window(struct sock *, unsigned short); void nr_write_internal(struct sock *, int); void __nr_transmit_reply(struct sk_buff *skb, int mine, unsigned char cmdflags); /* * This routine is called when a Connect Acknowledge with the Choke Flag * set is needed to refuse a connection. */ #define nr_transmit_refusal(skb, mine) \ do { \ __nr_transmit_reply((skb), (mine), NR_CONNACK | NR_CHOKE_FLAG); \ } while (0) /* * This routine is called when we don't have a circuit matching an incoming * NET/ROM packet. This is an G8PZT Xrouter extension. */ #define nr_transmit_reset(skb, mine) \ do { \ __nr_transmit_reply((skb), (mine), NR_RESET); \ } while (0) void nr_disconnect(struct sock *, int); /* nr_timer.c */ void nr_init_timers(struct sock *sk); void nr_start_heartbeat(struct sock *); void nr_start_t1timer(struct sock *); void nr_start_t2timer(struct sock *); void nr_start_t4timer(struct sock *); void nr_start_idletimer(struct sock *); void nr_stop_heartbeat(struct sock *); void nr_stop_t1timer(struct sock *); void nr_stop_t2timer(struct sock *); void nr_stop_t4timer(struct sock *); void nr_stop_idletimer(struct sock *); int nr_t1timer_running(struct sock *); /* sysctl_net_netrom.c */ int nr_register_sysctl(void); void nr_unregister_sysctl(void); #endif |
| 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_STRING_HELPERS_H_ #define _LINUX_STRING_HELPERS_H_ #include <linux/bits.h> #include <linux/ctype.h> #include <linux/string_choices.h> #include <linux/string.h> #include <linux/types.h> struct device; struct file; struct task_struct; static inline bool string_is_terminated(const char *s, int len) { return memchr(s, '\0', len) ? true : false; } /* Descriptions of the types of units to print in */ enum string_size_units { STRING_UNITS_10, /* use powers of 10^3 (standard SI) */ STRING_UNITS_2, /* use binary powers of 2^10 */ STRING_UNITS_MASK = BIT(0), /* Modifiers */ STRING_UNITS_NO_SPACE = BIT(30), STRING_UNITS_NO_BYTES = BIT(31), }; int string_get_size(u64 size, u64 blk_size, const enum string_size_units units, char *buf, int len); int parse_int_array_user(const char __user *from, size_t count, int **array); #define UNESCAPE_SPACE BIT(0) #define UNESCAPE_OCTAL BIT(1) #define UNESCAPE_HEX BIT(2) #define UNESCAPE_SPECIAL BIT(3) #define UNESCAPE_ANY \ (UNESCAPE_SPACE | UNESCAPE_OCTAL | UNESCAPE_HEX | UNESCAPE_SPECIAL) #define UNESCAPE_ALL_MASK GENMASK(3, 0) int string_unescape(char *src, char *dst, size_t size, unsigned int flags); static inline int string_unescape_inplace(char *buf, unsigned int flags) { return string_unescape(buf, buf, 0, flags); } static inline int string_unescape_any(char *src, char *dst, size_t size) { return string_unescape(src, dst, size, UNESCAPE_ANY); } static inline int string_unescape_any_inplace(char *buf) { return string_unescape_any(buf, buf, 0); } #define ESCAPE_SPACE BIT(0) #define ESCAPE_SPECIAL BIT(1) #define ESCAPE_NULL BIT(2) #define ESCAPE_OCTAL BIT(3) #define ESCAPE_ANY \ (ESCAPE_SPACE | ESCAPE_OCTAL | ESCAPE_SPECIAL | ESCAPE_NULL) #define ESCAPE_NP BIT(4) #define ESCAPE_ANY_NP (ESCAPE_ANY | ESCAPE_NP) #define ESCAPE_HEX BIT(5) #define ESCAPE_NA BIT(6) #define ESCAPE_NAP BIT(7) #define ESCAPE_APPEND BIT(8) #define ESCAPE_ALL_MASK GENMASK(8, 0) int string_escape_mem(const char *src, size_t isz, char *dst, size_t osz, unsigned int flags, const char *only); static inline int string_escape_mem_any_np(const char *src, size_t isz, char *dst, size_t osz, const char *only) { return string_escape_mem(src, isz, dst, osz, ESCAPE_ANY_NP, only); } static inline int string_escape_str(const char *src, char *dst, size_t sz, unsigned int flags, const char *only) { return string_escape_mem(src, strlen(src), dst, sz, flags, only); } static inline int string_escape_str_any_np(const char *src, char *dst, size_t sz, const char *only) { return string_escape_str(src, dst, sz, ESCAPE_ANY_NP, only); } static inline void string_upper(char *dst, const char *src) { do { *dst++ = toupper(*src); } while (*src++); } static inline void string_lower(char *dst, const char *src) { do { *dst++ = tolower(*src); } while (*src++); } char *kstrdup_quotable(const char *src, gfp_t gfp); char *kstrdup_quotable_cmdline(struct task_struct *task, gfp_t gfp); char *kstrdup_quotable_file(struct file *file, gfp_t gfp); char *kstrdup_and_replace(const char *src, char old, char new, gfp_t gfp); char **kasprintf_strarray(gfp_t gfp, const char *prefix, size_t n); void kfree_strarray(char **array, size_t n); char **devm_kasprintf_strarray(struct device *dev, const char *prefix, size_t n); #endif |
| 7 1 1 1 3 1 1 1 2 2 2 1 1 1 1 2 1 45 45 10 10 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 | // SPDX-License-Identifier: GPL-2.0-or-later /* * Stateless NAT actions * * Copyright (c) 2007 Herbert Xu <herbert@gondor.apana.org.au> */ #include <linux/errno.h> #include <linux/init.h> #include <linux/kernel.h> #include <linux/module.h> #include <linux/netfilter.h> #include <linux/rtnetlink.h> #include <linux/skbuff.h> #include <linux/slab.h> #include <linux/spinlock.h> #include <linux/string.h> #include <linux/tc_act/tc_nat.h> #include <net/act_api.h> #include <net/pkt_cls.h> #include <net/icmp.h> #include <net/ip.h> #include <net/netlink.h> #include <net/tc_act/tc_nat.h> #include <net/tcp.h> #include <net/udp.h> #include <net/tc_wrapper.h> static struct tc_action_ops act_nat_ops; static const struct nla_policy nat_policy[TCA_NAT_MAX + 1] = { [TCA_NAT_PARMS] = { .len = sizeof(struct tc_nat) }, }; static int tcf_nat_init(struct net *net, struct nlattr *nla, struct nlattr *est, struct tc_action **a, struct tcf_proto *tp, u32 flags, struct netlink_ext_ack *extack) { struct tc_action_net *tn = net_generic(net, act_nat_ops.net_id); bool bind = flags & TCA_ACT_FLAGS_BIND; struct tcf_nat_parms *nparm, *oparm; struct nlattr *tb[TCA_NAT_MAX + 1]; struct tcf_chain *goto_ch = NULL; struct tc_nat *parm; int ret = 0, err; struct tcf_nat *p; u32 index; if (nla == NULL) return -EINVAL; err = nla_parse_nested_deprecated(tb, TCA_NAT_MAX, nla, nat_policy, NULL); if (err < 0) return err; if (tb[TCA_NAT_PARMS] == NULL) return -EINVAL; parm = nla_data(tb[TCA_NAT_PARMS]); index = parm->index; err = tcf_idr_check_alloc(tn, &index, a, bind); if (!err) { ret = tcf_idr_create_from_flags(tn, index, est, a, &act_nat_ops, bind, flags); if (ret) { tcf_idr_cleanup(tn, index); return ret; } ret = ACT_P_CREATED; } else if (err > 0) { if (bind) return ACT_P_BOUND; if (!(flags & TCA_ACT_FLAGS_REPLACE)) { tcf_idr_release(*a, bind); return -EEXIST; } } else { return err; } err = tcf_action_check_ctrlact(parm->action, tp, &goto_ch, extack); if (err < 0) goto release_idr; nparm = kzalloc(sizeof(*nparm), GFP_KERNEL); if (!nparm) { err = -ENOMEM; goto release_idr; } nparm->old_addr = parm->old_addr; nparm->new_addr = parm->new_addr; nparm->mask = parm->mask; nparm->flags = parm->flags; p = to_tcf_nat(*a); spin_lock_bh(&p->tcf_lock); goto_ch = tcf_action_set_ctrlact(*a, parm->action, goto_ch); oparm = rcu_replace_pointer(p->parms, nparm, lockdep_is_held(&p->tcf_lock)); spin_unlock_bh(&p->tcf_lock); if (goto_ch) tcf_chain_put_by_act(goto_ch); if (oparm) kfree_rcu(oparm, rcu); return ret; release_idr: tcf_idr_release(*a, bind); return err; } TC_INDIRECT_SCOPE int tcf_nat_act(struct sk_buff *skb, const struct tc_action *a, struct tcf_result *res) { struct tcf_nat *p = to_tcf_nat(a); struct tcf_nat_parms *parms; struct iphdr *iph; __be32 old_addr; __be32 new_addr; __be32 mask; __be32 addr; int egress; int action; int ihl; int noff; tcf_lastuse_update(&p->tcf_tm); tcf_action_update_bstats(&p->common, skb); action = READ_ONCE(p->tcf_action); parms = rcu_dereference_bh(p->parms); old_addr = parms->old_addr; new_addr = parms->new_addr; mask = parms->mask; egress = parms->flags & TCA_NAT_FLAG_EGRESS; if (unlikely(action == TC_ACT_SHOT)) goto drop; noff = skb_network_offset(skb); if (!pskb_may_pull(skb, sizeof(*iph) + noff)) goto drop; iph = ip_hdr(skb); if (egress) addr = iph->saddr; else addr = iph->daddr; if (!((old_addr ^ addr) & mask)) { if (skb_try_make_writable(skb, sizeof(*iph) + noff)) goto drop; new_addr &= mask; new_addr |= addr & ~mask; /* Rewrite IP header */ iph = ip_hdr(skb); if (egress) iph->saddr = new_addr; else iph->daddr = new_addr; csum_replace4(&iph->check, addr, new_addr); } else if ((iph->frag_off & htons(IP_OFFSET)) || iph->protocol != IPPROTO_ICMP) { goto out; } ihl = iph->ihl * 4; /* It would be nice to share code with stateful NAT. */ switch (iph->frag_off & htons(IP_OFFSET) ? 0 : iph->protocol) { case IPPROTO_TCP: { struct tcphdr *tcph; if (!pskb_may_pull(skb, ihl + sizeof(*tcph) + noff) || skb_try_make_writable(skb, ihl + sizeof(*tcph) + noff)) goto drop; tcph = (void *)(skb_network_header(skb) + ihl); inet_proto_csum_replace4(&tcph->check, skb, addr, new_addr, true); break; } case IPPROTO_UDP: { struct udphdr *udph; if (!pskb_may_pull(skb, ihl + sizeof(*udph) + noff) || skb_try_make_writable(skb, ihl + sizeof(*udph) + noff)) goto drop; udph = (void *)(skb_network_header(skb) + ihl); if (udph->check || skb->ip_summed == CHECKSUM_PARTIAL) { inet_proto_csum_replace4(&udph->check, skb, addr, new_addr, true); if (!udph->check) udph->check = CSUM_MANGLED_0; } break; } case IPPROTO_ICMP: { struct icmphdr *icmph; if (!pskb_may_pull(skb, ihl + sizeof(*icmph) + noff)) goto drop; icmph = (void *)(skb_network_header(skb) + ihl); if (!icmp_is_err(icmph->type)) break; if (!pskb_may_pull(skb, ihl + sizeof(*icmph) + sizeof(*iph) + noff)) goto drop; icmph = (void *)(skb_network_header(skb) + ihl); iph = (void *)(icmph + 1); if (egress) addr = iph->daddr; else addr = iph->saddr; if ((old_addr ^ addr) & mask) break; if (skb_try_make_writable(skb, ihl + sizeof(*icmph) + sizeof(*iph) + noff)) goto drop; icmph = (void *)(skb_network_header(skb) + ihl); iph = (void *)(icmph + 1); new_addr &= mask; new_addr |= addr & ~mask; /* XXX Fix up the inner checksums. */ if (egress) iph->daddr = new_addr; else iph->saddr = new_addr; inet_proto_csum_replace4(&icmph->checksum, skb, addr, new_addr, false); break; } default: break; } out: return action; drop: tcf_action_inc_drop_qstats(&p->common); return TC_ACT_SHOT; } static int tcf_nat_dump(struct sk_buff *skb, struct tc_action *a, int bind, int ref) { unsigned char *b = skb_tail_pointer(skb); struct tcf_nat *p = to_tcf_nat(a); struct tc_nat opt = { .index = p->tcf_index, .refcnt = refcount_read(&p->tcf_refcnt) - ref, .bindcnt = atomic_read(&p->tcf_bindcnt) - bind, }; struct tcf_nat_parms *parms; struct tcf_t t; spin_lock_bh(&p->tcf_lock); opt.action = p->tcf_action; parms = rcu_dereference_protected(p->parms, lockdep_is_held(&p->tcf_lock)); opt.old_addr = parms->old_addr; opt.new_addr = parms->new_addr; opt.mask = parms->mask; opt.flags = parms->flags; if (nla_put(skb, TCA_NAT_PARMS, sizeof(opt), &opt)) goto nla_put_failure; tcf_tm_dump(&t, &p->tcf_tm); if (nla_put_64bit(skb, TCA_NAT_TM, sizeof(t), &t, TCA_NAT_PAD)) goto nla_put_failure; spin_unlock_bh(&p->tcf_lock); return skb->len; nla_put_failure: spin_unlock_bh(&p->tcf_lock); nlmsg_trim(skb, b); return -1; } static void tcf_nat_cleanup(struct tc_action *a) { struct tcf_nat *p = to_tcf_nat(a); struct tcf_nat_parms *parms; parms = rcu_dereference_protected(p->parms, 1); if (parms) kfree_rcu(parms, rcu); } static struct tc_action_ops act_nat_ops = { .kind = "nat", .id = TCA_ID_NAT, .owner = THIS_MODULE, .act = tcf_nat_act, .dump = tcf_nat_dump, .init = tcf_nat_init, .cleanup = tcf_nat_cleanup, .size = sizeof(struct tcf_nat), }; MODULE_ALIAS_NET_ACT("nat"); static __net_init int nat_init_net(struct net *net) { struct tc_action_net *tn = net_generic(net, act_nat_ops.net_id); return tc_action_net_init(net, tn, &act_nat_ops); } static void __net_exit nat_exit_net(struct list_head *net_list) { tc_action_net_exit(net_list, act_nat_ops.net_id); } static struct pernet_operations nat_net_ops = { .init = nat_init_net, .exit_batch = nat_exit_net, .id = &act_nat_ops.net_id, .size = sizeof(struct tc_action_net), }; MODULE_DESCRIPTION("Stateless NAT actions"); MODULE_LICENSE("GPL"); static int __init nat_init_module(void) { return tcf_register_action(&act_nat_ops, &nat_net_ops); } static void __exit nat_cleanup_module(void) { tcf_unregister_action(&act_nat_ops, &nat_net_ops); } module_init(nat_init_module); module_exit(nat_cleanup_module); |
| 5 5 5 2 3 5 5 3 4 1 1 5 4 5 5 3 10 4 2 5 1 1 9 6 2 3 5 3 3 2 3 3 3 7 9 7 7 3 1 1 1 1 3 3 3 7 30153 30177 30153 1 1 1 1 1 1 6 2 1 1 1 3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 | // SPDX-License-Identifier: GPL-2.0-only /* * xsave/xrstor support. * * Author: Suresh Siddha <suresh.b.siddha@intel.com> */ #include <linux/bitops.h> #include <linux/compat.h> #include <linux/cpu.h> #include <linux/mman.h> #include <linux/nospec.h> #include <linux/pkeys.h> #include <linux/seq_file.h> #include <linux/proc_fs.h> #include <linux/vmalloc.h> #include <linux/coredump.h> #include <asm/fpu/api.h> #include <asm/fpu/regset.h> #include <asm/fpu/signal.h> #include <asm/fpu/xcr.h> #include <asm/cpuid.h> #include <asm/tlbflush.h> #include <asm/prctl.h> #include <asm/elf.h> #include <uapi/asm/elf.h> #include "context.h" #include "internal.h" #include "legacy.h" #include "xstate.h" #define for_each_extended_xfeature(bit, mask) \ (bit) = FIRST_EXTENDED_XFEATURE; \ for_each_set_bit_from(bit, (unsigned long *)&(mask), 8 * sizeof(mask)) /* * Although we spell it out in here, the Processor Trace * xfeature is completely unused. We use other mechanisms * to save/restore PT state in Linux. */ static const char *xfeature_names[] = { "x87 floating point registers", "SSE registers", "AVX registers", "MPX bounds registers", "MPX CSR", "AVX-512 opmask", "AVX-512 Hi256", "AVX-512 ZMM_Hi256", "Processor Trace (unused)", "Protection Keys User registers", "PASID state", "Control-flow User registers", "Control-flow Kernel registers (unused)", "unknown xstate feature", "unknown xstate feature", "unknown xstate feature", "unknown xstate feature", "AMX Tile config", "AMX Tile data", "unknown xstate feature", }; static unsigned short xsave_cpuid_features[] __initdata = { [XFEATURE_FP] = X86_FEATURE_FPU, [XFEATURE_SSE] = X86_FEATURE_XMM, [XFEATURE_YMM] = X86_FEATURE_AVX, [XFEATURE_BNDREGS] = X86_FEATURE_MPX, [XFEATURE_BNDCSR] = X86_FEATURE_MPX, [XFEATURE_OPMASK] = X86_FEATURE_AVX512F, [XFEATURE_ZMM_Hi256] = X86_FEATURE_AVX512F, [XFEATURE_Hi16_ZMM] = X86_FEATURE_AVX512F, [XFEATURE_PT_UNIMPLEMENTED_SO_FAR] = X86_FEATURE_INTEL_PT, [XFEATURE_PKRU] = X86_FEATURE_OSPKE, [XFEATURE_PASID] = X86_FEATURE_ENQCMD, [XFEATURE_CET_USER] = X86_FEATURE_SHSTK, [XFEATURE_XTILE_CFG] = X86_FEATURE_AMX_TILE, [XFEATURE_XTILE_DATA] = X86_FEATURE_AMX_TILE, }; static unsigned int xstate_offsets[XFEATURE_MAX] __ro_after_init = { [ 0 ... XFEATURE_MAX - 1] = -1}; static unsigned int xstate_sizes[XFEATURE_MAX] __ro_after_init = { [ 0 ... XFEATURE_MAX - 1] = -1}; static unsigned int xstate_flags[XFEATURE_MAX] __ro_after_init; #define XSTATE_FLAG_SUPERVISOR BIT(0) #define XSTATE_FLAG_ALIGNED64 BIT(1) /* * Return whether the system supports a given xfeature. * * Also return the name of the (most advanced) feature that the caller requested: */ int cpu_has_xfeatures(u64 xfeatures_needed, const char **feature_name) { u64 xfeatures_missing = xfeatures_needed & ~fpu_kernel_cfg.max_features; if (unlikely(feature_name)) { long xfeature_idx, max_idx; u64 xfeatures_print; /* * So we use FLS here to be able to print the most advanced * feature that was requested but is missing. So if a driver * asks about "XFEATURE_MASK_SSE | XFEATURE_MASK_YMM" we'll print the * missing AVX feature - this is the most informative message * to users: */ if (xfeatures_missing) xfeatures_print = xfeatures_missing; else xfeatures_print = xfeatures_needed; xfeature_idx = fls64(xfeatures_print)-1; max_idx = ARRAY_SIZE(xfeature_names)-1; xfeature_idx = min(xfeature_idx, max_idx); *feature_name = xfeature_names[xfeature_idx]; } if (xfeatures_missing) return 0; return 1; } EXPORT_SYMBOL_GPL(cpu_has_xfeatures); static bool xfeature_is_aligned64(int xfeature_nr) { return xstate_flags[xfeature_nr] & XSTATE_FLAG_ALIGNED64; } static bool xfeature_is_supervisor(int xfeature_nr) { return xstate_flags[xfeature_nr] & XSTATE_FLAG_SUPERVISOR; } static unsigned int xfeature_get_offset(u64 xcomp_bv, int xfeature) { unsigned int offs, i; /* * Non-compacted format and legacy features use the cached fixed * offsets. */ if (!cpu_feature_enabled(X86_FEATURE_XCOMPACTED) || xfeature <= XFEATURE_SSE) return xstate_offsets[xfeature]; /* * Compacted format offsets depend on the actual content of the * compacted xsave area which is determined by the xcomp_bv header * field. */ offs = FXSAVE_SIZE + XSAVE_HDR_SIZE; for_each_extended_xfeature(i, xcomp_bv) { if (xfeature_is_aligned64(i)) offs = ALIGN(offs, 64); if (i == xfeature) break; offs += xstate_sizes[i]; } return offs; } /* * Enable the extended processor state save/restore feature. * Called once per CPU onlining. */ void fpu__init_cpu_xstate(void) { if (!boot_cpu_has(X86_FEATURE_XSAVE) || !fpu_kernel_cfg.max_features) return; cr4_set_bits(X86_CR4_OSXSAVE); /* * Must happen after CR4 setup and before xsetbv() to allow KVM * lazy passthrough. Write independent of the dynamic state static * key as that does not work on the boot CPU. This also ensures * that any stale state is wiped out from XFD. Reset the per CPU * xfd cache too. */ if (cpu_feature_enabled(X86_FEATURE_XFD)) xfd_set_state(init_fpstate.xfd); /* * XCR_XFEATURE_ENABLED_MASK (aka. XCR0) sets user features * managed by XSAVE{C, OPT, S} and XRSTOR{S}. Only XSAVE user * states can be set here. */ xsetbv(XCR_XFEATURE_ENABLED_MASK, fpu_user_cfg.max_features); /* * MSR_IA32_XSS sets supervisor states managed by XSAVES. */ if (boot_cpu_has(X86_FEATURE_XSAVES)) { wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() | xfeatures_mask_independent()); } } static bool xfeature_enabled(enum xfeature xfeature) { return fpu_kernel_cfg.max_features & BIT_ULL(xfeature); } /* * Record the offsets and sizes of various xstates contained * in the XSAVE state memory layout. */ static void __init setup_xstate_cache(void) { u32 eax, ebx, ecx, edx, i; /* start at the beginning of the "extended state" */ unsigned int last_good_offset = offsetof(struct xregs_state, extended_state_area); /* * The FP xstates and SSE xstates are legacy states. They are always * in the fixed offsets in the xsave area in either compacted form * or standard form. */ xstate_offsets[XFEATURE_FP] = 0; xstate_sizes[XFEATURE_FP] = offsetof(struct fxregs_state, xmm_space); xstate_offsets[XFEATURE_SSE] = xstate_sizes[XFEATURE_FP]; xstate_sizes[XFEATURE_SSE] = sizeof_field(struct fxregs_state, xmm_space); for_each_extended_xfeature(i, fpu_kernel_cfg.max_features) { cpuid_count(CPUID_LEAF_XSTATE, i, &eax, &ebx, &ecx, &edx); xstate_sizes[i] = eax; xstate_flags[i] = ecx; /* * If an xfeature is supervisor state, the offset in EBX is * invalid, leave it to -1. */ if (xfeature_is_supervisor(i)) continue; xstate_offsets[i] = ebx; /* * In our xstate size checks, we assume that the highest-numbered * xstate feature has the highest offset in the buffer. Ensure * it does. */ WARN_ONCE(last_good_offset > xstate_offsets[i], "x86/fpu: misordered xstate at %d\n", last_good_offset); last_good_offset = xstate_offsets[i]; } } static void __init print_xstate_feature(u64 xstate_mask) { const char *feature_name; if (cpu_has_xfeatures(xstate_mask, &feature_name)) pr_info("x86/fpu: Supporting XSAVE feature 0x%03Lx: '%s'\n", xstate_mask, feature_name); } /* * Print out all the supported xstate features: */ static void __init print_xstate_features(void) { print_xstate_feature(XFEATURE_MASK_FP); print_xstate_feature(XFEATURE_MASK_SSE); print_xstate_feature(XFEATURE_MASK_YMM); print_xstate_feature(XFEATURE_MASK_BNDREGS); print_xstate_feature(XFEATURE_MASK_BNDCSR); print_xstate_feature(XFEATURE_MASK_OPMASK); print_xstate_feature(XFEATURE_MASK_ZMM_Hi256); print_xstate_feature(XFEATURE_MASK_Hi16_ZMM); print_xstate_feature(XFEATURE_MASK_PKRU); print_xstate_feature(XFEATURE_MASK_PASID); print_xstate_feature(XFEATURE_MASK_CET_USER); print_xstate_feature(XFEATURE_MASK_XTILE_CFG); print_xstate_feature(XFEATURE_MASK_XTILE_DATA); } /* * This check is important because it is easy to get XSTATE_* * confused with XSTATE_BIT_*. */ #define CHECK_XFEATURE(nr) do { \ WARN_ON(nr < FIRST_EXTENDED_XFEATURE); \ WARN_ON(nr >= XFEATURE_MAX); \ } while (0) /* * Print out xstate component offsets and sizes */ static void __init print_xstate_offset_size(void) { int i; for_each_extended_xfeature(i, fpu_kernel_cfg.max_features) { pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d\n", i, xfeature_get_offset(fpu_kernel_cfg.max_features, i), i, xstate_sizes[i]); } } /* * This function is called only during boot time when x86 caps are not set * up and alternative can not be used yet. */ static __init void os_xrstor_booting(struct xregs_state *xstate) { u64 mask = fpu_kernel_cfg.max_features & XFEATURE_MASK_FPSTATE; u32 lmask = mask; u32 hmask = mask >> 32; int err; if (cpu_feature_enabled(X86_FEATURE_XSAVES)) XSTATE_OP(XRSTORS, xstate, lmask, hmask, err); else XSTATE_OP(XRSTOR, xstate, lmask, hmask, err); /* * We should never fault when copying from a kernel buffer, and the FPU * state we set at boot time should be valid. */ WARN_ON_FPU(err); } /* * All supported features have either init state all zeros or are * handled in setup_init_fpu() individually. This is an explicit * feature list and does not use XFEATURE_MASK*SUPPORTED to catch * newly added supported features at build time and make people * actually look at the init state for the new feature. */ #define XFEATURES_INIT_FPSTATE_HANDLED \ (XFEATURE_MASK_FP | \ XFEATURE_MASK_SSE | \ XFEATURE_MASK_YMM | \ XFEATURE_MASK_OPMASK | \ XFEATURE_MASK_ZMM_Hi256 | \ XFEATURE_MASK_Hi16_ZMM | \ XFEATURE_MASK_PKRU | \ XFEATURE_MASK_BNDREGS | \ XFEATURE_MASK_BNDCSR | \ XFEATURE_MASK_PASID | \ XFEATURE_MASK_CET_USER | \ XFEATURE_MASK_XTILE) /* * setup the xstate image representing the init state */ static void __init setup_init_fpu_buf(void) { BUILD_BUG_ON((XFEATURE_MASK_USER_SUPPORTED | XFEATURE_MASK_SUPERVISOR_SUPPORTED) != XFEATURES_INIT_FPSTATE_HANDLED); if (!boot_cpu_has(X86_FEATURE_XSAVE)) return; print_xstate_features(); xstate_init_xcomp_bv(&init_fpstate.regs.xsave, init_fpstate.xfeatures); /* * Init all the features state with header.xfeatures being 0x0 */ os_xrstor_booting(&init_fpstate.regs.xsave); /* * All components are now in init state. Read the state back so * that init_fpstate contains all non-zero init state. This only * works with XSAVE, but not with XSAVEOPT and XSAVEC/S because * those use the init optimization which skips writing data for * components in init state. * * XSAVE could be used, but that would require to reshuffle the * data when XSAVEC/S is available because XSAVEC/S uses xstate * compaction. But doing so is a pointless exercise because most * components have an all zeros init state except for the legacy * ones (FP and SSE). Those can be saved with FXSAVE into the * legacy area. Adding new features requires to ensure that init * state is all zeroes or if not to add the necessary handling * here. */ fxsave(&init_fpstate.regs.fxsave); } int xfeature_size(int xfeature_nr) { u32 eax, ebx, ecx, edx; CHECK_XFEATURE(xfeature_nr); cpuid_count(CPUID_LEAF_XSTATE, xfeature_nr, &eax, &ebx, &ecx, &edx); return eax; } /* Validate an xstate header supplied by userspace (ptrace or sigreturn) */ static int validate_user_xstate_header(const struct xstate_header *hdr, struct fpstate *fpstate) { /* No unknown or supervisor features may be set */ if (hdr->xfeatures & ~fpstate->user_xfeatures) return -EINVAL; /* Userspace must use the uncompacted format */ if (hdr->xcomp_bv) return -EINVAL; /* * If 'reserved' is shrunken to add a new field, make sure to validate * that new field here! */ BUILD_BUG_ON(sizeof(hdr->reserved) != 48); /* No reserved bits may be set */ if (memchr_inv(hdr->reserved, 0, sizeof(hdr->reserved))) return -EINVAL; return 0; } static void __init __xstate_dump_leaves(void) { int i; u32 eax, ebx, ecx, edx; static int should_dump = 1; if (!should_dump) return; should_dump = 0; /* * Dump out a few leaves past the ones that we support * just in case there are some goodies up there */ for (i = 0; i < XFEATURE_MAX + 10; i++) { cpuid_count(CPUID_LEAF_XSTATE, i, &eax, &ebx, &ecx, &edx); pr_warn("CPUID[%02x, %02x]: eax=%08x ebx=%08x ecx=%08x edx=%08x\n", CPUID_LEAF_XSTATE, i, eax, ebx, ecx, edx); } } #define XSTATE_WARN_ON(x, fmt, ...) do { \ if (WARN_ONCE(x, "XSAVE consistency problem: " fmt, ##__VA_ARGS__)) { \ __xstate_dump_leaves(); \ } \ } while (0) #define XCHECK_SZ(sz, nr, __struct) ({ \ if (WARN_ONCE(sz != sizeof(__struct), \ "[%s]: struct is %zu bytes, cpu state %d bytes\n", \ xfeature_names[nr], sizeof(__struct), sz)) { \ __xstate_dump_leaves(); \ } \ true; \ }) /** * check_xtile_data_against_struct - Check tile data state size. * * Calculate the state size by multiplying the single tile size which is * recorded in a C struct, and the number of tiles that the CPU informs. * Compare the provided size with the calculation. * * @size: The tile data state size * * Returns: 0 on success, -EINVAL on mismatch. */ static int __init check_xtile_data_against_struct(int size) { u32 max_palid, palid, state_size; u32 eax, ebx, ecx, edx; u16 max_tile; /* * Check the maximum palette id: * eax: the highest numbered palette subleaf. */ cpuid_count(CPUID_LEAF_TILE, 0, &max_palid, &ebx, &ecx, &edx); /* * Cross-check each tile size and find the maximum number of * supported tiles. */ for (palid = 1, max_tile = 0; palid <= max_palid; palid++) { u16 tile_size, max; /* * Check the tile size info: * eax[31:16]: bytes per title * ebx[31:16]: the max names (or max number of tiles) */ cpuid_count(CPUID_LEAF_TILE, palid, &eax, &ebx, &edx, &edx); tile_size = eax >> 16; max = ebx >> 16; if (tile_size != sizeof(struct xtile_data)) { pr_err("%s: struct is %zu bytes, cpu xtile %d bytes\n", __stringify(XFEATURE_XTILE_DATA), sizeof(struct xtile_data), tile_size); __xstate_dump_leaves(); return -EINVAL; } if (max > max_tile) max_tile = max; } state_size = sizeof(struct xtile_data) * max_tile; if (size != state_size) { pr_err("%s: calculated size is %u bytes, cpu state %d bytes\n", __stringify(XFEATURE_XTILE_DATA), state_size, size); __xstate_dump_leaves(); return -EINVAL; } return 0; } /* * We have a C struct for each 'xstate'. We need to ensure * that our software representation matches what the CPU * tells us about the state's size. */ static bool __init check_xstate_against_struct(int nr) { /* * Ask the CPU for the size of the state. */ int sz = xfeature_size(nr); /* * Match each CPU state with the corresponding software * structure. */ switch (nr) { case XFEATURE_YMM: return XCHECK_SZ(sz, nr, struct ymmh_struct); case XFEATURE_BNDREGS: return XCHECK_SZ(sz, nr, struct mpx_bndreg_state); case XFEATURE_BNDCSR: return XCHECK_SZ(sz, nr, struct mpx_bndcsr_state); case XFEATURE_OPMASK: return XCHECK_SZ(sz, nr, struct avx_512_opmask_state); case XFEATURE_ZMM_Hi256: return XCHECK_SZ(sz, nr, struct avx_512_zmm_uppers_state); case XFEATURE_Hi16_ZMM: return XCHECK_SZ(sz, nr, struct avx_512_hi16_state); case XFEATURE_PKRU: return XCHECK_SZ(sz, nr, struct pkru_state); case XFEATURE_PASID: return XCHECK_SZ(sz, nr, struct ia32_pasid_state); case XFEATURE_XTILE_CFG: return XCHECK_SZ(sz, nr, struct xtile_cfg); case XFEATURE_CET_USER: return XCHECK_SZ(sz, nr, struct cet_user_state); case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true; default: XSTATE_WARN_ON(1, "No structure for xstate: %d\n", nr); return false; } return true; } static unsigned int xstate_calculate_size(u64 xfeatures, bool compacted) { unsigned int topmost = fls64(xfeatures) - 1; unsigned int offset = xstate_offsets[topmost]; if (topmost <= XFEATURE_SSE) return sizeof(struct xregs_state); if (compacted) offset = xfeature_get_offset(xfeatures, topmost); return offset + xstate_sizes[topmost]; } /* * This essentially double-checks what the cpu told us about * how large the XSAVE buffer needs to be. We are recalculating * it to be safe. * * Independent XSAVE features allocate their own buffers and are not * covered by these checks. Only the size of the buffer for task->fpu * is checked here. */ static bool __init paranoid_xstate_size_valid(unsigned int kernel_size) { bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED); bool xsaves = cpu_feature_enabled(X86_FEATURE_XSAVES); unsigned int size = FXSAVE_SIZE + XSAVE_HDR_SIZE; int i; for_each_extended_xfeature(i, fpu_kernel_cfg.max_features) { if (!check_xstate_against_struct(i)) return false; /* * Supervisor state components can be managed only by * XSAVES. */ if (!xsaves && xfeature_is_supervisor(i)) { XSTATE_WARN_ON(1, "Got supervisor feature %d, but XSAVES not advertised\n", i); return false; } } size = xstate_calculate_size(fpu_kernel_cfg.max_features, compacted); XSTATE_WARN_ON(size != kernel_size, "size %u != kernel_size %u\n", size, kernel_size); return size == kernel_size; } /* * Get total size of enabled xstates in XCR0 | IA32_XSS. * * Note the SDM's wording here. "sub-function 0" only enumerates * the size of the *user* states. If we use it to size a buffer * that we use 'XSAVES' on, we could potentially overflow the * buffer because 'XSAVES' saves system states too. * * This also takes compaction into account. So this works for * XSAVEC as well. */ static unsigned int __init get_compacted_size(void) { unsigned int eax, ebx, ecx, edx; /* * - CPUID function 0DH, sub-function 1: * EBX enumerates the size (in bytes) required by * the XSAVES instruction for an XSAVE area * containing all the state components * corresponding to bits currently set in * XCR0 | IA32_XSS. * * When XSAVES is not available but XSAVEC is (virt), then there * are no supervisor states, but XSAVEC still uses compacted * format. */ cpuid_count(CPUID_LEAF_XSTATE, 1, &eax, &ebx, &ecx, &edx); return ebx; } /* * Get the total size of the enabled xstates without the independent supervisor * features. */ static unsigned int __init get_xsave_compacted_size(void) { u64 mask = xfeatures_mask_independent(); unsigned int size; if (!mask) return get_compacted_size(); /* Disable independent features. */ wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor()); /* * Ask the hardware what size is required of the buffer. * This is the size required for the task->fpu buffer. */ size = get_compacted_size(); /* Re-enable independent features so XSAVES will work on them again. */ wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() | mask); return size; } static unsigned int __init get_xsave_size_user(void) { unsigned int eax, ebx, ecx, edx; /* * - CPUID function 0DH, sub-function 0: * EBX enumerates the size (in bytes) required by * the XSAVE instruction for an XSAVE area * containing all the *user* state components * corresponding to bits currently set in XCR0. */ cpuid_count(CPUID_LEAF_XSTATE, 0, &eax, &ebx, &ecx, &edx); return ebx; } static int __init init_xstate_size(void) { /* Recompute the context size for enabled features: */ unsigned int user_size, kernel_size, kernel_default_size; bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED); /* Uncompacted user space size */ user_size = get_xsave_size_user(); /* * XSAVES kernel size includes supervisor states and uses compacted * format. XSAVEC uses compacted format, but does not save * supervisor states. * * XSAVE[OPT] do not support supervisor states so kernel and user * size is identical. */ if (compacted) kernel_size = get_xsave_compacted_size(); else kernel_size = user_size; kernel_default_size = xstate_calculate_size(fpu_kernel_cfg.default_features, compacted); if (!paranoid_xstate_size_valid(kernel_size)) return -EINVAL; fpu_kernel_cfg.max_size = kernel_size; fpu_user_cfg.max_size = user_size; fpu_kernel_cfg.default_size = kernel_default_size; fpu_user_cfg.default_size = xstate_calculate_size(fpu_user_cfg.default_features, false); return 0; } /* * We enabled the XSAVE hardware, but something went wrong and * we can not use it. Disable it. */ static void __init fpu__init_disable_system_xstate(unsigned int legacy_size) { fpu_kernel_cfg.max_features = 0; cr4_clear_bits(X86_CR4_OSXSAVE); setup_clear_cpu_cap(X86_FEATURE_XSAVE); /* Restore the legacy size.*/ fpu_kernel_cfg.max_size = legacy_size; fpu_kernel_cfg.default_size = legacy_size; fpu_user_cfg.max_size = legacy_size; fpu_user_cfg.default_size = legacy_size; /* * Prevent enabling the static branch which enables writes to the * XFD MSR. */ init_fpstate.xfd = 0; fpstate_reset(¤t->thread.fpu); } /* * Enable and initialize the xsave feature. * Called once per system bootup. */ void __init fpu__init_system_xstate(unsigned int legacy_size) { unsigned int eax, ebx, ecx, edx; u64 xfeatures; int err; int i; if (!boot_cpu_has(X86_FEATURE_FPU)) { pr_info("x86/fpu: No FPU detected\n"); return; } if (!boot_cpu_has(X86_FEATURE_XSAVE)) { pr_info("x86/fpu: x87 FPU will use %s\n", boot_cpu_has(X86_FEATURE_FXSR) ? "FXSAVE" : "FSAVE"); return; } /* * Find user xstates supported by the processor. */ cpuid_count(CPUID_LEAF_XSTATE, 0, &eax, &ebx, &ecx, &edx); fpu_kernel_cfg.max_features = eax + ((u64)edx << 32); /* * Find supervisor xstates supported by the processor. */ cpuid_count(CPUID_LEAF_XSTATE, 1, &eax, &ebx, &ecx, &edx); fpu_kernel_cfg.max_features |= ecx + ((u64)edx << 32); if ((fpu_kernel_cfg.max_features & XFEATURE_MASK_FPSSE) != XFEATURE_MASK_FPSSE) { /* * This indicates that something really unexpected happened * with the enumeration. Disable XSAVE and try to continue * booting without it. This is too early to BUG(). */ pr_err("x86/fpu: FP/SSE not present amongst the CPU's xstate features: 0x%llx.\n", fpu_kernel_cfg.max_features); goto out_disable; } fpu_kernel_cfg.independent_features = fpu_kernel_cfg.max_features & XFEATURE_MASK_INDEPENDENT; /* * Clear XSAVE features that are disabled in the normal CPUID. */ for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) { unsigned short cid = xsave_cpuid_features[i]; /* Careful: X86_FEATURE_FPU is 0! */ if ((i != XFEATURE_FP && !cid) || !boot_cpu_has(cid)) fpu_kernel_cfg.max_features &= ~BIT_ULL(i); } if (!cpu_feature_enabled(X86_FEATURE_XFD)) fpu_kernel_cfg.max_features &= ~XFEATURE_MASK_USER_DYNAMIC; if (!cpu_feature_enabled(X86_FEATURE_XSAVES)) fpu_kernel_cfg.max_features &= XFEATURE_MASK_USER_SUPPORTED; else fpu_kernel_cfg.max_features &= XFEATURE_MASK_USER_SUPPORTED | XFEATURE_MASK_SUPERVISOR_SUPPORTED; fpu_user_cfg.max_features = fpu_kernel_cfg.max_features; fpu_user_cfg.max_features &= XFEATURE_MASK_USER_SUPPORTED; /* Clean out dynamic features from default */ fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features; fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC; fpu_user_cfg.default_features = fpu_user_cfg.max_features; fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC; /* Store it for paranoia check at the end */ xfeatures = fpu_kernel_cfg.max_features; /* * Initialize the default XFD state in initfp_state and enable the * dynamic sizing mechanism if dynamic states are available. The * static key cannot be enabled here because this runs before * jump_label_init(). This is delayed to an initcall. */ init_fpstate.xfd = fpu_user_cfg.max_features & XFEATURE_MASK_USER_DYNAMIC; /* Set up compaction feature bit */ if (cpu_feature_enabled(X86_FEATURE_XSAVEC) || cpu_feature_enabled(X86_FEATURE_XSAVES)) setup_force_cpu_cap(X86_FEATURE_XCOMPACTED); /* Enable xstate instructions to be able to continue with initialization: */ fpu__init_cpu_xstate(); /* Cache size, offset and flags for initialization */ setup_xstate_cache(); err = init_xstate_size(); if (err) goto out_disable; /* Reset the state for the current task */ fpstate_reset(¤t->thread.fpu); /* * Update info used for ptrace frames; use standard-format size and no * supervisor xstates: */ update_regset_xstate_info(fpu_user_cfg.max_size, fpu_user_cfg.max_features); /* * init_fpstate excludes dynamic states as they are large but init * state is zero. */ init_fpstate.size = fpu_kernel_cfg.default_size; init_fpstate.xfeatures = fpu_kernel_cfg.default_features; if (init_fpstate.size > sizeof(init_fpstate.regs)) { pr_warn("x86/fpu: init_fpstate buffer too small (%zu < %d), disabling XSAVE\n", sizeof(init_fpstate.regs), init_fpstate.size); goto out_disable; } setup_init_fpu_buf(); /* * Paranoia check whether something in the setup modified the * xfeatures mask. */ if (xfeatures != fpu_kernel_cfg.max_features) { pr_err("x86/fpu: xfeatures modified from 0x%016llx to 0x%016llx during init, disabling XSAVE\n", xfeatures, fpu_kernel_cfg.max_features); goto out_disable; } /* * CPU capabilities initialization runs before FPU init. So * X86_FEATURE_OSXSAVE is not set. Now that XSAVE is completely * functional, set the feature bit so depending code works. */ setup_force_cpu_cap(X86_FEATURE_OSXSAVE); print_xstate_offset_size(); pr_info("x86/fpu: Enabled xstate features 0x%llx, context size is %d bytes, using '%s' format.\n", fpu_kernel_cfg.max_features, fpu_kernel_cfg.max_size, boot_cpu_has(X86_FEATURE_XCOMPACTED) ? "compacted" : "standard"); return; out_disable: /* something went wrong, try to boot without any XSAVE support */ fpu__init_disable_system_xstate(legacy_size); } /* * Restore minimal FPU state after suspend: */ void fpu__resume_cpu(void) { /* * Restore XCR0 on xsave capable CPUs: */ if (cpu_feature_enabled(X86_FEATURE_XSAVE)) xsetbv(XCR_XFEATURE_ENABLED_MASK, fpu_user_cfg.max_features); /* * Restore IA32_XSS. The same CPUID bit enumerates support * of XSAVES and MSR_IA32_XSS. */ if (cpu_feature_enabled(X86_FEATURE_XSAVES)) { wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() | xfeatures_mask_independent()); } if (fpu_state_size_dynamic()) wrmsrl(MSR_IA32_XFD, current->thread.fpu.fpstate->xfd); } /* * Given an xstate feature nr, calculate where in the xsave * buffer the state is. Callers should ensure that the buffer * is valid. */ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr) { u64 xcomp_bv = xsave->header.xcomp_bv; if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr))) return NULL; if (cpu_feature_enabled(X86_FEATURE_XCOMPACTED)) { if (WARN_ON_ONCE(!(xcomp_bv & BIT_ULL(xfeature_nr)))) return NULL; } return (void *)xsave + xfeature_get_offset(xcomp_bv, xfeature_nr); } /* * Given the xsave area and a state inside, this function returns the * address of the state. * * This is the API that is called to get xstate address in either * standard format or compacted format of xsave area. * * Note that if there is no data for the field in the xsave buffer * this will return NULL. * * Inputs: * xstate: the thread's storage area for all FPU data * xfeature_nr: state which is defined in xsave.h (e.g. XFEATURE_FP, * XFEATURE_SSE, etc...) * Output: * address of the state in the xsave area, or NULL if the * field is not present in the xsave buffer. */ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr) { /* * Do we even *have* xsave state? */ if (!boot_cpu_has(X86_FEATURE_XSAVE)) return NULL; /* * We should not ever be requesting features that we * have not enabled. */ if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr))) return NULL; /* * This assumes the last 'xsave*' instruction to * have requested that 'xfeature_nr' be saved. * If it did not, we might be seeing and old value * of the field in the buffer. * * This can happen because the last 'xsave' did not * request that this feature be saved (unlikely) * or because the "init optimization" caused it * to not be saved. */ if (!(xsave->header.xfeatures & BIT_ULL(xfeature_nr))) return NULL; return __raw_xsave_addr(xsave, xfeature_nr); } EXPORT_SYMBOL_GPL(get_xsave_addr); /* * Given an xstate feature nr, calculate where in the xsave buffer the state is. * The xsave buffer should be in standard format, not compacted (e.g. user mode * signal frames). */ void __user *get_xsave_addr_user(struct xregs_state __user *xsave, int xfeature_nr) { if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr))) return NULL; return (void __user *)xsave + xstate_offsets[xfeature_nr]; } #ifdef CONFIG_ARCH_HAS_PKEYS /* * This will go out and modify PKRU register to set the access * rights for @pkey to @init_val. */ int arch_set_user_pkey_access(struct task_struct *tsk, int pkey, unsigned long init_val) { u32 old_pkru, new_pkru_bits = 0; int pkey_shift; /* * This check implies XSAVE support. OSPKE only gets * set if we enable XSAVE and we enable PKU in XCR0. */ if (!cpu_feature_enabled(X86_FEATURE_OSPKE)) return -EINVAL; /* * This code should only be called with valid 'pkey' * values originating from in-kernel users. Complain * if a bad value is observed. */ if (WARN_ON_ONCE(pkey >= arch_max_pkey())) return -EINVAL; /* Set the bits we need in PKRU: */ if (init_val & PKEY_DISABLE_ACCESS) new_pkru_bits |= PKRU_AD_BIT; if (init_val & PKEY_DISABLE_WRITE) new_pkru_bits |= PKRU_WD_BIT; /* Shift the bits in to the correct place in PKRU for pkey: */ pkey_shift = pkey * PKRU_BITS_PER_PKEY; new_pkru_bits <<= pkey_shift; /* Get old PKRU and mask off any old bits in place: */ old_pkru = read_pkru(); old_pkru &= ~((PKRU_AD_BIT|PKRU_WD_BIT) << pkey_shift); /* Write old part along with new part: */ write_pkru(old_pkru | new_pkru_bits); return 0; } #endif /* ! CONFIG_ARCH_HAS_PKEYS */ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate, void *init_xstate, unsigned int size) { membuf_write(to, from_xstate ? xstate : init_xstate, size); } /** * __copy_xstate_to_uabi_buf - Copy kernel saved xstate to a UABI buffer * @to: membuf descriptor * @fpstate: The fpstate buffer from which to copy * @xfeatures: The mask of xfeatures to save (XSAVE mode only) * @pkru_val: The PKRU value to store in the PKRU component * @copy_mode: The requested copy mode * * Converts from kernel XSAVE or XSAVES compacted format to UABI conforming * format, i.e. from the kernel internal hardware dependent storage format * to the requested @mode. UABI XSTATE is always uncompacted! * * It supports partial copy but @to.pos always starts from zero. */ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, u64 xfeatures, u32 pkru_val, enum xstate_copy_mode copy_mode) { const unsigned int off_mxcsr = offsetof(struct fxregs_state, mxcsr); struct xregs_state *xinit = &init_fpstate.regs.xsave; struct xregs_state *xsave = &fpstate->regs.xsave; struct xstate_header header; unsigned int zerofrom; u64 mask; int i; memset(&header, 0, sizeof(header)); header.xfeatures = xsave->header.xfeatures; /* Mask out the feature bits depending on copy mode */ switch (copy_mode) { case XSTATE_COPY_FP: header.xfeatures &= XFEATURE_MASK_FP; break; case XSTATE_COPY_FX: header.xfeatures &= XFEATURE_MASK_FP | XFEATURE_MASK_SSE; break; case XSTATE_COPY_XSAVE: header.xfeatures &= fpstate->user_xfeatures & xfeatures; break; } /* Copy FP state up to MXCSR */ copy_feature(header.xfeatures & XFEATURE_MASK_FP, &to, &xsave->i387, &xinit->i387, off_mxcsr); /* Copy MXCSR when SSE or YMM are set in the feature mask */ copy_feature(header.xfeatures & (XFEATURE_MASK_SSE | XFEATURE_MASK_YMM), &to, &xsave->i387.mxcsr, &xinit->i387.mxcsr, MXCSR_AND_FLAGS_SIZE); /* Copy the remaining FP state */ copy_feature(header.xfeatures & XFEATURE_MASK_FP, &to, &xsave->i387.st_space, &xinit->i387.st_space, sizeof(xsave->i387.st_space)); /* Copy the SSE state - shared with YMM, but independently managed */ copy_feature(header.xfeatures & XFEATURE_MASK_SSE, &to, &xsave->i387.xmm_space, &xinit->i387.xmm_space, sizeof(xsave->i387.xmm_space)); if (copy_mode != XSTATE_COPY_XSAVE) goto out; /* Zero the padding area */ membuf_zero(&to, sizeof(xsave->i387.padding)); /* Copy xsave->i387.sw_reserved */ membuf_write(&to, xstate_fx_sw_bytes, sizeof(xsave->i387.sw_reserved)); /* Copy the user space relevant state of @xsave->header */ membuf_write(&to, &header, sizeof(header)); zerofrom = offsetof(struct xregs_state, extended_state_area); /* * This 'mask' indicates which states to copy from fpstate. * Those extended states that are not present in fpstate are * either disabled or initialized: * * In non-compacted format, disabled features still occupy * state space but there is no state to copy from in the * compacted init_fpstate. The gap tracking will zero these * states. * * The extended features have an all zeroes init state. Thus, * remove them from 'mask' to zero those features in the user * buffer instead of retrieving them from init_fpstate. */ mask = header.xfeatures; for_each_extended_xfeature(i, mask) { /* * If there was a feature or alignment gap, zero the space * in the destination buffer. */ if (zerofrom < xstate_offsets[i]) membuf_zero(&to, xstate_offsets[i] - zerofrom); if (i == XFEATURE_PKRU) { struct pkru_state pkru = {0}; /* * PKRU is not necessarily up to date in the * XSAVE buffer. Use the provided value. */ pkru.pkru = pkru_val; membuf_write(&to, &pkru, sizeof(pkru)); } else { membuf_write(&to, __raw_xsave_addr(xsave, i), xstate_sizes[i]); } /* * Keep track of the last copied state in the non-compacted * target buffer for gap zeroing. */ zerofrom = xstate_offsets[i] + xstate_sizes[i]; } out: if (to.left) membuf_zero(&to, to.left); } /** * copy_xstate_to_uabi_buf - Copy kernel saved xstate to a UABI buffer * @to: membuf descriptor * @tsk: The task from which to copy the saved xstate * @copy_mode: The requested copy mode * * Converts from kernel XSAVE or XSAVES compacted format to UABI conforming * format, i.e. from the kernel internal hardware dependent storage format * to the requested @mode. UABI XSTATE is always uncompacted! * * It supports partial copy but @to.pos always starts from zero. */ void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk, enum xstate_copy_mode copy_mode) { __copy_xstate_to_uabi_buf(to, tsk->thread.fpu.fpstate, tsk->thread.fpu.fpstate->user_xfeatures, tsk->thread.pkru, copy_mode); } static int copy_from_buffer(void *dst, unsigned int offset, unsigned int size, const void *kbuf, const void __user *ubuf) { if (kbuf) { memcpy(dst, kbuf + offset, size); } else { if (copy_from_user(dst, ubuf + offset, size)) return -EFAULT; } return 0; } /** * copy_uabi_to_xstate - Copy a UABI format buffer to the kernel xstate * @fpstate: The fpstate buffer to copy to * @kbuf: The UABI format buffer, if it comes from the kernel * @ubuf: The UABI format buffer, if it comes from userspace * @pkru: The location to write the PKRU value to * * Converts from the UABI format into the kernel internal hardware * dependent format. * * This function ultimately has three different callers with distinct PKRU * behavior. * 1. When called from sigreturn the PKRU register will be restored from * @fpstate via an XRSTOR. Correctly copying the UABI format buffer to * @fpstate is sufficient to cover this case, but the caller will also * pass a pointer to the thread_struct's pkru field in @pkru and updating * it is harmless. * 2. When called from ptrace the PKRU register will be restored from the * thread_struct's pkru field. A pointer to that is passed in @pkru. * The kernel will restore it manually, so the XRSTOR behavior that resets * the PKRU register to the hardware init value (0) if the corresponding * xfeatures bit is not set is emulated here. * 3. When called from KVM the PKRU register will be restored from the vcpu's * pkru field. A pointer to that is passed in @pkru. KVM hasn't used * XRSTOR and hasn't had the PKRU resetting behavior described above. To * preserve that KVM behavior, it passes NULL for @pkru if the xfeatures * bit is not set. */ static int copy_uabi_to_xstate(struct fpstate *fpstate, const void *kbuf, const void __user *ubuf, u32 *pkru) { struct xregs_state *xsave = &fpstate->regs.xsave; unsigned int offset, size; struct xstate_header hdr; u64 mask; int i; offset = offsetof(struct xregs_state, header); if (copy_from_buffer(&hdr, offset, sizeof(hdr), kbuf, ubuf)) return -EFAULT; if (validate_user_xstate_header(&hdr, fpstate)) return -EINVAL; /* Validate MXCSR when any of the related features is in use */ mask = XFEATURE_MASK_FP | XFEATURE_MASK_SSE | XFEATURE_MASK_YMM; if (hdr.xfeatures & mask) { u32 mxcsr[2]; offset = offsetof(struct fxregs_state, mxcsr); if (copy_from_buffer(mxcsr, offset, sizeof(mxcsr), kbuf, ubuf)) return -EFAULT; /* Reserved bits in MXCSR must be zero. */ if (mxcsr[0] & ~mxcsr_feature_mask) return -EINVAL; /* SSE and YMM require MXCSR even when FP is not in use. */ if (!(hdr.xfeatures & XFEATURE_MASK_FP)) { xsave->i387.mxcsr = mxcsr[0]; xsave->i387.mxcsr_mask = mxcsr[1]; } } for (i = 0; i < XFEATURE_MAX; i++) { mask = BIT_ULL(i); if (hdr.xfeatures & mask) { void *dst = __raw_xsave_addr(xsave, i); offset = xstate_offsets[i]; size = xstate_sizes[i]; if (copy_from_buffer(dst, offset, size, kbuf, ubuf)) return -EFAULT; } } if (hdr.xfeatures & XFEATURE_MASK_PKRU) { struct pkru_state *xpkru; xpkru = __raw_xsave_addr(xsave, XFEATURE_PKRU); *pkru = xpkru->pkru; } else { /* * KVM may pass NULL here to indicate that it does not need * PKRU updated. */ if (pkru) *pkru = 0; } /* * The state that came in from userspace was user-state only. * Mask all the user states out of 'xfeatures': */ xsave->header.xfeatures &= XFEATURE_MASK_SUPERVISOR_ALL; /* * Add back in the features that came in from userspace: */ xsave->header.xfeatures |= hdr.xfeatures; return 0; } /* * Convert from a ptrace standard-format kernel buffer to kernel XSAVE[S] * format and copy to the target thread. Used by ptrace and KVM. */ int copy_uabi_from_kernel_to_xstate(struct fpstate *fpstate, const void *kbuf, u32 *pkru) { return copy_uabi_to_xstate(fpstate, kbuf, NULL, pkru); } /* * Convert from a sigreturn standard-format user-space buffer to kernel * XSAVE[S] format and copy to the target thread. This is called from the * sigreturn() and rt_sigreturn() system calls. */ int copy_sigframe_from_user_to_xstate(struct task_struct *tsk, const void __user *ubuf) { return copy_uabi_to_xstate(tsk->thread.fpu.fpstate, NULL, ubuf, &tsk->thread.pkru); } static bool validate_independent_components(u64 mask) { u64 xchk; if (WARN_ON_FPU(!cpu_feature_enabled(X86_FEATURE_XSAVES))) return false; xchk = ~xfeatures_mask_independent(); if (WARN_ON_ONCE(!mask || mask & xchk)) return false; return true; } /** * xsaves - Save selected components to a kernel xstate buffer * @xstate: Pointer to the buffer * @mask: Feature mask to select the components to save * * The @xstate buffer must be 64 byte aligned and correctly initialized as * XSAVES does not write the full xstate header. Before first use the * buffer should be zeroed otherwise a consecutive XRSTORS from that buffer * can #GP. * * The feature mask must be a subset of the independent features. */ void xsaves(struct xregs_state *xstate, u64 mask) { int err; if (!validate_independent_components(mask)) return; XSTATE_OP(XSAVES, xstate, (u32)mask, (u32)(mask >> 32), err); WARN_ON_ONCE(err); } /** * xrstors - Restore selected components from a kernel xstate buffer * @xstate: Pointer to the buffer * @mask: Feature mask to select the components to restore * * The @xstate buffer must be 64 byte aligned and correctly initialized * otherwise XRSTORS from that buffer can #GP. * * Proper usage is to restore the state which was saved with * xsaves() into @xstate. * * The feature mask must be a subset of the independent features. */ void xrstors(struct xregs_state *xstate, u64 mask) { int err; if (!validate_independent_components(mask)) return; XSTATE_OP(XRSTORS, xstate, (u32)mask, (u32)(mask >> 32), err); WARN_ON_ONCE(err); } #if IS_ENABLED(CONFIG_KVM) void fpstate_clear_xstate_component(struct fpstate *fps, unsigned int xfeature) { void *addr = get_xsave_addr(&fps->regs.xsave, xfeature); if (addr) memset(addr, 0, xstate_sizes[xfeature]); } EXPORT_SYMBOL_GPL(fpstate_clear_xstate_component); #endif #ifdef CONFIG_X86_64 #ifdef CONFIG_X86_DEBUG_FPU /* * Ensure that a subsequent XSAVE* or XRSTOR* instruction with RFBM=@mask * can safely operate on the @fpstate buffer. */ static bool xstate_op_valid(struct fpstate *fpstate, u64 mask, bool rstor) { u64 xfd = __this_cpu_read(xfd_state); if (fpstate->xfd == xfd) return true; /* * The XFD MSR does not match fpstate->xfd. That's invalid when * the passed in fpstate is current's fpstate. */ if (fpstate->xfd == current->thread.fpu.fpstate->xfd) return false; /* * XRSTOR(S) from init_fpstate are always correct as it will just * bring all components into init state and not read from the * buffer. XSAVE(S) raises #PF after init. */ if (fpstate == &init_fpstate) return rstor; /* * XSAVE(S): clone(), fpu_swap_kvm_fpstate() * XRSTORS(S): fpu_swap_kvm_fpstate() */ /* * No XSAVE/XRSTOR instructions (except XSAVE itself) touch * the buffer area for XFD-disabled state components. */ mask &= ~xfd; /* * Remove features which are valid in fpstate. They * have space allocated in fpstate. */ mask &= ~fpstate->xfeatures; /* * Any remaining state components in 'mask' might be written * by XSAVE/XRSTOR. Fail validation it found. */ return !mask; } void xfd_validate_state(struct fpstate *fpstate, u64 mask, bool rstor) { WARN_ON_ONCE(!xstate_op_valid(fpstate, mask, rstor)); } #endif /* CONFIG_X86_DEBUG_FPU */ static int __init xfd_update_static_branch(void) { /* * If init_fpstate.xfd has bits set then dynamic features are * available and the dynamic sizing must be enabled. */ if (init_fpstate.xfd) static_branch_enable(&__fpu_state_size_dynamic); return 0; } arch_initcall(xfd_update_static_branch) void fpstate_free(struct fpu *fpu) { if (fpu->fpstate && fpu->fpstate != &fpu->__fpstate) vfree(fpu->fpstate); } /** * fpstate_realloc - Reallocate struct fpstate for the requested new features * * @xfeatures: A bitmap of xstate features which extend the enabled features * of that task * @ksize: The required size for the kernel buffer * @usize: The required size for user space buffers * @guest_fpu: Pointer to a guest FPU container. NULL for host allocations * * Note vs. vmalloc(): If the task with a vzalloc()-allocated buffer * terminates quickly, vfree()-induced IPIs may be a concern, but tasks * with large states are likely to live longer. * * Returns: 0 on success, -ENOMEM on allocation error. */ static int fpstate_realloc(u64 xfeatures, unsigned int ksize, unsigned int usize, struct fpu_guest *guest_fpu) { struct fpu *fpu = ¤t->thread.fpu; struct fpstate *curfps, *newfps = NULL; unsigned int fpsize; bool in_use; fpsize = ksize + ALIGN(offsetof(struct fpstate, regs), 64); newfps = vzalloc(fpsize); if (!newfps) return -ENOMEM; newfps->size = ksize; newfps->user_size = usize; newfps->is_valloc = true; /* * When a guest FPU is supplied, use @guest_fpu->fpstate * as reference independent whether it is in use or not. */ curfps = guest_fpu ? guest_fpu->fpstate : fpu->fpstate; /* Determine whether @curfps is the active fpstate */ in_use = fpu->fpstate == curfps; if (guest_fpu) { newfps->is_guest = true; newfps->is_confidential = curfps->is_confidential; newfps->in_use = curfps->in_use; guest_fpu->xfeatures |= xfeatures; guest_fpu->uabi_size = usize; } fpregs_lock(); /* * If @curfps is in use, ensure that the current state is in the * registers before swapping fpstate as that might invalidate it * due to layout changes. */ if (in_use && test_thread_flag(TIF_NEED_FPU_LOAD)) fpregs_restore_userregs(); newfps->xfeatures = curfps->xfeatures | xfeatures; newfps->user_xfeatures = curfps->user_xfeatures | xfeatures; newfps->xfd = curfps->xfd & ~xfeatures; /* Do the final updates within the locked region */ xstate_init_xcomp_bv(&newfps->regs.xsave, newfps->xfeatures); if (guest_fpu) { guest_fpu->fpstate = newfps; /* If curfps is active, update the FPU fpstate pointer */ if (in_use) fpu->fpstate = newfps; } else { fpu->fpstate = newfps; } if (in_use) xfd_update_state(fpu->fpstate); fpregs_unlock(); /* Only free valloc'ed state */ if (curfps && curfps->is_valloc) vfree(curfps); return 0; } static int validate_sigaltstack(unsigned int usize) { struct task_struct *thread, *leader = current->group_leader; unsigned long framesize = get_sigframe_size(); lockdep_assert_held(¤t->sighand->siglock); /* get_sigframe_size() is based on fpu_user_cfg.max_size */ framesize -= fpu_user_cfg.max_size; framesize += usize; for_each_thread(leader, thread) { if (thread->sas_ss_size && thread->sas_ss_size < framesize) return -ENOSPC; } return 0; } static int __xstate_request_perm(u64 permitted, u64 requested, bool guest) { /* * This deliberately does not exclude !XSAVES as we still might * decide to optionally context switch XCR0 or talk the silicon * vendors into extending XFD for the pre AMX states, especially * AVX512. */ bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED); struct fpu *fpu = ¤t->group_leader->thread.fpu; struct fpu_state_perm *perm; unsigned int ksize, usize; u64 mask; int ret = 0; /* Check whether fully enabled */ if ((permitted & requested) == requested) return 0; /* Calculate the resulting kernel state size */ mask = permitted | requested; /* Take supervisor states into account on the host */ if (!guest) mask |= xfeatures_mask_supervisor(); ksize = xstate_calculate_size(mask, compacted); /* Calculate the resulting user state size */ mask &= XFEATURE_MASK_USER_SUPPORTED; usize = xstate_calculate_size(mask, false); if (!guest) { ret = validate_sigaltstack(usize); if (ret) return ret; } perm = guest ? &fpu->guest_perm : &fpu->perm; /* Pairs with the READ_ONCE() in xstate_get_group_perm() */ WRITE_ONCE(perm->__state_perm, mask); /* Protected by sighand lock */ perm->__state_size = ksize; perm->__user_state_size = usize; return ret; } /* * Permissions array to map facilities with more than one component */ static const u64 xstate_prctl_req[XFEATURE_MAX] = { [XFEATURE_XTILE_DATA] = XFEATURE_MASK_XTILE_DATA, }; static int xstate_request_perm(unsigned long idx, bool guest) { u64 permitted, requested; int ret; if (idx >= XFEATURE_MAX) return -EINVAL; /* * Look up the facility mask which can require more than * one xstate component. */ idx = array_index_nospec(idx, ARRAY_SIZE(xstate_prctl_req)); requested = xstate_prctl_req[idx]; if (!requested) return -EOPNOTSUPP; if ((fpu_user_cfg.max_features & requested) != requested) return -EOPNOTSUPP; /* Lockless quick check */ permitted = xstate_get_group_perm(guest); if ((permitted & requested) == requested) return 0; /* Protect against concurrent modifications */ spin_lock_irq(¤t->sighand->siglock); permitted = xstate_get_group_perm(guest); /* First vCPU allocation locks the permissions. */ if (guest && (permitted & FPU_GUEST_PERM_LOCKED)) ret = -EBUSY; else ret = __xstate_request_perm(permitted, requested, guest); spin_unlock_irq(¤t->sighand->siglock); return ret; } int __xfd_enable_feature(u64 xfd_err, struct fpu_guest *guest_fpu) { u64 xfd_event = xfd_err & XFEATURE_MASK_USER_DYNAMIC; struct fpu_state_perm *perm; unsigned int ksize, usize; struct fpu *fpu; if (!xfd_event) { if (!guest_fpu) pr_err_once("XFD: Invalid xfd error: %016llx\n", xfd_err); return 0; } /* Protect against concurrent modifications */ spin_lock_irq(¤t->sighand->siglock); /* If not permitted let it die */ if ((xstate_get_group_perm(!!guest_fpu) & xfd_event) != xfd_event) { spin_unlock_irq(¤t->sighand->siglock); return -EPERM; } fpu = ¤t->group_leader->thread.fpu; perm = guest_fpu ? &fpu->guest_perm : &fpu->perm; ksize = perm->__state_size; usize = perm->__user_state_size; /* * The feature is permitted. State size is sufficient. Dropping * the lock is safe here even if more features are added from * another task, the retrieved buffer sizes are valid for the * currently requested feature(s). */ spin_unlock_irq(¤t->sighand->siglock); /* * Try to allocate a new fpstate. If that fails there is no way * out. */ if (fpstate_realloc(xfd_event, ksize, usize, guest_fpu)) return -EFAULT; return 0; } int xfd_enable_feature(u64 xfd_err) { return __xfd_enable_feature(xfd_err, NULL); } #else /* CONFIG_X86_64 */ static inline int xstate_request_perm(unsigned long idx, bool guest) { return -EPERM; } #endif /* !CONFIG_X86_64 */ u64 xstate_get_guest_group_perm(void) { return xstate_get_group_perm(true); } EXPORT_SYMBOL_GPL(xstate_get_guest_group_perm); /** * fpu_xstate_prctl - xstate permission operations * @option: A subfunction of arch_prctl() * @arg2: option argument * Return: 0 if successful; otherwise, an error code * * Option arguments: * * ARCH_GET_XCOMP_SUPP: Pointer to user space u64 to store the info * ARCH_GET_XCOMP_PERM: Pointer to user space u64 to store the info * ARCH_REQ_XCOMP_PERM: Facility number requested * * For facilities which require more than one XSTATE component, the request * must be the highest state component number related to that facility, * e.g. for AMX which requires XFEATURE_XTILE_CFG(17) and * XFEATURE_XTILE_DATA(18) this would be XFEATURE_XTILE_DATA(18). */ long fpu_xstate_prctl(int option, unsigned long arg2) { u64 __user *uptr = (u64 __user *)arg2; u64 permitted, supported; unsigned long idx = arg2; bool guest = false; switch (option) { case ARCH_GET_XCOMP_SUPP: supported = fpu_user_cfg.max_features | fpu_user_cfg.legacy_features; return put_user(supported, uptr); case ARCH_GET_XCOMP_PERM: /* * Lockless snapshot as it can also change right after the * dropping the lock. */ permitted = xstate_get_host_group_perm(); permitted &= XFEATURE_MASK_USER_SUPPORTED; return put_user(permitted, uptr); case ARCH_GET_XCOMP_GUEST_PERM: permitted = xstate_get_guest_group_perm(); permitted &= XFEATURE_MASK_USER_SUPPORTED; return put_user(permitted, uptr); case ARCH_REQ_XCOMP_GUEST_PERM: guest = true; fallthrough; case ARCH_REQ_XCOMP_PERM: if (!IS_ENABLED(CONFIG_X86_64)) return -EOPNOTSUPP; return xstate_request_perm(idx, guest); default: return -EINVAL; } } #ifdef CONFIG_PROC_PID_ARCH_STATUS /* * Report the amount of time elapsed in millisecond since last AVX512 * use in the task. */ static void avx512_status(struct seq_file *m, struct task_struct *task) { unsigned long timestamp = READ_ONCE(task->thread.fpu.avx512_timestamp); long delta; if (!timestamp) { /* * Report -1 if no AVX512 usage */ delta = -1; } else { delta = (long)(jiffies - timestamp); /* * Cap to LONG_MAX if time difference > LONG_MAX */ if (delta < 0) delta = LONG_MAX; delta = jiffies_to_msecs(delta); } seq_put_decimal_ll(m, "AVX512_elapsed_ms:\t", delta); seq_putc(m, '\n'); } /* * Report architecture specific information */ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { /* * Report AVX512 state if the processor and build option supported. */ if (cpu_feature_enabled(X86_FEATURE_AVX512F)) avx512_status(m, task); return 0; } #endif /* CONFIG_PROC_PID_ARCH_STATUS */ #ifdef CONFIG_COREDUMP static const char owner_name[] = "LINUX"; /* * Dump type, size, offset and flag values for every xfeature that is present. */ static int dump_xsave_layout_desc(struct coredump_params *cprm) { int num_records = 0; int i; for_each_extended_xfeature(i, fpu_user_cfg.max_features) { struct x86_xfeat_component xc = { .type = i, .size = xstate_sizes[i], .offset = xstate_offsets[i], /* reserved for future use */ .flags = 0, }; if (!dump_emit(cprm, &xc, sizeof(xc))) return 0; num_records++; } return num_records; } static u32 get_xsave_desc_size(void) { u32 cnt = 0; u32 i; for_each_extended_xfeature(i, fpu_user_cfg.max_features) cnt++; return cnt * (sizeof(struct x86_xfeat_component)); } int elf_coredump_extra_notes_write(struct coredump_params *cprm) { int num_records = 0; struct elf_note en; if (!fpu_user_cfg.max_features) return 0; en.n_namesz = sizeof(owner_name); en.n_descsz = get_xsave_desc_size(); en.n_type = NT_X86_XSAVE_LAYOUT; if (!dump_emit(cprm, &en, sizeof(en))) return 1; if (!dump_emit(cprm, owner_name, en.n_namesz)) return 1; if (!dump_align(cprm, 4)) return 1; num_records = dump_xsave_layout_desc(cprm); if (!num_records) return 1; /* Total size should be equal to the number of records */ if ((sizeof(struct x86_xfeat_component) * num_records) != en.n_descsz) return 1; return 0; } int elf_coredump_extra_notes_size(void) { int size; if (!fpu_user_cfg.max_features) return 0; /* .note header */ size = sizeof(struct elf_note); /* Name plus alignment to 4 bytes */ size += roundup(sizeof(owner_name), 4); size += get_xsave_desc_size(); return size; } #endif /* CONFIG_COREDUMP */ |
| 3 6120 6118 135 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 | // SPDX-License-Identifier: GPL-2.0 /* * bio-integrity.c - bio data integrity extensions * * Copyright (C) 2007, 2008, 2009 Oracle Corporation * Written by: Martin K. Petersen <martin.petersen@oracle.com> */ #include <linux/blk-integrity.h> #include <linux/mempool.h> #include <linux/export.h> #include <linux/bio.h> #include <linux/workqueue.h> #include <linux/slab.h> #include "blk.h" static struct kmem_cache *bip_slab; static struct workqueue_struct *kintegrityd_wq; void blk_flush_integrity(void) { flush_workqueue(kintegrityd_wq); } /** * bio_integrity_free - Free bio integrity payload * @bio: bio containing bip to be freed * * Description: Free the integrity portion of a bio. */ void bio_integrity_free(struct bio *bio) { struct bio_integrity_payload *bip = bio_integrity(bio); struct bio_set *bs = bio->bi_pool; if (bs && mempool_initialized(&bs->bio_integrity_pool)) { if (bip->bip_vec) bvec_free(&bs->bvec_integrity_pool, bip->bip_vec, bip->bip_max_vcnt); mempool_free(bip, &bs->bio_integrity_pool); } else { kfree(bip); } bio->bi_integrity = NULL; bio->bi_opf &= ~REQ_INTEGRITY; } /** * bio_integrity_alloc - Allocate integrity payload and attach it to bio * @bio: bio to attach integrity metadata to * @gfp_mask: Memory allocation mask * @nr_vecs: Number of integrity metadata scatter-gather elements * * Description: This function prepares a bio for attaching integrity * metadata. nr_vecs specifies the maximum number of pages containing * integrity metadata that can be attached. */ struct bio_integrity_payload *bio_integrity_alloc(struct bio *bio, gfp_t gfp_mask, unsigned int nr_vecs) { struct bio_integrity_payload *bip; struct bio_set *bs = bio->bi_pool; unsigned inline_vecs; if (WARN_ON_ONCE(bio_has_crypt_ctx(bio))) return ERR_PTR(-EOPNOTSUPP); if (!bs || !mempool_initialized(&bs->bio_integrity_pool)) { bip = kmalloc(struct_size(bip, bip_inline_vecs, nr_vecs), gfp_mask); inline_vecs = nr_vecs; } else { bip = mempool_alloc(&bs->bio_integrity_pool, gfp_mask); inline_vecs = BIO_INLINE_VECS; } if (unlikely(!bip)) return ERR_PTR(-ENOMEM); memset(bip, 0, sizeof(*bip)); /* always report as many vecs as asked explicitly, not inline vecs */ bip->bip_max_vcnt = nr_vecs; if (nr_vecs > inline_vecs) { bip->bip_vec = bvec_alloc(&bs->bvec_integrity_pool, &bip->bip_max_vcnt, gfp_mask); if (!bip->bip_vec) goto err; } else if (nr_vecs) { bip->bip_vec = bip->bip_inline_vecs; } bip->bip_bio = bio; bio->bi_integrity = bip; bio->bi_opf |= REQ_INTEGRITY; return bip; err: if (bs && mempool_initialized(&bs->bio_integrity_pool)) mempool_free(bip, &bs->bio_integrity_pool); else kfree(bip); return ERR_PTR(-ENOMEM); } EXPORT_SYMBOL(bio_integrity_alloc); static void bio_integrity_unpin_bvec(struct bio_vec *bv, int nr_vecs, bool dirty) { int i; for (i = 0; i < nr_vecs; i++) { if (dirty && !PageCompound(bv[i].bv_page)) set_page_dirty_lock(bv[i].bv_page); unpin_user_page(bv[i].bv_page); } } static void bio_integrity_uncopy_user(struct bio_integrity_payload *bip) { unsigned short orig_nr_vecs = bip->bip_max_vcnt - 1; struct bio_vec *orig_bvecs = &bip->bip_vec[1]; struct bio_vec *bounce_bvec = &bip->bip_vec[0]; size_t bytes = bounce_bvec->bv_len; struct iov_iter orig_iter; int ret; iov_iter_bvec(&orig_iter, ITER_DEST, orig_bvecs, orig_nr_vecs, bytes); ret = copy_to_iter(bvec_virt(bounce_bvec), bytes, &orig_iter); WARN_ON_ONCE(ret != bytes); bio_integrity_unpin_bvec(orig_bvecs, orig_nr_vecs, true); } /** * bio_integrity_unmap_user - Unmap user integrity payload * @bio: bio containing bip to be unmapped * * Unmap the user mapped integrity portion of a bio. */ void bio_integrity_unmap_user(struct bio *bio) { struct bio_integrity_payload *bip = bio_integrity(bio); if (bip->bip_flags & BIP_COPY_USER) { if (bio_data_dir(bio) == READ) bio_integrity_uncopy_user(bip); kfree(bvec_virt(bip->bip_vec)); return; } bio_integrity_unpin_bvec(bip->bip_vec, bip->bip_max_vcnt, bio_data_dir(bio) == READ); } /** * bio_integrity_add_page - Attach integrity metadata * @bio: bio to update * @page: page containing integrity metadata * @len: number of bytes of integrity metadata in page * @offset: start offset within page * * Description: Attach a page containing integrity metadata to bio. */ int bio_integrity_add_page(struct bio *bio, struct page *page, unsigned int len, unsigned int offset) { struct request_queue *q = bdev_get_queue(bio->bi_bdev); struct bio_integrity_payload *bip = bio_integrity(bio); if (bip->bip_vcnt > 0) { struct bio_vec *bv = &bip->bip_vec[bip->bip_vcnt - 1]; bool same_page = false; if (bvec_try_merge_hw_page(q, bv, page, len, offset, &same_page)) { bip->bip_iter.bi_size += len; return len; } if (bip->bip_vcnt >= min(bip->bip_max_vcnt, queue_max_integrity_segments(q))) return 0; /* * If the queue doesn't support SG gaps and adding this segment * would create a gap, disallow it. */ if (bvec_gap_to_prev(&q->limits, bv, offset)) return 0; } bvec_set_page(&bip->bip_vec[bip->bip_vcnt], page, len, offset); bip->bip_vcnt++; bip->bip_iter.bi_size += len; return len; } EXPORT_SYMBOL(bio_integrity_add_page); static int bio_integrity_copy_user(struct bio *bio, struct bio_vec *bvec, int nr_vecs, unsigned int len, unsigned int direction) { bool write = direction == ITER_SOURCE; struct bio_integrity_payload *bip; struct iov_iter iter; void *buf; int ret; buf = kmalloc(len, GFP_KERNEL); if (!buf) return -ENOMEM; if (write) { iov_iter_bvec(&iter, direction, bvec, nr_vecs, len); if (!copy_from_iter_full(buf, len, &iter)) { ret = -EFAULT; goto free_buf; } bip = bio_integrity_alloc(bio, GFP_KERNEL, 1); } else { memset(buf, 0, len); /* * We need to preserve the original bvec and the number of vecs * in it for completion handling */ bip = bio_integrity_alloc(bio, GFP_KERNEL, nr_vecs + 1); } if (IS_ERR(bip)) { ret = PTR_ERR(bip); goto free_buf; } if (write) bio_integrity_unpin_bvec(bvec, nr_vecs, false); else memcpy(&bip->bip_vec[1], bvec, nr_vecs * sizeof(*bvec)); ret = bio_integrity_add_page(bio, virt_to_page(buf), len, offset_in_page(buf)); if (ret != len) { ret = -ENOMEM; goto free_bip; } bip->bip_flags |= BIP_COPY_USER; bip->bip_vcnt = nr_vecs; return 0; free_bip: bio_integrity_free(bio); free_buf: kfree(buf); return ret; } static int bio_integrity_init_user(struct bio *bio, struct bio_vec *bvec, int nr_vecs, unsigned int len) { struct bio_integrity_payload *bip; bip = bio_integrity_alloc(bio, GFP_KERNEL, nr_vecs); if (IS_ERR(bip)) return PTR_ERR(bip); memcpy(bip->bip_vec, bvec, nr_vecs * sizeof(*bvec)); bip->bip_iter.bi_size = len; bip->bip_vcnt = nr_vecs; return 0; } static unsigned int bvec_from_pages(struct bio_vec *bvec, struct page **pages, int nr_vecs, ssize_t bytes, ssize_t offset) { unsigned int nr_bvecs = 0; int i, j; for (i = 0; i < nr_vecs; i = j) { size_t size = min_t(size_t, bytes, PAGE_SIZE - offset); struct folio *folio = page_folio(pages[i]); bytes -= size; for (j = i + 1; j < nr_vecs; j++) { size_t next = min_t(size_t, PAGE_SIZE, bytes); if (page_folio(pages[j]) != folio || pages[j] != pages[j - 1] + 1) break; unpin_user_page(pages[j]); size += next; bytes -= next; } bvec_set_page(&bvec[nr_bvecs], pages[i], size, offset); offset = 0; nr_bvecs++; } return nr_bvecs; } int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter) { struct request_queue *q = bdev_get_queue(bio->bi_bdev); unsigned int align = blk_lim_dma_alignment_and_pad(&q->limits); struct page *stack_pages[UIO_FASTIOV], **pages = stack_pages; struct bio_vec stack_vec[UIO_FASTIOV], *bvec = stack_vec; size_t offset, bytes = iter->count; unsigned int direction, nr_bvecs; int ret, nr_vecs; bool copy; if (bio_integrity(bio)) return -EINVAL; if (bytes >> SECTOR_SHIFT > queue_max_hw_sectors(q)) return -E2BIG; if (bio_data_dir(bio) == READ) direction = ITER_DEST; else direction = ITER_SOURCE; nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS + 1); if (nr_vecs > BIO_MAX_VECS) return -E2BIG; if (nr_vecs > UIO_FASTIOV) { bvec = kcalloc(nr_vecs, sizeof(*bvec), GFP_KERNEL); if (!bvec) return -ENOMEM; pages = NULL; } copy = !iov_iter_is_aligned(iter, align, align); ret = iov_iter_extract_pages(iter, &pages, bytes, nr_vecs, 0, &offset); if (unlikely(ret < 0)) goto free_bvec; nr_bvecs = bvec_from_pages(bvec, pages, nr_vecs, bytes, offset); if (pages != stack_pages) kvfree(pages); if (nr_bvecs > queue_max_integrity_segments(q)) copy = true; if (copy) ret = bio_integrity_copy_user(bio, bvec, nr_bvecs, bytes, direction); else ret = bio_integrity_init_user(bio, bvec, nr_bvecs, bytes); if (ret) goto release_pages; if (bvec != stack_vec) kfree(bvec); return 0; release_pages: bio_integrity_unpin_bvec(bvec, nr_bvecs, false); free_bvec: if (bvec != stack_vec) kfree(bvec); return ret; } static void bio_uio_meta_to_bip(struct bio *bio, struct uio_meta *meta) { struct bio_integrity_payload *bip = bio_integrity(bio); if (meta->flags & IO_INTEGRITY_CHK_GUARD) bip->bip_flags |= BIP_CHECK_GUARD; if (meta->flags & IO_INTEGRITY_CHK_APPTAG) bip->bip_flags |= BIP_CHECK_APPTAG; if (meta->flags & IO_INTEGRITY_CHK_REFTAG) bip->bip_flags |= BIP_CHECK_REFTAG; bip->app_tag = meta->app_tag; } int bio_integrity_map_iter(struct bio *bio, struct uio_meta *meta) { struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); unsigned int integrity_bytes; int ret; struct iov_iter it; if (!bi) return -EINVAL; /* * original meta iterator can be bigger. * process integrity info corresponding to current data buffer only. */ it = meta->iter; integrity_bytes = bio_integrity_bytes(bi, bio_sectors(bio)); if (it.count < integrity_bytes) return -EINVAL; /* should fit into two bytes */ BUILD_BUG_ON(IO_INTEGRITY_VALID_FLAGS >= (1 << 16)); if (meta->flags && (meta->flags & ~IO_INTEGRITY_VALID_FLAGS)) return -EINVAL; it.count = integrity_bytes; ret = bio_integrity_map_user(bio, &it); if (!ret) { bio_uio_meta_to_bip(bio, meta); bip_set_seed(bio_integrity(bio), meta->seed); iov_iter_advance(&meta->iter, integrity_bytes); meta->seed += bio_integrity_intervals(bi, bio_sectors(bio)); } return ret; } /** * bio_integrity_prep - Prepare bio for integrity I/O * @bio: bio to prepare * * Description: Checks if the bio already has an integrity payload attached. * If it does, the payload has been generated by another kernel subsystem, * and we just pass it through. Otherwise allocates integrity payload. * The bio must have data direction, target device and start sector set priot * to calling. In the WRITE case, integrity metadata will be generated using * the block device's integrity function. In the READ case, the buffer * will be prepared for DMA and a suitable end_io handler set up. */ bool bio_integrity_prep(struct bio *bio) { struct bio_integrity_payload *bip; struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); unsigned int len; void *buf; gfp_t gfp = GFP_NOIO; if (!bi) return true; if (!bio_sectors(bio)) return true; /* Already protected? */ if (bio_integrity(bio)) return true; switch (bio_op(bio)) { case REQ_OP_READ: if (bi->flags & BLK_INTEGRITY_NOVERIFY) return true; break; case REQ_OP_WRITE: if (bi->flags & BLK_INTEGRITY_NOGENERATE) return true; /* * Zero the memory allocated to not leak uninitialized kernel * memory to disk for non-integrity metadata where nothing else * initializes the memory. */ if (bi->csum_type == BLK_INTEGRITY_CSUM_NONE) gfp |= __GFP_ZERO; break; default: return true; } /* Allocate kernel buffer for protection data */ len = bio_integrity_bytes(bi, bio_sectors(bio)); buf = kmalloc(len, gfp); if (unlikely(buf == NULL)) { goto err_end_io; } bip = bio_integrity_alloc(bio, GFP_NOIO, 1); if (IS_ERR(bip)) { kfree(buf); goto err_end_io; } bip->bip_flags |= BIP_BLOCK_INTEGRITY; bip_set_seed(bip, bio->bi_iter.bi_sector); if (bi->csum_type == BLK_INTEGRITY_CSUM_IP) bip->bip_flags |= BIP_IP_CHECKSUM; /* describe what tags to check in payload */ if (bi->csum_type) bip->bip_flags |= BIP_CHECK_GUARD; if (bi->flags & BLK_INTEGRITY_REF_TAG) bip->bip_flags |= BIP_CHECK_REFTAG; if (bio_integrity_add_page(bio, virt_to_page(buf), len, offset_in_page(buf)) < len) { printk(KERN_ERR "could not attach integrity payload\n"); goto err_end_io; } /* Auto-generate integrity metadata if this is a write */ if (bio_data_dir(bio) == WRITE) blk_integrity_generate(bio); else bip->bio_iter = bio->bi_iter; return true; err_end_io: bio->bi_status = BLK_STS_RESOURCE; bio_endio(bio); return false; } EXPORT_SYMBOL(bio_integrity_prep); /** * bio_integrity_verify_fn - Integrity I/O completion worker * @work: Work struct stored in bio to be verified * * Description: This workqueue function is called to complete a READ * request. The function verifies the transferred integrity metadata * and then calls the original bio end_io function. */ static void bio_integrity_verify_fn(struct work_struct *work) { struct bio_integrity_payload *bip = container_of(work, struct bio_integrity_payload, bip_work); struct bio *bio = bip->bip_bio; blk_integrity_verify(bio); kfree(bvec_virt(bip->bip_vec)); bio_integrity_free(bio); bio_endio(bio); } /** * __bio_integrity_endio - Integrity I/O completion function * @bio: Protected bio * * Description: Completion for integrity I/O * * Normally I/O completion is done in interrupt context. However, * verifying I/O integrity is a time-consuming task which must be run * in process context. This function postpones completion * accordingly. */ bool __bio_integrity_endio(struct bio *bio) { struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); struct bio_integrity_payload *bip = bio_integrity(bio); if (bio_op(bio) == REQ_OP_READ && !bio->bi_status && bi->csum_type) { INIT_WORK(&bip->bip_work, bio_integrity_verify_fn); queue_work(kintegrityd_wq, &bip->bip_work); return false; } kfree(bvec_virt(bip->bip_vec)); bio_integrity_free(bio); return true; } /** * bio_integrity_advance - Advance integrity vector * @bio: bio whose integrity vector to update * @bytes_done: number of data bytes that have been completed * * Description: This function calculates how many integrity bytes the * number of completed data bytes correspond to and advances the * integrity vector accordingly. */ void bio_integrity_advance(struct bio *bio, unsigned int bytes_done) { struct bio_integrity_payload *bip = bio_integrity(bio); struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); unsigned bytes = bio_integrity_bytes(bi, bytes_done >> 9); bip->bip_iter.bi_sector += bio_integrity_intervals(bi, bytes_done >> 9); bvec_iter_advance(bip->bip_vec, &bip->bip_iter, bytes); } /** * bio_integrity_trim - Trim integrity vector * @bio: bio whose integrity vector to update * * Description: Used to trim the integrity vector in a cloned bio. */ void bio_integrity_trim(struct bio *bio) { struct bio_integrity_payload *bip = bio_integrity(bio); struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk); bip->bip_iter.bi_size = bio_integrity_bytes(bi, bio_sectors(bio)); } EXPORT_SYMBOL(bio_integrity_trim); /** * bio_integrity_clone - Callback for cloning bios with integrity metadata * @bio: New bio * @bio_src: Original bio * @gfp_mask: Memory allocation mask * * Description: Called to allocate a bip when cloning a bio */ int bio_integrity_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp_mask) { struct bio_integrity_payload *bip_src = bio_integrity(bio_src); struct bio_integrity_payload *bip; BUG_ON(bip_src == NULL); bip = bio_integrity_alloc(bio, gfp_mask, 0); if (IS_ERR(bip)) return PTR_ERR(bip); bip->bip_vec = bip_src->bip_vec; bip->bip_iter = bip_src->bip_iter; bip->bip_flags = bip_src->bip_flags & BIP_CLONE_FLAGS; bip->app_tag = bip_src->app_tag; return 0; } int bioset_integrity_create(struct bio_set *bs, int pool_size) { if (mempool_initialized(&bs->bio_integrity_pool)) return 0; if (mempool_init_slab_pool(&bs->bio_integrity_pool, pool_size, bip_slab)) return -1; if (biovec_init_pool(&bs->bvec_integrity_pool, pool_size)) { mempool_exit(&bs->bio_integrity_pool); return -1; } return 0; } EXPORT_SYMBOL(bioset_integrity_create); void bioset_integrity_free(struct bio_set *bs) { mempool_exit(&bs->bio_integrity_pool); mempool_exit(&bs->bvec_integrity_pool); } void __init bio_integrity_init(void) { /* * kintegrityd won't block much but may burn a lot of CPU cycles. * Make it highpri CPU intensive wq with max concurrency of 1. */ kintegrityd_wq = alloc_workqueue("kintegrityd", WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_CPU_INTENSIVE, 1); if (!kintegrityd_wq) panic("Failed to create kintegrityd\n"); bip_slab = kmem_cache_create("bio_integrity_payload", sizeof(struct bio_integrity_payload) + sizeof(struct bio_vec) * BIO_INLINE_VECS, 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL); } |
| 12 13 13 13 3 9 12 12 12 12 12 12 12 12 12 3 10 2 2 2 2 7 7 7 9 9 7 2 2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 | /* * llc_sap.c - driver routines for SAP component. * * Copyright (c) 1997 by Procom Technology, Inc. * 2001-2003 by Arnaldo Carvalho de Melo <acme@conectiva.com.br> * * This program can be redistributed or modified under the terms of the * GNU General Public License as published by the Free Software Foundation. * This program is distributed without any warranty or implied warranty * of merchantability or fitness for a particular purpose. * * See the GNU General Public License for more details. */ #include <net/llc.h> #include <net/llc_if.h> #include <net/llc_conn.h> #include <net/llc_pdu.h> #include <net/llc_sap.h> #include <net/llc_s_ac.h> #include <net/llc_s_ev.h> #include <net/llc_s_st.h> #include <net/sock.h> #include <net/tcp_states.h> #include <linux/llc.h> #include <linux/slab.h> static int llc_mac_header_len(unsigned short devtype) { switch (devtype) { case ARPHRD_ETHER: case ARPHRD_LOOPBACK: return sizeof(struct ethhdr); } return 0; } /** * llc_alloc_frame - allocates sk_buff for frame * @sk: socket to allocate frame to * @dev: network device this skb will be sent over * @type: pdu type to allocate * @data_size: data size to allocate * * Allocates an sk_buff for frame and initializes sk_buff fields. * Returns allocated skb or %NULL when out of memory. */ struct sk_buff *llc_alloc_frame(struct sock *sk, struct net_device *dev, u8 type, u32 data_size) { int hlen = type == LLC_PDU_TYPE_U ? 3 : 4; struct sk_buff *skb; hlen += llc_mac_header_len(dev->type); skb = alloc_skb(hlen + data_size, GFP_ATOMIC); if (skb) { skb_reset_mac_header(skb); skb_reserve(skb, hlen); skb_reset_network_header(skb); skb_reset_transport_header(skb); skb->protocol = htons(ETH_P_802_2); skb->dev = dev; if (sk != NULL) skb_set_owner_w(skb, sk); } return skb; } void llc_save_primitive(struct sock *sk, struct sk_buff *skb, u8 prim) { struct sockaddr_llc *addr; /* save primitive for use by the user. */ addr = llc_ui_skb_cb(skb); memset(addr, 0, sizeof(*addr)); addr->sllc_family = sk->sk_family; addr->sllc_arphrd = skb->dev->type; addr->sllc_test = prim == LLC_TEST_PRIM; addr->sllc_xid = prim == LLC_XID_PRIM; addr->sllc_ua = prim == LLC_DATAUNIT_PRIM; llc_pdu_decode_sa(skb, addr->sllc_mac); llc_pdu_decode_ssap(skb, &addr->sllc_sap); } /** * llc_sap_rtn_pdu - Informs upper layer on rx of an UI, XID or TEST pdu. * @sap: pointer to SAP * @skb: received pdu */ void llc_sap_rtn_pdu(struct llc_sap *sap, struct sk_buff *skb) { struct llc_sap_state_ev *ev = llc_sap_ev(skb); struct llc_pdu_un *pdu = llc_pdu_un_hdr(skb); switch (LLC_U_PDU_RSP(pdu)) { case LLC_1_PDU_CMD_TEST: ev->prim = LLC_TEST_PRIM; break; case LLC_1_PDU_CMD_XID: ev->prim = LLC_XID_PRIM; break; case LLC_1_PDU_CMD_UI: ev->prim = LLC_DATAUNIT_PRIM; break; } ev->ind_cfm_flag = LLC_IND; } /** * llc_find_sap_trans - finds transition for event * @sap: pointer to SAP * @skb: happened event * * This function finds transition that matches with happened event. * Returns the pointer to found transition on success or %NULL for * failure. */ static const struct llc_sap_state_trans *llc_find_sap_trans(struct llc_sap *sap, struct sk_buff *skb) { int i = 0; const struct llc_sap_state_trans *rc = NULL; const struct llc_sap_state_trans **next_trans; struct llc_sap_state *curr_state = &llc_sap_state_table[sap->state - 1]; /* * Search thru events for this state until list exhausted or until * its obvious the event is not valid for the current state */ for (next_trans = curr_state->transitions; next_trans[i]->ev; i++) if (!next_trans[i]->ev(sap, skb)) { rc = next_trans[i]; /* got event match; return it */ break; } return rc; } /** * llc_exec_sap_trans_actions - execute actions related to event * @sap: pointer to SAP * @trans: pointer to transition that it's actions must be performed * @skb: happened event. * * This function executes actions that is related to happened event. * Returns 0 for success and 1 for failure of at least one action. */ static int llc_exec_sap_trans_actions(struct llc_sap *sap, const struct llc_sap_state_trans *trans, struct sk_buff *skb) { int rc = 0; const llc_sap_action_t *next_action = trans->ev_actions; for (; next_action && *next_action; next_action++) if ((*next_action)(sap, skb)) rc = 1; return rc; } /** * llc_sap_next_state - finds transition, execs actions & change SAP state * @sap: pointer to SAP * @skb: happened event * * This function finds transition that matches with happened event, then * executes related actions and finally changes state of SAP. It returns * 0 on success and 1 for failure. */ static int llc_sap_next_state(struct llc_sap *sap, struct sk_buff *skb) { const struct llc_sap_state_trans *trans; int rc = 1; if (sap->state > LLC_NR_SAP_STATES) goto out; trans = llc_find_sap_trans(sap, skb); if (!trans) goto out; /* * Got the state to which we next transition; perform the actions * associated with this transition before actually transitioning to the * next state */ rc = llc_exec_sap_trans_actions(sap, trans, skb); if (rc) goto out; /* * Transition SAP to next state if all actions execute successfully */ sap->state = trans->next_state; out: return rc; } /** * llc_sap_state_process - sends event to SAP state machine * @sap: sap to use * @skb: pointer to occurred event * * After executing actions of the event, upper layer will be indicated * if needed(on receiving an UI frame). sk can be null for the * datalink_proto case. * * This function always consumes a reference to the skb. */ static void llc_sap_state_process(struct llc_sap *sap, struct sk_buff *skb) { struct llc_sap_state_ev *ev = llc_sap_ev(skb); ev->ind_cfm_flag = 0; llc_sap_next_state(sap, skb); if (ev->ind_cfm_flag == LLC_IND && skb->sk->sk_state != TCP_LISTEN) { llc_save_primitive(skb->sk, skb, ev->prim); /* queue skb to the user. */ if (sock_queue_rcv_skb(skb->sk, skb) == 0) return; } kfree_skb(skb); } /** * llc_build_and_send_test_pkt - TEST interface for upper layers. * @sap: sap to use * @skb: packet to send * @dmac: destination mac address * @dsap: destination sap * * This function is called when upper layer wants to send a TEST pdu. * Returns 0 for success, 1 otherwise. */ void llc_build_and_send_test_pkt(struct llc_sap *sap, struct sk_buff *skb, u8 *dmac, u8 dsap) { struct llc_sap_state_ev *ev = llc_sap_ev(skb); ev->saddr.lsap = sap->laddr.lsap; ev->daddr.lsap = dsap; memcpy(ev->saddr.mac, skb->dev->dev_addr, IFHWADDRLEN); memcpy(ev->daddr.mac, dmac, IFHWADDRLEN); ev->type = LLC_SAP_EV_TYPE_PRIM; ev->prim = LLC_TEST_PRIM; ev->prim_type = LLC_PRIM_TYPE_REQ; llc_sap_state_process(sap, skb); } /** * llc_build_and_send_xid_pkt - XID interface for upper layers * @sap: sap to use * @skb: packet to send * @dmac: destination mac address * @dsap: destination sap * * This function is called when upper layer wants to send a XID pdu. * Returns 0 for success, 1 otherwise. */ void llc_build_and_send_xid_pkt(struct llc_sap *sap, struct sk_buff *skb, u8 *dmac, u8 dsap) { struct llc_sap_state_ev *ev = llc_sap_ev(skb); ev->saddr.lsap = sap->laddr.lsap; ev->daddr.lsap = dsap; memcpy(ev->saddr.mac, skb->dev->dev_addr, IFHWADDRLEN); memcpy(ev->daddr.mac, dmac, IFHWADDRLEN); ev->type = LLC_SAP_EV_TYPE_PRIM; ev->prim = LLC_XID_PRIM; ev->prim_type = LLC_PRIM_TYPE_REQ; llc_sap_state_process(sap, skb); } /** * llc_sap_rcv - sends received pdus to the sap state machine * @sap: current sap component structure. * @skb: received frame. * @sk: socket to associate to frame * * Sends received pdus to the sap state machine. */ static void llc_sap_rcv(struct llc_sap *sap, struct sk_buff *skb, struct sock *sk) { struct llc_sap_state_ev *ev = llc_sap_ev(skb); ev->type = LLC_SAP_EV_TYPE_PDU; ev->reason = 0; skb_orphan(skb); sock_hold(sk); skb->sk = sk; skb->destructor = sock_efree; llc_sap_state_process(sap, skb); } static inline bool llc_dgram_match(const struct llc_sap *sap, const struct llc_addr *laddr, const struct sock *sk, const struct net *net) { struct llc_sock *llc = llc_sk(sk); return sk->sk_type == SOCK_DGRAM && net_eq(sock_net(sk), net) && llc->laddr.lsap == laddr->lsap && ether_addr_equal(llc->laddr.mac, laddr->mac); } /** * llc_lookup_dgram - Finds dgram socket for the local sap/mac * @sap: SAP * @laddr: address of local LLC (MAC + SAP) * @net: netns to look up a socket in * * Search socket list of the SAP and finds connection using the local * mac, and local sap. Returns pointer for socket found, %NULL otherwise. */ static struct sock *llc_lookup_dgram(struct llc_sap *sap, const struct llc_addr *laddr, const struct net *net) { struct sock *rc; struct hlist_nulls_node *node; int slot = llc_sk_laddr_hashfn(sap, laddr); struct hlist_nulls_head *laddr_hb = &sap->sk_laddr_hash[slot]; rcu_read_lock_bh(); again: sk_nulls_for_each_rcu(rc, node, laddr_hb) { if (llc_dgram_match(sap, laddr, rc, net)) { /* Extra checks required by SLAB_TYPESAFE_BY_RCU */ if (unlikely(!refcount_inc_not_zero(&rc->sk_refcnt))) goto again; if (unlikely(llc_sk(rc)->sap != sap || !llc_dgram_match(sap, laddr, rc, net))) { sock_put(rc); continue; } goto found; } } rc = NULL; /* * if the nulls value we got at the end of this lookup is * not the expected one, we must restart lookup. * We probably met an item that was moved to another chain. */ if (unlikely(get_nulls_value(node) != slot)) goto again; found: rcu_read_unlock_bh(); return rc; } static inline bool llc_mcast_match(const struct llc_sap *sap, const struct llc_addr *laddr, const struct sk_buff *skb, const struct sock *sk) { struct llc_sock *llc = llc_sk(sk); return sk->sk_type == SOCK_DGRAM && llc->laddr.lsap == laddr->lsap && llc->dev == skb->dev; } static void llc_do_mcast(struct llc_sap *sap, struct sk_buff *skb, struct sock **stack, int count) { struct sk_buff *skb1; int i; for (i = 0; i < count; i++) { skb1 = skb_clone(skb, GFP_ATOMIC); if (!skb1) { sock_put(stack[i]); continue; } llc_sap_rcv(sap, skb1, stack[i]); sock_put(stack[i]); } } /** * llc_sap_mcast - Deliver multicast PDU's to all matching datagram sockets. * @sap: SAP * @laddr: address of local LLC (MAC + SAP) * @skb: PDU to deliver * * Search socket list of the SAP and finds connections with same sap. * Deliver clone to each. */ static void llc_sap_mcast(struct llc_sap *sap, const struct llc_addr *laddr, struct sk_buff *skb) { int i = 0; struct sock *sk; struct sock *stack[256 / sizeof(struct sock *)]; struct llc_sock *llc; struct hlist_head *dev_hb = llc_sk_dev_hash(sap, skb->dev->ifindex); spin_lock_bh(&sap->sk_lock); hlist_for_each_entry(llc, dev_hb, dev_hash_node) { sk = &llc->sk; if (!llc_mcast_match(sap, laddr, skb, sk)) continue; sock_hold(sk); if (i < ARRAY_SIZE(stack)) stack[i++] = sk; else { llc_do_mcast(sap, skb, stack, i); i = 0; } } spin_unlock_bh(&sap->sk_lock); llc_do_mcast(sap, skb, stack, i); } void llc_sap_handler(struct llc_sap *sap, struct sk_buff *skb) { struct llc_addr laddr; llc_pdu_decode_da(skb, laddr.mac); llc_pdu_decode_dsap(skb, &laddr.lsap); if (is_multicast_ether_addr(laddr.mac)) { llc_sap_mcast(sap, &laddr, skb); kfree_skb(skb); } else { struct sock *sk = llc_lookup_dgram(sap, &laddr, dev_net(skb->dev)); if (sk) { llc_sap_rcv(sap, skb, sk); sock_put(sk); } else kfree_skb(skb); } } |
| 81 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _BCACHEFS_BBPOS_H #define _BCACHEFS_BBPOS_H #include "bbpos_types.h" #include "bkey_methods.h" #include "btree_cache.h" static inline int bbpos_cmp(struct bbpos l, struct bbpos r) { return cmp_int(l.btree, r.btree) ?: bpos_cmp(l.pos, r.pos); } static inline struct bbpos bbpos_successor(struct bbpos pos) { if (bpos_cmp(pos.pos, SPOS_MAX)) { pos.pos = bpos_successor(pos.pos); return pos; } if (pos.btree != BTREE_ID_NR) { pos.btree++; pos.pos = POS_MIN; return pos; } BUG(); } static inline void bch2_bbpos_to_text(struct printbuf *out, struct bbpos pos) { bch2_btree_id_to_text(out, pos.btree); prt_char(out, ':'); bch2_bpos_to_text(out, pos.pos); } #endif /* _BCACHEFS_BBPOS_H */ |
| 17 286 17 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 | /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _ASM_X86_PARAVIRT_H #define _ASM_X86_PARAVIRT_H /* Various instructions on x86 need to be replaced for * para-virtualization: those hooks are defined here. */ #include <asm/paravirt_types.h> #ifndef __ASSEMBLY__ struct mm_struct; #endif #ifdef CONFIG_PARAVIRT #include <asm/pgtable_types.h> #include <asm/asm.h> #include <asm/nospec-branch.h> #ifndef __ASSEMBLY__ #include <linux/bug.h> #include <linux/types.h> #include <linux/cpumask.h> #include <linux/static_call_types.h> #include <asm/frame.h> u64 dummy_steal_clock(int cpu); u64 dummy_sched_clock(void); DECLARE_STATIC_CALL(pv_steal_clock, dummy_steal_clock); DECLARE_STATIC_CALL(pv_sched_clock, dummy_sched_clock); void paravirt_set_sched_clock(u64 (*func)(void)); static __always_inline u64 paravirt_sched_clock(void) { return static_call(pv_sched_clock)(); } struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; __visible void __native_queued_spin_unlock(struct qspinlock *lock); bool pv_is_native_spin_unlock(void); __visible bool __native_vcpu_is_preempted(long cpu); bool pv_is_native_vcpu_is_preempted(void); static inline u64 paravirt_steal_clock(int cpu) { return static_call(pv_steal_clock)(cpu); } #ifdef CONFIG_PARAVIRT_SPINLOCKS void __init paravirt_set_cap(void); #endif /* The paravirtualized I/O functions */ static inline void slow_down_io(void) { PVOP_VCALL0(cpu.io_delay); #ifdef REALLY_SLOW_IO PVOP_VCALL0(cpu.io_delay); PVOP_VCALL0(cpu.io_delay); PVOP_VCALL0(cpu.io_delay); #endif } void native_flush_tlb_local(void); void native_flush_tlb_global(void); void native_flush_tlb_one_user(unsigned long addr); void native_flush_tlb_multi(const struct cpumask *cpumask, const struct flush_tlb_info *info); static inline void __flush_tlb_local(void) { PVOP_VCALL0(mmu.flush_tlb_user); } static inline void __flush_tlb_global(void) { PVOP_VCALL0(mmu.flush_tlb_kernel); } static inline void __flush_tlb_one_user(unsigned long addr) { PVOP_VCALL1(mmu.flush_tlb_one_user, addr); } static inline void __flush_tlb_multi(const struct cpumask *cpumask, const struct flush_tlb_info *info) { PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info); } static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table) { PVOP_VCALL2(mmu.tlb_remove_table, tlb, table); } static inline void paravirt_arch_exit_mmap(struct mm_struct *mm) { PVOP_VCALL1(mmu.exit_mmap, mm); } static inline void notify_page_enc_status_changed(unsigned long pfn, int npages, bool enc) { PVOP_VCALL3(mmu.notify_page_enc_status_changed, pfn, npages, enc); } #ifdef CONFIG_PARAVIRT_XXL static inline void load_sp0(unsigned long sp0) { PVOP_VCALL1(cpu.load_sp0, sp0); } /* The paravirtualized CPUID instruction. */ static inline void __cpuid(unsigned int *eax, unsigned int *ebx, unsigned int *ecx, unsigned int *edx) { PVOP_VCALL4(cpu.cpuid, eax, ebx, ecx, edx); } /* * These special macros can be used to get or set a debugging register */ static __always_inline unsigned long paravirt_get_debugreg(int reg) { return PVOP_CALL1(unsigned long, cpu.get_debugreg, reg); } #define get_debugreg(var, reg) var = paravirt_get_debugreg(reg) static __always_inline void set_debugreg(unsigned long val, int reg) { PVOP_VCALL2(cpu.set_debugreg, reg, val); } static inline unsigned long read_cr0(void) { return PVOP_CALL0(unsigned long, cpu.read_cr0); } static inline void write_cr0(unsigned long x) { PVOP_VCALL1(cpu.write_cr0, x); } static __always_inline unsigned long read_cr2(void) { return PVOP_ALT_CALLEE0(unsigned long, mmu.read_cr2, "mov %%cr2, %%rax;", ALT_NOT_XEN); } static __always_inline void write_cr2(unsigned long x) { PVOP_VCALL1(mmu.write_cr2, x); } static inline unsigned long __read_cr3(void) { return PVOP_ALT_CALL0(unsigned long, mmu.read_cr3, "mov %%cr3, %%rax;", ALT_NOT_XEN); } static inline void write_cr3(unsigned long x) { PVOP_ALT_VCALL1(mmu.write_cr3, x, "mov %%rdi, %%cr3", ALT_NOT_XEN); } static inline void __write_cr4(unsigned long x) { PVOP_VCALL1(cpu.write_cr4, x); } static __always_inline void arch_safe_halt(void) { PVOP_VCALL0(irq.safe_halt); } static inline void halt(void) { PVOP_VCALL0(irq.halt); } static inline u64 paravirt_read_msr(unsigned msr) { return PVOP_CALL1(u64, cpu.read_msr, msr); } static inline void paravirt_write_msr(unsigned msr, unsigned low, unsigned high) { PVOP_VCALL3(cpu.write_msr, msr, low, high); } static inline u64 paravirt_read_msr_safe(unsigned msr, int *err) { return PVOP_CALL2(u64, cpu.read_msr_safe, msr, err); } static inline int paravirt_write_msr_safe(unsigned msr, unsigned low, unsigned high) { return PVOP_CALL3(int, cpu.write_msr_safe, msr, low, high); } #define rdmsr(msr, val1, val2) \ do { \ u64 _l = paravirt_read_msr(msr); \ val1 = (u32)_l; \ val2 = _l >> 32; \ } while (0) #define wrmsr(msr, val1, val2) \ do { \ paravirt_write_msr(msr, val1, val2); \ } while (0) #define rdmsrl(msr, val) \ do { \ val = paravirt_read_msr(msr); \ } while (0) static inline void wrmsrl(unsigned msr, u64 val) { wrmsr(msr, (u32)val, (u32)(val>>32)); } #define wrmsr_safe(msr, a, b) paravirt_write_msr_safe(msr, a, b) /* rdmsr with exception handling */ #define rdmsr_safe(msr, a, b) \ ({ \ int _err; \ u64 _l = paravirt_read_msr_safe(msr, &_err); \ (*a) = (u32)_l; \ (*b) = _l >> 32; \ _err; \ }) static inline int rdmsrl_safe(unsigned msr, unsigned long long *p) { int err; *p = paravirt_read_msr_safe(msr, &err); return err; } static inline unsigned long long paravirt_read_pmc(int counter) { return PVOP_CALL1(u64, cpu.read_pmc, counter); } #define rdpmc(counter, low, high) \ do { \ u64 _l = paravirt_read_pmc(counter); \ low = (u32)_l; \ high = _l >> 32; \ } while (0) #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter)) static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries) { PVOP_VCALL2(cpu.alloc_ldt, ldt, entries); } static inline void paravirt_free_ldt(struct desc_struct *ldt, unsigned entries) { PVOP_VCALL2(cpu.free_ldt, ldt, entries); } static inline void load_TR_desc(void) { PVOP_VCALL0(cpu.load_tr_desc); } static inline void load_gdt(const struct desc_ptr *dtr) { PVOP_VCALL1(cpu.load_gdt, dtr); } static inline void load_idt(const struct desc_ptr *dtr) { PVOP_VCALL1(cpu.load_idt, dtr); } static inline void set_ldt(const void *addr, unsigned entries) { PVOP_VCALL2(cpu.set_ldt, addr, entries); } static inline unsigned long paravirt_store_tr(void) { return PVOP_CALL0(unsigned long, cpu.store_tr); } #define store_tr(tr) ((tr) = paravirt_store_tr()) static inline void load_TLS(struct thread_struct *t, unsigned cpu) { PVOP_VCALL2(cpu.load_tls, t, cpu); } static inline void load_gs_index(unsigned int gs) { PVOP_VCALL1(cpu.load_gs_index, gs); } static inline void write_ldt_entry(struct desc_struct *dt, int entry, const void *desc) { PVOP_VCALL3(cpu.write_ldt_entry, dt, entry, desc); } static inline void write_gdt_entry(struct desc_struct *dt, int entry, void *desc, int type) { PVOP_VCALL4(cpu.write_gdt_entry, dt, entry, desc, type); } static inline void write_idt_entry(gate_desc *dt, int entry, const gate_desc *g) { PVOP_VCALL3(cpu.write_idt_entry, dt, entry, g); } #ifdef CONFIG_X86_IOPL_IOPERM static inline void tss_invalidate_io_bitmap(void) { PVOP_VCALL0(cpu.invalidate_io_bitmap); } static inline void tss_update_io_bitmap(void) { PVOP_VCALL0(cpu.update_io_bitmap); } #endif static inline void paravirt_enter_mmap(struct mm_struct *next) { PVOP_VCALL1(mmu.enter_mmap, next); } static inline int paravirt_pgd_alloc(struct mm_struct *mm) { return PVOP_CALL1(int, mmu.pgd_alloc, mm); } static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *pgd) { PVOP_VCALL2(mmu.pgd_free, mm, pgd); } static inline void paravirt_alloc_pte(struct mm_struct *mm, unsigned long pfn) { PVOP_VCALL2(mmu.alloc_pte, mm, pfn); } static inline void paravirt_release_pte(unsigned long pfn) { PVOP_VCALL1(mmu.release_pte, pfn); } static inline void paravirt_alloc_pmd(struct mm_struct *mm, unsigned long pfn) { PVOP_VCALL2(mmu.alloc_pmd, mm, pfn); } static inline void paravirt_release_pmd(unsigned long pfn) { PVOP_VCALL1(mmu.release_pmd, pfn); } static inline void paravirt_alloc_pud(struct mm_struct *mm, unsigned long pfn) { PVOP_VCALL2(mmu.alloc_pud, mm, pfn); } static inline void paravirt_release_pud(unsigned long pfn) { PVOP_VCALL1(mmu.release_pud, pfn); } static inline void paravirt_alloc_p4d(struct mm_struct *mm, unsigned long pfn) { PVOP_VCALL2(mmu.alloc_p4d, mm, pfn); } static inline void paravirt_release_p4d(unsigned long pfn) { PVOP_VCALL1(mmu.release_p4d, pfn); } static inline pte_t __pte(pteval_t val) { return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; } static inline pteval_t pte_val(pte_t pte) { return PVOP_ALT_CALLEE1(pteval_t, mmu.pte_val, pte.pte, "mov %%rdi, %%rax", ALT_NOT_XEN); } static inline pgd_t __pgd(pgdval_t val) { return (pgd_t) { PVOP_ALT_CALLEE1(pgdval_t, mmu.make_pgd, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; } static inline pgdval_t pgd_val(pgd_t pgd) { return PVOP_ALT_CALLEE1(pgdval_t, mmu.pgd_val, pgd.pgd, "mov %%rdi, %%rax", ALT_NOT_XEN); } #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { pteval_t ret; ret = PVOP_CALL3(pteval_t, mmu.ptep_modify_prot_start, vma, addr, ptep); return (pte_t) { .pte = ret }; } static inline void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t old_pte, pte_t pte) { PVOP_VCALL4(mmu.ptep_modify_prot_commit, vma, addr, ptep, pte.pte); } static inline void set_pte(pte_t *ptep, pte_t pte) { PVOP_VCALL2(mmu.set_pte, ptep, pte.pte); } static inline void set_pmd(pmd_t *pmdp, pmd_t pmd) { PVOP_VCALL2(mmu.set_pmd, pmdp, native_pmd_val(pmd)); } static inline pmd_t __pmd(pmdval_t val) { return (pmd_t) { PVOP_ALT_CALLEE1(pmdval_t, mmu.make_pmd, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; } static inline pmdval_t pmd_val(pmd_t pmd) { return PVOP_ALT_CALLEE1(pmdval_t, mmu.pmd_val, pmd.pmd, "mov %%rdi, %%rax", ALT_NOT_XEN); } static inline void set_pud(pud_t *pudp, pud_t pud) { PVOP_VCALL2(mmu.set_pud, pudp, native_pud_val(pud)); } static inline pud_t __pud(pudval_t val) { pudval_t ret; ret = PVOP_ALT_CALLEE1(pudval_t, mmu.make_pud, val, "mov %%rdi, %%rax", ALT_NOT_XEN); return (pud_t) { ret }; } static inline pudval_t pud_val(pud_t pud) { return PVOP_ALT_CALLEE1(pudval_t, mmu.pud_val, pud.pud, "mov %%rdi, %%rax", ALT_NOT_XEN); } static inline void pud_clear(pud_t *pudp) { set_pud(pudp, native_make_pud(0)); } static inline void set_p4d(p4d_t *p4dp, p4d_t p4d) { p4dval_t val = native_p4d_val(p4d); PVOP_VCALL2(mmu.set_p4d, p4dp, val); } #if CONFIG_PGTABLE_LEVELS >= 5 static inline p4d_t __p4d(p4dval_t val) { p4dval_t ret = PVOP_ALT_CALLEE1(p4dval_t, mmu.make_p4d, val, "mov %%rdi, %%rax", ALT_NOT_XEN); return (p4d_t) { ret }; } static inline p4dval_t p4d_val(p4d_t p4d) { return PVOP_ALT_CALLEE1(p4dval_t, mmu.p4d_val, p4d.p4d, "mov %%rdi, %%rax", ALT_NOT_XEN); } static inline void __set_pgd(pgd_t *pgdp, pgd_t pgd) { PVOP_VCALL2(mmu.set_pgd, pgdp, native_pgd_val(pgd)); } #define set_pgd(pgdp, pgdval) do { \ if (pgtable_l5_enabled()) \ __set_pgd(pgdp, pgdval); \ else \ set_p4d((p4d_t *)(pgdp), (p4d_t) { (pgdval).pgd }); \ } while (0) #define pgd_clear(pgdp) do { \ if (pgtable_l5_enabled()) \ set_pgd(pgdp, native_make_pgd(0)); \ } while (0) #endif /* CONFIG_PGTABLE_LEVELS == 5 */ static inline void p4d_clear(p4d_t *p4dp) { set_p4d(p4dp, native_make_p4d(0)); } static inline void set_pte_atomic(pte_t *ptep, pte_t pte) { set_pte(ptep, pte); } static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { set_pte(ptep, native_make_pte(0)); } static inline void pmd_clear(pmd_t *pmdp) { set_pmd(pmdp, native_make_pmd(0)); } #define __HAVE_ARCH_START_CONTEXT_SWITCH static inline void arch_start_context_switch(struct task_struct *prev) { PVOP_VCALL1(cpu.start_context_switch, prev); } static inline void arch_end_context_switch(struct task_struct *next) { PVOP_VCALL1(cpu.end_context_switch, next); } #define __HAVE_ARCH_ENTER_LAZY_MMU_MODE static inline void arch_enter_lazy_mmu_mode(void) { PVOP_VCALL0(mmu.lazy_mode.enter); } static inline void arch_leave_lazy_mmu_mode(void) { PVOP_VCALL0(mmu.lazy_mode.leave); } static inline void arch_flush_lazy_mmu_mode(void) { PVOP_VCALL0(mmu.lazy_mode.flush); } static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, phys_addr_t phys, pgprot_t flags) { pv_ops.mmu.set_fixmap(idx, phys, flags); } #endif #if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS) static __always_inline void pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) { PVOP_VCALL2(lock.queued_spin_lock_slowpath, lock, val); } static __always_inline void pv_queued_spin_unlock(struct qspinlock *lock) { PVOP_ALT_VCALLEE1(lock.queued_spin_unlock, lock, "movb $0, (%%" _ASM_ARG1 ");", ALT_NOT(X86_FEATURE_PVUNLOCK)); } static __always_inline void pv_wait(u8 *ptr, u8 val) { PVOP_VCALL2(lock.wait, ptr, val); } static __always_inline void pv_kick(int cpu) { PVOP_VCALL1(lock.kick, cpu); } static __always_inline bool pv_vcpu_is_preempted(long cpu) { return PVOP_ALT_CALLEE1(bool, lock.vcpu_is_preempted, cpu, "xor %%" _ASM_AX ", %%" _ASM_AX ";", ALT_NOT(X86_FEATURE_VCPUPREEMPT)); } void __raw_callee_save___native_queued_spin_unlock(struct qspinlock *lock); bool __raw_callee_save___native_vcpu_is_preempted(long cpu); #endif /* SMP && PARAVIRT_SPINLOCKS */ #ifdef CONFIG_X86_32 /* save and restore all caller-save registers, except return value */ #define PV_SAVE_ALL_CALLER_REGS "pushl %ecx;" #define PV_RESTORE_ALL_CALLER_REGS "popl %ecx;" #else /* save and restore all caller-save registers, except return value */ #define PV_SAVE_ALL_CALLER_REGS \ "push %rcx;" \ "push %rdx;" \ "push %rsi;" \ "push %rdi;" \ "push %r8;" \ "push %r9;" \ "push %r10;" \ "push %r11;" #define PV_RESTORE_ALL_CALLER_REGS \ "pop %r11;" \ "pop %r10;" \ "pop %r9;" \ "pop %r8;" \ "pop %rdi;" \ "pop %rsi;" \ "pop %rdx;" \ "pop %rcx;" #endif /* * Generate a thunk around a function which saves all caller-save * registers except for the return value. This allows C functions to * be called from assembler code where fewer than normal registers are * available. It may also help code generation around calls from C * code if the common case doesn't use many registers. * * When a callee is wrapped in a thunk, the caller can assume that all * arg regs and all scratch registers are preserved across the * call. The return value in rax/eax will not be saved, even for void * functions. */ #define PV_THUNK_NAME(func) "__raw_callee_save_" #func #define __PV_CALLEE_SAVE_REGS_THUNK(func, section) \ extern typeof(func) __raw_callee_save_##func; \ \ asm(".pushsection " section ", \"ax\";" \ ".globl " PV_THUNK_NAME(func) ";" \ ".type " PV_THUNK_NAME(func) ", @function;" \ ASM_FUNC_ALIGN \ PV_THUNK_NAME(func) ":" \ ASM_ENDBR \ FRAME_BEGIN \ PV_SAVE_ALL_CALLER_REGS \ "call " #func ";" \ PV_RESTORE_ALL_CALLER_REGS \ FRAME_END \ ASM_RET \ ".size " PV_THUNK_NAME(func) ", .-" PV_THUNK_NAME(func) ";" \ ".popsection") #define PV_CALLEE_SAVE_REGS_THUNK(func) \ __PV_CALLEE_SAVE_REGS_THUNK(func, ".text") /* Get a reference to a callee-save function */ #define PV_CALLEE_SAVE(func) \ ((struct paravirt_callee_save) { __raw_callee_save_##func }) /* Promise that "func" already uses the right calling convention */ #define __PV_IS_CALLEE_SAVE(func) \ ((struct paravirt_callee_save) { func }) #ifdef CONFIG_PARAVIRT_XXL static __always_inline unsigned long arch_local_save_flags(void) { return PVOP_ALT_CALLEE0(unsigned long, irq.save_fl, "pushf; pop %%rax;", ALT_NOT_XEN); } static __always_inline void arch_local_irq_disable(void) { PVOP_ALT_VCALLEE0(irq.irq_disable, "cli;", ALT_NOT_XEN); } static __always_inline void arch_local_irq_enable(void) { PVOP_ALT_VCALLEE0(irq.irq_enable, "sti;", ALT_NOT_XEN); } static __always_inline unsigned long arch_local_irq_save(void) { unsigned long f; f = arch_local_save_flags(); arch_local_irq_disable(); return f; } #endif /* Make sure as little as possible of this mess escapes. */ #undef PARAVIRT_CALL #undef __PVOP_CALL #undef __PVOP_VCALL #undef PVOP_VCALL0 #undef PVOP_CALL0 #undef PVOP_VCALL1 #undef PVOP_CALL1 #undef PVOP_VCALL2 #undef PVOP_CALL2 #undef PVOP_VCALL3 #undef PVOP_CALL3 #undef PVOP_VCALL4 #undef PVOP_CALL4 extern void default_banner(void); void native_pv_lock_init(void) __init; #else /* __ASSEMBLY__ */ #ifdef CONFIG_X86_64 #ifdef CONFIG_PARAVIRT_XXL #ifdef CONFIG_DEBUG_ENTRY #define PARA_INDIRECT(addr) *addr(%rip) .macro PARA_IRQ_save_fl ANNOTATE_RETPOLINE_SAFE; call PARA_INDIRECT(pv_ops+PV_IRQ_save_fl); .endm #define SAVE_FLAGS ALTERNATIVE_2 "PARA_IRQ_save_fl;", \ "ALT_CALL_INSTR;", ALT_CALL_ALWAYS, \ "pushf; pop %rax;", ALT_NOT_XEN #endif #endif /* CONFIG_PARAVIRT_XXL */ #endif /* CONFIG_X86_64 */ #endif /* __ASSEMBLY__ */ #else /* CONFIG_PARAVIRT */ # define default_banner x86_init_noop #ifndef __ASSEMBLY__ static inline void native_pv_lock_init(void) { } #endif #endif /* !CONFIG_PARAVIRT */ #ifndef __ASSEMBLY__ #ifndef CONFIG_PARAVIRT_XXL static inline void paravirt_enter_mmap(struct mm_struct *mm) { } #endif #ifndef CONFIG_PARAVIRT static inline void paravirt_arch_exit_mmap(struct mm_struct *mm) { } #endif #ifndef CONFIG_PARAVIRT_SPINLOCKS static inline void paravirt_set_cap(void) { } #endif #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_PARAVIRT_H */ |
| 28 2 26 27 1 26 1 24 16 1 14 4 44 44 9 9 42 44 9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 | // SPDX-License-Identifier: GPL-2.0-or-later /* * ip_vs_proto_tcp.c: TCP load balancing support for IPVS * * Authors: Wensong Zhang <wensong@linuxvirtualserver.org> * Julian Anastasov <ja@ssi.bg> * * Changes: Hans Schillstrom <hans.schillstrom@ericsson.com> * * Network name space (netns) aware. * Global data moved to netns i.e struct netns_ipvs * tcp_timeouts table has copy per netns in a hash table per * protocol ip_vs_proto_data and is handled by netns */ #define KMSG_COMPONENT "IPVS" #define pr_fmt(fmt) KMSG_COMPONENT ": " fmt #include <linux/kernel.h> #include <linux/ip.h> #include <linux/tcp.h> /* for tcphdr */ #include <net/ip.h> #include <net/tcp.h> /* for csum_tcpudp_magic */ #include <net/ip6_checksum.h> #include <linux/netfilter.h> #include <linux/netfilter_ipv4.h> #include <linux/indirect_call_wrapper.h> #include <net/ip_vs.h> static int tcp_csum_check(int af, struct sk_buff *skb, struct ip_vs_protocol *pp); static int tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb, struct ip_vs_proto_data *pd, int *verdict, struct ip_vs_conn **cpp, struct ip_vs_iphdr *iph) { struct ip_vs_service *svc; struct tcphdr _tcph, *th; __be16 _ports[2], *ports = NULL; /* In the event of icmp, we're only guaranteed to have the first 8 * bytes of the transport header, so we only check the rest of the * TCP packet for non-ICMP packets */ if (likely(!ip_vs_iph_icmp(iph))) { th = skb_header_pointer(skb, iph->len, sizeof(_tcph), &_tcph); if (th) { if (th->rst || !(sysctl_sloppy_tcp(ipvs) || th->syn)) return 1; ports = &th->source; } } else { ports = skb_header_pointer( skb, iph->len, sizeof(_ports), &_ports); } if (!ports) { *verdict = NF_DROP; return 0; } /* No !th->ack check to allow scheduling on SYN+ACK for Active FTP */ if (likely(!ip_vs_iph_inverse(iph))) svc = ip_vs_service_find(ipvs, af, skb->mark, iph->protocol, &iph->daddr, ports[1]); else svc = ip_vs_service_find(ipvs, af, skb->mark, iph->protocol, &iph->saddr, ports[0]); if (svc) { int ignored; if (ip_vs_todrop(ipvs)) { /* * It seems that we are very loaded. * We have to drop this packet :( */ *verdict = NF_DROP; return 0; } /* * Let the virtual server select a real server for the * incoming connection, and create a connection entry. */ *cpp = ip_vs_schedule(svc, skb, pd, &ignored, iph); if (!*cpp && ignored <= 0) { if (!ignored) *verdict = ip_vs_leave(svc, skb, pd, iph); else *verdict = NF_DROP; return 0; } } /* NF_ACCEPT */ return 1; } static inline void tcp_fast_csum_update(int af, struct tcphdr *tcph, const union nf_inet_addr *oldip, const union nf_inet_addr *newip, __be16 oldport, __be16 newport) { #ifdef CONFIG_IP_VS_IPV6 if (af == AF_INET6) tcph->check = csum_fold(ip_vs_check_diff16(oldip->ip6, newip->ip6, ip_vs_check_diff2(oldport, newport, ~csum_unfold(tcph->check)))); else #endif tcph->check = csum_fold(ip_vs_check_diff4(oldip->ip, newip->ip, ip_vs_check_diff2(oldport, newport, ~csum_unfold(tcph->check)))); } static inline void tcp_partial_csum_update(int af, struct tcphdr *tcph, const union nf_inet_addr *oldip, const union nf_inet_addr *newip, __be16 oldlen, __be16 newlen) { #ifdef CONFIG_IP_VS_IPV6 if (af == AF_INET6) tcph->check = ~csum_fold(ip_vs_check_diff16(oldip->ip6, newip->ip6, ip_vs_check_diff2(oldlen, newlen, csum_unfold(tcph->check)))); else #endif tcph->check = ~csum_fold(ip_vs_check_diff4(oldip->ip, newip->ip, ip_vs_check_diff2(oldlen, newlen, csum_unfold(tcph->check)))); } INDIRECT_CALLABLE_SCOPE int tcp_snat_handler(struct sk_buff *skb, struct ip_vs_protocol *pp, struct ip_vs_conn *cp, struct ip_vs_iphdr *iph) { struct tcphdr *tcph; unsigned int tcphoff = iph->len; bool payload_csum = false; int oldlen; #ifdef CONFIG_IP_VS_IPV6 if (cp->af == AF_INET6 && iph->fragoffs) return 1; #endif oldlen = skb->len - tcphoff; /* csum_check requires unshared skb */ if (skb_ensure_writable(skb, tcphoff + sizeof(*tcph))) return 0; if (unlikely(cp->app != NULL)) { int ret; /* Some checks before mangling */ if (!tcp_csum_check(cp->af, skb, pp)) return 0; /* Call application helper if needed */ if (!(ret = ip_vs_app_pkt_out(cp, skb, iph))) return 0; /* ret=2: csum update is needed after payload mangling */ if (ret == 1) oldlen = skb->len - tcphoff; else payload_csum = true; } tcph = (void *)skb_network_header(skb) + tcphoff; tcph->source = cp->vport; /* Adjust TCP checksums */ if (skb->ip_summed == CHECKSUM_PARTIAL) { tcp_partial_csum_update(cp->af, tcph, &cp->daddr, &cp->vaddr, htons(oldlen), htons(skb->len - tcphoff)); } else if (!payload_csum) { /* Only port and addr are changed, do fast csum update */ tcp_fast_csum_update(cp->af, tcph, &cp->daddr, &cp->vaddr, cp->dport, cp->vport); if (skb->ip_summed == CHECKSUM_COMPLETE) skb->ip_summed = cp->app ? CHECKSUM_UNNECESSARY : CHECKSUM_NONE; } else { /* full checksum calculation */ tcph->check = 0; skb->csum = skb_checksum(skb, tcphoff, skb->len - tcphoff, 0); #ifdef CONFIG_IP_VS_IPV6 if (cp->af == AF_INET6) tcph->check = csum_ipv6_magic(&cp->vaddr.in6, &cp->caddr.in6, skb->len - tcphoff, cp->protocol, skb->csum); else #endif tcph->check = csum_tcpudp_magic(cp->vaddr.ip, cp->caddr.ip, skb->len - tcphoff, cp->protocol, skb->csum); skb->ip_summed = CHECKSUM_UNNECESSARY; IP_VS_DBG(11, "O-pkt: %s O-csum=%d (+%zd)\n", pp->name, tcph->check, (char*)&(tcph->check) - (char*)tcph); } return 1; } static int tcp_dnat_handler(struct sk_buff *skb, struct ip_vs_protocol *pp, struct ip_vs_conn *cp, struct ip_vs_iphdr *iph) { struct tcphdr *tcph; unsigned int tcphoff = iph->len; bool payload_csum = false; int oldlen; #ifdef CONFIG_IP_VS_IPV6 if (cp->af == AF_INET6 && iph->fragoffs) return 1; #endif oldlen = skb->len - tcphoff; /* csum_check requires unshared skb */ if (skb_ensure_writable(skb, tcphoff + sizeof(*tcph))) return 0; if (unlikely(cp->app != NULL)) { int ret; /* Some checks before mangling */ if (!tcp_csum_check(cp->af, skb, pp)) return 0; /* * Attempt ip_vs_app call. * It will fix ip_vs_conn and iph ack_seq stuff */ if (!(ret = ip_vs_app_pkt_in(cp, skb, iph))) return 0; /* ret=2: csum update is needed after payload mangling */ if (ret == 1) oldlen = skb->len - tcphoff; else payload_csum = true; } tcph = (void *)skb_network_header(skb) + tcphoff; tcph->dest = cp->dport; /* * Adjust TCP checksums */ if (skb->ip_summed == CHECKSUM_PARTIAL) { tcp_partial_csum_update(cp->af, tcph, &cp->vaddr, &cp->daddr, htons(oldlen), htons(skb->len - tcphoff)); } else if (!payload_csum) { /* Only port and addr are changed, do fast csum update */ tcp_fast_csum_update(cp->af, tcph, &cp->vaddr, &cp->daddr, cp->vport, cp->dport); if (skb->ip_summed == CHECKSUM_COMPLETE) skb->ip_summed = cp->app ? CHECKSUM_UNNECESSARY : CHECKSUM_NONE; } else { /* full checksum calculation */ tcph->check = 0; skb->csum = skb_checksum(skb, tcphoff, skb->len - tcphoff, 0); #ifdef CONFIG_IP_VS_IPV6 if (cp->af == AF_INET6) tcph->check = csum_ipv6_magic(&cp->caddr.in6, &cp->daddr.in6, skb->len - tcphoff, cp->protocol, skb->csum); else #endif tcph->check = csum_tcpudp_magic(cp->caddr.ip, cp->daddr.ip, skb->len - tcphoff, cp->protocol, skb->csum); skb->ip_summed = CHECKSUM_UNNECESSARY; } return 1; } static int tcp_csum_check(int af, struct sk_buff *skb, struct ip_vs_protocol *pp) { unsigned int tcphoff; #ifdef CONFIG_IP_VS_IPV6 if (af == AF_INET6) tcphoff = sizeof(struct ipv6hdr); else #endif tcphoff = ip_hdrlen(skb); switch (skb->ip_summed) { case CHECKSUM_NONE: skb->csum = skb_checksum(skb, tcphoff, skb->len - tcphoff, 0); fallthrough; case CHECKSUM_COMPLETE: #ifdef CONFIG_IP_VS_IPV6 if (af == AF_INET6) { if (csum_ipv6_magic(&ipv6_hdr(skb)->saddr, &ipv6_hdr(skb)->daddr, skb->len - tcphoff, ipv6_hdr(skb)->nexthdr, skb->csum)) { IP_VS_DBG_RL_PKT(0, af, pp, skb, 0, "Failed checksum for"); return 0; } } else #endif if (csum_tcpudp_magic(ip_hdr(skb)->saddr, ip_hdr(skb)->daddr, skb->len - tcphoff, ip_hdr(skb)->protocol, skb->csum)) { IP_VS_DBG_RL_PKT(0, af, pp, skb, 0, "Failed checksum for"); return 0; } break; default: /* No need to checksum. */ break; } return 1; } #define TCP_DIR_INPUT 0 #define TCP_DIR_OUTPUT 4 #define TCP_DIR_INPUT_ONLY 8 static const int tcp_state_off[IP_VS_DIR_LAST] = { [IP_VS_DIR_INPUT] = TCP_DIR_INPUT, [IP_VS_DIR_OUTPUT] = TCP_DIR_OUTPUT, [IP_VS_DIR_INPUT_ONLY] = TCP_DIR_INPUT_ONLY, }; /* * Timeout table[state] */ static const int tcp_timeouts[IP_VS_TCP_S_LAST+1] = { [IP_VS_TCP_S_NONE] = 2*HZ, [IP_VS_TCP_S_ESTABLISHED] = 15*60*HZ, [IP_VS_TCP_S_SYN_SENT] = 2*60*HZ, [IP_VS_TCP_S_SYN_RECV] = 1*60*HZ, [IP_VS_TCP_S_FIN_WAIT] = 2*60*HZ, [IP_VS_TCP_S_TIME_WAIT] = 2*60*HZ, [IP_VS_TCP_S_CLOSE] = 10*HZ, [IP_VS_TCP_S_CLOSE_WAIT] = 60*HZ, [IP_VS_TCP_S_LAST_ACK] = 30*HZ, [IP_VS_TCP_S_LISTEN] = 2*60*HZ, [IP_VS_TCP_S_SYNACK] = 120*HZ, [IP_VS_TCP_S_LAST] = 2*HZ, }; static const char *const tcp_state_name_table[IP_VS_TCP_S_LAST+1] = { [IP_VS_TCP_S_NONE] = "NONE", [IP_VS_TCP_S_ESTABLISHED] = "ESTABLISHED", [IP_VS_TCP_S_SYN_SENT] = "SYN_SENT", [IP_VS_TCP_S_SYN_RECV] = "SYN_RECV", [IP_VS_TCP_S_FIN_WAIT] = "FIN_WAIT", [IP_VS_TCP_S_TIME_WAIT] = "TIME_WAIT", [IP_VS_TCP_S_CLOSE] = "CLOSE", [IP_VS_TCP_S_CLOSE_WAIT] = "CLOSE_WAIT", [IP_VS_TCP_S_LAST_ACK] = "LAST_ACK", [IP_VS_TCP_S_LISTEN] = "LISTEN", [IP_VS_TCP_S_SYNACK] = "SYNACK", [IP_VS_TCP_S_LAST] = "BUG!", }; static const bool tcp_state_active_table[IP_VS_TCP_S_LAST] = { [IP_VS_TCP_S_NONE] = false, [IP_VS_TCP_S_ESTABLISHED] = true, [IP_VS_TCP_S_SYN_SENT] = true, [IP_VS_TCP_S_SYN_RECV] = true, [IP_VS_TCP_S_FIN_WAIT] = false, [IP_VS_TCP_S_TIME_WAIT] = false, [IP_VS_TCP_S_CLOSE] = false, [IP_VS_TCP_S_CLOSE_WAIT] = false, [IP_VS_TCP_S_LAST_ACK] = false, [IP_VS_TCP_S_LISTEN] = false, [IP_VS_TCP_S_SYNACK] = true, }; #define sNO IP_VS_TCP_S_NONE #define sES IP_VS_TCP_S_ESTABLISHED #define sSS IP_VS_TCP_S_SYN_SENT #define sSR IP_VS_TCP_S_SYN_RECV #define sFW IP_VS_TCP_S_FIN_WAIT #define sTW IP_VS_TCP_S_TIME_WAIT #define sCL IP_VS_TCP_S_CLOSE #define sCW IP_VS_TCP_S_CLOSE_WAIT #define sLA IP_VS_TCP_S_LAST_ACK #define sLI IP_VS_TCP_S_LISTEN #define sSA IP_VS_TCP_S_SYNACK struct tcp_states_t { int next_state[IP_VS_TCP_S_LAST]; }; static const char * tcp_state_name(int state) { if (state >= IP_VS_TCP_S_LAST) return "ERR!"; return tcp_state_name_table[state] ? tcp_state_name_table[state] : "?"; } static bool tcp_state_active(int state) { if (state >= IP_VS_TCP_S_LAST) return false; return tcp_state_active_table[state]; } static struct tcp_states_t tcp_states[] = { /* INPUT */ /* sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */ /*syn*/ {{sSR, sES, sES, sSR, sSR, sSR, sSR, sSR, sSR, sSR, sSR }}, /*fin*/ {{sCL, sCW, sSS, sTW, sTW, sTW, sCL, sCW, sLA, sLI, sTW }}, /*ack*/ {{sES, sES, sSS, sES, sFW, sTW, sCL, sCW, sCL, sLI, sES }}, /*rst*/ {{sCL, sCL, sCL, sSR, sCL, sCL, sCL, sCL, sLA, sLI, sSR }}, /* OUTPUT */ /* sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */ /*syn*/ {{sSS, sES, sSS, sSR, sSS, sSS, sSS, sSS, sSS, sLI, sSR }}, /*fin*/ {{sTW, sFW, sSS, sTW, sFW, sTW, sCL, sTW, sLA, sLI, sTW }}, /*ack*/ {{sES, sES, sSS, sES, sFW, sTW, sCL, sCW, sLA, sES, sES }}, /*rst*/ {{sCL, sCL, sSS, sCL, sCL, sTW, sCL, sCL, sCL, sCL, sCL }}, /* INPUT-ONLY */ /* sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */ /*syn*/ {{sSR, sES, sES, sSR, sSR, sSR, sSR, sSR, sSR, sSR, sSR }}, /*fin*/ {{sCL, sFW, sSS, sTW, sFW, sTW, sCL, sCW, sLA, sLI, sTW }}, /*ack*/ {{sES, sES, sSS, sES, sFW, sTW, sCL, sCW, sCL, sLI, sES }}, /*rst*/ {{sCL, sCL, sCL, sSR, sCL, sCL, sCL, sCL, sLA, sLI, sCL }}, }; static struct tcp_states_t tcp_states_dos[] = { /* INPUT */ /* sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */ /*syn*/ {{sSR, sES, sES, sSR, sSR, sSR, sSR, sSR, sSR, sSR, sSA }}, /*fin*/ {{sCL, sCW, sSS, sTW, sTW, sTW, sCL, sCW, sLA, sLI, sSA }}, /*ack*/ {{sES, sES, sSS, sSR, sFW, sTW, sCL, sCW, sCL, sLI, sSA }}, /*rst*/ {{sCL, sCL, sCL, sSR, sCL, sCL, sCL, sCL, sLA, sLI, sCL }}, /* OUTPUT */ /* sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */ /*syn*/ {{sSS, sES, sSS, sSA, sSS, sSS, sSS, sSS, sSS, sLI, sSA }}, /*fin*/ {{sTW, sFW, sSS, sTW, sFW, sTW, sCL, sTW, sLA, sLI, sTW }}, /*ack*/ {{sES, sES, sSS, sES, sFW, sTW, sCL, sCW, sLA, sES, sES }}, /*rst*/ {{sCL, sCL, sSS, sCL, sCL, sTW, sCL, sCL, sCL, sCL, sCL }}, /* INPUT-ONLY */ /* sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */ /*syn*/ {{sSA, sES, sES, sSR, sSA, sSA, sSA, sSA, sSA, sSA, sSA }}, /*fin*/ {{sCL, sFW, sSS, sTW, sFW, sTW, sCL, sCW, sLA, sLI, sTW }}, /*ack*/ {{sES, sES, sSS, sES, sFW, sTW, sCL, sCW, sCL, sLI, sES }}, /*rst*/ {{sCL, sCL, sCL, sSR, sCL, sCL, sCL, sCL, sLA, sLI, sCL }}, }; static void tcp_timeout_change(struct ip_vs_proto_data *pd, int flags) { int on = (flags & 1); /* secure_tcp */ /* ** FIXME: change secure_tcp to independent sysctl var ** or make it per-service or per-app because it is valid ** for most if not for all of the applications. Something ** like "capabilities" (flags) for each object. */ pd->tcp_state_table = (on ? tcp_states_dos : tcp_states); } static inline int tcp_state_idx(struct tcphdr *th) { if (th->rst) return 3; if (th->syn) return 0; if (th->fin) return 1; if (th->ack) return 2; return -1; } static inline void set_tcp_state(struct ip_vs_proto_data *pd, struct ip_vs_conn *cp, int direction, struct tcphdr *th) { int state_idx; int new_state = IP_VS_TCP_S_CLOSE; int state_off = tcp_state_off[direction]; /* * Update state offset to INPUT_ONLY if necessary * or delete NO_OUTPUT flag if output packet detected */ if (cp->flags & IP_VS_CONN_F_NOOUTPUT) { if (state_off == TCP_DIR_OUTPUT) cp->flags &= ~IP_VS_CONN_F_NOOUTPUT; else state_off = TCP_DIR_INPUT_ONLY; } if ((state_idx = tcp_state_idx(th)) < 0) { IP_VS_DBG(8, "tcp_state_idx=%d!!!\n", state_idx); goto tcp_state_out; } new_state = pd->tcp_state_table[state_off+state_idx].next_state[cp->state]; tcp_state_out: if (new_state != cp->state) { struct ip_vs_dest *dest = cp->dest; IP_VS_DBG_BUF(8, "%s %s [%c%c%c%c] c:%s:%d v:%s:%d " "d:%s:%d state: %s->%s conn->refcnt:%d\n", pd->pp->name, ((state_off == TCP_DIR_OUTPUT) ? "output " : "input "), th->syn ? 'S' : '.', th->fin ? 'F' : '.', th->ack ? 'A' : '.', th->rst ? 'R' : '.', IP_VS_DBG_ADDR(cp->af, &cp->caddr), ntohs(cp->cport), IP_VS_DBG_ADDR(cp->af, &cp->vaddr), ntohs(cp->vport), IP_VS_DBG_ADDR(cp->daf, &cp->daddr), ntohs(cp->dport), tcp_state_name(cp->state), tcp_state_name(new_state), refcount_read(&cp->refcnt)); if (dest) { if (!(cp->flags & IP_VS_CONN_F_INACTIVE) && !tcp_state_active(new_state)) { atomic_dec(&dest->activeconns); atomic_inc(&dest->inactconns); cp->flags |= IP_VS_CONN_F_INACTIVE; } else if ((cp->flags & IP_VS_CONN_F_INACTIVE) && tcp_state_active(new_state)) { atomic_inc(&dest->activeconns); atomic_dec(&dest->inactconns); cp->flags &= ~IP_VS_CONN_F_INACTIVE; } } if (new_state == IP_VS_TCP_S_ESTABLISHED) ip_vs_control_assure_ct(cp); } if (likely(pd)) cp->timeout = pd->timeout_table[cp->state = new_state]; else /* What to do ? */ cp->timeout = tcp_timeouts[cp->state = new_state]; } /* * Handle state transitions */ static void tcp_state_transition(struct ip_vs_conn *cp, int direction, const struct sk_buff *skb, struct ip_vs_proto_data *pd) { struct tcphdr _tcph, *th; #ifdef CONFIG_IP_VS_IPV6 int ihl = cp->af == AF_INET ? ip_hdrlen(skb) : sizeof(struct ipv6hdr); #else int ihl = ip_hdrlen(skb); #endif th = skb_header_pointer(skb, ihl, sizeof(_tcph), &_tcph); if (th == NULL) return; spin_lock_bh(&cp->lock); set_tcp_state(pd, cp, direction, th); spin_unlock_bh(&cp->lock); } static inline __u16 tcp_app_hashkey(__be16 port) { return (((__force u16)port >> TCP_APP_TAB_BITS) ^ (__force u16)port) & TCP_APP_TAB_MASK; } static int tcp_register_app(struct netns_ipvs *ipvs, struct ip_vs_app *inc) { struct ip_vs_app *i; __u16 hash; __be16 port = inc->port; int ret = 0; struct ip_vs_proto_data *pd = ip_vs_proto_data_get(ipvs, IPPROTO_TCP); hash = tcp_app_hashkey(port); list_for_each_entry(i, &ipvs->tcp_apps[hash], p_list) { if (i->port == port) { ret = -EEXIST; goto out; } } list_add_rcu(&inc->p_list, &ipvs->tcp_apps[hash]); atomic_inc(&pd->appcnt); out: return ret; } static void tcp_unregister_app(struct netns_ipvs *ipvs, struct ip_vs_app *inc) { struct ip_vs_proto_data *pd = ip_vs_proto_data_get(ipvs, IPPROTO_TCP); atomic_dec(&pd->appcnt); list_del_rcu(&inc->p_list); } static int tcp_app_conn_bind(struct ip_vs_conn *cp) { struct netns_ipvs *ipvs = cp->ipvs; int hash; struct ip_vs_app *inc; int result = 0; /* Default binding: bind app only for NAT */ if (IP_VS_FWD_METHOD(cp) != IP_VS_CONN_F_MASQ) return 0; /* Lookup application incarnations and bind the right one */ hash = tcp_app_hashkey(cp->vport); list_for_each_entry_rcu(inc, &ipvs->tcp_apps[hash], p_list) { if (inc->port == cp->vport) { if (unlikely(!ip_vs_app_inc_get(inc))) break; IP_VS_DBG_BUF(9, "%s(): Binding conn %s:%u->" "%s:%u to app %s on port %u\n", __func__, IP_VS_DBG_ADDR(cp->af, &cp->caddr), ntohs(cp->cport), IP_VS_DBG_ADDR(cp->af, &cp->vaddr), ntohs(cp->vport), inc->name, ntohs(inc->port)); cp->app = inc; if (inc->init_conn) result = inc->init_conn(inc, cp); break; } } return result; } /* * Set LISTEN timeout. (ip_vs_conn_put will setup timer) */ void ip_vs_tcp_conn_listen(struct ip_vs_conn *cp) { struct ip_vs_proto_data *pd = ip_vs_proto_data_get(cp->ipvs, IPPROTO_TCP); spin_lock_bh(&cp->lock); cp->state = IP_VS_TCP_S_LISTEN; cp->timeout = (pd ? pd->timeout_table[IP_VS_TCP_S_LISTEN] : tcp_timeouts[IP_VS_TCP_S_LISTEN]); spin_unlock_bh(&cp->lock); } /* --------------------------------------------- * timeouts is netns related now. * --------------------------------------------- */ static int __ip_vs_tcp_init(struct netns_ipvs *ipvs, struct ip_vs_proto_data *pd) { ip_vs_init_hash_table(ipvs->tcp_apps, TCP_APP_TAB_SIZE); pd->timeout_table = ip_vs_create_timeout_table((int *)tcp_timeouts, sizeof(tcp_timeouts)); if (!pd->timeout_table) return -ENOMEM; pd->tcp_state_table = tcp_states; return 0; } static void __ip_vs_tcp_exit(struct netns_ipvs *ipvs, struct ip_vs_proto_data *pd) { kfree(pd->timeout_table); } struct ip_vs_protocol ip_vs_protocol_tcp = { .name = "TCP", .protocol = IPPROTO_TCP, .num_states = IP_VS_TCP_S_LAST, .dont_defrag = 0, .init = NULL, .exit = NULL, .init_netns = __ip_vs_tcp_init, .exit_netns = __ip_vs_tcp_exit, .register_app = tcp_register_app, .unregister_app = tcp_unregister_app, .conn_schedule = tcp_conn_schedule, .conn_in_get = ip_vs_conn_in_get_proto, .conn_out_get = ip_vs_conn_out_get_proto, .snat_handler = tcp_snat_handler, .dnat_handler = tcp_dnat_handler, .state_name = tcp_state_name, .state_transition = tcp_state_transition, .app_conn_bind = tcp_app_conn_bind, .debug_packet = ip_vs_tcpudp_debug_packet, .timeout_change = tcp_timeout_change, }; |
| 1 1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 | /* SPDX-License-Identifier: GPL-2.0 */ /* * Backlight Lowlevel Control Abstraction * * Copyright (C) 2003,2004 Hewlett-Packard Company * */ #ifndef _LINUX_BACKLIGHT_H #define _LINUX_BACKLIGHT_H #include <linux/device.h> #include <linux/fb.h> #include <linux/mutex.h> #include <linux/notifier.h> #include <linux/types.h> /** * enum backlight_update_reason - what method was used to update backlight * * A driver indicates the method (reason) used for updating the backlight * when calling backlight_force_update(). */ enum backlight_update_reason { /** * @BACKLIGHT_UPDATE_HOTKEY: The backlight was updated using a hot-key. */ BACKLIGHT_UPDATE_HOTKEY, /** * @BACKLIGHT_UPDATE_SYSFS: The backlight was updated using sysfs. */ BACKLIGHT_UPDATE_SYSFS, }; /** * enum backlight_type - the type of backlight control * * The type of interface used to control the backlight. */ enum backlight_type { /** * @BACKLIGHT_RAW: * * The backlight is controlled using hardware registers. */ BACKLIGHT_RAW = 1, /** * @BACKLIGHT_PLATFORM: * * The backlight is controlled using a platform-specific interface. */ BACKLIGHT_PLATFORM, /** * @BACKLIGHT_FIRMWARE: * * The backlight is controlled using a standard firmware interface. */ BACKLIGHT_FIRMWARE, /** * @BACKLIGHT_TYPE_MAX: Number of entries. */ BACKLIGHT_TYPE_MAX, }; /** enum backlight_scale - the type of scale used for brightness values * * The type of scale used for brightness values. */ enum backlight_scale { /** * @BACKLIGHT_SCALE_UNKNOWN: The scale is unknown. */ BACKLIGHT_SCALE_UNKNOWN = 0, /** * @BACKLIGHT_SCALE_LINEAR: The scale is linear. * * The linear scale will increase brightness the same for each step. */ BACKLIGHT_SCALE_LINEAR, /** * @BACKLIGHT_SCALE_NON_LINEAR: The scale is not linear. * * This is often used when the brightness values tries to adjust to * the relative perception of the eye demanding a non-linear scale. */ BACKLIGHT_SCALE_NON_LINEAR, }; struct backlight_device; /** * struct backlight_ops - backlight operations * * The backlight operations are specified when the backlight device is registered. */ struct backlight_ops { /** * @options: Configure how operations are called from the core. * * The options parameter is used to adjust the behaviour of the core. * Set BL_CORE_SUSPENDRESUME to get the update_status() operation called * upon suspend and resume. */ unsigned int options; #define BL_CORE_SUSPENDRESUME (1 << 0) /** * @update_status: Operation called when properties have changed. * * Notify the backlight driver some property has changed. * The update_status operation is protected by the update_lock. * * The backlight driver is expected to use backlight_is_blank() * to check if the display is blanked and set brightness accordingly. * update_status() is called when any of the properties has changed. * * RETURNS: * * 0 on success, negative error code if any failure occurred. */ int (*update_status)(struct backlight_device *); /** * @get_brightness: Return the current backlight brightness. * * The driver may implement this as a readback from the HW. * This operation is optional and if not present then the current * brightness property value is used. * * RETURNS: * * A brightness value which is 0 or a positive number. * On failure a negative error code is returned. */ int (*get_brightness)(struct backlight_device *); /** * @controls_device: Check against the display device * * Check if the backlight controls the given display device. This * operation is optional and if not implemented it is assumed that * the display is always the one controlled by the backlight. * * RETURNS: * * If display_dev is NULL or display_dev matches the device controlled by * the backlight, return true. Otherwise return false. */ bool (*controls_device)(struct backlight_device *bd, struct device *display_dev); }; /** * struct backlight_properties - backlight properties * * This structure defines all the properties of a backlight. */ struct backlight_properties { /** * @brightness: The current brightness requested by the user. * * The backlight core makes sure the range is (0 to max_brightness) * when the brightness is set via the sysfs attribute: * /sys/class/backlight/<backlight>/brightness. * * This value can be set in the backlight_properties passed * to devm_backlight_device_register() to set a default brightness * value. */ int brightness; /** * @max_brightness: The maximum brightness value. * * This value must be set in the backlight_properties passed to * devm_backlight_device_register() and shall not be modified by the * driver after registration. */ int max_brightness; /** * @power: The current power mode. * * User space can configure the power mode using the sysfs * attribute: /sys/class/backlight/<backlight>/bl_power * When the power property is updated update_status() is called. * * The possible values are: (0: full on, 4: full off), see * BACKLIGHT_POWER constants. * * When the backlight device is enabled, @power is set to * BACKLIGHT_POWER_ON. When the backlight device is disabled, * @power is set to BACKLIGHT_POWER_OFF. */ int power; #define BACKLIGHT_POWER_ON (0) #define BACKLIGHT_POWER_OFF (4) #define BACKLIGHT_POWER_REDUCED (1) // deprecated; don't use in new code /** * @type: The type of backlight supported. * * The backlight type allows userspace to make appropriate * policy decisions based on the backlight type. * * This value must be set in the backlight_properties * passed to devm_backlight_device_register(). */ enum backlight_type type; /** * @state: The state of the backlight core. * * The state is a bitmask. BL_CORE_FBBLANK is set when the display * is expected to be blank. BL_CORE_SUSPENDED is set when the * driver is suspended. * * backlight drivers are expected to use backlight_is_blank() * in their update_status() operation rather than reading the * state property. * * The state is maintained by the core and drivers may not modify it. */ unsigned int state; #define BL_CORE_SUSPENDED (1 << 0) /* backlight is suspended */ #define BL_CORE_FBBLANK (1 << 1) /* backlight is under an fb blank event */ /** * @scale: The type of the brightness scale. */ enum backlight_scale scale; }; /** * struct backlight_device - backlight device data * * This structure holds all data required by a backlight device. */ struct backlight_device { /** * @props: Backlight properties */ struct backlight_properties props; /** * @update_lock: The lock used when calling the update_status() operation. * * update_lock is an internal backlight lock that serialise access * to the update_status() operation. The backlight core holds the update_lock * when calling the update_status() operation. The update_lock shall not * be used by backlight drivers. */ struct mutex update_lock; /** * @ops_lock: The lock used around everything related to backlight_ops. * * ops_lock is an internal backlight lock that protects the ops pointer * and is used around all accesses to ops and when the operations are * invoked. The ops_lock shall not be used by backlight drivers. */ struct mutex ops_lock; /** * @ops: Pointer to the backlight operations. * * If ops is NULL, the driver that registered this device has been unloaded, * and if class_get_devdata() points to something in the body of that driver, * it is also invalid. */ const struct backlight_ops *ops; /** * @fb_notif: The framebuffer notifier block */ struct notifier_block fb_notif; /** * @entry: List entry of all registered backlight devices */ struct list_head entry; /** * @dev: Parent device. */ struct device dev; /** * @fb_bl_on: The state of individual fbdev's. * * Multiple fbdev's may share one backlight device. The fb_bl_on * records the state of the individual fbdev. */ bool fb_bl_on[FB_MAX]; /** * @use_count: The number of uses of fb_bl_on. */ int use_count; }; /** * backlight_update_status - force an update of the backlight device status * @bd: the backlight device */ static inline int backlight_update_status(struct backlight_device *bd) { int ret = -ENOENT; mutex_lock(&bd->update_lock); if (bd->ops && bd->ops->update_status) ret = bd->ops->update_status(bd); mutex_unlock(&bd->update_lock); return ret; } /** * backlight_enable - Enable backlight * @bd: the backlight device to enable */ static inline int backlight_enable(struct backlight_device *bd) { if (!bd) return 0; bd->props.power = BACKLIGHT_POWER_ON; bd->props.state &= ~BL_CORE_FBBLANK; return backlight_update_status(bd); } /** * backlight_disable - Disable backlight * @bd: the backlight device to disable */ static inline int backlight_disable(struct backlight_device *bd) { if (!bd) return 0; bd->props.power = BACKLIGHT_POWER_OFF; bd->props.state |= BL_CORE_FBBLANK; return backlight_update_status(bd); } /** * backlight_is_blank - Return true if display is expected to be blank * @bd: the backlight device * * Display is expected to be blank if any of these is true:: * * 1) if power in not UNBLANK * 2) if state indicate BLANK or SUSPENDED * * Returns true if display is expected to be blank, false otherwise. */ static inline bool backlight_is_blank(const struct backlight_device *bd) { return bd->props.power != BACKLIGHT_POWER_ON || bd->props.state & (BL_CORE_SUSPENDED | BL_CORE_FBBLANK); } /** * backlight_get_brightness - Returns the current brightness value * @bd: the backlight device * * Returns the current brightness value, taking in consideration the current * state. If backlight_is_blank() returns true then return 0 as brightness * otherwise return the current brightness property value. * * Backlight drivers are expected to use this function in their update_status() * operation to get the brightness value. */ static inline int backlight_get_brightness(const struct backlight_device *bd) { if (backlight_is_blank(bd)) return 0; else return bd->props.brightness; } struct backlight_device * backlight_device_register(const char *name, struct device *dev, void *devdata, const struct backlight_ops *ops, const struct backlight_properties *props); struct backlight_device * devm_backlight_device_register(struct device *dev, const char *name, struct device *parent, void *devdata, const struct backlight_ops *ops, const struct backlight_properties *props); void backlight_device_unregister(struct backlight_device *bd); void devm_backlight_device_unregister(struct device *dev, struct backlight_device *bd); void backlight_force_update(struct backlight_device *bd, enum backlight_update_reason reason); struct backlight_device *backlight_device_get_by_name(const char *name); struct backlight_device *backlight_device_get_by_type(enum backlight_type type); int backlight_device_set_brightness(struct backlight_device *bd, unsigned long brightness); #define to_backlight_device(obj) container_of(obj, struct backlight_device, dev) /** * bl_get_data - access devdata * @bl_dev: pointer to backlight device * * When a backlight device is registered the driver has the possibility * to supply a void * devdata. bl_get_data() return a pointer to the * devdata. * * RETURNS: * * pointer to devdata stored while registering the backlight device. */ static inline void * bl_get_data(struct backlight_device *bl_dev) { return dev_get_drvdata(&bl_dev->dev); } #ifdef CONFIG_OF struct backlight_device *of_find_backlight_by_node(struct device_node *node); #else static inline struct backlight_device * of_find_backlight_by_node(struct device_node *node) { return NULL; } #endif #if IS_ENABLED(CONFIG_BACKLIGHT_CLASS_DEVICE) struct backlight_device *devm_of_find_backlight(struct device *dev); #else static inline struct backlight_device * devm_of_find_backlight(struct device *dev) { return NULL; } #endif #endif |
| 75 1 75 80 85 3 160 29 160 13 115 45 36 24 160 29 147 147 170 172 172 162 57 62 143 12 139 139 88 72 72 52 50 24 145 145 146 27 45 1 5 29 29 29 96 97 97 20 96 17 97 2 20 3 19 1 19 19 118 279 251 65 1 115 115 145 160 221 219 212 160 160 160 160 88 28 94 115 66 166 213 4 70 70 4 66 4 4 45 43 37 37 2 45 18 11 9 53 7 47 23 67 4 58 25 6 6 11 15 191 24 35 23 14 84 84 12 18 9 8 8 1 5 11 8 115 115 19 10 8 15 2 1 4 3 16 2 11 108 46 9 24 2 10 48 4 25 226 1719 1510 198 115 189 132 60 114 127 1 125 10 117 1684 81 75 5 63 63 1604 202 1586 307 6 17 10 1200 1105 276 831 99 94 14 90 99 9 5 2 3 13 84 99 99 840 775 68 62 68 3 3 1 2 68 68 249 240 241 8 8 8 8 8 1 1 1 1382 1327 51 1284 55 56 57 55 3 52 3 54 57 57 1232 1193 42 43 43 40 40 43 43 29 2 6 21 23 21 18 18 4 15 2 18 18 13 13 13 13 13 18 18 18 18 18 13 6 2 2 2 2 6 40 15 9 13 3 11 10 3 2 3 7 7 5 5 347 88 296 20 39 49 3 18 128 127 88 48 41 12 9 38 1 38 45 45 44 1 45 48 9 13 39 85 3 1 80 2 79 76 3 87 86 88 2 85 85 85 9 79 8 71 71 68 4 85 85 62 20 12 12 12 12 2 4 12 6 1 6 17 17 20 1 17 3 2 14 3 3 4 1 3 3 1 2 2 10 3 2 1 18 9 1 9 18 11 16 2 15 5 11 5 7 7 2 12 2 12 2 6 8 5 9 5 9 5 7 7 |