参考文献

  • MySQL技术内幕 InnoDB存储引擎

Checkpoint技术

  • 缓冲池的设计目的为了协调CPU速度与磁盘速度的鸿沟.因此页的操作首先都是在缓冲池中完成的.如果一条DML语句,如UPDATE或DELETE改变了页中的记录,那么此时页是脏的,即缓冲池中页的版本要比磁盘的新.数据库需要将新版本的页从缓冲池刷新到磁盘.

  • 若每次一个页发生变化,就将新页的版本刷新到磁盘,那么这个开销是非常大的.若热点数据集中在某几个页中,那么数据库的性能将变得非常差.同时,如果在从缓冲池将页的新版本刷新到磁盘时发生了宕机,那么数据就不能恢复了.

  • 为了避免发送数据丢失的问题,当前事务数据库系统都普遍采用了Write Ahead Log策略,即当事务提交时,先写重做日志,再修改页.当由于发生宕机而导致数据丢失时,通过重做日志来完成数据的恢复.

  • Checkpoint(检查点)技术的目的是解决以下几个问题:

    • 缩短数据库的恢复时间;
    • 缓冲池不够用时,将脏页刷新到磁盘;
    • 重做日志不可用时,刷新脏页;
  • 当数据库发生宕机时,数据库不需要重做所有日志,因为Checkpoint之前的页都已经刷新回磁盘.因此数据库只需要对Checkpoint之后的重做日志进行恢复,这样就大大缩短了恢复的时间.

    • 此外,当缓冲池不够用时,根据LRU算法会溢出最近最少使用的页,若此页为脏页,那么需要强制执行Checkpoint,将脏页也就是页的新版本刷回磁盘.
  • 对于InnoDB存储引擎而言,其是通过LSN(Log Sequence Number)来标记版本的.而LSN是8字节的数字,其单位是字节.每个页都有LSN,重做日志也有LSN,Checkpoint也有LSN.可以通过命令show engine innodb status\G来观察.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    ---
    LOG
    ---
    Log sequence number 1273529290
    Log buffer assigned up to 1273529290
    Log buffer completed up to 1273529290
    Log written up to 1273529290
    Log flushed up to 1273529290
    Added dirty pages up to 1273529290
    Pages flushed up to 1273529290
    Last checkpoint at 1273529290
    142951 log i/o's done, 0.11 log i/o's/second

Checkpoint的种类

  • 在InnoDB存储引擎中,Checkpoint发生的时间,条件以及脏页的选择都是非常复杂.而Checkpoint所做的事情无外乎是将缓冲池中的脏页刷回磁盘.不同之处在于每次刷新多少页到磁盘,每次从哪里取脏页,以及什么时间触发Checkpoint.

  • 在InnoDB存储引擎内部,有两种Checkpoint,分别为:

    • Sharp Checkpoint
    • Fuzzy Checkpoint
  • Sharp Checkpoint发生在数据库关闭时将所有的脏页都刷新回磁盘,这是默认的工作方式,即参数innodb_fast_shutdown=1

  • 但是若数据库在运行时也使用Sharp Checkpoint,那么数据库的可用性就会受到很大影像.故在InnoDB存储引擎内部使用Fuzzy Checkpoint进行页的刷新,即只刷新一部分脏页,而不是刷新所有的脏页回磁盘.

  • 在InnoDB存储引擎中可能发生如下集中情况的Fuzzy Checkpoint:

    • Master Thread Checkpoint

      • 对于Master Thread中发生的Checkpoint,差不多以每秒或每十秒的速度从缓冲池的脏页列表中刷新一定比例的页回磁盘.这个过程是异步的,即此时InnoDB存储引擎可以进行其他的操作,用户查询线程不会阻塞.
    • FLUSH_LRU_LIST Checkpoint

      • FLUSH_LRU_LIST Checkpoint是因为InnoDB存储引擎需要保证LRU列表中需要有差不多100个空闲页可供使用.在InnoDB1.1.x版本之前,需要检查LRU列表中是否有足够的可用空间操作发生在用户查询线程中,显然这会阻塞用户的查询操作.倘若没有100个可用空闲页,那么InnoDB存储引擎会将LRU列表尾端的页移除.如果这些页中有脏页,那么需要进行Checkpoint,而这些页是来自LRU列表的,因此称为FLUSH_LRU_LIST Checkpoint

      • 在MySQL5.6版本,也就是InnoDB1.2.x版本开始,这个检查被放在一个单独的Page Cleaner Thread中进行,并且可用通过参数innodb_lru_scan_depth来控制LRU列表中可用页的数量,该值默认为1024.

        1
        2
        3
        4
        5
        6
        7
        mysql> show variables like 'innodb_lru_scan_depth';
        +-----------------------+-------+
        | Variable_name | Value |
        +-----------------------+-------+
        | innodb_lru_scan_depth | 1024 |
        +-----------------------+-------+
        1 row in set (0.01 sec)
    • Async/Sync Flush Checkpoint

      • Async/Sync Flush Checkpoint指的是重做日志不可用的情况,这是需要强制将一些页刷新磁盘,而此时脏页是从脏页列表中选取的.

      • 若将已经写入到重做日志的LSN记为redos_lsn,将已经刷新回磁盘最新页的LSN记为checkpoint_lsn,则可定义:

        1
        2
        3
        4
        checkpoint_age = redo_lsn - checkpoint_lsn

        async_water_mark = 75% * total_redo_log_file_size
        sync_water_mark = 80% * total_redo_log_file_size
      • 若每个重做日志的大小为1GB,并且定义两个重做日志,则重做日志文件的总大小为2GB.那么async_water_mark=1.5GB,sync_water_mark=1.8GB则:

        • checkpoint_age < async_water_mark 时,不需要刷新任何脏页到磁盘
        • async_water_mark < checkpoint_age < sync_water_mark 时触发Async Flush,从Flush列表中刷新足够的脏页磁盘,使得刷新后满足checkpoint_age < async_water_mark
        • checkpoint_age > sync_water_mark这种情况很少发生,除非设置的重做日志文件太小,并且在进行LOAD DATABULK INSERT操作,此时触发Sync Flush操作,从Flush列表中刷新足够的脏页回磁盘,石凳刷新后满足checkpoint_age < async_water_mark.
      • Async/Sync Flush Checkpoint是为了保证重做日志的循环使用的可用性.

      • 在MySQL5.6版本,也就是InnoDB1.2.x版本开始,这个检查也被放在一个单独的Page Cleaner Thread中进行

    • Dirty Page too much Checkpoint

      • 即脏页的数量太多,导致InnoDB存储引擎强制进行Checkpoint.其目的总的来说是为了保证缓冲池中有足够可用的页,可用由参数innodb_max_dirty_pages_pct

        1
        2
        3
        4
        5
        6
        7
        mysql> show variables like 'innodb_max_dirty_pages_pct';
        +----------------------------+-----------+
        | Variable_name | Value |
        +----------------------------+-----------+
        | innodb_max_dirty_pages_pct | 90.000000 |
        +----------------------------+-----------+
        1 row in set (0.00 sec)
        • 该参数值表示,当缓冲池中脏页的数量占据90%时,强制进行Checkpoint,刷新一部分的脏页回到磁盘.

Master Thread的工作方式

InnoDB1.0.x版本之前的Master Thread

  • Master Thread具有最高的线程优先级别,其内部由多个循环组成,其中包括:主循环(loop),后台循环(background loop),刷新循环(flush loop),暂停循环(suspend loop).
  • Master Thread会根据数据库运行状态在loop,background loop,flush loop,suspend loop中进行切换.

主循环Loop

  • Loop被称为主循环,因为大多数的操作都是在这个循环中,其中有两大部分的操作:每秒的操作和每10秒的操作.

    • 每一秒的操作包括:
      • 日志缓冲刷新到磁盘,即使事务没有事务没有提交(总是);
      • 合并插入缓冲(可能);
      • 至多刷新100个InnoDB的缓冲池中的脏页到磁盘(可能);
      • 如果当前没有用户活动,则切换到background loop(可能);
    • 下面代码为MySQL5.1.11版本下storage/innobase/srv/srv0srv.c文件srv_master_thread方法的主循环部分代码:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    loop:
    // ... 略

    /* ---- We run the following loop approximately once per second
    when there is database activity */

    skip_sleep = FALSE;

    for (i = 0; i < 10; i++) {
    n_ios_old = log_sys->n_log_ios + buf_pool->n_pages_read
    + buf_pool->n_pages_written;
    srv_main_thread_op_info = "sleeping";

    if (!skip_sleep) {
    // sleep 1 秒
    os_thread_sleep(1000000);
    }

    skip_sleep = FALSE;

    /* ALTER TABLE in MySQL requires on Unix that the table handler
    can drop tables lazily after there no longer are SELECT
    queries to them. */

    srv_main_thread_op_info = "doing background drop tables";

    row_drop_tables_for_mysql_in_background();

    srv_main_thread_op_info = "";

    if (srv_fast_shutdown && srv_shutdown_state > 0) {

    goto background_loop;
    }

    /* We flush the log once in a second even if no commit
    is issued or the we have specified in my.cnf no flush
    at transaction commit */

    srv_main_thread_op_info = "flushing log";
    // 日志缓冲刷新到磁盘,即使事务没有事务没有提交
    log_buffer_flush_to_disk();

    srv_main_thread_op_info = "making checkpoint";
    log_free_check();

    /* If there were less than 5 i/os during the
    one second sleep, we assume that there is free
    disk i/o capacity available, and it makes sense to
    do an insert buffer merge. */

    n_pend_ios = buf_get_n_pending_ios()
    + log_sys->n_pending_writes;
    n_ios = log_sys->n_log_ios + buf_pool->n_pages_read
    + buf_pool->n_pages_written;
    // 合并插入缓冲(Insert Buffer)并不是每秒都发生的.
    // InnoDB存储引擎会判断当前一秒内发生的IO次数是否小于5次,
    // 若小于5次,InnoDB认为当前IO压力很小,可以执行合并插入缓冲的操作
    if (n_pend_ios < 3 && (n_ios - n_ios_old < 5)) {
    // 进行合并插入缓冲
    srv_main_thread_op_info = "doing insert buffer merge";
    ibuf_contract_for_n_pages(TRUE, 5);

    srv_main_thread_op_info = "flushing log";
    log_buffer_flush_to_disk();
    }
    // 刷新100个脏页也不是每秒发生的.
    // InnoDB存储引擎通过判断当前缓冲池中脏页的比例(buf_get_modified_ratio_pct)
    // 是否超过了配置文件中innodb_max_dirty_pages_pct参数(默认为90,代表90%),
    // 若超过了这个阈值,InnoDB存储引擎则认为需要进行磁盘同步操作,将100个脏页写入磁盘中.
    if (buf_get_modified_ratio_pct() >
    srv_max_buf_pool_modified_pct) {

    /* Try to keep the number of modified pages in the
    buffer pool under the limit wished by the user */
    // 至多刷新100个InnoDB的缓冲池中的脏页到磁盘
    n_pages_flushed = buf_flush_batch(BUF_FLUSH_LIST, 100,
    ut_dulint_max);

    /* If we had to do the flush, it may have taken
    even more than 1 second, and also, there may be more
    to flush. Do not sleep 1 second during the next
    iteration of this loop. */

    skip_sleep = TRUE;
    }
    // 若当前没有用户活动,则切换到background loop
    if (srv_activity_count == old_activity_count) {

    /* There is no user activity at the moment, go to
    the background loop */

    goto background_loop;
    }
    }
    • 即使某个事务还没有提交,InnoDB存储引擎仍然每秒会将重做日志缓冲中的内容刷新到重做日志文件中.

    • 每10秒的操作,包括如下内容:

      • 刷新100个脏页到磁盘(可能的情况下);
      • 合并至多5个插入缓冲(总是);
      • 将日志缓冲刷新到磁盘(总是);
      • 删除无用的Undo页(总是);
      • 刷新100个或者10个脏页到磁盘(总是);
    • 部分代码如下:

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      66
      67
      68
      69
      70
      71
      72
      73
      74
      75
      76
      77
      78
      79
      80
      81
      82
      83
      84
      85
      86
      87
      88
      89
      90
      91
      92
      93
      94
      95
      96
      97
      98
      99
      100
      101
      102
      103
      104
      105
      106
      107
      108
      109
      	/* ---- We perform the following code approximately once per 10 seconds when there is database activity */

      #ifdef MEM_PERIODIC_CHECK
      /* Check magic numbers of every allocated mem block once in 10
      seconds */
      mem_validate_all_blocks();
      #endif
      /* If there were less than 200 i/os during the 10 second period,
      we assume that there is free disk i/o capacity available, and it
      makes sense to flush 100 pages. */

      n_pend_ios = buf_get_n_pending_ios() + log_sys->n_pending_writes;
      n_ios = log_sys->n_log_ios + buf_pool->n_pages_read
      + buf_pool->n_pages_written;
      // InnoDB存储引擎会先判断过去10秒之内磁盘的IO操作是否小于200次,
      // 若存储引擎认为当前有足够的磁盘IO能力,则会将100个脏页刷新到磁盘
      if (n_pend_ios < 3 && (n_ios - n_ios_very_old < 200)) {

      srv_main_thread_op_info = "flushing buffer pool pages";
      buf_flush_batch(BUF_FLUSH_LIST, 100, ut_dulint_max);

      srv_main_thread_op_info = "flushing log";
      log_buffer_flush_to_disk();
      }

      /* We run a batch of insert buffer merge every 10 seconds,
      even if the server were active */
      // 接着存储引擎会合并插入缓冲,不同于每秒一次操作时可能发生的合并插入缓冲操作.
      // 这次的合并插入缓冲总会在这个阶段进行.
      srv_main_thread_op_info = "doing insert buffer merge";
      ibuf_contract_for_n_pages(TRUE, 5);
      // 之后,InnoDB存储引擎会再次进行一次将日志缓冲刷新到磁盘的操作.
      // 这和每秒一次时发生的操作是一样的.
      srv_main_thread_op_info = "flushing log";
      log_buffer_flush_to_disk();

      /* We run a full purge every 10 seconds, even if the server
      were active */

      n_pages_purged = 1;

      last_flush_time = time(NULL);
      // InnoDB存储引擎会进行一步执行full purge操作,即删除无用的Undo页.
      // 对表进行update,delete这类操作,原先的行被标记为删除,但是因为一致性读的关系,需要保留这些行版本的信息.
      while (n_pages_purged) {

      if (srv_fast_shutdown && srv_shutdown_state > 0) {
      goto background_loop;
      }

      srv_main_thread_op_info = "purging";
      // 在full purge过程中,InnoDB存储引擎会判断当前事务系统中已被删除的行是否可以删除,
      // 比如有时候可能还有查询操作需要读取之前版本的undo信息,如果可以删除,InnoDB会立即将删除.
      // InnoDB存储引擎在执行full purge操作时,每次最多尝试回收20个undo页
      n_pages_purged = trx_purge();

      current_time = time(NULL);

      if (difftime(current_time, last_flush_time) > 1) {
      srv_main_thread_op_info = "flushing log";

      log_buffer_flush_to_disk();
      last_flush_time = current_time;
      }
      }

      srv_main_thread_op_info = "flushing buffer pool pages";

      /* Flush a few oldest pages to make a new checkpoint younger */
      // 然后InnoDB存储引擎会判断缓冲池中脏页的比例(buf_get_modified_ratio_pct)
      if (buf_get_modified_ratio_pct() > 70) {

      /* If there are lots of modified pages in the buffer pool
      (> 70 %), we assume we can afford reserving the disk(s) for
      the time it requires to flush 100 pages */
      // 若超过70%的脏页,则需要刷新100个脏页到磁盘
      n_pages_flushed = buf_flush_batch(BUF_FLUSH_LIST, 100,
      ut_dulint_max);
      } else {
      /* Otherwise, we only flush a small number of pages so that
      we do not unnecessarily use much disk i/o capacity from
      other work */
      // 若脏页比例小于70%,则只需要刷新10%的脏页到磁盘
      n_pages_flushed = buf_flush_batch(BUF_FLUSH_LIST, 10,
      ut_dulint_max);
      }

      srv_main_thread_op_info = "making checkpoint";

      /* Make a new checkpoint about once in 10 seconds */

      log_checkpoint(TRUE, FALSE);

      srv_main_thread_op_info = "reserving kernel mutex";

      mutex_enter(&kernel_mutex);

      /* ---- When there is database activity, we jump from here back to
      the start of loop */

      if (srv_activity_count != old_activity_count) {
      mutex_exit(&kernel_mutex);
      goto loop;
      }

      mutex_exit(&kernel_mutex);

      /* If the database is quiet, we enter the background loop */

      • storage/innobase/trx/trx0purge.c文件trx_purge方法
      1
      2
      3
      4
      5
      6
      7
      /* Handle at most 20 undo log pages in one purge batch */

      purge_sys->handle_limit = purge_sys->n_pages_handled + 20;

      old_pages_handled = purge_sys->n_pages_handled;

      mutex_exit(&(purge_sys->mutex));
    • background loop,若当前没有用户活动(数据库空闲时)或者数据库关闭(shutdown),就会切换到这个循环.

      • background loop会执行一下操作:
        • 删除无用的Undo页(总是):
        • 合并20个插入缓冲(总是);
        • 跳回到主循环(总是);
        • 不断刷新100个页直到符合条件(可能,跳转到flusn loop中完成).
      • background loop部分代码:
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      66
      67
      68
      69
      70
      71
      72
      73
      74
      background_loop:
      /* ---- In this loop we run background operations when the server
      is quiet from user activity. Also in the case of a shutdown, we
      loop here, flushing the buffer pool to the data files. */

      /* The server has been quiet for a while: start running background
      operations */

      srv_main_thread_op_info = "doing background drop tables";

      n_tables_to_drop = row_drop_tables_for_mysql_in_background();

      if (n_tables_to_drop > 0) {
      /* Do not monopolize the CPU even if there are tables waiting
      in the background drop queue. (It is essentially a bug if
      MySQL tries to drop a table while there are still open handles
      to it and we had to put it to the background drop queue.) */

      os_thread_sleep(100000);
      }

      srv_main_thread_op_info = "purging";

      /* Run a full purge */

      n_pages_purged = 1;

      last_flush_time = time(NULL);

      while (n_pages_purged) {
      if (srv_fast_shutdown && srv_shutdown_state > 0) {

      break;
      }

      srv_main_thread_op_info = "purging";
      n_pages_purged = trx_purge();

      current_time = time(NULL);

      if (difftime(current_time, last_flush_time) > 1) {
      srv_main_thread_op_info = "flushing log";

      log_buffer_flush_to_disk();
      last_flush_time = current_time;
      }
      }

      srv_main_thread_op_info = "reserving kernel mutex";

      mutex_enter(&kernel_mutex);
      if (srv_activity_count != old_activity_count) {
      mutex_exit(&kernel_mutex);
      goto loop;
      }
      mutex_exit(&kernel_mutex);

      srv_main_thread_op_info = "doing insert buffer merge";

      if (srv_fast_shutdown && srv_shutdown_state > 0) {
      n_bytes_merged = 0;
      } else {
      n_bytes_merged = ibuf_contract_for_n_pages(TRUE, 20);
      }

      srv_main_thread_op_info = "reserving kernel mutex";

      mutex_enter(&kernel_mutex);
      if (srv_activity_count != old_activity_count) {
      mutex_exit(&kernel_mutex);
      goto loop;
      }
      mutex_exit(&kernel_mutex);

    • flush loop 部分代码:

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      66
      67
      68
      69
      70
      71
      72
      73
      74
      75
      76
      77
      78
      79
      80

      flush_loop:
      srv_main_thread_op_info = "flushing buffer pool pages";

      if (srv_fast_shutdown < 2) {
      n_pages_flushed =
      buf_flush_batch(BUF_FLUSH_LIST, 100, ut_dulint_max);
      } else {
      /* In the fastest shutdown we do not flush the buffer pool
      to data files: we set n_pages_flushed to 0 artificially. */

      n_pages_flushed = 0;
      }

      srv_main_thread_op_info = "reserving kernel mutex";

      mutex_enter(&kernel_mutex);
      if (srv_activity_count != old_activity_count) {
      mutex_exit(&kernel_mutex);
      goto loop;
      }
      mutex_exit(&kernel_mutex);

      srv_main_thread_op_info = "waiting for buffer pool flush to end";
      buf_flush_wait_batch_end(BUF_FLUSH_LIST);

      srv_main_thread_op_info = "flushing log";

      log_buffer_flush_to_disk();

      srv_main_thread_op_info = "making checkpoint";

      log_checkpoint(TRUE, FALSE);

      if (buf_get_modified_ratio_pct() > srv_max_buf_pool_modified_pct) {

      /* Try to keep the number of modified pages in the
      buffer pool under the limit wished by the user */

      goto flush_loop;
      }

      srv_main_thread_op_info = "reserving kernel mutex";

      mutex_enter(&kernel_mutex);
      if (srv_activity_count != old_activity_count) {
      mutex_exit(&kernel_mutex);
      goto loop;
      }
      mutex_exit(&kernel_mutex);
      /*
      srv_main_thread_op_info = "archiving log (if log archive is on)";

      log_archive_do(FALSE, &n_bytes_archived);
      */
      n_bytes_archived = 0;

      /* Keep looping in the background loop if still work to do */

      if (srv_fast_shutdown && srv_shutdown_state > 0) {
      if (n_tables_to_drop + n_pages_flushed
      + n_bytes_archived != 0) {

      /* If we are doing a fast shutdown (= the default)
      we do not do purge or insert buffer merge. But we
      flush the buffer pool completely to disk.
      In a 'very fast' shutdown we do not flush the buffer
      pool to data files: we have set n_pages_flushed to
      0 artificially. */

      goto background_loop;
      }
      } else if (n_tables_to_drop +
      n_pages_purged + n_bytes_merged + n_pages_flushed
      + n_bytes_archived != 0) {
      /* In a 'slow' shutdown we run purge and the insert buffer
      merge to completion */

      goto background_loop;
      }
    • suspend loop

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      /* There is no work for background operations either: suspend
      master thread to wait for more server activity */

      suspend_thread:
      srv_main_thread_op_info = "suspending";

      mutex_enter(&kernel_mutex);

      if (row_get_background_drop_list_len_low() > 0) {
      mutex_exit(&kernel_mutex);

      goto loop;
      }

      event = srv_suspend_thread();

      mutex_exit(&kernel_mutex);

      srv_main_thread_op_info = "waiting for server activity";

      os_event_wait(event);

      if (srv_shutdown_state == SRV_SHUTDOWN_EXIT_THREADS) {
      /* This is only extra safety, the thread should exit
      already when the event wait ends */

      os_thread_exit(NULL);
      }

      /* When there is user activity, InnoDB will set the event and the main
      thread goes back to loop: */

      goto loop;

      /* We count the number of threads in os_thread_exit(). A created
      thread should always use that to exit and not use return() to exit.
      The thread actually never comes here because it is exited in an
      os_event_wait(). */

      os_thread_exit(NULL);

      OS_THREAD_DUMMY_RETURN;

InnoDB1.2.x版本之前的Master Thread

  • 1.0.x版本之前的Master Thread存在对IO有限制,在缓冲池向磁盘刷新时其实都做了一定的硬编码.

  • 无论何时,InnoDB存储引擎最大只会刷新100个脏页到磁盘,合并20个插入缓冲.若是在写入密集的应用程序中,每秒可能会产生大于100个的脏页,如果是产生大于20插入缓冲的情况,Master Thread似乎会"忙不过来",或者说做的很慢.

  • 因此InnoDB Plugin提供了参数innodb_io_capacity,用来表示磁盘IO的吞入量,默认为200.

    1
    2
    3
    4
    5
    6
    7
    mysql> show variables like 'innodb_io_capacity';
    +--------------------+-------+
    | Variable_name | Value |
    +--------------------+-------+
    | innodb_io_capacity | 200 |
    +--------------------+-------+
    1 row in set (0.02 sec)
  • 对于刷新到磁盘页的数量,会按照innodb_io_capacity的百分比来进行控制:

    • 在合并插入缓冲时,合并插入缓冲的数量为innodb_io_capacity值的5%.
    • 在从缓冲区刷新脏页时,刷新脏页的数量为innodb_io_capacity.
  • 参数innodb_adaptive_flushing(自适应刷新),该值影响每秒刷新脏页的数量.

    1
    2
    3
    4
    5
    6
    7
    mysql> show variables like 'innodb_adaptive_flushing';
    +--------------------------+-------+
    | Variable_name | Value |
    +--------------------------+-------+
    | innodb_adaptive_flushing | ON |
    +--------------------------+-------+
    1 row in set (0.00 sec)
  • 参数innodb_purge_batch_size,该参数可以控制每次full purge回收Undo页的数量.

    1
    2
    3
    4
    5
    6
    7
    mysql> show variables like 'innodb_purge_batch_size';
    +-------------------------+-------+
    | Variable_name | Value |
    +-------------------------+-------+
    | innodb_purge_batch_size | 300 |
    +-------------------------+-------+
    1 row in set (0.00 sec)

InnoDB1.2.x版本的Master Thread

  • MySQL5.6版本storage/innobase/srv/srv0srv.cc源码:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    static
    void
    srv_master_sleep(void)
    /*==================*/
    {
    srv_main_thread_op_info = "sleeping";
    os_thread_sleep(1000000);
    srv_main_thread_op_info = "";
    }

    extern "C" UNIV_INTERN
    os_thread_ret_t
    DECLARE_THREAD(srv_master_thread)(
    /*==============================*/
    void* arg MY_ATTRIBUTE((unused)))
    /*!< in: a dummy parameter required by
    os_thread_create */
    {
    // ... 略
    loop:
    // 数据库启动时,检查是否需要数据库恢复操作
    if (srv_force_recovery >= SRV_FORCE_NO_BACKGROUND) {
    // 若需要恢复进入 suspend_thread 循环部分
    goto suspend_thread;
    }
    // 数据库处于正常状态
    while (srv_shutdown_state == SRV_SHUTDOWN_NONE) {
    // sleep 1 秒
    srv_master_sleep();

    MONITOR_INC(MONITOR_MASTER_THREAD_SLEEP);
    // 判断数据库是否处于空闲
    if (srv_check_activity(old_activity_count)) {
    old_activity_count = srv_get_activity_count();
    // 数据库非空闲时 每一秒的操作
    srv_master_do_active_tasks();
    } else {
    // 数据库空闲时 每10秒的操作
    srv_master_do_idle_tasks();
    }
    }

    while (srv_master_do_shutdown_tasks(&last_print_time)) {

    /* Shouldn't loop here in case of very fast shutdown */
    ut_ad(srv_fast_shutdown < 2);
    }

    suspend_thread:
    //挂起 master thread 以在其线程槽 slot 中等待事件。
    srv_main_thread_op_info = "suspending";

    srv_suspend_thread(slot);

    /* DO NOT CHANGE THIS STRING. innobase_start_or_create_for_mysql()
    waits for database activity to die down when converting < 4.1.x
    databases, and relies on this string being exactly as it is. InnoDB
    manual also mentions this string in several places. */
    srv_main_thread_op_info = "waiting for server activity";

    os_event_wait(slot->event);
    // 判断是否需要退出所有线程
    if (srv_shutdown_state == SRV_SHUTDOWN_EXIT_THREADS) {
    my_thread_end();
    os_thread_exit(NULL);
    }

    goto loop;

    OS_THREAD_DUMMY_RETURN; /* Not reached, avoid compiler warning */
    }
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    enum srv_shutdown_state {
    SRV_SHUTDOWN_NONE = 0, /*!< Database running normally */
    SRV_SHUTDOWN_CLEANUP, /*!< Cleaning up in
    logs_empty_and_mark_files_at_shutdown() */
    SRV_SHUTDOWN_FLUSH_PHASE,/*!< At this phase the master and the
    purge threads must have completed their
    work. Once we enter this phase the
    page_cleaner can clean up the buffer
    pool and exit */
    SRV_SHUTDOWN_LAST_PHASE,/*!< Last phase after ensuring that
    the buffer pool can be freed: flush
    all file spaces and close all files */
    SRV_SHUTDOWN_EXIT_THREADS/*!< Exit all threads */
    };