2020-07-12 22:33:59 +0000 UTC | Golang 、转载

[转载]Golang 内存对齐

# Golang 内存对齐

## 序：术语明确

字（word）

是用于表示其自然的数据单位，也叫machine word。字是电脑用来一次性处理事务的一个固定长度。

字长

一个字的位数（即字长）。

现代电脑的字长通常为16、32、64位。（一般N位系统的字长是 N/8 字节。）

电脑中大多数寄存器的大小是一个字长。CPU和内存之间的数据传送单位也通常是一个字长。还有而内存中用于指明一个存储位置的地址也经常是以字长为单位。

## 1. 为什么要内存对齐

简单来说，操作系统的cpu不是一个字节一个字节访问内存的，是按2,4,8这样的字长来访问的。

所以当处理器从存储器子系统读取数据至寄存器，或者，写寄存器数据到存储器，传送的数据长度通常是字长。

如32位系统访问粒度是4字节（bytes），64位系统的是8字节。

当被访问的数据长度为 n 字节且该数据地址为n字节对齐，那么操作系统就可以一次定位到数据，这样会更加高效。无需多次读取、处理对齐运算等额外操作。

## 2. 内存对齐，对于性能的提升

  package test
  import (
      "testing"
      "unsafe"
  )
  
  var ptrSize uintptr
  
  func init() {
      ptrSize = unsafe.Sizeof(uintptr(1))
  }
  
  type SType struct {
      b [32]byte
  }
  
  func BenchmarkUnAligned(b *testing.B) {
      x := SType{}
      address := unsafe.Pointer(&x.b[9])
      if uintptr(address)%ptrSize == 0 {
          b.Error("not UnAligned Address")
      }
      tmp := (*int64)(address)
      b.ResetTimer()
      for i := 0; i < b.N; i++ {
          *tmp = int64(i)
      }
  }
  
  func BenchmarkAligned(b *testing.B) {
      x := SType{}
      address := uintptr(unsafe.Pointer(&x.b))
      if address%ptrSize != 0 {
          b.Error("not Aligned Address")
      }
      tmp := (*int64)(unsafe.Pointer(address))
      b.ResetTimer()
      for i := 0; i < b.N; i++ {
          *tmp = int64(i)
      }
  }

执行

go test -gcflags=‘-N -l’ ./11_test.go -bench . -count 3

output:

goos: darwin goarch: amd64 BenchmarkUnAligned-12 614625067 2.04 ns/op BenchmarkUnAligned-12 518480124 1.94 ns/op BenchmarkUnAligned-12 602900706 2.02 ns/op BenchmarkAligned-12 682062852 1.77 ns/op BenchmarkAligned-12 684985354 1.76 ns/op BenchmarkAligned-12 690433512 1.76 ns/op PASS ok command-line-arguments 8.294s

### 由此可以得出，在内存不对齐(内存地址不对齐)情况下，处理器修改数据将要访问两块内存块，而产生额外的开销

## 3. 内存数据结构对齐图解

### 3.1 基本的数据结构内存占用

| type | size in bytes | | :——————————– | :———— | | byte, uint8, int8 | 1 | | uint16, int16 | 2 | | uint32, int32, float32 | 4 | | uint64, int64, float64, complex64 | 8 | | complex128 | 16 |

###

### 3.2 内存对齐工具推荐

#layout

go get github.com/ajstarks/svgo/structlayout-svg //将json转化成图形

go get -u honnef.co/go/tools

go install honnef.co/go/tools/cmd/structlayout

go install honnef.co/go/tools/cmd/structlayout-pretty

#optmize

go install honnef.co/go/tools/cmd/structlayout-optmize

###

### 3.3 举例

  // /Users/eric/GoProject/temp_test/test/ag/12.go
  type Ag struct {
      arr [2]int8
      bl  bool
      sl  []int16
      ptr *int64
      st  struct {
          str string
      }
      m map[string]int64
      i interface{}
  }

执行

structlayout -json ./test/ag Ag |structlayout-svg -t ‘align-gurantee’ > ag.svg

由图可知

Array 长度是元素长度*整体长度为 2 byte
Bool 是1 byte
由于内存对齐特性，所以会自动补足 5 byte
slice 是 24 byte
pointer 是 8 byte
string 是 16 byte
map 是 8 byte
interface 是 16 byte

###

### 3.4 数据结构对齐

  // 注意 ： 其中 int 会根据系统的环境 自动选择是 8 byte 还是 4 byte
  
  // reflect/value.go
  // StringHeader is the runtime representation of a string.
  // It cannot be used safely or portably and its representation may
  // change in a later release.
  // Moreover, the Data field is not sufficient to guarantee the data
  // it references will not be garbage collected, so programs must keep
  // a separate, correctly typed pointer to the underlying data.
  type StringHeader struct {
  	Data uintptr // 8 byte
  	Len  int // 8 byte
  }// total 16 byte
  
  // SliceHeader is the runtime representation of a slice.
  // It cannot be used safely or portably and its representation may
  // change in a later release.
  // Moreover, the Data field is not sufficient to guarantee the data
  // it references will not be garbage collected, so programs must keep
  // a separate, correctly typed pointer to the underlying data.
  type SliceHeader struct {
  	Data uintptr // 8 byte
  	Len  int // 8 byte
  	Cap  int // 8 byte
  }// total 24 byte
  
  // runtime/map.go
  // A header for a Go map.
  // map 虽然内部结构较为复杂 但是 实际上 map 被返回的仅仅是一个指向 hmap 这个结构体的指针 所以长度为 8 byte
  type hmap struct {
  	// Note: the format of the hmap is also encoded in cmd/compile/internal/gc/reflect.go.
  	// Make sure this stays in sync with the compiler's definition.
  	count     int // # live cells == size of map.  Must be first (used by len() builtin)
  	flags     uint8
  	B         uint8  // log_2 of # of buckets (can hold up to loadFactor * 2^B items)
  	noverflow uint16 // approximate number of overflow buckets; see incrnoverflow for details
  	hash0     uint32 // hash seed
  
  	buckets    unsafe.Pointer // array of 2^B Buckets. may be nil if count==0.
  	oldbuckets unsafe.Pointer // previous bucket array of half the size, non-nil only when growing
  	nevacuate  uintptr        // progress counter for evacuation (buckets less than this have been evacuated)
  
  	extra *mapextra // optional fields
  }
  
  
  // runtime/runtime2.go
  type iface struct {
  	tab  *itab // 8 byte
  	data unsafe.Pointer // 8 byte
  }// total 16 byte
  
  type eface struct {
  	_type *_type // 8 byte 
  	data  unsafe.Pointer // 8 byte 
  }// total 16 byte

### 3.5 特例（Final Zero Field）

  
  import (
      "fmt"
      "unsafe"
  )
  
  type T1 struct {
      a struct{}
      x int64
  }
  type T2 struct {
      x int64
      a struct{}
  }
  
  func main() {
      a1 := T1{}
      a2 := T2{}
      fmt.Printf("zero size struct of T1 is %d , T2 (as final filed) size :%d",
          unsafe.Sizeof(a1),
          unsafe.Sizeof(a2),
      )
  }

output

zero size struct of T1 is 8 , T2 (as final filed) size :16

T2 of 64bit system ：16 byte

T2 of 32bit system ：12 byte

#### 原因：

在golang在结构体的最后一个field出现zero大小的byte时，会自动padding对齐

### 3.6 重排优化（粗暴方式-按对齐值的递减来重排成员）

通过优化成员顺序可以节省结构体所占用的内存

  
  type MakeMeBetter struct {
      i16 int16
      i64 int64
      i8 int8
      i32 int32
      ptr *string
      b bool
  }
  // $ structlayout -json ./ MakeMeBetter |structlayout-optimize -r
  //output :
  //MakeMeBetter.i64 int64: 0-8 (size 8, align 8)
  //MakeMeBetter.ptr *string: 8-16 (size 8, align 8)
  //MakeMeBetter.i32 int32: 16-20 (size 4, align 4)
  //MakeMeBetter.i16 int16: 20-22 (size 2, align 2)
  //MakeMeBetter.i8 int8: 22-23 (size 1, align 1)
  //MakeMeBetter.b bool: 23-24 (size 1, align 1)

优化前

优化后

## 4. 内存地址对齐

计算机结构可能会要求内存地址对齐；也就是说，一个变量的地址是一个因子的倍数，也就是该变量的类型对齐值。

函数 Alignof 接受一个表示任何类型变量的表达式作为参数，并以字节为单位返回变量（类型）的对齐值。

对于变量 x：

  uintptr(unsafe.Pointer(&x)) % unsafe.Alignof(x) == 0

### 内存不对齐原子操作int64而在32位系统报错：

  
  x := SType{}
  for i := range x.b {
    fmt.Println(i, uintptr(unsafe.Pointer(&x.b[i])))
  }
  a := (*uint64)(unsafe.Pointer(&x.b[9]))
  fmt.Printf("地址值:%d\n", uintptr(unsafe.Pointer(a)))
  fmt.Printf("地址值除以8的余数：%d\n", uintptr(unsafe.Pointer(a))%8)
  fmt.Printf("原子操作结果：%d\n", atomic.AddUint64(a, math.MaxUint64))
  fmt.Println(atomic.LoadUint64(a))

### //Output:

地址值:405799145 地址值除以8的余数：1 panic: runtime error: invalid memory address or nil pointer dereference [signal 0xb code=0x1 addr=0x1 pc=0x8081aac]

goroutine 16 [running]: runtime.panic(0x80e0080, 0x814add3) /usr/lib/go/src/pkg/runtime/panic.c:279 +0xe9 sync/atomic.AddUint64(0x183000e9, 0xffffffff, 0xffffffff, 0x1, 0x1) /usr/lib/go/src/pkg/sync/atomic/asm_386.s:118 +0xc main.main() /root/14.go:33 +0x21a

goroutine 17 [runnable]: runtime.MHeap_Scavenger() /usr/lib/go/src/pkg/runtime/mheap.c:507 runtime.goexit() /usr/lib/go/src/pkg/runtime/proc.c:1445

goroutine 18 [runnable]: bgsweep() /usr/lib/go/src/pkg/runtime/mgc0.c:1976 runtime.goexit() /usr/lib/go/src/pkg/runtime/proc.c:1445

goroutine 19 [runnable]: runfinq() /usr/lib/go/src/pkg/runtime/mgc0.c:2606 runtime.goexit() /usr/lib/go/src/pkg/runtime/proc.c:1445 exit status 2

### 4.1 举例

  
  // A WaitGroup waits for a collection of goroutines to finish.
  // The main goroutine calls Add to set the number of
  // goroutines to wait for. Then each of the goroutines
  // runs and calls Done when finished. At the same time,
  // Wait can be used to block until all goroutines have finished.
  //
  // A WaitGroup must not be copied after first use.
  type WaitGroup struct {
  	noCopy noCopy
  
  	// 64-bit value: high 32 bits are counter, low 32 bits are waiter count.
  	// 64-bit atomic operations require 64-bit alignment, but 32-bit
  	// compilers do not ensure it. So we allocate 12 bytes and then use
  	// the aligned 8 bytes in them as state, and the other 4 as storage
  	// for the sema.
  	state1 [3]uint32
    //64位值：高32位为计数器，低32位为服务员计数。
    //64位原子操作需要64位对齐，但32位编译器无法确保对齐。
    //因此，我们分配了12个字节，然后使用其中对齐的8个字节作为状态，其他4个字节作为sema的存储。
  }
  
  // state returns pointers to the state and sema fields stored within wg.state1.
  func (wg *WaitGroup) state() (statep *uint64, semap *uint32) {
    // 判断地址是否8位对齐
  	if uintptr(unsafe.Pointer(&wg.state1))%8 == 0 {
      // 前 8 bytes 作 uint64 指针 statep ，后 4 bytes 作 semap
  		return (*uint64)(unsafe.Pointer(&wg.state1)), &wg.state1[2]
  	} else {
      // 否则相反
  		return (*uint64)(unsafe.Pointer(&wg.state1[1])), &wg.state1[0]
  	}
  }

### 4.2 64位字的安全访问保证（32位系统）

在x86-32上，64位函数使用Pentium MMX之前不存在的指令。

在非Linux ARM上，64位函数使用ARMv6k内核之前不可用的指令。

在ARM，x86-32和32位MIPS上，调用方有责任安排对原子访问的64位字的64位对齐。变量或分配的结构、数组或切片中的第一个字（ word）可以依赖当做是64位对齐的。

https://golang.org/pkg/sync/atomic/#pkg-note-BUG

### 4.3 64位字的安全访问保证 Why？

这是因为int64在bool之后未对齐。

它是32位对齐的，但不是64位对齐的，因为我们使用的是32位系统，

因此实际上只是两个32位值并排在一起。

https://github.com/golang/go/issues/6404#issuecomment-66085602

  type WillPanic struct {
  	init			bool
   	uncounted	int64 
  }

### 4.4 64位字的安全访问保证 How？

变量或已分配的结构体、数组或切片中的第一个字（ word）可以依赖当做是64位对齐的。

The first word in a variable or in an allocated struct, array, or slice can be relied upon to be 64-bit aligned.

  	//切片会内存对齐，因为切片实际上在结构体中是指向另一块内存的指针，内存地址分布与结构体无关
  	c5 := struct {
          val   int64
          valid bool
          val2  []int64
      }{val2: []int64{0}}
      fmt.Println("结构体中的64位字切片：", atomic.AddInt64(&c5.val2[0], 1))
      
      //数组不会内存对齐，因为数组的内存与结构体存放在一起，所以会收到结构体影响
      c6 := struct {
          val   int64
          valid bool
          val2  [3]int64
      }{val2: [3]int64{0}}
      fmt.Println("结构体中的64位字数组：", atomic.AddInt64(&c6.val2[0], 1))

### 4.5 一些源码中的例子

GMP中的管理groutine本地队列的上下文p中，记录计时器运行时长的uint64，需要保证32位系统上也是8byte对齐（原子操作）

  // runtime.runtime2.go
  
  type p struct {
  	id          int32
  	status      uint32 // one of pidle/prunning/...
  	link        puintptr
  	schedtick   uint32     // incremented on every scheduler call
  	syscalltick uint32     // incremented on every system call
  	sysmontick  sysmontick // last tick observed by sysmon
  	m           muintptr   // back-link to associated m (nil if idle)
  	mcache      *mcache
  	pcache      pageCache
  	raceprocctx uintptr
  
  	deferpool    [5][]*_defer // pool of available defer structs of different sizes (see panic.go)
  	deferpoolbuf [5][32]*_defer
  
  	// Cache of goroutine ids, amortizes accesses to runtime·sched.goidgen.
  	goidcache    uint64
  	goidcacheend uint64
  
  	// Queue of runnable goroutines. Accessed without lock.
  	runqhead uint32
  	runqtail uint32
  	runq     [256]guintptr
  	// runnext, if non-nil, is a runnable G that was ready'd by
  	// the current G and should be run next instead of what's in
  	// runq if there's time remaining in the running G's time
  	// slice. It will inherit the time left in the current time
  	// slice. If a set of goroutines is locked in a
  	// communicate-and-wait pattern, this schedules that set as a
  	// unit and eliminates the (potentially large) scheduling
  	// latency that otherwise arises from adding the ready'd
  	// goroutines to the end of the run queue.
  	runnext guintptr
  
  	// Available G's (status == Gdead)
  	gFree struct {
  		gList
  		n int32
  	}
  
  	sudogcache []*sudog
  	sudogbuf   [128]*sudog
  
  	// Cache of mspan objects from the heap.
  	mspancache struct {
  		// We need an explicit length here because this field is used
  		// in allocation codepaths where write barriers are not allowed,
  		// and eliminating the write barrier/keeping it eliminated from
  		// slice updates is tricky, moreso than just managing the length
  		// ourselves.
  		len int
  		buf [128]*mspan
  	}
  
  	tracebuf traceBufPtr
  
  	// traceSweep indicates the sweep events should be traced.
  	// This is used to defer the sweep start event until a span
  	// has actually been swept.
  	traceSweep bool
  	// traceSwept and traceReclaimed track the number of bytes
  	// swept and reclaimed by sweeping in the current sweep loop.
  	traceSwept, traceReclaimed uintptr
  
  	palloc persistentAlloc // per-P to avoid mutex
  
  	_ uint32 // Alignment for atomic fields below
  
  	// The when field of the first entry on the timer heap.
  	// This is updated using atomic functions.
  	// This is 0 if the timer heap is empty.
  	timer0When uint64
  
  	// Per-P GC state
  	gcAssistTime         int64    // Nanoseconds in assistAlloc
  	gcFractionalMarkTime int64    // Nanoseconds in fractional mark worker (atomic)
  	gcBgMarkWorker       guintptr // (atomic)
  	gcMarkWorkerMode     gcMarkWorkerMode
  
  	// gcMarkWorkerStartTime is the nanotime() at which this mark
  	// worker started.
  	gcMarkWorkerStartTime int64
  
  	// gcw is this P's GC work buffer cache. The work buffer is
  	// filled by write barriers, drained by mutator assists, and
  	// disposed on certain GC state transitions.
  	gcw gcWork
  
  	// wbBuf is this P's GC write barrier buffer.
  	//
  	// TODO: Consider caching this in the running G.
  	wbBuf wbBuf
  
  	runSafePointFn uint32 // if 1, run sched.safePointFn at next safe point
  
  	// Lock for timers. We normally access the timers while running
  	// on this P, but the scheduler can also do it from a different P.
  	timersLock mutex
  
  	// Actions to take at some time. This is used to implement the
  	// standard library's time package.
  	// Must hold timersLock to access.
  	timers []*timer
  
  	// Number of timers in P's heap.
  	// Modified using atomic instructions.
  	numTimers uint32
  
  	// Number of timerModifiedEarlier timers on P's heap.
  	// This should only be modified while holding timersLock,
  	// or while the timer status is in a transient state
  	// such as timerModifying.
  	adjustTimers uint32
  
  	// Number of timerDeleted timers in P's heap.
  	// Modified using atomic instructions.
  	deletedTimers uint32
  
  	// Race context used while executing timer functions.
  	timerRaceCtx uintptr
  
  	// preempt is set to indicate that this P should be enter the
  	// scheduler ASAP (regardless of what G is running on it).
  	preempt bool
  
  	pad cpu.CacheLinePad
  }

堆对象分配的mheap中，管理全局cache的中心缓存列表central，分配或释放需要加互斥锁另外为了不同列表间互斥锁不会伪共享，增加了cacheLinePadding

cacheLine 参考： https://appliedgo.net/concurrencyslower/

  // runtime/mheap.go
  
  // Main malloc heap.
  // The heap itself is the "free" and "scav" treaps,
  // but all the other global data is here too.
  //
  // mheap must not be heap-allocated because it contains mSpanLists,
  // which must not be heap-allocated.
  //
  //go:notinheap
  type mheap struct {
  	// lock must only be acquired on the system stack, otherwise a g
  	// could self-deadlock if its stack grows with the lock held.
  	lock      mutex
  	pages     pageAlloc // page allocation data structure
  	sweepgen  uint32    // sweep generation, see comment in mspan; written during STW
  	sweepdone uint32    // all spans are swept
  	sweepers  uint32    // number of active sweepone calls
  
  	// allspans is a slice of all mspans ever created. Each mspan
  	// appears exactly once.
  	//
  	// The memory for allspans is manually managed and can be
  	// reallocated and move as the heap grows.
  	//
  	// In general, allspans is protected by mheap_.lock, which
  	// prevents concurrent access as well as freeing the backing
  	// store. Accesses during STW might not hold the lock, but
  	// must ensure that allocation cannot happen around the
  	// access (since that may free the backing store).
  	allspans []*mspan // all spans out there
  
  	// sweepSpans contains two mspan stacks: one of swept in-use
  	// spans, and one of unswept in-use spans. These two trade
  	// roles on each GC cycle. Since the sweepgen increases by 2
  	// on each cycle, this means the swept spans are in
  	// sweepSpans[sweepgen/2%2] and the unswept spans are in
  	// sweepSpans[1-sweepgen/2%2]. Sweeping pops spans from the
  	// unswept stack and pushes spans that are still in-use on the
  	// swept stack. Likewise, allocating an in-use span pushes it
  	// on the swept stack.
  	sweepSpans [2]gcSweepBuf
  
  	// _ uint32 // align uint64 fields on 32-bit for atomics
  
  	// Proportional sweep
  	//
  	// These parameters represent a linear function from heap_live
  	// to page sweep count. The proportional sweep system works to
  	// stay in the black by keeping the current page sweep count
  	// above this line at the current heap_live.
  	//
  	// The line has slope sweepPagesPerByte and passes through a
  	// basis point at (sweepHeapLiveBasis, pagesSweptBasis). At
  	// any given time, the system is at (memstats.heap_live,
  	// pagesSwept) in this space.
  	//
  	// It's important that the line pass through a point we
  	// control rather than simply starting at a (0,0) origin
  	// because that lets us adjust sweep pacing at any time while
  	// accounting for current progress. If we could only adjust
  	// the slope, it would create a discontinuity in debt if any
  	// progress has already been made.
  	pagesInUse         uint64  // pages of spans in stats mSpanInUse; updated atomically
  	pagesSwept         uint64  // pages swept this cycle; updated atomically
  	pagesSweptBasis    uint64  // pagesSwept to use as the origin of the sweep ratio; updated atomically
  	sweepHeapLiveBasis uint64  // value of heap_live to use as the origin of sweep ratio; written with lock, read without
  	sweepPagesPerByte  float64 // proportional sweep ratio; written with lock, read without
  	// TODO(austin): pagesInUse should be a uintptr, but the 386
  	// compiler can't 8-byte align fields.
  
  	// scavengeGoal is the amount of total retained heap memory (measured by
  	// heapRetained) that the runtime will try to maintain by returning memory
  	// to the OS.
  	scavengeGoal uint64
  
  	// Page reclaimer state
  
  	// reclaimIndex is the page index in allArenas of next page to
  	// reclaim. Specifically, it refers to page (i %
  	// pagesPerArena) of arena allArenas[i / pagesPerArena].
  	//
  	// If this is >= 1<<63, the page reclaimer is done scanning
  	// the page marks.
  	//
  	// This is accessed atomically.
  	reclaimIndex uint64
  	// reclaimCredit is spare credit for extra pages swept. Since
  	// the page reclaimer works in large chunks, it may reclaim
  	// more than requested. Any spare pages released go to this
  	// credit pool.
  	//
  	// This is accessed atomically.
  	reclaimCredit uintptr
  
  	// Malloc stats.
  	largealloc  uint64                  // bytes allocated for large objects
  	nlargealloc uint64                  // number of large object allocations
  	largefree   uint64                  // bytes freed for large objects (>maxsmallsize)
  	nlargefree  uint64                  // number of frees for large objects (>maxsmallsize)
  	nsmallfree  [_NumSizeClasses]uint64 // number of frees for small objects (<=maxsmallsize)
  
  	// arenas is the heap arena map. It points to the metadata for
  	// the heap for every arena frame of the entire usable virtual
  	// address space.
  	//
  	// Use arenaIndex to compute indexes into this array.
  	//
  	// For regions of the address space that are not backed by the
  	// Go heap, the arena map contains nil.
  	//
  	// Modifications are protected by mheap_.lock. Reads can be
  	// performed without locking; however, a given entry can
  	// transition from nil to non-nil at any time when the lock
  	// isn't held. (Entries never transitions back to nil.)
  	//
  	// In general, this is a two-level mapping consisting of an L1
  	// map and possibly many L2 maps. This saves space when there
  	// are a huge number of arena frames. However, on many
  	// platforms (even 64-bit), arenaL1Bits is 0, making this
  	// effectively a single-level map. In this case, arenas[0]
  	// will never be nil.
  	arenas [1 << arenaL1Bits]*[1 << arenaL2Bits]*heapArena
  
  	// heapArenaAlloc is pre-reserved space for allocating heapArena
  	// objects. This is only used on 32-bit, where we pre-reserve
  	// this space to avoid interleaving it with the heap itself.
  	heapArenaAlloc linearAlloc
  
  	// arenaHints is a list of addresses at which to attempt to
  	// add more heap arenas. This is initially populated with a
  	// set of general hint addresses, and grown with the bounds of
  	// actual heap arena ranges.
  	arenaHints *arenaHint
  
  	// arena is a pre-reserved space for allocating heap arenas
  	// (the actual arenas). This is only used on 32-bit.
  	arena linearAlloc
  
  	// allArenas is the arenaIndex of every mapped arena. This can
  	// be used to iterate through the address space.
  	//
  	// Access is protected by mheap_.lock. However, since this is
  	// append-only and old backing arrays are never freed, it is
  	// safe to acquire mheap_.lock, copy the slice header, and
  	// then release mheap_.lock.
  	allArenas []arenaIdx
  
  	// sweepArenas is a snapshot of allArenas taken at the
  	// beginning of the sweep cycle. This can be read safely by
  	// simply blocking GC (by disabling preemption).
  	sweepArenas []arenaIdx
  
  	// curArena is the arena that the heap is currently growing
  	// into. This should always be physPageSize-aligned.
  	curArena struct {
  		base, end uintptr
  	}
  
  	_ uint32 // ensure 64-bit alignment of central
  
  	// central free lists for small size classes.
  	// the padding makes sure that the mcentrals are
  	// spaced CacheLinePadSize bytes apart, so that each mcentral.lock
  	// gets its own cache line.
  	// central is indexed by spanClass.
  	central [numSpanClasses]struct {
  		mcentral mcentral
  		pad      [cpu.CacheLinePadSize - unsafe.Sizeof(mcentral{})%cpu.CacheLinePadSize]byte
  	}
  
  	spanalloc             fixalloc // allocator for span*
  	cachealloc            fixalloc // allocator for mcache*
  	specialfinalizeralloc fixalloc // allocator for specialfinalizer*
  	specialprofilealloc   fixalloc // allocator for specialprofile*
  	speciallock           mutex    // lock for special record allocators.
  	arenaHintAlloc        fixalloc // allocator for arenaHints
  
  	unused *specialfinalizer // never set, just here to force the specialfinalizer type into DWARF
  }

### 4.6 64位字的安全访问保证 (Bug！)

如果包含首个64位字的结构体是12byte大小时，不一定能保证64未对齐

这是因为tinyalloc分配小对象时没有做对齐保证

https://github.com/golang/go/issues/37262#issuecomment-587576192

  //runtime/malloc.go
  
  // Allocate an object of size bytes.
  // Small objects are allocated from the per-P cache's free lists.
  // Large objects (> 32 kB) are allocated straight from the heap.
  func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer {
  	if gcphase == _GCmarktermination {
  		throw("mallocgc called with gcphase == _GCmarktermination")
  	}
  
  	if size == 0 {
  		return unsafe.Pointer(&zerobase)
  	}
  
  	if debug.sbrk != 0 {
  		align := uintptr(16)
  		if typ != nil {
  			// TODO(austin): This should be just
  			//   align = uintptr(typ.align)
  			// but that's only 4 on 32-bit platforms,
  			// even if there's a uint64 field in typ (see #599).
  			// This causes 64-bit atomic accesses to panic.
  			// Hence, we use stricter alignment that matches
  			// the normal allocator better.
  			if size&7 == 0 {
  				align = 8
  			} else if size&3 == 0 {
  				align = 4
  			} else if size&1 == 0 {
  				align = 2
  			} else {
  				align = 1
  			}
  		}
  		return persistentalloc(size, align, &memstats.other_sys)
  	}
  
  	// assistG is the G to charge for this allocation, or nil if
  	// GC is not currently active.
  	var assistG *g
  	if gcBlackenEnabled != 0 {
  		// Charge the current user G for this allocation.
  		assistG = getg()
  		if assistG.m.curg != nil {
  			assistG = assistG.m.curg
  		}
  		// Charge the allocation against the G. We'll account
  		// for internal fragmentation at the end of mallocgc.
  		assistG.gcAssistBytes -= int64(size)
  
  		if assistG.gcAssistBytes < 0 {
  			// This G is in debt. Assist the GC to correct
  			// this before allocating. This must happen
  			// before disabling preemption.
  			gcAssistAlloc(assistG)
  		}
  	}
  
  	// Set mp.mallocing to keep from being preempted by GC.
  	mp := acquirem()
  	if mp.mallocing != 0 {
  		throw("malloc deadlock")
  	}
  	if mp.gsignal == getg() {
  		throw("malloc during signal")
  	}
  	mp.mallocing = 1
  
  	shouldhelpgc := false
  	dataSize := size
  	c := gomcache()
  	var x unsafe.Pointer
  	noscan := typ == nil || typ.ptrdata == 0
  	if size <= maxSmallSize {
  		if noscan && size < maxTinySize {
  			// Tiny allocator.
  			//
  			// Tiny allocator combines several tiny allocation requests
  			// into a single memory block. The resulting memory block
  			// is freed when all subobjects are unreachable. The subobjects
  			// must be noscan (don't have pointers), this ensures that
  			// the amount of potentially wasted memory is bounded.
  			//
  			// Size of the memory block used for combining (maxTinySize) is tunable.
  			// Current setting is 16 bytes, which relates to 2x worst case memory
  			// wastage (when all but one subobjects are unreachable).
  			// 8 bytes would result in no wastage at all, but provides less
  			// opportunities for combining.
  			// 32 bytes provides more opportunities for combining,
  			// but can lead to 4x worst case wastage.
  			// The best case winning is 8x regardless of block size.
  			//
  			// Objects obtained from tiny allocator must not be freed explicitly.
  			// So when an object will be freed explicitly, we ensure that
  			// its size >= maxTinySize.
  			//
  			// SetFinalizer has a special case for objects potentially coming
  			// from tiny allocator, it such case it allows to set finalizers
  			// for an inner byte of a memory block.
  			//
  			// The main targets of tiny allocator are small strings and
  			// standalone escaping variables. On a json benchmark
  			// the allocator reduces number of allocations by ~12% and
  			// reduces heap size by ~20%.
  			off := c.tinyoffset
  			// Align tiny pointer for required (conservative) alignment.
  			if size&7 == 0 {
  				off = alignUp(off, 8)
  			} else if size&3 == 0 {
  				off = alignUp(off, 4)
  			} else if size&1 == 0 {
  				off = alignUp(off, 2)
  			}
  			if off+size <= maxTinySize && c.tiny != 0 {
  				// The object fits into existing tiny block.
  				x = unsafe.Pointer(c.tiny + off)
  				c.tinyoffset = off + size
  				c.local_tinyallocs++
  				mp.mallocing = 0
  				releasem(mp)
  				return x
  			}
  			// Allocate a new maxTinySize block.
  			span := c.alloc[tinySpanClass]
  			v := nextFreeFast(span)
  			if v == 0 {
  				v, _, shouldhelpgc = c.nextFree(tinySpanClass)
  			}
  			x = unsafe.Pointer(v)
  			(*[2]uint64)(x)[0] = 0
  			(*[2]uint64)(x)[1] = 0
  			// See if we need to replace the existing tiny block with the new one
  			// based on amount of remaining free space.
  			if size < c.tinyoffset || c.tiny == 0 {
  				c.tiny = uintptr(x)
  				c.tinyoffset = size
  			}
  			size = maxTinySize
  		} else {
  			var sizeclass uint8
  			if size <= smallSizeMax-8 {
  				sizeclass = size_to_class8[(size+smallSizeDiv-1)/smallSizeDiv]
  			} else {
  				sizeclass = size_to_class128[(size-smallSizeMax+largeSizeDiv-1)/largeSizeDiv]
  			}
  			size = uintptr(class_to_size[sizeclass])
  			spc := makeSpanClass(sizeclass, noscan)
  			span := c.alloc[spc]
  			v := nextFreeFast(span)
  			if v == 0 {
  				v, span, shouldhelpgc = c.nextFree(spc)
  			}
  			x = unsafe.Pointer(v)
  			if needzero && span.needzero != 0 {
  				memclrNoHeapPointers(unsafe.Pointer(v), size)
  			}
  		}
  	} else {
  		var s *mspan
  		shouldhelpgc = true
  		systemstack(func() {
  			s = largeAlloc(size, needzero, noscan)
  		})
  		s.freeindex = 1
  		s.allocCount = 1
  		x = unsafe.Pointer(s.base())
  		size = s.elemsize
  	}
  
  	var scanSize uintptr
  	if !noscan {
  		// If allocating a defer+arg block, now that we've picked a malloc size
  		// large enough to hold everything, cut the "asked for" size down to
  		// just the defer header, so that the GC bitmap will record the arg block
  		// as containing nothing at all (as if it were unused space at the end of
  		// a malloc block caused by size rounding).
  		// The defer arg areas are scanned as part of scanstack.
  		if typ == deferType {
  			dataSize = unsafe.Sizeof(_defer{})
  		}
  		heapBitsSetType(uintptr(x), size, dataSize, typ)
  		if dataSize > typ.size {
  			// Array allocation. If there are any
  			// pointers, GC has to scan to the last
  			// element.
  			if typ.ptrdata != 0 {
  				scanSize = dataSize - typ.size + typ.ptrdata
  			}
  		} else {
  			scanSize = typ.ptrdata
  		}
  		c.local_scan += scanSize
  	}
  
  	// Ensure that the stores above that initialize x to
  	// type-safe memory and set the heap bits occur before
  	// the caller can make x observable to the garbage
  	// collector. Otherwise, on weakly ordered machines,
  	// the garbage collector could follow a pointer to x,
  	// but see uninitialized memory or stale heap bits.
  	publicationBarrier()
  
  	// Allocate black during GC.
  	// All slots hold nil so no scanning is needed.
  	// This may be racing with GC so do it atomically if there can be
  	// a race marking the bit.
  	if gcphase != _GCoff {
  		gcmarknewobject(uintptr(x), size, scanSize)
  	}
  
  	if raceenabled {
  		racemalloc(x, size)
  	}
  
  	if msanenabled {
  		msanmalloc(x, size)
  	}
  
  	mp.mallocing = 0
  	releasem(mp)
  
  	if debug.allocfreetrace != 0 {
  		tracealloc(x, size, typ)
  	}
  
  	if rate := MemProfileRate; rate > 0 {
  		if rate != 1 && size < c.next_sample {
  			c.next_sample -= size
  		} else {
  			mp := acquirem()
  			profilealloc(mp, x, size)
  			releasem(mp)
  		}
  	}
  
  	if assistG != nil {
  		// Account for internal fragmentation in the assist
  		// debt now that we know it.
  		assistG.gcAssistBytes -= int64(size - dataSize)
  	}
  
  	if shouldhelpgc {
  		if t := (gcTrigger{kind: gcTriggerHeap}); t.test() {
  			gcStart(t)
  		}
  	}
  
  	return x
  }

### 4.7 64位字的安全访问保证 (改为加锁!)

  	c := struct {
          val  int16
          val2 int64
      }{}
      var mutex sync.Mutex
      mutex.Lock()
      c.val2 += 1
      mutex.Unlock()

## 5. 总结

内存对齐是为了cpu更高效访问内存中数据
结构体对齐依赖类型的大小保证和对齐保证
地址对齐保证是：如果类型 t 的对齐保证是 n，那么类型 t 的每个值的地址在运行时必须是 n 的倍数.
struct内字段如果填充过多，可以尝试重排，使字段排列更紧密，减少内存浪费
零大小字段要避免作为struct最后一个字段，会有内存浪费
32位系统上对64位字的原子访问要保证其是8bytes对齐的；当然如果不必要的话，还是用加锁（mutex）的方式更清晰简单

好好学习，天天向上