10月 19 2015

阅读周记(第二期)

从学渣到学霸 - 我的100天阅读简史

点评：
    1. 行动力是第一的，有想法就要去实际动手。作者就是一个行动派，敬佩！  
    2. 每天早上起床后的时间，也是可以利用的。对我个人而言，早睡一会，早起一个小时。然后利用这一个小时，花几个番茄钟去读书！
    3. 对于不同的性质的书，要采用不同的方式阅读  
    4. 自己能够成为一个行动派？真正成为一个爱读书的人？纸上得来终觉浅，绝知此事要躬行。  
    5. 每读完一段书，写一个短评(如125字)去总结，是一个便于梳理知识的好方法。

一个可以显示Linux命令运行进度的伟大工具

摘要：
    1. Coreutils Viewer（cv）是一个简单的程序，它可以用于显示任何核心组件命令
    （如：cp、mv、dd、tar、gzip、gunzip、cat、grep、fgrep、egrep、cut、sort、xz、exiting）的进度
点评：
    1. 不过感觉没那么实用，就指放在这里做个备忘吧

网易163/126邮箱过亿数据泄漏

点评：
    网络世界无安全啊，还是忧伤的去改相关密码吧。

What’s New in CPUs Since the 80s and How Does It Affect Programmers?

摘要：
    1. Memory/Caches: The solution to the problem of having relatively slow memory has been to add
       caching, which provides fast access to frequently used data, and prefetching, which preloads
       data into caches if the access pattern is predictable.  
    2. TLBs(CPU中最重要的特殊功能的cache):
        a. TLBs, which are caches for virtual memory lookups(done via
           a 4-level page table structure on x86). 
        b. If you use 4k pages, the limited size of TLBs limits the amount
           of memory you can address without incurring a TLB miss.
        c. X87 also supports 2MB and 1GB pages; some applications will
           benefits a lot from using larger page sizes.
        d. Also, first-level caches are ususally limited by the page size times
           the associativity of the cache. 
           Haswell has an 8-way associative cache and 4kB pages. Its L1 data cache
           is `8 * 4 kB = 32kB.
    3. Out of Order Execution/Serialization: For a couple decades now, x86 chips
       have been able to speculatively execute and re-order execution(to avoid
       blocking on a single stalled resource).
    4. Memory/Concurrency(即多核):
        a. if core0 and core1 interact, there’s no guarantee that their interaction is ordered.
        b. To make a sequence atomic, we can use xchg or cmpxchg, which are always locked as compare-and-swap primitives.
    5. Memory/Non-Temporal Stores/Write-Combine Memory:
        a. UC memory: uncacheable memory
        b. WC: write combine
        c. WC is kind of eventually consistent UC. Writes have to eventually
           make it to memory, but they can be buffered internally.
    6. Memory/NUMA:
        a. Non-uniform memory access, where memory latencies and bandwidth are
           different for different processors.
        b. The takeaway here is that threads that share memory should be
           on the same socket, and a memory-mapped I/O heavy thread should
           make sure it's on the socket that's colsest to the I/O device it's
           talking to.
    7. Context Switches/Syscalss:
        a. A side effect of all the caching that modern cores have is that
           context switches are expensive, which causes syscalls to be expensive.
        b. The high cost of syscalls is the reason people have switched to using
           batched versions of syscalls for high-performance code (e.g., epoll, or recvmmsg)
           and the reason that perple who need very high performance I/O often use userspace I/O
           stacks.
        c. More generally, the cost of context switches is why high-performance code
           is often thread-per-core(or even single threaded on a pinned thread) and not
           thread-per-logical-task.
    8. SIMD(Single Instruction Multiple Data):
        a. Since it’s common to want to do the same operation multiple times,
           Intel added instructions that will let you operate on a 128-bit
           chunk of data as 2 64-bit chunks, 4 32-bit chunks, 8 16-bit chunks, etc. 
        b. It’s pretty common to get a 2x-4x speedup from using SIMD instructions;
           it’s definitely worth looking into if you’ve got a computationally heavy workload.

点评：
    1. Out of Order Execution/Serialization:
        a. 乱序的最大限度，受限于CPU OFO buffer大小
        b. 乱序执行是对分支预测技术的一个重要支撑
        c. 乱序并不是意味着没有约束，读后读，读后写，写后读都是限制乱序执行是否允许的
    2. Memory/Concurrency(即多核):
        a. MESI protocol是一个解决多核访问时，cache不一致问题的重要知识点
           M: Modified  E: Exclusive  S: Shared  I: Invalid
    3. NUMA的由来：cache一致性问题(多核之间交互一致性信息开销大，复杂），从而让每个socket负责
       一个region的memory。因此形成了NUMA结构。但NUMA结构的缺点也就很明显：在跨socket访问memory
       时，延迟开销较大。
    4. GPU(Graphical Processing Units): 就是数量取胜。通过大量的负责专用计算用途(浮点计算，矩阵运算)的
       小核组成。  
    5. Branches: 作者的意思是说分支预测其实开销已经很小了。
        一方面，Haswell架构的CPU，分支预测错误的额外开销也就14个cycle；
        同时，分支预测错误率现在也已经很低了（作者通过perf stat测试了常用程序后得出的结论），即分支预测已经做的很好了，预测成功的概率很大了。
    6. Alignment: 强迫症似得去对齐page-size现在已经没有必要了，CPU已经对这类代码优化的很好了。
       现在还强制的区对其可能还会造成性能损耗。  
    7. Self-modifying code: 暂时不动self-modifying code是个神马意思。待查。。。

Use multiple CPU Cores with your Linux commands — awk, sed, bzip2, grep, wc, etc.

点评：目前的功能内容，不太设计超大文件的文本处理，所以现在还可以接收。
    这篇文章提到的parallel用法确实挺吸引人的，所以放在这里备忘。  
    不过针对我个人的情况，如果偶尔需要处理超大文件，或许可以用下面这个思路解决：
    首先通过split命令将大文件切割，然后在用多进程去利用多核进行处理。  
相关：http://blog.sciencenet.cn/blog-548663-812884.html

手把手教你用Strace诊断问题

摘要：
    1. 运行top时,按[1]打开CPU列表,按[shift+p]以CPU排序
    2. 内核态的函数调用跟踪用[strace],用户态的函数调用跟踪用[ltrace]
    3. strace使用实例
        # strace -p <PID>               // 使用strace跟踪某个进程的系统调用，不过就等着被刷屏吧
        # strace -cp <PID>              // 使用[-c]选项可以汇总各个操作的总耗时，调用次数等信息，很实用
        # strace -T -e clone -p <PID>   // -T选项获得操作实际消耗的时间，-e指定单独跟踪后一个函数
点评：
    1. 文章介绍的strace选项很少，但排查问题的思路却很值得学习
    2. 问题排查思路：
        a. 通过top找到系统资源瓶颈，进而找到对应的进程(比如如果系统CPU消耗较高，就找到CPU最高的进程)
        b. CPU的消耗是需要区分内核态[sy]还是用户态[us]的，根据此来决定是通过strace还是ltrace来排查
        c. 使用strace排查问题时，先使用-c来找到进程中的最耗时的系统调用
        d. 然后使用-T, -e来针对具体的函数调用来排查
        e. 最终结合真实的业务代码，来定位前面排查的函数对应的代码部分